InfoQ Homepage Articles The Challenges of Building a Reliable Real-Time Event-Driven Ecosystem

The Challenges of Building a Reliable Real-Time Event-Driven Ecosystem

Jul 30, 2020 12 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Globally, there is an increasing appetite for data delivered in real time. As both producers and consumers are more and more interested in faster experiences and instantaneous data transactions, we are witnessing the emergence of the real-time API.
When it comes to event-driven APIs, engineers can choose between multiple different protocols. Options include the simple webhook, the newer WebSub, popular open protocols such as WebSockets, MQTT or SSE, or even streaming protocols, such as Kafka. In addition to choosing a protocol, engineers also have to think about subscription models: server-initiated (push-based) or client-initiated (pull-based).
Client-initiated models are the best choice for the “last mile” delivery of data to end-user devices. These devices only need access to data when they are online (connected) and don’t care what happens when they are disconnected. Due to this fact, the complexity of the producer is reduced, as the server-side doesn’t need to be stateful.
In the case of streaming data at scale, engineers should adopt a server-initiated model. The responsibility of sharding data across multiple connections and managing those connections rests with the producer, and other than the potential use of a client-side load balancer, things are kept rather simple on the consumer side.
To truly benefit from the power of real-time data, the entire tech stack needs to be event-driven. Perhaps we should start talking more about event-driven architectures than about event-driven APIs.

Globally, there is an increasing appetite for data delivered in real time. Since both producers and consumers are more and more interested in faster experiences and instantaneous data transactions, we are witnessing the emergence of the real-time API.

This new type of event-driven API is suitable for a wide variety of use cases. It can be used to power real-time functionality and technologies such as chat, alerts, and notifications or IoT devices. Real-time APIs can also be used to stream high volumes of data between different businesses or different components of a system.

This article starts by exploring the fundamental differences between the REST model and real-time APIs. Up next, we dive into some of the many engineering challenges and considerations involved in building a reliable and scalable event-driven ecosystem, such as choosing the right communication protocol and subscription model, managing client, and server-side complexity, or scaling to support high-volume data streams.

What exactly is a real-time API?

Usually, when we talk about data being delivered in real time, we think about speed. By this logic, one could assume that improving REST APIs to be more responsive and able to execute operations in real time (or as close as possible) makes them real-time APIs. However, that’s just an improvement of an existing condition, not a fundamental change. Just because a traditional REST API can deliver data in real time does not make it a real-time API.

The basic premise around real-time APIs is that they are event-driven. According to the event-driven design pattern, a system should react or respond to events as they happen. Multiple types of APIs can be regarded as event-driven, as illustrated below.

The real-time API family. Source: Ably

Streaming, Pub/Sub, and Push are patterns that can be successfully delivered via an event-driven architecture. This makes all of them fall under the umbrella of event-driven APIs.

Unlike the popular request-response model of REST APIs, event-driven APIs follow an asynchronous communication model. An event-driven architecture consists of the following main components:

Event producers—they push data to channels whenever an event takes place.
Channels—they push the data received from event producers to event consumers.
Event consumers—they subscribe to channels and consume the data.

Let’s look at a simple and fictional example to better understand how these components interact. Let’s say we have a football app that uses a data stream to deliver real-time updates to end-users whenever something relevant happens on the field. If a goal is scored, the event is pushed to a channel. When a consumer uses the app, they connect to the respective channel, which then pushes the event to the client device.

Note that in an event-driven architecture, producers and consumers are decoupled. Components perform their task independently and are unaware of each other. This separation of concerns allows you to more reliably scale a real-time system and it can prevent potential issues with one of the components from impacting the other ones.

Compared to REST, event-driven APIs invert complexity and put more responsibility on the shoulders of producers rather than consumers.

REST vs event-driven: complexity is inverted. Source: Ably

This complexity inversion relates to the very foundation of the way event-driven APIs are designed. While in a REST paradigm the consumer is always responsible for maintaining state and always has to trigger requests to get updates. In an event-driven system, the producer is responsible for maintaining state and pushing updates to the consumer.

Event-driven architecture considerations

Building a dependable event-driven architecture is by no means an easy feat. There is an entire array of engineering challenges you will have to face and decisions you will have to make. Among them, protocol fragmentation and choosing the right subscription model (client-initiated or server-initiated) for your specific use case are some of the most pressing things you need to consider.

While traditional REST APIs all use HTTP as the transport and protocol layer, the situation is much more complex when it comes to event-driven APIs. You can choose between multiple different protocols. Options include the simple webhook, the newer WebSub, popular open protocols such as WebSockets, MQTT or SSE, or even streaming protocols, such as Kafka.

This diversity can be a double-edged sword—on one hand, you aren’t restricted to only one protocol; on the other hand, you need to select the best one for your use case, which adds an additional layer of engineering complexity.

Besides choosing a protocol, you also have to think about subscription models: server-initiated (push-based) or client-initiated (pull-based). Note that some protocols can be used with both models, while some protocols only support one of the two subscription approaches. Of course, this brings even more engineering complexity to the table.

In a client-initiated model, the consumer is responsible for connecting and subscribing to an event-driven data stream. This model is simpler from a producer perspective: if no consumers are subscribed to the data stream, the producer has no work to do and relies on clients to decide when to reconnect. Additionally, complexities around maintaining state are also handled by consumers. Wildly popular and effective even when high volumes of data are involved, WebSockets represent the most common example of a client-initiated protocol.

In contrast, with a server-initiated approach, the producer is responsible for pushing data to consumers whenever an event occurs. This model is often preferable for consumers, especially when the volume of data increases, as they are not responsible for maintaining any state—this responsibility sits with the producer. The common webhook—which is great for pushing rather infrequent low-latency updates—is the most obvious example of a server-initiated protocol.

We are now going to dive into more details and explore the strengths, weaknesses, and engineering complexities of client-initiated and server-initiated subscriptions.

Client-initiated vs. server-initiated models—challenges and use cases

Client-initiated models are the best choice for last-mile delivery of data to end-user devices. These devices only need access to data when they are online (connected) and don’t care what happens when they are disconnected. Due to this fact, the complexity of the producer is reduced, as the server-side doesn’t need to be stateful. The complexity is kept at a low level even on the consumer side: generally, all client devices have to do is connect and subscribe to the channels they want to listen to for messages.

There are several client-initiated protocols you can choose from. The most popular and efficient ones are:

WebSocket. Provides full-duplex communication channels over a single TCP connection. Much lower overhead than half-duplex alternatives such as HTTP polling. Great choice for financial tickers, location-based apps, and chat solutions.
MQTT. The go-to protocol for streaming data between devices with limited CPU power and/or battery life, and networks with expensive or low bandwidth, unpredictable stability, or high latency. Great for IoT.
SSE. Open, lightweight, subscribe-only protocol for event-driven data streams. Ideal for subscribing to data feeds, such as live sport updates.

Among these, WebSocket is arguably the most widely-used protocol. There are even a couple of proprietary protocols and open solutions that are built on top of raw WebSockets, such as Socket.IO. All of them are generally lightweight, and they are well supported by various development platforms and programming languages. This makes them ideal for B2C data delivery.

Let’s look at a real-life use case to demonstrate how WebSocket-based solutions can be used to power a client-initiated event-driven system. Tennis Australia (the governing body for tennis in Australia) wanted a solution that would allow them to stream real-time rally and commentary updates to tennis fans browsing the Australian Open website. Tennis Australia had no way of knowing how many client devices could subscribe to updates at any given moment, nor where these devices could be located throughout the world. Additionally, client devices are generally unpredictable—they can connect and disconnect at any moment.

Due to these constraints, a client-initiated model where a client device would open a connection whenever it wanted to subscribe to updates was the right way to go. However, since millions of client devices could connect at the same time, it wouldn’t have been scalable to have a 1:1 relationship with each client device. Tennis Australia was interested in keeping engineering complexity to a minimum—they wanted to publish one message every time there was an update and distribute that message to all connected client devices via a message broker.

In the end, instead of building their own proprietary solution, Tennis Australia chose to use Ably as the message broker. This enables Tennis Australia to keep things very simple on their side—all they have to do is publish a message to Ably whenever there’s a score update. The message looks something like this:

var ably = new Ably.Realtime('API_KEY');
var channel = ably.channels.get('tennis-score-updates');

// Publish a message to the tennis-score-updates channel
channel.publish('score', 'Game Point!');

Ably then distributes that message to all connected client devices over WebSockets, by using a pub/sub approach, while also handling most of the engineering complexity on the producer side, such as connection churn, backpressure, or message fan-out.

Things are kept simple for consumers as well. All a client device had to do is open a WebSocket connection and subscribe to updates:

// Subscribe to messages on channel
channel.subscribe('score', function(message) {
  alert(message.data);
});

Traditionally, developers use the client-initiated model to build apps for end-users. It’s a sensible choice since protocols like WebSockets, MQTT, or SSE can be successfully used to stream frequent updates to a high number of users, as demonstrated by the Tennis Australia example. However, it’s hazardous to think that client-initiated models scale well when high-throughput streams of data are involved—I’m referring to scenarios where businesses exchange large volumes of data, or where an organization is streaming information from one system to another.

In such cases, it’s usually not practical to stream all that data over a single consumer-initiated connection (between one server that is the producer, and another one that is the consumer). Often, to manage the influx of data, the consumer needs to shard it across multiple nodes. But by doing so, the consumer also has to figure out how to distribute these smaller streams of data across multiple connections and deal with other complex engineering challenges, such as fault tolerance.

We’ll use an example to better illustrate some of the consumer-side complexities. For example, let’s say you have two servers (A and B) that are consuming two streams of data. Let’s imagine that server A fails. How does server B know it needs to pick up additional work? How does it even know where server A left off? This is just a basic example, but imagine how hard it would be to manage hundreds of servers consuming hundreds of data streams. As a general rule, when there’s a lot of complexity involved, it’s the producer’s responsibility to handle it; data ingestion should be as simple as possible for the consumer.

That’s why in the case of streaming data at scale you should adopt a server-initiated model. This way, the responsibility of sharding data across multiple connections and managing those connections rests with the producer. Things are kept rather simple on the consumer side—they would typically have to use a load balancer to distribute the incoming data streams to available nodes for consumption, but that’s about as complex as it gets.

Webhooks are often used in server-initiated models. The webhook is a very popular pattern because it’s simple and effective. As a consumer, you would have a load balancer that receives webhook requests and distributes them to servers to be processed. However, webhooks become less and less effective as the volume of data increases. Webhooks are HTTP-based, so there’s an overhead with each webhook event (message) because each one triggers a new request. In addition, webhooks provide no integrity or message ordering guarantees.

That’s why for streaming data at scale you should usually go with a streaming protocol such as AMQP, Kafka, or ActiveMQ, to name just a few. Streaming protocols generally have much lower overheads per message, and they provide ordering and integrity guarantees. They can even provide additional benefits—idempotency, for example. Last but not least, streaming protocols enable you to shard data before streaming it to consumers.

It’s time to look at a real-life implementation of a server-initiated model. HubSpot is a well-known developer of marketing, sales, and customer service software. As part of its offering, HubSpot provides a chat service (Conversations) that enables communication between end-users. The organization is also interested in streaming all that chat data to other HubSpot services for onward processing and persistent storage. Using a client-initiated subscription to successfully stream high volumes of data to their internal message buses is not really an option. For this to happen, HubSpot would need to know what channels are active at any point in time, to pull data from them.

To avoid having to deal with complex engineering challenges, HubSpot decided to use Ably as a message broker that enables chat communication between end-users. Furthermore, Ably uses a server-initiated model to push chat data into Amazon Kinesis, which is the data processing component of HubSpot’s message bus ecosystem.

High-level overview of HubSpot chat architecture. Source: Ably

Consumer complexity is kept to a minimum. HubSpot only has to expose a Kinesis endpoint and Ably streams the chat data over as many connections as needed.

A brief conclusion

Hopefully, this article offers a taste of what real-time APIs are and helps readers navigate some of the many complexities and challenges of building an effective real-time architecture. It is naive to think that by improving a traditional REST API to be quicker and more responsive you get a real-time API. Real time in the context of APIs means so much more.

By design, real-time APIs are event-driven; this is a fundamental shift from the request-response pattern of RESTful services. In the event-driven paradigm, the responsibility is inverted, and the core of engineering complexities rests with the data producer, with the purpose of making data ingestion as easy as possible for the consumer.

But having one real-time API is not enough—this is not a solution to a problem. To truly benefit from the power of real-time data, your entire tech stack needs to be event-driven. Perhaps we should start talking more about event-driven architectures than about event-driven APIs. After all, can pushing high volumes of data to an endpoint (see HubSpot example above for details) even be classified as an API?

About the Author

Matthew O’Riordan is the technical co-founder of Ably, a global, cloud-based real-time messaging platform that provides APIs used by thousands of developers and businesses. Matthew has been a programmer for over 20 years and first started working on commercial internet projects in the mid-90s when Internet Explorer 3 and Netscape were still battling it out. While he enjoys coding, the challenges he faces as an entrepreneur starting and scaling businesses is what drives him. Matthew has previously started and successfully exited from two previous tech businesses.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

The Challenges of Building a Reliable Real-Time Event-Driven Ecosystem

InfoQ Article Contest

Key Takeaways

Related Sponsored Content

What exactly is a real-time API?

Event-driven architecture considerations

Client-initiated vs. server-initiated models—challenges and use cases

A brief conclusion

About the Author

Rate this Article

This content is in the Streaming topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter