Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations From Zero to a Hundred Billion: Building Scalable Real-Time Event Processing at DoorDash

From Zero to a Hundred Billion: Building Scalable Real-Time Event Processing at DoorDash



Allen Wang discusses the design of the event system including major components of event producing, event processing with Flink and streaming SQL, event format and schema validation.


Allen Wang is currently a tech lead at data platform at DoorDash. He is the architect for the Iguazu event processing system and a founding member of the real-time streaming platform team. Prior to joining DoorDash, he was a lead in the real time data infrastructure team at Netflix where he created the Kafka infrastructure for Netflix’s Keystone data pipeline.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Wang: My name is Allen Wang. I've been building real time data infrastructure for the past seven years. First at Netflix for the Keystone Data Pipeline and currently at DoorDash, building a real time event processing system called Iguazu. The journey of building those systems give me an opportunity to find some common patterns in successful real time data architectures.

Successful Patterns in Building Realtime Event Processing

In my mind, there are four principles that have passed the test of time. First, decoupling stages. We'll talk about what those stages are, and why, and how to make that decoupling happen. Second, leveraging stream processing framework. We will cover what stream processing framework can do for you and how you can pick one. Third, creating abstractions. For that, I will list the common abstractions that you can create to facilitate the adoption of your system. Finally, fine-grained failure isolation and scalability. Failures will happen, how do you isolate the failures and avoid bottleneck in scalability? I know those concepts may sound a bit abstract at this point, so I'm going to use a real time event processing system we created at DoorDash as an example, to give you some concrete ideas.

Real Time Events Use Case at DoorDash

First, let's talk about use cases of real time events. At DoorDash, real time events are an important data source to gain insight into our business, and help us make data driven decisions, just to give you a few examples how real time events are used at DoorDash. First, almost all events need to be reliably transported to our data warehouse, Snowflake, or other online analytical datastores with low latency for business analysis. For example, the dasher assignment team relies on the assignment events to detect any bugs in their algorithm at near real time. The second use case is mobile application health monitoring. Some mobile events will be integrated with our time series metric backend for monitoring and alerting so that the teams can quickly identify issues in the latest mobile application releases. The third use case is sessionization. We will like to group user events into sessions and generate session attributes at real time so that we can better analyze the user behaviors and push real time recommendations.

Historically, DoorDash has a few data pipelines that get data from our legacy monolithic web application and ingest the data into Snowflake. Each pipeline is built differently and can only handle one event. They involve mixed data transports, multiple queuing systems, and multiple third-party data services, which make it really difficult to control the latency, maintain the cost, and identify scalability bottlenecks in operations. We started to rethink this approach and decided to build a new system to replace those legacy pipelines and address the future event processing needs that we anticipated. First, it should support heterogeneous data sources, including microservices, and mobile or web applications, and be able to deliver the events to different destinations. Second, it should provide low latency and reliable data ingest into our data warehouse with a reasonable cost. Third, real time events should be easily accessible for data consumers. We want to empower all teams to create their own processing logic and tap into the streams of real time data. Finally, to improve data quality, we want to have end-to-end schema enforcement and schema evolution.

Introducing Iguazu

Two years ago, we started the journey of creating from scratch a real time event processing system named Iguazu. The important design decisions we made is to shift the strategy from heavily relying on third party data services to leveraging open source frameworks that can be customized and better integrates with the DoorDash infrastructure. Fast forward to today, we scaled Iguazu from processing just a few billion events to hundreds of billions of events per day with four nines of delivery rate. Compared to the legacy pipelines, the end-to-end latency to Snowflake is reduced from a day to just a few minutes. This is the architecture overview of Iguazu. The data ingestion for microservices and mobile clients are enabled through our Iguazu Kafka proxy before it lands into Kafka. We use Apache Kafka for Pub/Sub and decoupling the producers from consumers. Once the data lands into Kafka, we use stream processing applications built on top of Apache Flink for data transformation. That is achieved through Flink's data stream APIs and Flink SQL. After the stream process is done, the data is sent to different destinations, including S3 for data warehouse, and data lake integrations, Redis for real time features, and Chronosphere for operational metrics. In the following slides, I'll discuss the details of each of those major components.

Simplify and Optimize Event Producing

Let's first talk about producing events. The main focus is to make event producing as easy as possible and optimized for the workload of analytical data. To give more context, let me first explain the requirements for processing analytical data. First, analytical data usually comes with high volume and growth rate. As an example, in the process of making restaurant order on DoorDash, tens of thousands of analytical events are published. As we strive to improve user experience and delivery quality, more user actions are tracked at fine granularity, and these lead to the higher growth rate of analytical data volume compared with order volume. Therefore, compared with transactional data, analytical data requires higher scalability and cost efficiency. On the other hand, because analytical data are almost always processed and analyzed in aggregated fashion, minor data loss will not affect the quality of analysis. For example, if you collect 1 million results in an A/B test, and randomly drop 100 events from the results, most likely it will not affect the conclusion of the experiment. Typically, we have found a data loss of less than 0.1% will be acceptable for analytical events.

Leveraging Kafka REST Proxy

Let's look at the measures that we have taken to ensure the efficiency and the scalability in analytical event publishing. As I mentioned in the overview of the Iguazu architecture, we choose Kafka as the central Pub/Sub system. One challenge we face is how we can enable average DoorDash service to easily produce events through Kafka. Even though Kafka has been out there for quite some time, there are still teams struggling with the producer internals, tuning the producer properties, and making the Kafka connection. The solution we created is to leverage the Confluent Kafka REST proxy. The proxy provides a central place where we can enhance and optimize event producing functionalities. It provides the abstractions over Kafka with HTTP interface, eliminating the need to configure Kafka connections from all services and making event publishing much easier.

The Kafka REST proxy provides all the basic features that we need out of the box. One critical feature is that it supports event batching. You probably know that batching is a key factor in improving efficiencies in data processing. Why would it specifically help improving efficiency in event publishing to Kafka? What we have found out is that the Kafka brokers workload is highly dependent on the rate of the producing requests. The higher the rate of producing requests, the higher the CPU utilization on the program, and more workers will likely be needed. How can you produce to your Kafka with high data volume better than low rate of produce requests? As a formula in the slide suggests, you need to increase the number of events per request, which essentially means increasing the batch size. Batching comes in with tradeoff of higher latency in event publishing. However, processing of analytical events are typically done in an asynchronous fashion and does not require sub-second latency in getting back the final results. Small increase in the event publishing latency is an acceptable tradeoff. With Kafka REST proxy, you can batch events in each request from client side and rely on the Kafka client leader proxy to further batch them before producing to the broker. The result is that you will get the nice effect of event batching across client instances and applications. Let's say you have an application that publishes events at a very low rate, which would lead to inefficient batching. The proxy will be able to mix these low volume events with other high-volume events in one batch. It makes event publishing very efficient and greatly reduce the workload for the Kafka brokers.

Proxy Enhancements

The Kafka REST proxy provides all the basic features we need out of the box, but to further improve its performance and scalability, we added our own feature like multi-cluster producing, and asynchronous request processing. Multi-cluster producing means the same proxy can produce to multiple Kafka clusters. Each topic will be mapped to a cluster, and this ensures that we can scale beyond one Kafka cluster. It also enables us to migrate topics from one Kafka cluster to another, which helps to balance the workload and improve the cost efficiency of Kafka clusters. Asynchronous request processing means the proxy will respond to the produce request as soon as it pulls the Kafka records into the Kafka clients producer buffer without waiting for the broker's acknowledgment. This has a few advantages. First, it significantly improves the performance of the proxy and helps reduce the back pressure on the proxy's clients. Second, asynchronous request processing means the proxies spend less time block waiting for the broker's response and more time for the proxy to process requests, which lead to better batching and throughput. Finally, we understand that this asynchronous mode means clients may not get the actual clear response from the broker and may lead to data loss. To mitigate this issue, we added automated retries on the proxy side on behalf of the client. The number of retries is configurable and each subsequent retry will be done on a randomly picked partition to maximize the chance of success. The result is minimal data loss of less than 0.001%, which is well in range of acceptable data loss level for analytical events.

Event Processing with Flink

Now that I have covered events producing, let's focus on what we have done to facilitate event consuming. One important objective for Iguazu is to create a platform for easy data processing. Apache Flink's layered API architecture fits perfectly with this objective. We choose Apache Flink, also because of its low latency processing, native support of processing based on event time, and fault tolerance and built-in integrations with a wide range of sources and sinks, including Kafka, Redis, Elasticsearch, and S3. Ultimately, we need to understand what stream processing framework can do for you. We'll demonstrate that by looking at a simple Kafka consumer. This is a typical Kafka consumer. First, it gets records from Kafka in a loop. Then it will update a local state using the records, and it just retrieves and produce a result from the state. Finally, it will push the result to a downstream process, perhaps over the network.

On the first look, the code is really simple and does the job. However, questions will arise over time. First, note that the code does not commit Kafka offset, and this will likely lead to failures to provide any delivery guarantee when failures occur. Where should the offset commit be added to the code to provide the desired delivery guarantee be it at-least once, at-most once, or exactly-once. Second, the local state object is stored in memory and it will be lost when the consumer crashes. It should be persisted along with the Kafka offset so that the state can be accurately restored upon failure recovery. How can we persist and restore the state when necessary? Finally, the parallelism of Kafka consumer is limited by the number of partitions of the topic being consumed. What if the bottleneck of the application is in processing of the records or pushing the results to downstream, and we need higher parallelism than the number of partitions. Apparently, it takes more than a simple Kafka consumer to create a scalable and fault tolerant application. This is where stream processing framework like Flink shines. As shown in this code example, Flink helps you to achieve delivery guarantees, automatically persists and restores application state through checkpointing. It provides a flexible way to assign compute resources to different data operators. These are just a few examples that stream processing framework can offer.

One of the most important features from Flink is layered APIs. At the bottom, process function allows engineers to create highly customized code and have precise control on handling events, state, and time. The next level up, Flink offers data stream APIs with built-in high-level functions to support different aggregations and windowing, so that engineers can create a stream processing solution with just a few lines of code. On the top we have SQL and table APIs, which offer casual data users the opportunity to write Flink applications in a declarative way, using SQL instead of code. To help people at DoorDash leverage Flink, we have created a stream processing platform. Our platform provides a base frame Docker image with all the necessary configurations that are well integrated with the rest of the DoorDash infrastructure. Flink's high availability setup and Flink internal metrics will be available out of the box. For better failure isolation and the ability to scale independently, each Flink job is deployed in a standalone mode as a separate Kubernetes service. We support two abstractions in our platform, data stream APIs for engineers and a Flink SQL for casual data users.

You may be wondering why Flink SQL is an important abstraction we want to leverage. Here's a concrete example. Real time features are an important component in machine learning, for model training and prediction. For example, to predict an ETA of a DoorDash delivery order requires up to date store order count for each restaurant, which we call a real time feature. Traditionally, creating a real time feature requires coding history and parsing the application and transforms and arrays the events into real time features. Creating Flink applications requires a big learning curve for machine learning engineers and becomes a bottleneck when tens or hundreds of features need to be created. The application created often have a lot of boilerplate code that are replicated across multiple applications. Engineers also take the shortcut to bundle the calculation of multiple features in one application, which lacks failure isolation or the ability to allocate more resources to a specific feature calculation.


To meet those challenges, we decided to create a SQL based DSL framework called Riviera, where all the necessary processing logic and wiring are captured in a YAML file. The YAML file creates a lot of high-level abstractions, for example, connecting to a Kafka source, and producing to certain things. To create a real time feature, engineers only need to create one YAML file. Riviera achieved great results. The time needed to develop a new feature is reduced from weeks to hours. The feature engineering code base is reduced by 70% after migrating to Riviera. Here's a Rivera DSL example, to show how Flink SQL is used to calculate store order count for each restaurant, which is an important real time feature used in production for model prediction.

First, you need to specify the source sink and a schema used in the application. In this case, the source is a Kafka topic and a sink is a Redis cluster. The process logic is expressed as a SQL query. You can see that we used the built-in COUNT function to aggregate the order count. We also used hot window. The hot window indicates that you want to process the data that are received in the last 20 minutes and refresh the results every 20 seconds.

Event Format and Schema: Protocol between Producers and Consumers

In the above two sections, we covered event producing and consuming in Iguazu. However, without a unified event format, it's still difficult for producers and consumers to understand each other. Here, we will discuss the event format, and schemas, which serve as a protocol between producers and consumers. From the very beginning, we defined a unified event format. The unified event format includes an envelope and payload. The payload contains the schema encoded event properties. The envelope contains context of the event, for example, event creation time, metadata, including encoding method and the references to the schema. It also includes a non-schematized JSON blob called custom attributes. This JSON section in the envelope gives users a choice, where they can store certain data and evolve them freely without going through the formal schema evolution. This flexibility proves to be useful at an early stage of event creation, where frequent adjustments of schema definition are expected. We created serialization libraries for both event producers and the consumers to interact with this standard event format. In Kafka, the event envelope is stored as a Kafka record header, and the schema encoded payload is stored as a record value. Our serialization library takes the responsibility of converting back and forth between the event API and the properly encoded Kafka record so that the applications can focus on their main logic.

We have leveraged Confluent schema registry for generic data processing. First, let me briefly introduce schema registry. As we know, schema is a contract between producers and consumers on data or events that they both interact with. To make sure producers and consumers agree on the schema, one simple way is to present the schema as a POJO class, which is available for both producers and consumers. However, there's no guarantee that the producers and the consumers will have the same version of the POJO. To ensure that a change of the schema is propagated from the producer to the consumer, they must coordinate on when the new POJO class will be made available for each party. Schema registry helps to avoid this manual coordination between producers and consumers on schema changes. It is a standalone REST service to store schemas and serve the schema lookup requests from both producers and consumers using schema IDs. Where schema changes are registered with schema registry, it will enforce compatibility rules and reject incompatible schema changes.

To leverage the schema registry for generic data processing, schema ID is embedded in the event payload so that the downstream consumers can look it up from schema registry without relying on the object classes on the runtime class path. Both protobuf and Avro schemas are supported in our serialization library and the schema registry. We support protobuf schema, because almost all of our microservices are based on gRPC and protobuf supported by a central protobuf Git repository. To enforce this single source of truth, avoid duplicate schema definition and to ensure the smooth adoption of Iguazu, we decided to use the protobuf as the primary schema type. On the other hand, the Avro schema is still better supported than protobuf in most of the data frameworks. Where necessary, our serialization library takes the responsibility of seamlessly converting the protobuf message to Avro format and vice versa.

One decision we have to make is when we should allow schema update to happen. There are two choices, at build time or producers runtime. It is usually tempting to let producers freely update the schema at the time of event publishing. There are some risks associated with that. It may lead to data loss because any incompatible schema change will fail the schema registry update and cause runtime failures in event publishing. It will also lead to spikes of schema registry update requests causing potential scalability issues for the schema registry. Instead, it will be ideal to register and update the schema at build time to catch incompatible schema changes early in the development cycle, and reduce the update aka call volume to the schema registry.

One challenge we faced is how we can centrally automate the schema update at build time. The solution we created is to leverage the central repository that manages all of our protobuf schemas and integrate the schema registry update as part of its CI/CD process. When a protobuf definition is updated in the pull request, the CI process will validate the change with the schema registry and it will fail if the change is incompatible. After the CI passes, and the pull request is merged, the CD process will actually register or update the schema registry and publish the compiled protobuf JAR files. The CI/CD process not only eliminates the overhead of manual schema registration, but also guarantees the early detection of incompatible schema changes and the consistency between the [inaudible 00:26:34] protobuf class families and the schemas in the schema registry. Ultimately, this automation avoids schema update at runtime and the possible data loss due to incompatible schema changes.

Data Warehouse and Data Lake Integration

So far, we have talked about event producing, consuming, and event format. In this slide, I'll give some details on our data warehouse integration. Data Warehouse integration is one of the key goals of Iguazu. Snowflake is our main data warehouse, and we expect events to be delivered to Snowflake with strong consistency and low latency. The data warehouse integration is implemented as a two-step process. In the first step, data is consumed by a Flink application from a Kafka topic and uploaded to S3 in the parquet format. This step leverages S3 to decouple the stream processing from Snowflake so that Snowflake failures will not impact stream processing. It will also provide a backfill mechanism from S3 given Kafka's limited retention. Finally, having the parquet files on S3 enables date lake integrations as the parquet data can be easily converted to any desired table format. The implementation of uploading data to S3 is done through Flink's StreamingFileSink. When completing an upload as part of Flink's checkpoint, StreamingFileSink guarantees strong consistency and exactly once delivery. StreamingFileSink also allows customized bucketing on S3 which means you can flexibly partition the data using any fields. This optimization greatly facilitates the downstream consumers.

At the second step, data is copied from S3 to Snowflake tables via Snowpipe. Snowpipe is a Snowflake service to load data from external storage at near real time. Triggered by SQS messages, Snowpipe copies data from S3 as soon as they become available. Snowpipe also allows simple data transformation during the copy process. Given that Snowpipe is declarative and easy to use, it's a good alternative compared to doing data transformation in stream processing. One important thing to note is that each event has its own stream processing application for S3 offload and its own Snowpipe. As a result, we can scale pipelines for each event individually and isolate the failures.

Working Towards a Self-Serve Platform

So far, we covered end-to-end data flows from clients to data warehouse, and here we want to discuss the operational aspect of Iguazu and see how we are making it a self-serve to reduce operational burdens. As you recall, to achieve failure isolation, each event in Iguazu has its own pipeline from Flink job to Snowpipe. This requires a lot more setup work and makes the operation a challenge. This diagram shows the complicated steps required to onboard a new event, including creation of the Kafka topic, schema registration, creation of a Flink stream processing job, and creation of the Snowflake objects. We really want to automate all these manual steps to improve operational efficiency. One challenge we face is how we can accomplish it under the infrastructure as code principle. At DoorDash, most of our operations, from creating Kafka topic, to setting up service configurations involve some pull request to different Terraform repositories and requires code review. How can we automate all these? To solve the issue, we first worked with our infrastructure team to set up the right pull approval process, and then automate the pull request using GitHub automation. Essentially, a GitHub app is created, where we can programmatically create and merge pull requests. We also leverage the Cadence workflow engine, and implemented the process as a reliable workflow. This pull automation reduced the event onboarding time from days to minutes. We get the best of both worlds, where we achieve automations, but also get versioning, all the necessary code reviews, and consistent state between our code and the infrastructure.

To get us one step closer to self-serve, we created high level UIs for a user to onboard the event. This screenshot shows a schema exploration UI, where users can search for a schema, using brackets, pick the right schema subject and version, and then onboard the event from there. This screenshot shows the Snowpipe integration UI where users can review and create Snowflake table schemas and Snowpipe schemas. Most importantly, we created a service called Minions that does the orchestration, and have it integrated with Slack so that we get notifications on each step it carries out. On this screenshot, you can see about 17 actions it has taken in order to onboard an event, including all the pull requests, launching the Flink job, and creating Snowflake objects.


Now that we covered the architecture and designs of all the major components in Iguazu, I'd like to once again review the four principles that I emphasized. The first principle is to decouple different stages. You want to use a Pub/Sub or messaging system to decouple data producers from consumers. This not only relieves the producers from back pressure, but also increases the data durability for downstream consumers. There are a couple of considerations you can use when choosing the right Pub/Sub or messaging system. First, it should be simple to use. I tend to choose the one that's built with a single purpose and do it well. Secondly, it should have high durability and throughput guarantees. It should be highly scalable, even when there's a lot of consumers and high data fanout. Similarly, stream processing should be decoupled from any downstream processes, including the data warehouse, using a cost-effective cloud storage or data lake. This guarantees that in case there's any failures in the data warehouse, you don't have to stop the stream processing job and you can always use the cheaper storage for backfill.

As a second principle, it pays to invest on the right stream processing framework. There are some considerations you can use when choosing the right stream processing framework. First, it should support multiple abstractions so that you can create different solutions according to the type of the audience. It should have integrations with sources and sinks you need, and support a variety of data formats like Avro, JSON, protobuf, or parquet, so that you can choose the right data format in the right stage of stream processing. The third principle is to create abstractions. In my opinion, abstraction is an important factor to facilitate the adoption of any complicated system. When it comes to the real time data infrastructure, there are a few common abstractions worth considering. For example, leveraging Kafka REST proxy, creating event API and serialization library, providing a DSL framework like Flink SQL, and providing high level UIs with orchestrations. Finally, you need to have fine-grained failure isolation and scalability. For that purpose, you should aim to create independent stream processing jobs or pipelines for each event. This avoids resource contention, isolates failures, and makes it easy to scale each pipeline independently. You also get the added benefits of being able to provide different SLAs for different events and easily attribute your costs at event level. Because you are creating a lot of independent services and pipelines to achieve failure isolation, it is crucial to create orchestration service to automate things and reduce the operational overhead.

Beyond the Architecture

Here are some final thoughts that go beyond the architecture. First, I find it important to build products using a platform mindset. Ad hoc solutions are not only inefficient, but also difficult to scale and operate. Secondly, picking the right framework and creating the right building blocks is crucial to ensure success. Once you have researched and find the sweet spots of those frameworks, you can easily create new products by combining a few building blocks. This dramatically reduces the time to building your products, and the effort to maintain the platform. I would like to use a Kafka REST proxy as an example. We use it initially for event publishing from internal microservices. When the time comes for us to develop a solution for mobile events, we found the proxy is also a good fit because it supports batching and JSON payload, which are important for mobile events. With a little investment, we made it work as a batch service, which save us a lot of development time.

Questions and Answers

Anand: Would you use a REST proxy if loss is not an option?

Wang: I mentioned that we added the extra capability to do asynchronous request processing. That is designed purely for analytical data, because it would introduce maybe very minor data loss, because we are not giving the actual broker acknowledgments back to the client. You can also use the REST proxy in a different way, where it would give precisely what the broker acknowledgment is. Basically, you're adding a date hop in the middle, but client would know definitely whether this producing request is successful or not, or it can timeout, sometimes it can timeout. Let's say, you send something to the proxy and the proxy crashed, and then you don't hear anything back from the proxy, but on the client side, you will know it timed out, then you will try to do some kind of retry.

Anand: I'm assuming REST requests coming to the REST proxy, you accumulate it in some buffer, but you're responding back to the REST clients if they're mobile or whatever, that you've received the data, but it's uncommitted. Therefore, you can do larger writes and get a higher throughput sense.

Wang: Yes, of course, that also had to be changed, if you are really looking for lossless data publishing. It's not desirable to have that behavior.

Anand: Also curious about alternatives to Flink. One person is using Akka. They're moving off Akka towards Flink, and they were wondering what your alternative is.

Wang: I haven't used Akka before. Flink is the primary one that I use. I think Flink probably provides more data processing capability than Akka including all the abstractions it provides, the SQL support, and the fault tolerance built on top of Flink's internal state. I would definitely recommend using these kinds of frameworks that provide these capabilities.

Anand: Also curious about alternatives to Flink. We used to use Airflow plus Beam on data flow to build a similar framework. Did your team evaluate Beam? Why choose Flink? We usually hear about Beam and Flink, because Beam is a tier above.

Wang: I think Beam does not add a lot of advantages in streaming.

Anand: Beam is just an API layer, and it can be used with Spark streaming or with Flink.

Wang: Unless you want to have the capability of being able to migrate your workload to a completely different stream processing engine, Beam is one of those choices you can use. I also know that some of the built-in functionalities on streaming parts and edges are not transferable to another. I think most likely they're not. You are using certain built-in functionalities of those stream processing frameworks, which is not exposed with Beam. It's actually hard to migrate to a different one. I think Beam is a very good concept, but in reality, I feel it's actually more efficient to stick to the native stream processing framework?

Anand: It's a single dialect if the two engines below have the same functionality, but the two engines are diverging and adding their own, so having a single dialect is tough.

How do you handle retries in the event processing platform? Are there any persistence to the problematic events with an eventual reprocessing?

Wang: The retries, I'm sure what is that context? In both the event publishing stage, and event processing stage or consuming stage, you can do retries. Let's say in publishing stage, of course, when you try to publish something, the broker gives you a negative response, you can always retry.

Anand: What they're saying is, how do you handle retries in the event processing platform? The next one is, if you have a bad message, do you persist these away, like in a dead letter queue, with an eventual reprocessing of it once it's patched?

Wang: In the context of data processing or consuming the data, if you push something to the downstream and that failed, or you're processing something that failed, you can still do a retry, but I think dead letter queue is probably the ultimate answer. That you can push the data back to a different Kafka topic, for example, and have the application consumed again from that dead letter queue. You can add a lot of controls on like, how many times you want to retry, and how long you want to keep retrying. I think dead letter queue is essential if you really want to minimize your data loss.

Anand: Apache Kafka is a popular choice, especially for real time event stream processing, I'm curious of presenter's opinion on simpler event services such as Amazon SNS. How does one decide if you truly need something as powerful as Kafka, or perhaps a combination of SNS and SQS for added message durability? Could they be enough for a use case?

Wang: I think SNS as a queue has helped different kinds of use cases, they are more like point-to-point messaging system. Kafka on the other side, it's really helped emphasize on streaming. For example, in Kafka, you always consume a batch of messages and your offset commit is on the base level batch, not individual messages. Kafka has another great advantage where you can have different consumers consume from the same queue or same topic and it will not affect each other. Data fanout is a very good use case of Kafka. Having data fanout is a little bit difficult to deal with, with SNS and SQS.

Anand: I've used SNS, SQS together. SNS is topic based, but if a consumer joins late, they don't get historical data, so then you basically have like multiple consumer groups as SQS topics. Then each consumer group has to link to that SQS topic. What it didn't work for, for me, is Kafka, when the producer wants a durable, it doesn't matter how many consumer groups you add later, different apps. With SNS, SQS, you end up adding the SQS topic, or the SQS queue for another consumer group, but there's no historical data. I think that's the difference. Kafka supports that, SNS, SQS doesn't actually solve that problem.

Are there any plans in exploring Pulsar to replace Kafka? What are your thoughts on Pulsar if you already explored it?

Wang: We actually explored Pulsar in my previous job at Netflix. At that time, Pulsar was not super mature, it still had some stability issues. In the end, I think Kafka is still simple to use. I think Pulsar has tried to do a lot of things. It's tried to accomplish both, supporting stream processing and also supporting this point-to-point messaging. It tried to accommodate both. It becomes a bit more complicated or complex architecture, than Kafka. For example, I think Pulsar relies on another service or open source framework called BookKeeper, Apache BookKeeper. There's multiple layers of services in between. We think that deployment installation operation could be a headache, so we choose Kafka, we want to stick with Kafka for its simplicity.

Anand: You have the option of doing go time versus directly hooked in. I think that option is only at publish time, like producer time.


See more presentations with transcripts


Recorded at:

Jun 09, 2023