InfoQ Homepage Presentations Resilient Real-Time Data Streaming across the Edge and Hybrid Cloud

Resilient Real-Time Data Streaming across the Edge and Hybrid Cloud

Bookmarks

View Presentation

Speed:

45:34

Summary

Kai Waehner explores different architectures and their trade-offs for transactional and analytical workloads. Real-world examples include financial services, retail, and the automotive industry.

Bio

Kai Waehner is Field CTO at Confluent. He works with customers across the globe and with internal teams. Kai’s main area of expertise lies within the fields of Data Streaming, Analytics, Hybrid Cloud Architectures, Internet of Things, and Blockchain. Kai is a regular speaker at international conferences such as Devoxx, ApacheCon, and Kafka Summit.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Waehner: This is Kai Waehner from Confluent. I will talk about resilient real-time data streaming across the edge and hybrid cloud. In this talk, you will see several real world deployments and very different architectures. I will cover all the pros and cons so that you would understand the tradeoffs to build a resilient architecture.

Outline

I will talk a little bit about why you need resilient enterprise architectures and what the requirements are. Then cover how real-time data streaming will help with that. After that, I cover several different architectures, and use real world examples across different industries. I talk about automotive, banking, retail, and the public sector. The main goal is to show you a broad spectrum of architectures I've seen in a real world so that you can learn how you architect your own architecture to solve problems.

Resilient Enterprise Architectures

Let's begin with the definition of a resilient enterprise architecture, and why we want to do that. Here's an interesting example from Disney. Disney and their theme parks, Disney World in the U.S., had a problem. There was an AWS Cloud outage. What's the problem with the theme parks, you might wonder? The theme parks are still there, but the problem is that Disney uses a mobile app to collect all the customer data and provide a good customer experience. If the cloud is not available anymore, then you cannot use the theme park, you cannot do rides, you cannot order in the restaurants anymore. I think this is a great example why resilient architectures are super important, and why they can look very differently depending on the use case. That's what I will cover, show you edge, hybrid, and cloud-first architectures with all their pros and cons. In this case, maybe a little bit of edge computing and resiliency might help to make the customers happy in the theme parks.

There's very different reasons why you need resilient architectures. This is a survey, one of my former colleagues and friends Gwen Shapira did, actually the query is from 2017 but it's still more or less up to date. The point is to show there's very different reasons why people build resilient architectures. One DC might get nuked, yes, and disaster recovery is a key use case. Sometimes it's also about latency and distance or implications of the cost regarding where you replicate your data. Sometimes it also is legal reasons. There's many reasons why resilient architectures are required. If we talk about that, there's two terms we need to understand. It doesn't matter what technology you use, like we today talk about real-time streaming with Kafka. In general, you need to solve the problem of the RPO and the RTO. The RPO is the Recovery Point Objective. This defines how much data you will lose in case of a downtime or a disaster. On the other side, the RTO, the Recovery Time Objective means, what's the actual recovery period before we are online, again, and your systems work again? Both of these terms are very important. You always need to ask yourself, what do I need? Initially, many people say, of course, I need zero downtime and zero data loss. This is very often hard to architect, as we have seen in the Disney example before, and what I will show you in many different examples. Keep that in mind, always ask yourself, what downtime or data loss is ok for me? Based on that, you can architect what you need.

In the end, zero RPO requires synchronous replication. This means even if one node is down, if you're replicated synchronously, it guarantees that you have zero data loss. On the other side, if you need zero downtime, then you need a seamless failover. It's pretty easy definitions, but it's super hard to architect that. That's what I will cover. That also shows you why real-time data is so important here, because if you have replication in a batch process from one node to another, or from one cloud region to another, that takes longer. Instead, if you replicate data in real-time, even in case of disaster, you lose much less data with that.

Real-Time Data Streaming with the Apache Kafka Ecosystem

With that, this directly brings me to real-time data streaming. In the end, it's pretty easy. Real-time data beats slow data. Think about that in your business for your use cases. Many of the people in this audience are not business people, but architects or developers. Nevertheless, think about what you're implementing. Or when you talk to your business colleagues, ask yourself and your colleagues, if you want to get this data and process it, is it better now or later? Maybe later could be in a few minutes, hours, or days. In almost all use cases, it's better to process data continuously in real-time. This can be to increase the customer experience, to reduce the cost, or to reduce the risk. There's many options why you would do that. Here's just a few examples. The point is, real-time data almost always beats slow data. That's important for the use cases. Then also, of course, for the resiliency behind the architectures, as we discussed regarding downtime and data loss, which directly impacts how your customer experience is or how high the risk is for your business.

When we talk about processing data in real-time, then Apache Kafka is the de facto standard for that. There's many vendors behind it. I've worked for Confluent, but there's many other vendors. Also even companies that don't use Kafka, but another framework, they at least use the Kafka protocol very often, because it became a standard. While this presentation is about success stories and architectures around Apache Kafka, obviously, the same can be applied to other technologies, like when you use a cloud native service from a cloud provider, like AWS Kinesis, and so on, or maybe you're trying to use Apache Pulsar or whatever else you want to try. The key point, however, is that Kafka is not just a messaging platform, or ingestion layer, like many people think. Yes, it's real-time messaging at any scale, but in addition to that, a key component is the storage. You store all the events in your infrastructure, within the event streaming platform. With this, you achieve true decoupling between the producers and consumers. You handle the backpressure automatically, because often consumers cannot handle the load from the producer side. You can also replay existing data.

If you think about these characteristics of the storage for the decoupling and backpressure handling, that's a key component you can leverage for building a resilient architecture. That's what we will see in this talk about different approaches, no matter if you're at the edge, in the cloud, or in a hybrid architecture. In addition to that, Kafka is also about data integration with Kafka Connect, and data processing, in real-time, continuously, with tools like Kafka Streams, or KSQL. That in the end is what Kafka is. It's not just a messaging system. This is super important when we talk about use cases in general, but then also about how to build resilient architectures, because messaging alone doesn't help. It's about the combination of messaging and storage, and integration, and processing of data.

Kafka is a commit log. This means it's append only from the producer side. Then consumers consume it whenever they need or can, in real-time, in milliseconds, in near real-time, in seconds, or maybe in batch, or in an interactive way with a request. It's very flexible, and you can also replay the data. That's the strength of the storage behind the streaming log. With that, Kafka is a resilient system by nature. While you can deploy a single broker of Kafka, and that's sometimes done at the edge, but in most cases you'll deploy Kafka as a distributed system, and with that, you get the resiliency out of the box. Kafka is built for failure. This means a broker can go down, the network can go down, a disk can break. This doesn't matter because if you configure Kafka correctly, you can still guarantee zero downtime and zero data loss. With that, that's the reason why Kafka is not just used for analytics use cases, but also for mission critical workloads. Without going into more detail, keep in mind, Kafka is a highly available system. It provides other features like rolling upgrades and backwards compatibility between server and client so that you can continuously run your business. That's what I understand under a resilient system. You don't have downtime from maintenance, or if an issue occurs with the hardware.

Here's an example for a Kafka architecture. In this case, I'm showing a few Confluent components, obviously working for Confluent. Even if you use open source Kafka or maybe another vendor, you glue together the right components from the Kafka ecosystem. In this case, the central data hub is the platform, these are the clusters. Then you also use Kafka Connect for data integration, like to a database or an IoT interface, and then you consume all this data. However, the messaging part is not really what adds the business value, the business value is to continuously process the data. That's what stream processing does. In this case, we're using Kafka native technologies like KSQL and Kafka Streams. Obviously, you could also use something like Apache Flink, or any other stream processing engine for that. Then of course, there's also data sink, so Kafka Connect can also be used to ingest data into a data lake or database, if you don't need or don't want to process data only in real-time. This is then the overall architecture, which has different components that work together to build an integration layer, and to build business logic on top of that.

Global Event Streaming

With that then, if we once again think a little bit about bigger deployments, you can deploy it in very different flavors. Here's just a few examples of that, like replication between different regions, or even continents. This is also part of the Kafka ecosystem. You build one infrastructure to have a highly scalable and reliable infrastructure. That even replicates data across regions or continents, or between on-prem and the cloud, or between multi-cloud. You can do this with open source Kafka using MirrorMaker 2.0, or with some more advanced tools like at Confluent, we have cluster linking that directly connects clusters together on the Kafka protocol as a commercial offering, no matter what you choose for the replications. The point is, you can deploy Kafka very differently depending on the SLAs. This is what we discussed in the beginning about the RPO and RTO. Ideally, you have zero downtime, and no data loss, like in the orange cluster, but you stretch one cluster across regions. That's too hard to deploy for every use case, so sometimes you have aggregation clusters, where some downtime might be ok, especially in analytics use cases, like in yellow. With that, keep in mind, you can deploy Kafka and its ecosystem in very different architectures. It has always pros and cons and different complexity. That's what you should be aware of when you think about how to architect a resilient architecture across edge, hybrid, and multi-cloud.

Here's one example for that. This is now a use case from the shipping industry. Here now you see what's happening in the real world. You have very different environments and infrastructures. Let's get started on the top right. Here you see the cloud. In this case now, ideally you have a serverless offering, because in the cloud, in best case, you don't have to worry about infrastructure. It has elastic scaling, and you have consumption based pricing for that. For example, with Confluent Cloud in my case, again. Here you have all your real-time workloads and integrate with other systems. In some cases, however, you still have data on-prem. On the bottom right, you'll see where a Kafka cluster is running in the data center, connecting to traditional databases, old ERP systems, the mainframe, or whatever you need there. This also replicates data to the cloud and the other way back. Once again, as I said in the beginning, real-time data beats slow data. It's not just true for business, but also for these kinds of resiliency and replication scenarios. Because if you replicate data in real-time, in case of disaster, you're still not falling behind much and don't lose much data, because it's replicated to the cloud in real-time. Then on the other side, on the left side, we have edge use cases where you deploy a small version of Kafka, either as a single node like in the drone, or still as a mission critical cluster with three brokers on a ship, for edge analytics, disconnected from the cloud, or from the data center. This example is just to make clear, there's many different options how to deploy Kafka in such a real world scenario. It always has tradeoffs. That's what I will cover in the next sections.

Cloud-First and Serverless Industrial IoT in Automotive

Intentionally, I chose examples across different industries, so that you see how that is deployed. Also, this is now very different kinds of architectures with different requirements and setups. Therefore, the resiliency is also very different depending on where you deploy Kafka and why you deploy it there. Let me get started with the first example. This is now about cloud-first and serverless. This is where everybody's going. Ideally, everything is in the cloud, and you don't have to manage it, you just use it. That's true, if it's possible. Here is an example where this is possible. This might surprise some people. This is why I chose this use case. Actually, it is a use case where BMW is running event streaming in the cloud as a serverless offering on Azure. However, they are actually directly connecting to their smart factories at the edge. In this case, BMW's goal is to not worry about hardware or infrastructure if you don't need to, so they deploy in the cloud first, what's possible. In this case, they leverage event streaming to consume all the data from the smart factories, machines, PLCs, sensors, robots, all the data is flowing into the cloud in real-time via a direct connection to the Azure Cloud. Then the data is processed in real-time at scale. A key reason why BMW chose this architecture is that they get the data into the cloud once. Then they provide it as a data hub with Kafka to every business unit and application that needs access to it, tap into the data. With Kafka, and because it's also, as I discussed, a storage system, it doesn't matter what technology consumes the data. It doesn't matter what communication paradigms consumes the data. It's very flexible. You can connect to that from your real-time consumer, Kafka native maybe. You can consume from it from your data lake, more near real-time or batch. You can connect to it via a web service and REST API from a mobile app. This is the architecture that BMW chose here. This is very interesting, cloud only, but still connecting to data at the edge. The cloud infrastructure is resilient. Because it also has a direct connection to the edge, this works well from an SLA perspective for BMW.

Let me go a little bit deeper into the stream processing part, because that's also crucial when you want to build resilient architectures for resilient applications. When we talk about data streaming, then we get sensor data in all the time. Then we can build applications around that. An application can be anything, like in this case, we're doing condition monitoring on temperature spikes. In this case, we're using Java code with Kafka Streams. Here, you see every single event is processed by itself. That's pretty straightforward. This scales for millions of events per second. This can run everywhere where a Kafka cluster is. In this case, if we deploy that in a use case like BMW, this is running in the cloud, and typically, next to the serverless environment for low latency. There can also be advanced use cases where you do stateful processing. In this case, you do not just process each event by itself, but different events, and correlate them for whatever the use case is. In this case, we are creating a sliding window and continuously monitor the last seconds or minutes, or in this case, hour, to detect a spike in this case of a temperature for continuous anomaly detection. As you can see here, you're very flexible with use cases and deploy them. Still, it's a resilient application, because it's based on Kafka, with all the characteristics Kafka has. That's also true for the application, not just for the server side. This code, in this case, it's KSQL now, automatically handles failover. It handles latency issues. It handles disconnectivity, outages of hardware, because it's all built into the Kafka protocol to solve these problems.

Then you can even do more advanced use cases, like in this case, we've built a user defined function to embed a TensorFlow model. In this case, we are doing real-time scoring of analytic models applied against every single event, or against a stateful aggregation of events, depending on your business logic. The point here is that you can build very resilient real-time applications with that. In the same way, this is also true for the replication you do between different Kafka clusters, because it's also using the Kafka protocol. It is super interesting if you want to replicate data and process it even across different data centers.

Multi-Region Infrastructure for Core Banking

This was an example where we see everything in the cloud. However, reality is, sometimes resiliency needs even more than that. Sometimes you need multi-region deployments. This means even in case of disaster, a cloud goes down or a data center goes down, you need business continuity. This is in the end where we have an RTO of zero, where we have no downtime, and an RTO and RPO of zero, no data loss. Here's an example from JPMorgan. This is financial services. This is typically in most critical deployments with compliance and a lot of legal constraints so that you need to guarantee that you don't lose data even in case of disaster. In this case, JPMorgan deploys a separate independent Kafka cluster to two different data centers. Then they replicate the data between the data centers using real-time replication. They also handle the switchover, so if one data center is down, they also switch the producers and the consumers to the other data center. This is obviously a very complex approach, so you need to get that right, including the testing and all these things. As you can see the link here, JPMorgan Chase talked about it in 45 minutes, just about this implementation. It's super interesting to learn how the end users deploy such an environment.

However, having said that, this example is still not 100% resilient. This is the use case where we replicate data between two Kafka clusters asynchronously. In this case, it's done with Confluent Replicator, but the same would be true with Mirrormaker if you use open source. The case is, if there is disaster, you still lose a little bit of data, because you're replicating data asynchronously between the data centers. It's still good enough for most use cases, because it's super hard to do synchronous replication between data centers, especially if they are far away from each other, because then you have latency problems and inconsistency. Always understand the tradeoffs between your requirements and how hard it is to implement them.

I want to show you another example, however. Here now, we're really stretching a single Kafka cluster across regions. With this solution then, you have synchronous replication. With that, you can really guarantee zero data loss even in case a complete data center goes down. This is a feature of Confluent platform. This is not available in open source. This shows how you can then implement your own solution or buy something where you can then solve the problems that are even harder to do. In this case, how this works is that, as discussed, you need synchronous replication between data centers, to guarantee zero data loss. However, because you get latency issues and inconsistencies, then, what we did here, we provide the option so that you can decide which topics are replicated synchronously between the brokers within a single stretch Kafka cluster. As you see in the picture on the left side, the transactional workloads, that's what we replicate in a synchronous way, zero data loss even in case of disaster. For the not so relevant data, we replicate it asynchronously within the single Kafka cluster. In this case, here is data loss in case of disaster. Now you can decide per business case what you replicate synchronously, so you still have the performance guarantees, but still also can keep your SLAs for the critical datasets. This is now a really resilient data center. We battled tested this before the GA across the U.S., with U.S.-West, Central, and East. This is really across hundreds of miles, so not just next to each other. This is super powerful, but therefore much harder to implement, and to deploy. That's the tradeoff of this.

On the other side, here's another great open source example. This is Robinhood. Their mission is to democratize finance for all. What I really like about their use cases using Kafka is that they're using it for everything. They're using it for analytical use cases, and also for mission critical use cases. In their screenshots from their presentation, you see that they're doing stock trading, clearing, crypto trading, messaging and push notifications. All these critical applications are running in real-time across the Kafka cluster between the different applications. This is one more point to mention again, so still too many people in my opinion think about Kafka just for data ingestion into a data lake and for analytical workloads. Kafka is ready to process transactional data in mission critical scenarios without data loss.

Another example of that is Thought Machine. This is just one more of the examples to building transactional use cases. This is a core banking platform and a cloud native way for transactional workloads. It's running on top of Kafka, and provides the ecosystem to build core banking functionalities. Here, you see the true decoupling between the different consumers' microservices, and not all of them need to be real-time in milliseconds, you could also easily connect a batch consumer. Or you can also easily connect Python clients so that your data scientist can use his Jupyter Notebook and replay historical data from the Kafka cluster. There are many options here. This is my key point, you can use it for analytics, but also for transactional data, and resilient use cases like core banking.

Even more interestingly, Kafka has a transaction API. That's also what many people don't know. The transaction API actually is intentionally called Exactly-Once Semantics. Because in distributed systems, transactions work very differently than in your traditional Oracle or IBM MQ integration where you do a two-phase commit protocol. Two-phase commit doesn't scale, so it doesn't work in distributed systems. You need to have another solution. I have no idea how this works under the hood. This is what the smart engineers of the Kafka community built years ago. The point is, as you see in the left side, you have a transaction API to solve the business problem to guarantee that each message that is produced once from a producer is durable and consumed exactly once from each consumer, and that's what you need in a transaction business case. Just be aware that this exists. This has not much performance impact. This is optional, but you can use it if you have transactional workloads. If you want to have resiliency end-to-end, then you should use it, it's much easier than removing duplicates by yourself.

Hybrid Cloud for Customer Experiences in Retail

After we talked a lot about transactional workloads, including multi-region stretch clusters, there is more resilient requirements. Now let's talk about hybrid cloud architectures. I actually choose one of the most interesting examples, I think. This is Royal Caribbean, so cruise ships for tourists. As you can imagine, each ship has some IT data. Here, it's a little bit more than that. They are running mission critical Kafka clusters on each ship, for doing point of sale integration, recommendations and notifications to customers, edge analytics, reservation, loyalty platform. All these things you need to do in such a business. They are running this on each ship, because each ship has very bad internet connection, and it's very expensive. They need to have a resilient architecture at the edge for real-time data. Then, when one of the ships comes back to the harbor, then you have very good internet connection for a few hours, then you can replicate all the data into the cloud. You can do this for every ship after every tour, and in the cloud. Then you can integrate with your data lake for doing analytics. You can integrate with your CRM and loyalty platform for synchronizing the data. Then the ship goes on the next tour. This is a very exciting use case for hybrid architectures.

Here is how this looks like in more general. You have one bigger Kafka cluster. This is maybe running in the cloud, like here, or maybe in your data center, where you connect to your traditional IT systems like a CRM system, or a third party payment provider. Then also, you see in the bottom, we also integrate with the small local Kafka clusters, they are required for the resilient edge computing in the retail store or on the ship. They also communicate with the central big Kafka cluster, all of that in real-time in a reliable way, using the Kafka protocol. If you go deeper into this edge, like in this case, the retail store, or in the example from before from the ship, here, now you can do all the edge computing, whatever that is. For example, you can integrate with the point of sale and with the payment. Even if you're disconnected from the cloud, you can do payments and sell things. Because if this doesn't work, then your business is down. This happens in reality. All of this happens at the edge. Then if there's good internet connectivity, of course, you're replicated back to the cloud.

We have customers, especially now in retail in malls, where during the day, the Wi-Fi is very bad, so they can't replicate much. During the night, when there is no customers, then they replicate all the data from the day into the cloud. This is a very common cloud infrastructure or hybrid infrastructure. Kafka is so good here, because it's not just a messaging system, it's also a storage system. With this storage, you truly decouple the things, so you can still do the point of sale, but you cannot push it to the cloud where your central system is. You keep it in the Kafka log, and automatically when there is Kafka connection to the cloud, then it starts replicating. It's all built into the framework. This is how you build resilient architectures easily by leveraging the tools for that and not building that by yourself because it's super hard.

With that, we see how omnichannel works much better also with Kafka. Again, it's super important to understand that Kafka is a storage system that truly decouples things and stores things for later replay. Like in this case, we had a newsletter first, 90 and 60 days ago. Then, 10 and 8 days ago, we used our Car Configurator. We configured a car and changed it. Then at day zero, we're going into the dealership, and here the salesperson already knows all historical and real-time information about me. He knows real-time because it's a location with service with my app, I'm walking into the store. In that moment, the data about me is replayed from the Kafka log historical data. In some cases, even better, it's advanced analytics, where the salesperson even gets a recommendation from the AI engine in the backend, recommending a specific discount because of your loyalty and because of your history, and so on. This is the power of building omnichannel, because it's not just a messaging system, but it's a resilient architecture within one platform where you do messaging in real-time, but also storage for replayability, and data integration and data processing for correlation at the right context at the right time.

Disconnected Edge for Safety and Security in the Public Sector

Let me go even deeper into the edge. This is about safety critical, and cybersecurity, and these kinds of use cases. There's many examples. The first one here is, if you think about a train system. The train is on the rails all the time. Here, you also can do data processing at the edge, like in a retail store. This has a resilient architecture, because the IT is small, but it's running on computers in the train. With that, also, it's very efficient while it's resilient. You don't need to connect to the cloud all the time when you want to understand, what is the estimated time of arrival? You get this information once pushed from the cloud into the train, and each customer on the train can consume the data from a local broker. The same is when you, for example, consume any other data or when you want to do a seat reservation in the restaurant on the train. This train is completely decoupled from another train, so it can be local edge processing. Once again, because it's not just messaging, but a complete platform around data processing and integration, you can connect all these other systems, no matter if they're real-time, or a file based legacy integration, like in a train you often have a Windows Server running. That's totally flexible how you do that.

Then this is also more about safety and about criticality for cyber-attacks. Here's an example, where we have Devon Energy, in the oil and gas business. Here, on the left side, you see these small yellow boxes at the edge. This is one of our hardware partners where we have deployed Confluent platform to do edge computing in real-time. As you see in the picture, they're collecting data, but they're also processing the data. Because in these resilient architectures, not everything should go to the cloud, it should be processed at the edge. It's much more cost efficient. Also, the internet connection is not perfect. You do most of the critical workloads at the edge, while you still replicate some of the aggregated data into the cloud from each edge site. This is a super powerful example about these hybrid architectures, where the edge is often disconnected and only connects from time to time. Or sometimes you really run workloads only at the edge and don't connect them at all to the internet for cybersecurity reasons. That's then more an air-gapped environment. That's what we also see a lot these days.

With that, my last example is also about defense. Really critical and super interesting. That's something where we need a resilient streaming architecture even closer to the edge. You see the command post at the top, this is where a mission critical Kafka cluster is running for doing compute and analytics around this area in the military, for the command post. Each soldier actually also has installed a very small computer, which also is running a Kafka broker, not a cluster here, it's just a single broker, it's good enough. This is collecting sensor data, when the soldier is moving around. Like he's making pictures, he's collecting other data, whatever. Even if he's outside of the internet wide area, he can continue collecting data, because it's stored on the Kafka broker. Then when he's going back into the Wide Area Network, the Kafka automatically starts replicating the data into the command post. From there, you can do the analytics in the command post for this region. You can also replicate information back to Confluent Cloud in this case, where we collect data from all the different command posts to make more central decisions. With this example, I think it has shown you very well how you can design very different architectures, all of them resilient, depending on the requirements you have, on the infrastructure you have, and the SLAs you have.

Why Confluent?

Why do people work with us? Many people are using Kafka, and it's totally fine. It's a great open source tool. The point is, it's more like a car engine. You should ask yourself, do you want to build your own car, or do you want to get a real car, a complete car that is safe and secure, that provides the operations, the monitoring, the connectivity, all these things. Then, that's why people use Confluent platform, the self-managed solution. Or if you're in the cloud, and you're even more lucky, because there we're providing the self-driving car level 5, which is the only truly fully managed offering of Kafka and its ecosystem in the cloud. In this case, across all clouds, including AWS, Azure, Google, and Alibaba in China. This is our business model. That's why people come to us.

Questions and Answers

Watt: One of the things I found really interesting was the edge use cases that you have. I wanted to find out, like when you have the edge cases where you've got limited hardware, so you want to try and preserve the data. There's a possibility that the edge clusters or servers won't actually be able to connect to the cloud for quite some time. Are there any strategies for ensuring that you don't lose data by virtue of the fact that you just run out of storage on the actual edge cluster?

Waehner: Obviously, at the edge, you typically have limited hardware. It will of course depend on the setup you have. Like in a small drone, you're very limited. Last week, I had a call with a customer who was working with wind turbines. They actually expect that they are sometimes offline for a complete week. That really means high volumes of data. Depending on this setup, of course, you need to keep storage because without storage, you cannot keep it in a case of disaster, like disconnectivity. Therefore, of course, you have to plan depending on the use case. The other strategy, however, is then, if you really get disconnected longer than you expect, maybe not just a week in this case, but then really a month, because this is really complex to fix, sometimes in such an environment.

Then the other strategy is a little bit of workaround because even if you store data at the edge, in this case, there are still different kinds of value in data. In that case, you can embed a very simple rules engine by saying, for example, if you're disconnected longer than a week, then only store data, which is XYZ, but not ABC. Or the other option is instead of then storing all the data, in this case now we pre-process the data at the edge and only store the aggregations of the data, or filter it out in the beginning already. Because reality is in these high volume datasets, like in a car today, you produce a few terabytes per day, and in a wind turbine even more, and in these cases, anyway, most of the data is not really relevant. There is definitely workarounds. This again, like I discussed in the beginning, always depends on the SLAs you define, how much data can you lose? If a disaster strikes, whatever that is, what do you do then? You also need to plan for that. These are some of the workarounds for that kind of disaster then.

Watt: That's really interesting, actually, because I think sometimes people will think, I just have to have a simple strategy, like first-in, first-out, and I just lose all the last things. Actually, if you have that computing power at the edge, you can actually decide based on the context, so whether it's a week or a month, maybe you discard different data and only send up.

How good is the transactional feature of Kafka? Have you used it in any of your clients in production?

Waehner: People are always surprised when they see this feature, actually. I've worked for Confluent now for five years. Five years ago, this feature was released. Shortly after I started, exactly once a month, things were introduced into Kafka, including that transaction API. The API is more powerful than most people think. You can even send events to more than a single Kafka topic. You open your [inaudible 00:38:28], you say, send messages to topic one, to topic two, to topic three. Either you send all of them end-to-end, or none of them. It's transactional behavior, including rollback. It's really important to understand that how this is implemented is very different from a transactional workload, from a traditional system like an IBM MQ connecting to a mainframe, or to an Oracle database. That is typically implemented with two-phase commit transactions, a very complex protocol that doesn't scale well. In Kafka, it works very differently. For example, now you use idempotent producers, a very good design pattern in distributed architectures. End-to-end, as an end user, you don't have to worry, it works. It's battle tested. Many customers are using it, and so you really don't have to worry about that. Many customers have this in production, and this is super battle tested in the last years.

Watt: We've got some clients as well actually, that take advantage of this. As you say, it's been around for a little while. It has certain constraints, which you need to be aware of, but it certainly is a good feature.

Waehner: In general, data streaming is a different paradigm. If you're in an Oracle database, you typically do one transaction, and that's what you think about. If you talk about data streaming, you typically have come data in, then you correlate it with other data, then you do the business transaction, and you do more data. The real added value of the transactional feature is not really just about the last consumer but it's about the end-to-end pipeline with different Kafka applications. If you use transactional behavior within this end-to-end pipeline where you have different Kafka applications in the middle, then with transactional behavior, you don't have to check for duplicates in each of these locations, and so on. Again, because the performance impact is very low, it's really something like only 10%. This is a huge win if you build real stream processing applications. You really also not just have to think differently about how the transactions work under the hood, but also that a good stream processing application uses very different design patterns than a normal web server and database API application.

Watt: Another question is asking about eventual consistency and real-time, and what the challenges are that you see from that perspective.

Waehner: That's another great point because of eventual consistency. That's, in the end, the drawback you have in such a distributed system. It depends a little bit on the application you build. The general rule of thumb is that also, you simply build applications differently. In the end, it's ok if you do receive sometimes a message not after 5 milliseconds, but sometimes if you have a spike of p99, and it's 100 millisecond, because if you have built the application in the right way, this is totally ok for most applications. If you really need something else, there is other patterns like that. I actually wrote a blog post on my blog about comparing the JMS API for message queues compared to Kafka. One key difference is that a JMS API, out of the box, provides a request-reply pattern. That's something what you also can do in Kafka, but it's done very differently. Here again, there's examples for that. Also, if you check my blog, there's a link to the Spring Framework, which uses the Spring JMS template that even can implement synchronous request-reply patterns with Kafka.

The key point here to understand is, don't think about your understanding from MQ, or ESPs, or traditional transactions and try to re-implement it with Kafka, but really take a look at a design pattern on the market, like from Martin Fowler, and so on for distributed systems. There you learn, if you do it differently, you build applications in another way, but that's how this is expected then. Then you can build the same business logic, like in the example I brought up about Robinhood. Robinhood is building the most critical transactional behavior for trading applications. That's all running via Kafka APIs. You can do it, but you need to get it right. Therefore, the last hint for that, the earlier you get it right in the beginning, the better. I've seen too many customers that tried it for a year by themselves, and then they asked for review. We told them, this doesn't look good. The recommendation is really that in the early stages, ask an expert to review the architecture, because stream processing is a different pattern than request-reply, and then transactional behavior. You really need to do it the right way from the beginning. As we all know, in software development, the later you change something or have to change it, the more costly it is.

Watt: Are there cases actually where Kafka is not the right answer? Nothing is perfect. It's not the answer to everything. What cases is this not really applicable?

Waehner: There's plenty of cases. It's a great point because many people use it wrongly or try to use it wrongly. A short list of things. Don't use it as a proxy to thousands or hundreds of thousands of clients. That's where REST proxy is good, or MQTT is good, or similar things are good. Kafka is not for embedded systems. Safety critical applications are C, C++, or Rust. Deterministic real-time, that's not Kafka. That's even faster than Kafka. Also, Kafka is not, for example, an API management layer. That's big if I use something like MuleSoft, or Apigee, or Kong these days, the streaming engines get in the direction. Today, if you want to use API management for monetization, and so on, that's not Kafka. There is a list of things it does not do very well. It's not built for that. I have another blog post, which is really called, "When Not to Use Apache Kafka." It goes through this list in much more detail, because it's super important at the beginning of your project to also understand when not to use it and how to combine it with others. It's not competing with MQTT. It's not competing with a REST proxy. Understand when to use the right tools here and how to combine them.

See more presentations with transcripts

Recorded at:

Jan 27, 2023

Kai Waehner

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?