Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Event-Based Architectures: the Hard Parts

Event-Based Architectures: the Hard Parts



Raymond Roestenburg and Sergey Bykov discuss event-driven architectures and some of the challenges they present.


Raymond Roestenburg is technical lead at Akka Platform Team @lightbend. Sergey Bykov is SDE @temporalio.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Breck: I work in real-time systems, and have for a while. I think Astrid's talk, talked about thinking of the electrical grid as a distributed system, and how are we going to manage all these renewable devices on the electrical grid. That's inherently a real-time problem. Our shift to these really event-based real-time systems is out of necessity. It's not just a fad. The same thing in the work that I do. If we think of ride sharing, or if we think of financial transactions, these things are all very real-time, event-based systems. We've had a lot of success in the transition to these systems. Often, we don't hear about some of the real hard parts, especially in living with these systems, I think over time. That's what I want to focus this panel discussion on.


I'm Colin Breck. I lead the cloud platforms team at Tesla, in the Energy Division, and mainly focused on real-time services for battery charging, and discharging, solar power generation, and vehicle charging.

Bykov: I'm Sergey Bykov. Currently I work at Temporal Technologies, at a startup that is focusing on building a workflow engine for code-first workflows to automate a lot of things. Before joining Temporal, I spent like forever at Microsoft, there is group sec from servers embedded, eventually go into cloud and building cloud technologies with an angle on actors, and streaming, and highly biased towards gaming scenarios, or like event basis. Very important.

Roestenburg: My name is Raymond Roestenburg. I'm a tech lead at Lightbend. I'm a member of the Akka team, where I work on Akka serverless, Akka platform, and related technologies. I'm specialized in event-based systems, highly scalable distributed systems, and data streaming, in those areas. I've been working with event-based streaming for about 15 years since. I counted the other day, I've been using the JVM for 25 years or so. That was a quarter of a century. Also, I wrote a book on Akka, "Akka in Action." There's a second edition now available for early access. That's what I do nowadays.

Batch Systems and Real-Time Systems

Breck: I think there was lots of talk about the Lambda architecture a few years ago, where we would have parallel batch systems and build time systems. There's a lot of tension in that, especially maintaining data models and code that does aggregations or these kinds of things, in two parallel systems, is a real burden. Even say, answering certain queries across those two systems becomes a huge problem. I think there was maybe a notion that, as we move to purely event-based systems, purely real-time systems that we can do away with the batch system. In my experience, that isn't the case. We're still living with batch systems. Do you also see that? Do you have opinions on how batch systems and real-time systems are maybe complementary?

Roestenburg: I think they are complementary, in many cases. I think Lambda in some shape or form is still used. People obviously want to move away from the complexity of these systems. For instance, in machine learning, it makes a lot of sense to have training on your larger historical data, and then do the inference or scoring on the streaming information that passes by. That's a very useful way for using a Lambda like architecture. There's obviously, as we know, lots of work in trying to figure out a better architecture that is simpler to use and that removes some duplication in the code. It's very well known, written about in the Kappa architecture, which is very often used with Kafka, there is a Delta architecture, which is something that Databricks is doing with Delta tables. They are basically making it very easy for you to do upserts on top of very large historical data. Upserting real-time information on top of existing data.

At the same time, it also depends on how you look at streaming. You can look at streaming from an in-application operation, or as a data oriented ETL pipeline. There's quite a few different areas in which you could use streaming, of course. If we're talking about streaming between microservices, for me, it's quite different from the streaming case where you're doing ETL and you're basically purifying data, and doing ML Ops or machine learning type systems.

Bykov: I generally agree. I think these are complementary scenarios, and they come from optimizing for different things. The batch systems have always been much more efficient than granularly your per-event, that is more expensive, but the latency requirements or capabilities go way down. It can be much more contextualized. It can be more rapid than reacting, whether it is device operation, some alerts, or a financial transaction like fraud prevention. Another factor I see is that there are different kinds of tooling. For people that do Hadoop, Hive queries, or machine learning, it's a different skill than extending queries in more real-time systems. Again, they're complementary.

None of these technologies die. I was at Costco the other day, and I was solving some problems, I saw they still use an AS/400 for the backend. I'm sure Lambda will stay for a long time. I think Kappa may be like more what I've been saying recently, where there is some ingestion layer where things go through the streaming system, but then they diverge, and one fork of it goes into more real-time the other goes literally into like blob to create batch for the system. Some cross between Lambda and Kappa, I think it's more what's popular.

Data and Code Governance in Batch and Event-based Systems

Breck: I definitely see in our work, things like late reporting IoT devices, and doing certain types of aggregation, say like a daily rollup and these kinds of things, are just much easier in batch systems, or more reliable in batch systems, easier to iterate on in batch systems than in purely event-based systems. I think a lot of the tension there is how do you maintain data models or code that maybe you're doing aggregations or derivations and those kinds of things? Do you have experience or have you seen people having success maybe managing data governance or code governance across systems when they're doing one thing in an event-based system, in a real-time system, but then also bringing that maybe to their data warehouse or their batch systems?

Roestenburg: What I've seen so far is that that's very often separate systems. It's easier to do it that way. Very often the real-time processing, because of what you just said, the possibilities for aggregation are different. If you're having batch or you can do more query like things. There's far more state available. The moment you are doing stateful processing, it makes a lot of sense to have batch or micro-batch. For instance, Spark on top of Parquet, those kind of systems. Where on the real-time side, you keep stuff in memory, but it's more transient. It's more, you're processing, you keep some information, but you can't bring the whole world in there because then it will take too long to stay in real-time. The actual logic will be different. Even though it might seem like you're duplicating, what's being duplicated is very often the schemas and the data formats. There is a need to combine both the real-time and the historical data, so there's a similarity there. I think the logic is very often quite different.

Bykov: I agree with that. In developer experience, if you make a change, speaking about schema evolution and evolution of systems, for real-time you have different constraints with rollout of the new version of your streaming system. You cannot usually go much further back and reprocess. While in the batch system, you can say, let's go a month back, or even a year in some cases, and reprocess everything.

The other difference I saw is that real-time systems are much less concerned with data protection and governance laws. In the batch system, you have to think upfront how you're going to delete this data, or how you're going to prove that you handle it correctly. In real-time, they're more ephemeral, and you have aggregates, or some products persisted, there are less objects to have.

Breck: I think this is a particularly interesting subject area, actually, you have a lot of decisions being made in real-time systems that then aren't necessarily represented historically. You can imagine in some operational technology, where an operator or an algorithm makes a certain decision based on real-time data, but that's actually not persisted. You can't actually audit or evaluate how the system made its decision, I think it's actually quite interesting.

State of the Art for Workflow Management in Event-based Systems

One of the challenges I keep seeing over again, in event-based systems is that there's a chain of events that happen when we get an event. You can see that in Astrid's talk as well, there's a time component to things. It's not just we've made a decision to discharge a battery, but we want it to follow this profile over time, and maybe have other triggers in the system that are dependent on it. It blows up into this, essentially workflow. There's a workflow defined through a series of events that trigger it. Of course, there's implications there for state management, whether that's state of devices, or whether that's some notion of timing and retries, or giving up on things, or event windowing, watermarks that are using streaming systems, these things. I know that's close to your heart Sergey, and what you have been working on recently. What do you think the state of the art is for workflow management in event-based systems?

Bykov: I'm not sure I can claim something as state of the art. I think what you point into is the key question, because data needs to be processed in the context. That context is like a user or a device, some session. Within that context, you have recent history, you have timing, which is much harder to reason about in the context of millions of users or devices. This is where you trade efficiency of batching for less efficient processing, but much more easy to understand and reason about it, and evolve the processing. You have watermarks and some events that happened a certain time ago, this is where that goes into this, you have workloads that can call workflow business process, which is some set of rules that say, ok, if this happens three times, and then if time more than x, do this action. Then you can change that logic. Somebody reviewing this change will understand, ok, we just did this way. Maybe I'm biased towards more like the real-time and more contextualized processing and small scale versus batch, but that's where I'm thinking about.

Developer Experience in Event-based Workflows

Breck: Maybe talking about the developer experience there is like, there are certain engines for managing workflows. I haven't seen developers get that excited about using these and they often become quite constraining after a certain point. You get a POC, it is ok. Especially if you're driving real-time services, like imagine bidding into the energy market and these kinds of things. Those workflow engines haven't served people that well. Then I've also seen people that don't even go near that stuff, and they're essentially building their own workflow engine from scratch, and dealing with all these hard problems. How do you see that evolving? Do you think the developer experience in terms of event-based workflow will get better and we'll be building on top of platforms that serve those needs?

Bykov: This is a hot topic, actually. I think the contention is between code-first versus declarative or data-driven definitional workflows. I think I'm squarely in the code-first camp because there, the developer writes code, and then can reason and can debug code, what's going on, versus things that define for a simple XML, like XAML, or YAML, all those things. They look simple, like you said. In POC, we have three steps, and then add more. Then they evolve over time, and before you know, we have this huge blob of some ad hoc DSL, and it's very difficult to figure out what went wrong. Again, developer experience, and especially dealing with failures, I think, is paramount and is often underestimated when people get into this, "I'll just write some DSL and handle it." Then two years later, somebody who inherited this system, they cannot manage it. It's like, leave it as-is. My bias is squarely towards code-first, or code based, to be developer friendly.

Roestenburg: For me as well, I've always been in the code-first camp. I would agree with you there, Sergey. I've seen a lot of systems where people start using a few Kafka topics at first, and then they get more Kafka topics. Then over time, it becomes very difficult to track where everything goes. You can obviously add all kinds of logging, all kinds of tracing stuff. It's definitely not a simple thing. Being able to easily debug and have a good developer experience is, I think, something that the future hopefully brings us.

Bykov: On the very same visit to Costco when I saw the AS/400, I ran into an old friend of mine who is working at this huge payment processing company, and he was selling exactly the solar array you were mentioning. He's reviewing architecture of systems and services, and people ask what's wrong. He sees this aggregate, starts with Kafka topic, and then gets into a dead-letter queue. Then there's three layers of dead-letter queues and some ad hoc logic to deal with that. Because yes, it's simple from the start, before you get into these real life cases.

Dealing with Complexity in Event-based Systems

Breck: I'm definitely dealing with this, you transition fully into these event-based architectures, often largely around some messaging system at the core. Each new problem you have, like wanting a dead-letter queue or something like that, you produce back to Kafka, something like that. Then you tack on one more layer. It reminds me a little bit of the third talk in the track of this microservice sprawl comes first. Then you need to figure out solutions for data governance, for federation, for solving the hard problems. Yes, I'm not sure we've quite got there yet with event-based systems. How do you see that evolving maybe? Say, a company who has standardized on Kafka, and it's pretty easy to create a new service against it or create a new topic and eventually that becomes fairly intertwined. It's hard to tell where events are going. It's hard to change the system over time. Are there companies that are doing that well? What are the techniques they're using?

Roestenburg: There are different things that you can do, though, eventually, you have to build up a picture of what's going on. You can imagine a whole bunch of topics with processing parts in between, and you need to understand where things go. One of the things that you can do is in the metadata or the header information, similar to CloudEvents, for instance, you can put contextual information about what the message was, but you would still have to extract it from every topic and then build up a bigger picture. The same thing you can do with log tracing systems where you can see which components called which. This is definitely not easy to do. You have to start doing it from the beginning. This is one of the difficult issue with event-based systems, from the beginning, you have to build things in, because otherwise you can't later on see what's going on.

For instance, one very simple thing, if you have a processor that reads from a topic and writes to another topic, when you're processing from that topic, and you get any corruption, basically the only option you have is to drop the message and to continue, because otherwise your system is stuck, you have a poisoned queue. The dead-letter queues that you mentioned, very often are not very useful, because it was corrupted data. How are you going to read this? It needs a lot of human intervention. You would have to go to previous versions of code, see what it was maybe. By that time your system is already a lot further. These interactions become very difficult.

Yes, I think building up more context per message and keeping this context with the message or inside the message itself in the domain of itself, or as metadata on the message, can help you to later on create a picture of what's going on. As you can understand, the moment you lose a message, that's something you don't see. It's very difficult to then see where it'd actually go. I haven't seen any stream lineage tools or something that really can just very automatically trace back to where things began and how it fanned out, or something. I haven't seen that yet so far.

Bykov: We see a bunch of things with customers, where we started referring to them on a queues and duct tape architecture. Where the queues by themselves are great, but this challenge of when there's a poisoned message, or just pure failure to talk to another system, which may be down or maybe there's some interrupt and retries, and exponential backups, backpressure. These set of concerns are pretty standard when you go from one stage to another or from one microservice to another. The challenge there is to have a simple to use, unambiguous set of tools for developers, which will be very clear what the behavior of the system would be without writing code for this duct tape. Have duct tape be part of the system, and then you can reason about what's going on. You can trace what's going on by looking at standardized, unified way of, here's the progress of this processing, within context. I think context comes first.

Visibility into State Management & Workflow Progress for Millions of Instances

Breck: Maybe back to that context, let's maybe think of an event-based workflow that has lots of this context, maybe applied to an IoT scenario, when you have millions of them. How do we get visibility into state management, workflow progress, those kinds of things, when that is defined not as a single instance, but say millions of them?

Bykov: It depends on the scenario. In some scenarios, like in IoT, there may be transient state that it's ok because there'll be subsequent update, maybe like a minute later, that will invalidate the previous state anyway. It's not important to alert in this case. Or like telemetry, another example, where there may be some hiccup in telemetry, some metrics, and generally at high volume it's ok because they keep being reissued. There are different scenarios like payment transactions, where it's absolutely unacceptable to handle them, then you need to alert. Then you need to stop processing, and maybe even like page a developer or an engineer, to go figure out.

Using CloudEvents to Solve Metadata Problems

Breck: Are you seeing anybody use CloudEvents to solve the metadata problems you're just talking about of traceability of events, or correlating events to certain actions?

Bykov: I don't. Maybe people do that, but I have not myself seen that.

Roestenburg: Yes, we are using CloudEvents in Akka serverless, for instance, and there is some additional information that we keep, but it's not necessarily meant, at this point, in any case, to extend for this information. You could use Kafka's normal headers as well. It's not necessarily CloudEvents, you could use anything. Although it is handy to have a particular format that you keep for all your stuff, whatever you're doing, so you can very easily inspect, ok, if I put everything in CloudEvents, I can very easily see at least some of these metrics, some of these headers. The other thing you could obviously do is, in your stateful processing, keep information about what's going on and then provide that in telemetry again. It's very case by case basis, potential in what you want to do.

Bykov: I think Telemetry is key. If you emit enough telemetry, then you can separate systemic issues versus some one-off things, and then you can decide where you want to invest time.

Breck: That's been my experience as I think CloudEvents sounds very interesting, but I haven't seen much adoption of it. I've seen more what you're mentioning, Ray, with using Kafka headers, or propagating request IDs, even all the way through Kafka. You can even tie a gRPC, or an HTTP request to an event that ended up in Kafka that got processed later on to do something.

Roestenburg: You could do it inside your messages or outside as a header, it depends. The nice thing about if you put on the headers is that they are always readable so you wouldn't get a corrupt header, or something really horrible would happen to Kafka maybe. That's the benefit of using these envelopes.

Breck: I've seen people wanting to do transactional things, actually using the Kafka offsets and propagating those as metadata. Especially when you run an event to another topic, to another topic, because if you maintain the original offset, then you can actually do some transactional things against that, which is interesting.

Roestenburg: We've done that but mostly from the topic where you're producing or where you're processing the topic, that's where you do is in your writes to the output, so when you write to any output, even to a database. In a transaction, you write to offset where you left off, so in that way you can do effectively once processing within that service. It's very hard. I don't know how to do that over many services. One of those hops is fine, and so you can build these services that can die at any time, continue where they left off. You might have to deal with some duplication. In the case where you write the offsets to the output where you're writing, if that has a transactional model, then you can be effectively once basically, for processing.

Bykov: Same here, I've seen it within the scope of a service, but crossing service boundaries are usually different mechanism.

Developer Experience and Ecosystems

Breck: We're going to talk a bit about developer experience and ecosystems. A lot of event-based systems, they have their own libraries, or they have their own developer experience. I think of Kafka and Kafka streams and the systems people build around that. Most of my experience in the last five years has been using Akka Streams, even around Kafka, because I can interface with all sorts of different systems, and I'm not just tied into one ecosystem. Do you think that's a model that will continue to evolve in libraries in lots of different languages, or do you think there are certain advantages of having ecosystem specific libraries? Do you have opinions there? What are you seeing develop?

Bykov: My opinion here is that what's most important is establishing patterns, and Kafka did, it established. It's essentially the partition log. It's not even the queuing system. For example, when I had to explain years ago, what Azure Event Hubs is, it's essentially hosted Kafka. It's the same thing, like the Kinesis. Establishing this pattern, and then people can use different services like that in the same manner. Then switching from library [inaudible 00:29:46] is speaking the same language. In terms of abstractions, I think it's easy. Trying to unify it into one layer that hides everything, I think that's dangerous. I think we made that mistake early in streaming, back in the days when we provided a single interface for queuing systems with very different semantics, like log-based versus SQS as your queue versus even in-memory like TCP, direct messages. I think that's a dangerous thing to try to hide that.

Roestenburg: I definitely agree with that. I think it's very good to take advantage of these very specific libraries. Streaming is not streaming, there are so many different things you can do. You can do what you said with Akka Streams, which obviously I like. It's a particular kind of in-application streaming that you do there. The moment you want to do data oriented streaming, it's a whole different thing. For instance, Akka Streams doesn't have things like windowing, watermarks. It's possible to build those things, but there's far better support in Spark, or Flink, which are completely different models, so you basically need specific libraries. I don't think it's a place inside a programming language. That's a danger as well to expect that of a programming language. It's very specific needs, if you look further than just memory constraints, processing of data.

Apache Beam

Breck: Are you seeing any use of Apache Beam as the model for solving some of those problems?

Roestenburg: I sadly have not. I've always been hoping to see Apache Beam being used more. I do see things like Flink. The actual standards of Apache Beam being used, is not something I actually know. I was expecting it. That's a very large standard for these things.

The Evolution of Developer Experience for Building Event-based Systems

Breck: Let's talk maybe about the evolution of these systems. I think a lot of events-first systems, streaming systems, people have built those. I have lots of experience with actors over the last many years. I know both of you have a lot of experience with actor-like. I know Sergey, you wrote an article saying it's such a loaded term, it's hard to use sometimes. Most of what I build, we run it ourselves. We're building these distributed, stateful actor systems that model entities, and we're building it ourselves. It seems like, more recently, we're starting to see this evolution on top of Cloud platforms with Azure Durable Functions, or the Cloudflare Durable Objects. It seems like this location transparent, stateful actor model is almost becoming the way of programming in the cloud in the future. Do you agree? What do you think that evolution is in terms of developer experience for building event-based systems?

Bykov: That's been my belief that it would come to that serverless hosted, cloud hosted systems based on the actor or context oriented computations. I think it's inevitable because there's this data gravity. Yes, you can deploy and run your own replicated databases in the whole estate, but most people don't want to. They want to outsource this problem. When it comes to hosting your compute, of course, we can all run on Kubernetes clusters. Do we want to, becomes the question if it's cost effective to run it in the cloud provider. Then before you know, you outsource most of this platform, you can start to focus on the actual application. I think that's inevitable for a lot of scenarios, if not for most, that would be much easier, more economical to just run it in the cloud.

Roestenburg: In all cases, you want to focus on the business. A lot of these systems, which are wonderful, they are also very hard to run. Kafka is quite notorious in this respect. You can get these distributed logs, Pulsar, Kafka, you can get them run for you. Nowadays even with infinite storage, tiered storage, those things are very hard to set up yourself. Even if you would be able to manage a Kafka cluster, how are you going to set up all of this extra stuff that you need to not run out of space? A similar thing you can see with Google Spanner where there's also an infinite amount of storage. There, I think it's very interesting that people love transactions in the sense of simplicity. That comes back as well in what Databricks is doing with the Delta Lake stuff. I think people are moving towards, ok, it's very interesting, all these technologies, but I want to focus on a simple model, and have a lot of the Ops taken care of for me, instead of having to deal with all these complexities.

Bykov: One other thing I forgot to mention is compliance. If you hosted yourself, getting SOC 2 compliance, or other compliance is just a pain. Yes, if it's outsourced, then you just present their SOC 2 documents and you're good.

Roestenburg: Even security to some extent will be better arranged than what you are going to do yourself. Although there's still obviously always a big open problem there.

The Saga Pattern

Breck: Maybe back to workflows. I hope most people are familiar with the Saga pattern, the notion of compensating transactions and taking care of things when they fail. Is that actually being successfully applied? My impression is that most people that try this, learn that it's actually really difficult to apply the Saga pattern correctly. That there is really no such thing as rolling back the world. You can compensate in certain ways sometimes, but you can't go backwards.

Bykov: I'm not sure I agree with you actually. Yes, it's compensation. You're not rolling the clock back, like in this classic example of reservations, yes, you can cancel through reservations if one of them failed or got declined. I think there is nothing wrong with Saga as a pattern. I think Saga is a restricted case of a general workflow. It's like what do you want to do if this step in your business logic fails. Saga just says, you unwind what you allocated before. In a more general case, you might want to do something else. I don't see that as necessarily special, separate case that's different from a normal workflow.

Breck: What I've seen is depending on the state that you arrive in you can't actually compensate. The reason you can't move forward is also the reason you can't compensate. It's almost like head-of-line blocking again, you're stuck.

Bykov: That's fine. Because in this case, like in a general workflow logic, you need to have a step or a case for notifying. Saying, this thing needs to be manually handled, or call manager, or something. It's a more general escape hatch for something that did not progress automatically on the predefined simple logic.

Roestenburg: I think one of the issues is that you really have to keep it in the business domain, in the logic of the business. Modeling things as Sagas purely in the business domain makes sense, because you can have these different scenarios, you compensate or you continue. The really tricky part is, if you might be confused with the fact that you could use Sagas for intermittent errors or something, or technical errors. The service wasn't there, and then you might have already called it. You might have already reserved a table, you just didn't get a response back. On the business level, where you have more context, you can then say, this workflow ends because something went wrong and I'm just going to get someone to look at it. If you try to solve technical problems with the Saga, then you're going to be in a big problem.


See more presentations with transcripts


Recorded at:

Apr 29, 2022