Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Opportunities and Pitfalls of Event-driven Utopia

Opportunities and Pitfalls of Event-driven Utopia



Bernd Rücker goes over the concepts, the advantages, and the pitfalls of event-driven utopia. He shares real-life stories or points to source code examples.


Bernd Rücker is co-founder and developer advocate at Camunda. Previously, he has helped automating highly scalable core workflows at global companies including T-Mobile, Lufthansa, Zalando. He is currently focused on new workflow automation paradigms that fit into modern architectures around distributed systems, microservices, domain-driven design, event-driven architecture and reactive systems.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Rücker: I called my talk "Opportunities and Pitfalls of Event-Driven Utopia." That's probably a strange title. What motivated me to do this talk, actually? What I saw over the last couple of years, and especially the last year, is that more and more people, more companies, more of our customers as well are picking up event-driven architecture. It's not a new topic, there is a book even called "Event-Driven Architecture" which is quite old, but I really see that adopted more and more.

Here’s evidence where you can also see that. We are at QCon, so you probably also read InfoQ. They did that architecture trends report; for example, in 2018 they had event-driven architectures for the early adopters. Already in 2019 in Q1, they moved that to Early Majority. That's just one indication for what I also see in the market, that event-driven architectures are picked up more and more nowadays.

At the same time, we're seeing that we don't have yet a common understanding of what event-driven architecture means. This is a good blog post actually, by Martin Fowler. He also did a keynote on that, I think it was in GOTO Chicago, which is on YouTube so you can watch that as well. He said, "We also saw that at ThoughtWorks, for example, we discussed event-driven applications with a couple of people there, and we recognized that when people talk about events, they actually mean something quite different." It's really hard to talk about even-driven in a bigger audience, because everybody thinks of different things. So I think it makes sense to go over that and to understand a couple of the concepts.

First I want to talk about what I called "Events on the inside." How can I use event-driven to build better applications or better services? I talk about a single service here, not a complete set of microservice. It's one single service. How can I use events there in order to do things different? Then when we start to understand that, we can also talk about, "Can we use events to communicate between services to collaborate between services, for example to communicate between them?" I called that "Events on the outside." You probably know Chris Richardson. He wrote the "Microservices Patterns" book in Manning, and he also talks about events on the inside and outside. I think that makes very much sense. You probably spotted this if you read the slides here very carefully: that's topic one, that's topic three, so there is a topic missing here. I added something which I call "Events inside out." I want to go into that later on, I don't want to introduce that now. That's basically the agenda what I want to go over today.

Events on the Inside

If we look at how can we leverage events on the inside of an application or service, I first want to have a look at how we built applications over the last one or two decades. You probably know this kind of architecture. By the way, that's a hint that's throughout the whole slide deck. Whenever you see that papyrus here, it means that’s a traditional architecture. That's probably not the best practice architecture for the future. It's not necessarily event-driven at all. It's what we did over the last years. You probably know that and that's what Ted Neward once referred to as a BBC architecture, box-arrow-box-arrow-cylinder - that's all diagrams you ever need. We can express every architecture by that, but you probably know that. We build an application and we use a relational database underneath.

There's the great thing about that architecture, and that's very often forgotten. When we now discuss that, it's very often forgotten. The great thing about that architecture is there's a relational database management system, and that's really great. It gives you a lot of features, we shouldn't forget about that. It gives you ACID transactions - you probably know ACID, atomic and consistent, isolated, durable. That's awesome. You can build awesome applications by that. The database does a lot of things for you which are pretty hard to do. I care about consistency for example, and this kind of things.

Whenever it's possible to use a relational database, I think it's still the best option. The problem with that architecture is the database. The database is not really what we nowadays very often call web-scale or internet scale. If you're Amazon or probably even a much smaller company but offer your services build right, it probably doesn't deliver the scalability you need, and especially resiliency is very important. You can either go for an Oracle GoldenGate, or MySQL has a couple of clustering options that are not easy to set up, that are not easy to operate, so that's already hard to do. Then if you add some geographically distributed things to that, it gets really hard.

We have a couple of customers doing that, replicating between Europe and the U.S., and that adds so much latency and so many problems, that this is probably not the best way to use a relational database. That's a limit of the relational database here. If you don't hit the limit, go with the relational database. I'm a metaphor person, I'm a picture person, I remember things better if I have a picture for that. I was in the mood of making bad animations for that slide, so I basically used Kubernetes, that's the cloud native thing at the moment. You can imagine it like that: the relational database doesn't fit in Kubernetes. You can see that by the shape, it doesn't fit. That's a thing you can probably remember. It's not really the scalability architecture.

If you look at that more seriously, you'll probably come across Pat Helland. I'm a big Pat Helland fan. He worked at Amazon, Salesforce, Microsoft, a couple of the big companies. He had a talk quite a while back where he said, "Immutability changes everything." We have to think about immutability in order to build these resilient and scalable applications. What does that mean? If you look at an example - and this is a very common example - let's assume you built that BBC architecture and you built an accounting application or a bank account application or something like that. You use a database, you have a table where you say, "This account has this balance at the moment." That's what I call persistent state; you persist the current state. If you withdraw money, you do an update on that state. That's what everybody knows here.

Why does it scale so bad? If you're not replicating the database, for example, you have one in the U.S. and one in Europe, you have to make sure they're in sync. If they're not in sync, and you do an update on the balance in Europe and an update on the balance in the U.S., you have a conflict which is hard to resolve, because you updated to one amount here to another amount here, which one wins? That's really hard to answer. There are a couple of these reasons why this doesn't really scale. What's the alternative? The alternative is to persist the change you want to make, not the latest state, but the change you want to make.

That's very often known as these append-only logs, or event logs. The way this works is, I just have a couple of these immutable things, I created the bank account then I transferred money to it, and probably I withdraw money, and so on. I don't update the current state, I just write what we nowadays normally name events, which are immutable facts. I can never change them. That makes things much easier. Then the current balance, I just derive by reading all this kind of facts. I can always derive that.

In this scenario, it's much easier. If I have a withdraw in Europe and in the U.S., for example, and they are not in sync currently, I might make a wrong decision about the current balance, and allow to withdraw too much money but I don't have a conflict in the updates. I withdraw $100 here, €100 here, and I'm fine, I know what to do. That's much easier to scale; this is how we can imagine it.

This is what's called event sourcing. That means I use the events, the change, as my data store for the application. That works like this: imagine you have a customer service, and you have your event store. By the way, an event store technically could be different things. You could even implement that by a relational database. Nowadays, you also have special applications for that. There are a couple of applications which are really focusing on being an event store. You just store these kinds of events and then you probably want to create a customer.

Now, you have an execution loop inside your own service, your application to execute that. First of all, you have to read back all the events from the past, and build some internal state of the application, because you have to know that state. In this example, you have to know if you already know the customer. In the withdrawal example, you have to know the current balance of the account, so you have to rebuild that state internally.

The big difference is I don't store that, I can throw it away. It's just for the purpose of basically validating invariance. Can I withdraw the money? Do I have enough balance? That's the decision I have to make on that. That's why I need the internal state. I make the validations, I decide, "Ok, I can create that customer or I can withdraw the amount," and then you write the event to the log. As soon as you know, "That's okay. The event is getting a real fact," it's written to the log. Then you might even publish that to the outside world. We're coming back to that in a second, that's an event on the outside. That's a typical loop you go through in event sourcing.

There's one great thing about that. I talked about ACID transactions at the very beginning, like consistency. There's only one requirement for an atomic operation here. Whenever you insert events in the event store, which is probably not only one, but a couple of them, this has to be atomic. You have to make sure you either write all of the events or none, but you don't need more atomicity. I want to make the counter example to get that across. That's again, the papyrus, so that's a traditional architecture. I have the customer service, and then I would go for relational database, for example, so I create the customer. That means I insert the customer into the table and database. That's my business logic. Then I commit that, that's an atomic operation again.

Let's assume whenever I create the customer, the customer service should also make sure an account is opened. I probably want to call or want to propagate something in order to make the account being opened. That's normally a remote communication here. If you're talking microservices for example, or in serverless, it means that the customer is one box. The account is one box which lives somewhere else. That's a remote call.

If it's a remote call, we don't have ACID transactions available. We can't do a transaction which spawns that business logic in the database, for example, sending a message on a message queue. Normally, some people are thinking, "I can do that, there's two-phase commit. There are these distributed transactions." If you're still believing that, I would recommend to read this paper. It's a very good paper, "Life beyond Distributed Transactions" from Pat Helland. I just used his own quote, "Grown-Ups don't use distributed transactions." The summary is, it doesn't scale. The second thing is it's too complex, it doesn't really work, nobody understands it. So don't use these distributed transactions. Assume you don't have them.

In this scenario you say, "But I want to have it atomic. I want to create the customer in my database and open the account; either both or none. What can I do?" You have to decide for an ordering of the commits. Either you commit that first - and then it might happen that the application crashes, you never open the account - or you open the account first and you never commit to the database. That's inverse, then you have an account without a customer, so that's bad. What can you do? That's very often referred to as the so-called outbox pattern.

What people start doing is they write what very often is called a job. They write something in the database and say, "When I created the customer, I still have to create an account. That's a job, it's not yet done, but I write it to the same database - in a separate table, but the same database. I execute that later on." That makes it possible for me to have that job creation and the real business logic in one transaction. Now it's kind of atomic, because I make sure I do both, even if it's eventual consistency that I execute the job later on. There might be a delay, milliseconds, seconds, probably minutes - that depends on the application - but I make sure both will happen.

That's a very common pattern, I actually see that applied relatively often. There are a couple of ways of implementing that. Actually, that's not a focus for me today, but just for the reference. Either, you write your own scheduler - I see that applied very often. You write a scheduler which just regularly checks the job table and then executes these kinds of jobs. I saw more often than I would have expected reading database transaction logs. If you know that logo, that's Debezium. That's a framework for replicating databases very often across data centers. What they do is they also read the database transaction log in order to replicate. People leverage that in order to read the database transaction log to see, "There was a job created. We have to do something." That's a possibility, or you can leverage, "That's my domain. That's my world where I work a lot." You could leverage a workflow engine in order to have a very easy workflow of doing that. There are a couple of ways of doing that, it's not impossible.

Probably, you even want to add idempotency. I’m not talking about that today, but I still believe, if you do distributed systems, idempotency is the number one problem you should solve very well in order to save you a lot of trouble later on. Idempotency here means I probably get the create customer message multiple times. I get retries, I get redeliveries, so I probably have that message multiple times. I want to make sure I don't process it multiple times. Very often, that architecture, you also capture that request in a separate table. If it's a message, I might save that in my own database.

I even saw that people are splitting that up into multiple transactions to make sure that a request is captured before it's executed to make sure you don't have the message coming again in milliseconds and being processed twice. What I’m trying to get across is there's quite some complexity involved in doing that right, and we're doing relatively simple things here.

That's a great thing if you're doing these event source systems. Basically, you don't have all these problems. The outbox pattern is solved, because you only have that one atomic operation for the event, and then you have the same loop again. It's included in that architecture, so a couple of these problems you don't have to solve yourself. That's, I think, one of the advantages of using event sourcing underneath.

I want to give you a more concrete example. So far, it’s been the "Hello world!" example of event sourcing, but I found it more interesting to dive deeper, to dive into real systems and look at how they behave. I wanted to bring you what I know best.

I want to quickly introduce myself, just a couple of words. My name is Bernd Rucker, I'm co-founder of Camunda. What we're doing is an open-source workflow engine. I always add that warning now - I'm a human being, I have my own opinion, keep that in mind. I'm a workflow guy, so you probably need that in order to [inaudible 00:16:57]. What we have - and it's actually interesting, and I’ve discussed this a lot recently – is two different workflow engines as an open-source project. One is called Camunda and that's the one we have had for years. Then we have a relatively new project called Zeebe.

People are asking, "Why are you doing two different workflow engines which basically do the same thing? They solve the same problem. Why?" The reason is simple. We have Camunda which does persistent state, it uses a relational database. That's how you built applications 10 years ago. And we have Zeebe, where we changed that in order to have persistent change, in order to be more scalable, in order to be more resilient. If you look at that in particular, then that works like this. I'm not sure if you're familiar with workflow engines, but it's a relatively simple concept. You can define these flows where you say, "Whenever I kick off a new instance, this should happen first. Whenever that's done, this should happen second, for example, then I'm done." You can make more complicated things, but that's the basic idea. Activities in a sequence.

If you have that in a relational database - how it works, in a simplified slide - I want to kick off a new workflow instance, so I insert in the right database table. I now have a new workflow instance which currently is in that activity, and that's it. Then somebody can pick that up, does whatever needs to be done to retrieve the payment, and calls back the workflow engine and then says, "I did that, it's done." What we do then is we update the current state and say, "This workflow instance is no longer in that activity. It's now in that activity,” and so on. Actually, to be honest, we also write a couple of database tables where we keep the changes for audit purposes, but that's an additional thing. You can even configure it to not write that to save disk space or to optimize performance, for example. That's additional, the basic thing is we update things.

How does it work if you go event-sourced? That's a bit different. If we want to kick off a new workflow instance, it works like that. We write a couple of these events. The first is we want to have a workflow instance created, and then basically the workflow engine does a couple of steps in order to do that. It knows, "Ok, then I need the workflow instance to be created, the start event occurred." Because we start here, the sequence flow is taken, so we walk that way, and an activity is activated - that's our terminology - and a task is created for the retrieve payment. I have these events written to the log.

The next thing that happens is somebody says, "Ok, I can do that, I can retrieve payment." It logs the task - that's a detail, it's not that important. Then it completes the task. Again, we're writing that to the log, "Complete the task." Ok, the task is completed, the activity is completed. We leave that. The sequence flow is taken, and so on. I hope you get the idea. We are just writing that.

Why is that cool? Let's go into the handling loop here. If you look in the inside of how we have built that engine, it works like that. We have that workflow engine we call Broker - because that's what you nowadays call systems if they're in a distributed system - and we have that event log. Let's assume I want to complete the task, that activity. I send in a command - I’ll talk about commands later on really in depth. I send that to the Broker, and the Broker adds it to its log. The next thing that happens is we store that in the event log. In our case, that's simply writing it to disk. We're writing that to disk, and we're replicating it. What we have here are followers - that's a typical concept in a distributed system. If you use Kafka or ZooKeeper or others, what it does is it builds an old distributed system, and knows "Ok, I have to replicate that to a couple of peers in order to make sure it's there. I don't lose data." So it's replicated.

As soon as it's replicated, we can process it. We don't process it before it's replicated, by the way. Then we process it, then we need the internal state in order to do that. We are using RocksDB for that. RocksDB is a framework from Netflix, that's a key value store, a very efficient one, a very fast one, which can also flush to disk. We're building that state internally in RocksDB. There's one important thing here, and I'll talk about it very briefly, but I think it’s one of the core concepts here. Then we have one single thread which executes this kind of commands on the RocksDB.

Why is that so important? If it's a single thread, we don't have contention. We don't need the isolation a relational database would give you, because we don't have competing threads writing to the same data. It's always one thread. We are having a clear sequence, it's totally clear what happens. It's easy to understand and it's easy to build. That's a very important concept. How is that scalable? I’ll come back to that in a second.

We process that on the internal state, and then - that's also interesting - as soon as we validated that the command is executable, that we can do it, we send back a response to our clients and say, "Yes, we completed the task." If you're really looking at it on a very detailed level, you might qualify that as kind of a lie. It's not yet executed, we haven't yet done anything, but we have validated that we can execute it. That's enough because now we know that we can always execute it. It's validated, it's written to the log, it has a clear sequence. It's replicated, we cannot lose the data. It's totally clear what happens, so we kind of already have it executed. That's why we are sending the response early on. We save latency, we improve performance by doing that. Only when we do that, afterwards, we write all the events which occurred because of that, like what I had on the slide earlier. The activity is completed, the next activity is entered, and so on. That might be a lot of events actually.

We do that afterwards. Then we append the event and we replicate it. The trick is that we can replay that. If something happens while appending the events or replicating, if the system crashes, that's no problem; we can always restart it from that command, and it will produce exactly the same result. It's a single thread, it has a clear ordering. It will produce exactly the same result. That's one of the tricks you can use in this kind of event source systems.

That's, by the way, one of the things we do differently. If you search the internet for command sourcing, that's an interesting exercise, actually. You'll find a lot of threads where they discuss whether it’s a good idea or bad, or evil. That's one of the things we do different, we explicitly store the command in order to have that faster response time. It's very often discussed if that's a good idea. For us, it's a good idea, so it's always about tradeoffs.

What else do we do different? We actually do persist the internal state. I talked about not needing that, you can always replay the events. Yes, that's true, but replaying the events take time. To sincrease performance, we persist the internal state so we can start up faster, meaning we replicate that to remote peers, so they don't have to rebuild the internal state, because for our experiments so far, that's faster than rebuilding it. These are optimizations.

That's also interesting. The idea of event sourcing, if you read about it, is always, "We have the events from the beginning of time, and we keep them for the whole time." We can always replay everything from the world. That obviously needs a lot of memory, you have to store a lot of things. Then kicks in something called log compaction, "At some point in time, we delete old events we don't need anymore." That's the thing. It's really hard to decide which events I don't need any more if you're a workflow engine, because some workflows run for a month, so we can’t just delete old events. They're probably still very valid.

For us, a really hard challenge was to compact the log, because then you're going through a whole log and deleting, "For this process instance, this workflow instance ended like a month ago. We can probably delete it, but the events are here, there, and there." We try to be fast and efficient, so we even optimize for disk writing performance. That's not a good option; then you end up with a fragmented log. What we do is, as soon as a command is completely executed, as soon as we don't need it anymore, we delete it from the log. As you can see there, we are not purely event source.

I want to quickly talk about CAP. The CAP theorem says, "There is consistency, there's availability and partition. You can only have two of them.” Basically, you can't decide for partitioning. If you have a distributed system, you can always have partitions. The analogy, the thought experiment, is marriage. Marriage is an atomic operation; either both are married or none of them are. We do that in a very monolithic way. We meet at a church, everybody. Every resource is blocked until we're done. It's a very inefficient thing to do that, but we can make sure we're consistent.

If you transport that to a distributed system, you want to be more efficient in marrying, then we can do it via the phone, remote. Then you can think about it. The priests asks, "Do you want to marry her?" I say, "Yes." Then, "Do you want to marry him?" and the line breaks. You don't know the answer, you don't know what happened. I could be married, I could not - I don't know. Then you have two options. The first option is, "I don't know. Just in case I have a date this evening, I want to be available. Otherwise, if I'm not married, and then I miss that date, that's kind of a waste." So I go for dating, then I go for availability. Or, I could say, "No, I don't know, I have to wait until I know. I don't date, probably I'm married." That's consistency.

You have to make that choice. You have to make that choice very consciously. For example, in Zeebe, we definitely decided for consistency. These are details and this the event source system; if we can replicate, we're not executing, so we go for consistency. In the account withdrawal example, you can decide; if you have a partition, if Europe and U.S. are not connected, you can decide if you want to withdraw the money, go for availability. Or, if you want to wait, go for consistency, and do not give the people too much money. That's the decision you always have to make in this kind of systems.

How do we scale? I have a single thread, and a single thread is not that scalable. We do partitioning. You have that single thread, but then we have multiple partitions each partition has a thread. You can distribute them on one machine with multi threads, or you can even distribute them remotely. Then for every single workflow instance, we are sure on which partition it is executed. Then for every workflow instance, we have a single thread. This is how you scale these systems. That's event sourcing.

If I have any event-sourced system, the next problem I have, I can't ask the event log questions like, "Which workflow instances do I have? Where do they currently wait? Are they taking too long?" and this kind of questions. You can't ask that to an event log. What you need is to build in another separate read model - a query model - which you can ask questions. In our case, that's Elastic. We push out all the events to Elastic, build our own read model, and this we can ask with our tooling, for example, in order to show the operations what's going on. You have to think about that.

As a summary for events on the inside, I think it's a good way of building really resilient and highly scalable systems. That's awesome, but you have to think about a lot of things. My experience so far is that - and this is, I think, the important part - there's not that much industry experience available. Probably, there are not that many people who understand it. Probably most of them are here at QCon, and if you go to other conferences, like typical developer conferences, nobody understands this. It's really hard. I would really think about it if I want to go that way. Go there only if you really need it.

Events Inside Out

If you go more on the outside, I call that inside-out. What's that? If you think of our system where we have that event store and we have a service, normally, if you're event-driven, very often you also use messaging. I rarely see that you have a REST API for that, but it's probably also the case. Let's assume you have messaging. One idea which is probably not that surprising, is to say, "I had that event store down there. I had the messaging down there. Why shouldn't emerge that too? That's probably a good idea." Then you have a shared event store where you say, "You don't send me the event via message and I write it in my log; write it in my log yourself." It sounds a bit weird, but it's actually advertised relatively good. If you look for Kafka, for example, and this messaging as a single source of truth, it's not explicitly written, but it sounds like this is what you should probably do.

I personally am not that convinced that this is a good idea, so I always make that exclamation mark there as a warning, because now you won't map your internals. Personally, I'm a big fan of having my local data of the service, which I can totally control, and having a clear API. I'm actually not a big fan of that, so I personally wouldn't go that direction.

What you can do - but I think it's important to make the distinction here - is you can use something like Kafka as, let's say, the input topic, the input queue. You can use it as a transport. That's fine, it's still persistent, by the way. That changes a lot of things. It's not like in RabbitMQ, you consume a message, it's gone, you have consumed it. In Kafka, it's persistent. That changes a couple of things; how you can work with that. That's event inside out.

Events on the Outside

I want to dedicate the rest of the time to the outside. How can I use events to communicate between services? That's actually what probably much more of you do on a daily basis. Let's say 20 years back, we have a CRM system and we have other systems. Whenever you change the address of the customer, you have to inform all the other systems. That was a big time of enterprise application integration tools, or probably [inaudible 00:33:15]. How do I make that efficient, that a customer changes the address in the billing system?

With event notification, you turn that around. What you basically do is you say, "The customer, the CRM system, that's a very general component in my whole architecture. That should produce events, put it on some topic, don't know who is interested in that." The billing system knows, "I should listen for address changes," and consumes it. The direction of dependencies, now the billing system knows about address changes from the customer, and not the other way around - the customer service knows about billing and has to inform it, because then you end up with the customer CRM system knowing basically each and every system. That's the idea of event notification.

There's one interesting discussion, which I don't want to go into detail today, but just as a reference, whenever you do that, you will end up with discussions on "What's the payload of my event?" In this case, a pure event notification if you go for terminology, would be, "The address has changed. Now it's your problem." That means I probably have to ask for, "What's the new address? I want to have that. I need that."

Much more often, actually, I see something which - if you go for terminology - should be Event-Carried State Transfer. I'm not only saying, "That has changed," but I add the data. That's a discussion in itself, so that's a whole other talk. What data should I include in there? It could be like this. I could send a delta. That was the old address, and that was a new address. I could send just the, "The address has changed." That's the new situation, that's a new state. That will be an option. I could even discuss about granularity, "Has address changed?" or "Do I send a customer change whenever something changes in a customer?" Especially in domain-driven design, if you go to these communities, they normally discuss it in a way that, "Yes, address changed. That's a technical thing. Why should I care about an address change? I care about the business requirement behind that, so the customer has moved." That's a different event. I’ll spare that discussion today. It's very interesting and, to be honest, it's very hard to get that right, because as soon as you have these events defined and circulating around, it's pretty hard to change them.

The other one - and I want to actually dive into that in more depth, there I also have much more experience, so I can talk more in-depth about that - is a decision; if you want to call somebody else, or if you want to be notified by an event. That's also a very complex decision. We don't yet have a lot of understanding of how to do that right. I'll give an example. Let's say you want to change the address. You go to a web interface and say, "I want to change my address." Let's assume you get a notification email which says, "Hey, your address has changed. Is that correct? Please confirm that." You could imagine that requirement.

If we implement that, we might end up with two different services: one for the CRM, the customer management, and one for sending the notifications. Now we have two possibilities. The first is, "Please change the address," so the customer service publishes an event like "address change". Not yet, it needs to be confirmed, so it's more like "address change requested." Notification service, "If there's an address change request, then I send out an email, wait for the notification." When it's notified, I can say, "Address change," or "Address change confirmed," something like that.

In this case, I have the direction of dependency from notification to customer. Notification needs to know about customer. The alternative is the other way around. The customer doesn't send an event, but it knows, "There's a notification service somewhere. I can leverage that." I send them a command to send an email, I wait for the response, and then this probably sends back an event like, "The confirmation was approved. Then I probably generate the address change event." See the difference? In this case, I have the direction of dependency from customer to notification.

The thing is, that's a case-by-case decision. There's not a single rule where you say, "Go always event-driven," and that's actually what I see a lot in the current projects. They think of, "Event-driven - that's the future. We want to be in the future, so we go event-driven," but it's not that easy. You should always think about, "Do I have any notifications?" or "Do I want to send commands?" That's quite a challenge.

There is one thing - and it's basically only about terminology - I think it's really important. If you look at that, what makes that even more important is that people mix it up very often with, for example, communication protocols. They think, "If I command somebody, 'change the address,' that's REST RPC. I want to do that. I want to be event-driven." But that's not true, both can be messaging, can put a command in a message, send it over, no problem. It can be REST even if it's event-driven; REST feeds is an example, to be event-driven with REST technology. It's totally independent of the question of what kind of transport I want to use.

Especially nowadays, you have a lot of technical possibilities of being event-driven. For example, you might be Kafka. That's what everybody is thinking, "Event-driven, I use Kafka," but it might also be a topic on Rabbit. It might be REST feeds. It might be Webhooks -Webhooks are event-driven. Especially, if you look into the whole serverless or cloud environments, they have a lot of ways of being event-driven. Basically everything is event-driven there. That's, by the way, one of the reasons why I think it will get more and more important over the next years. That's all event-driven, it's not only something like Kafka.

The problem then is, we have that event, and an event is always a fact, something that already happened in the past, immutable. We have the commands. The command is an intent. "I want somebody else to do something." Some people then discuss, "Yes, but that 'I want somebody to do something' is already a fact, like an event I want," but that's kind of a weird discussion sometimes. I have commands, and sometimes I want to know something from somebody else, so I have queries. That's actually very much agreed on, even if you read through CQRS communities and this kind of things. We have Event, Command, and Queries. That makes sense. It's even in the "Enterprise Integration Patterns" book. I think it talks about events, commands and queries.

The problem is, a couple of years back that was normally called a message, because we had messaging systems. Sometimes nowadays, if you use Kafka, they talk about record. I can live with a record, but a lot of people refer to, let's say, the common class of that, to be an event. That's actually a clash we currently have in a lot of discussions. When people say event, they either mean the content, the event, the fact, or they mean, "I'm sending a message." That can lead to very weird discussions. That's why I'm really putting a lot of emphasis on that. It's only terminology, but I think it's important. Personally, I try to really talk about messages or records. In Zeebe, for example, we called it record as well, not event, because that's a clash. Then you have events, commands, and queries.

What I see a lot as a result of that are what I call commands in disguise. "The customer needs to be sent a message to confirm the address change intend," That's a command - I want somebody else to do something. If that helps you in order to get your architecture through, it might be ok to do it. One thing which is very important from an architecture point of view is that the alternative - sending a message - normally uses the terminology, the wording of the recipient, so the service which provides that service, which is more natural, which normally is a much more stable API versus the command in disguise, reduces the words of the sender who wants to do something which is normally not that stable.

Let's do a couple of examples to make that more concrete. Let's assume you have a payment service. Then for me it's very natural that other services who want to retrieve a payment sends a command, because otherwise, the payment service needs to know all the services which probably need a payment. That does make a lot of sense. In this case, I want to have it in this direction. The address change, we already talked about that. For me, that's a very good example where I want to have it in this direction because the customer service should know about all the clients which are interested in address changes.

If you go for a notification service which probably does all the notifications for an order fulfillment, it could be completely event-driven, because it knows about what stages the order could be in in order to send the right notification, so that event-driven makes a lot of sense for me. It could be different if you have a general notification service, which sends out notifications for everything in your company. It’s very weird that this should know about orders, then you have to send the command. It's really a case-by-case decision. Those are just a couple of examples to get you thinking.

When I discussed the talk with Suzanne and when I prepared for it, she had a good example as well. They had a document management system, that was the context. They used DDD, and they had a couple of contexts like for caring about the document, for caring about a certain page, and probably a certain attachment, and so on, and they had an authentication service. They started with an event-driven approach where there were a lot of events, like whatever document is created, a new page is added or whatever, where you had to add authentication information.

If you do this kind of thing, you end up with every change, and each and every single bounded context leads to a change in the authentication service. That's what we call a distributed monolith. You can't independently change anything anymore. That's a pretty bad situation to be in. In this case, it got pretty clear they re-factored it to have a clear stable API, basically. The services down there, the bounded context, called the authentication service to do something.

Enough of the examples. The next challenge is event change, one of my favorite topics. Let's assume you want to create a customer. Let's assume you do a couple of checks before you do that. There are a couple of services you have for that and an event bus. It could work like that. The registration says "There's a registration requested," that's an event. Credit check says, "Ok, then I have to check the customer. I did that." Then Address check knows, "If the credit was checked, I need to check or verify the address." If that is done, the customer can be created, and it's registered. That's an event chain. That's actually how we can implement these kinds of flows.

If you do that, it's pretty hard to get an overview of how that works. Personally, it's hard to understand the system. During runtime, you see what's happening. I personally love the blog post from Martin Fowler. He says, "The danger is that it's very easy to make nicely decoupled systems without realizing that you're losing sight of a larger-scale flow, and thus set yourself up for trouble in future years." That's a danger you have with these kinds of systems.

If we want to change it, let's assume we get it somehow. We can work with that, but we want to change it. By the way, that's a real example from a customer I discussed that with. They wanted to add an additional check - let's assume something like criminal registered check. If you want to do that, and you want to add it, you have to change a couple of things in that system. The first is you have to remove that notification, and now you have to listen for the address checked and the criminal check. Then you have to turn to the customer, no longer to listen to the address check, but to the criminal check. You have to have two changes which you have to coordinate. Not a good idea in a microservice environment. Not really decoupling.

Now, you could argue that this is bad event flow. It's not a good event flow, I do agree; probably, you should do it differently. Let's try again from scratch. You can't do that that often with a real-life product, but let's do it for now. Registration requested, both address check and credit check listen to that, so it's in parallel now. It's changing a bit, but it gets easier. They produce their results, the customer aggregates that, does everything. Now we can just simply add the criminal check without changing anything else. That's good, but you still have to change the customer service to basically cooperate the result in order to do something. You still have two changes in your microservices.

I think that's my point here. I think that that's unavoidable. You have these two changes; you want to have these two changes. You want to change your system. The thing is, how easy is it to do that? If you now have requirements for the sequence of these kinds of things that you want to do there, that gets pretty hard to do. If you want to change something, if you say, "We want to keep it stable. Don't worry, we don't have to touch it. You just have to remove that stick down there." That shouldn't be that hard. That's it, that's all we ask. That's how it feels, that's what I see in a lot of systems currently.

I always use this picture, this is choreography. That's this event ping pong, we just saw a beautiful dance. We add a new microservice and it's part of the dance. It's not what I see in a real system. What's the alternative? That’s one of the last things I want to talk about. It's orchestration. Orchestration means that in this case, I want to have it registered. Now, the credit check or address check doesn't listen to that. We have a separate service that might be merged with the customer service. I did it as a separate service, but it can be the customer service. That depends on different things.

This listens to it, and then it sends out commands. It says, "Credit check, please do that for me," and waits for the response. As soon as it did that, "Address check, please do that for me," waits for the response. Then it says, "Ok, now we're done." Now you have that one point where you can change, where you can remove the stick easily. That's important to keep in mind, and that's very often connected to the thought that you need these commands. Not everything is event-driven.

If you look at the change scenario, that gets easier again. We have these address checks. We want to have the criminal check additionally. What do we have to change? We have to change the customer on-boarding, because this one now knows, "Ok, we have to have that additional." If you compare that, and I wrote an InfoWorld article which is linked here, which compares the changes, the coupling you have to make, from my perspective, it's exactly the same. You have to change the customer service. You have to change the criminal or add the criminal check service, in both scenarios. Why do I emphasize that? Because I think it's nonsense that purely choreographed systems are less coupled than the others. If I have a communication between two things, it's coupling. You can't avoid that, you can't decide how to do that. Yes, in my world, you can do orchestration workflow engines, obviously, but you don't have to.

Commands and events decide about the direction of the dependency. That's a very important thought. As soon as you have that, and beware of event chains, you lose sight and you don't have less coupling. That's simply not true. You should balance choreography and orchestration in a way that it makes sense for your architecture. It's not easy, but it's very important.

I walked through the events on the inside, the event sourcing part. I shared event stores, and events versus command. That was quite a lot of things, actually, I hope I got across at least a couple of the ideas. In summary, it's not easy; events on the inside are even harder, but it's doable. You can do that, and it's probably worth it if you need it.


See more presentations with transcripts


Recorded at:

Oct 07, 2019