Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Complex Event Flows in Distributed Systems

Complex Event Flows in Distributed Systems



Bernd Ruecker demonstrates how the new generation of lightweight and highly-scalable state machines ease the implementation of long running services. He shares how to handle complex logic and flows which require proper reactions on failures, timeouts and compensating actions and provides guidance backed by code examples to illustrate alternative approaches.


Bernd Ruecker is co-founder and developer advocate at Camunda. Previously, he has helped automating highly scalable core workflows at global companies including T-Mobile, Lufthansa, Zalando. He is currently focused on new workflow automation paradigms that fit into modern architectures around distributed systems, microservices, domain-driven design, event-driven architecture and reactive systems.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Ruecker: I want to talk about complex event flows in distributed systems and basically what I want to do is I want to talk about common hypotheses as I see out there. So, we have quite a rise of the, let's say, kind of a hype but also like in really adoption of event-driven systems out there. And commonly, I see people talking about it like, "Now, we have these event-driven systems and they are like the magic thing in order to decouple our systems and everything we do should be event-driven." I want to look at that first. And then, I want to look at the second thing which normally comes up. “If I do these event-driven things, I don't need orchestration at all. Orchestration is evil because it introduces coupling again and we don't want to have that.” And that's very often connected with the third thing that “this workflow technology is not required in that new world at all. It's painful tools from the past, so we don't want to have that in these new world.”

And that's basically my plan for today, to go over these things and comment on them. Obviously it contains opinion, my personal opinion. Wes [Reisz] said that yesterday already. QCon definitely also likes to have opinionation. I mean that's what you do - you talk about your own experiences. I just want to make it transparent.

And Jonas [Bonér] already introduced me. I'm a co-founder of Camunda, we're an open-source virtual automation vendor. I've contributed to a couple of open-source engines in the past. What I basically do, all my professional life is working with workflow engines. And that means I look through the lens of workflow automation onto a lot of problems and that's probably good to know. But in that time, over the last 15 years, I saw thousands of customers doing different workflows and different industries with different architectures and that's what I try to bring together today. That's my plan.

Simplified Example: Dash Button

I love using examples. The example I use today is motivated by Zalando. You probably know Zalando, they're active in the U.K. as well. Sometimes they position themselves as the Amazon for shoes and clothes. They're a German startup. They're growing really quickly. And they also have, obviously, an order fulfillment process. I cannot show the Zalando order fulfillment stuff here on stage and in the public slide.

I do a much simpler workflow here and that's motivated by the Amazon dash button. You probably know that, it recently got forbidden in Germany but it's still I think available in the U.K. The idea is you have the dash button and you put it next to the washing machine. If the washing powder is empty, you just press the button and then one box of washing powder gets ordered at Amazon and shipped to you, a very easy order fulfillment thing. That's the example I use here. In the background, that's still an order fulfillment like workflow but, in a way that you press the button and then you have to pay for it. That's normally done by the payment details you already stored in your account. Then, it obviously has to be fetched from the warehouse and shipped to you, right? That's the workflow which should happen in the background.


If you want to design such a system, nowadays you probably want to look into DDD, "Domain-Driven Design." That's something I can totally recommend, that's a really good book and that talks about something called a bounded context. You have to slice your domain into these contexts where it makes sense to have really one domain language and these kind of things. It could be that you end up with these four bounded contexts here. It could be different. I mean that, it depends on a lot of factors but for today, I want to make the example with these four domains. And if you go for microservices normally that's a natural candidate to be microservices. You have the checkout communicating with the button. You have the payment, like getting the money. You have inventory like doing everything what's on stock and fetching it from stock, and you have shipment in order to get it to the customer. That makes normally quite a lot of sense.

I don't want to define microservices here. I just want to mention one thing which I think is important to keep in mind. I will come back to that later on. My view on microservices - why are we doing microservices? We want to have autonomous teams caring about the separate microservices. Every microservice has an own team caring about it, its own data set where the data is stored independently, so not a joint database, not a shared database. And we want to run on different, whatever it is, virtual machine, like Kubernetes pods. It's separated so everybody can run on its own, can deploy on its own. That's the basic idea, that's just the introductory stuff.

Events Decrease Coupling

Let's look at how you can build a system out of that with the event notifications. And let's do an example. It's not in the dash button but you could imagine something, like, if you press the button it should blink quickly green that we can ship within 24 hours. It could be a requirement you have to implement in such a scenario. And the naive approach of implementing that would be - I mean the checkout service communicates with the button so it has to know if it blinks or not, but it doesn't know it. I mean it doesn't know about what's in stock. So, the naive approach would be, like, "Hey, I asked the service," which knows about it which is the inventory, so I could do something, a request and response, like an RPC style of thing and ask the inventory, "Hey, do you have that item on stock? Can I ship that in 24 hours?" I get a response and then I can blink or not. It's not necessarily really RPC here. It doesn't have to be HTTP or REST or something like that. It can be messaging. It can be a request message and waiting for the response. I'm not talking about protocols here.

But it means that I couple of these together in a way that if the inventory is not available I cannot make the button blink. I might add some latency because I have to ask over a network. Maybe inventory has to do a couple of other things. There are a couple of downsides of this approach, the most obvious is the temporal coupling. So, if inventory is not available, checkout cannot do that.

The alternative would be to use an event-driven approach where it's more, like, "Hey, we have that inventory down there," and inventory sends out events whenever something happens, which is relevant for the information what's on stock. So, for example, "Hey, there's new goods on stock," or something is just, like, moved away from stock. And you could send these events to [the events] bus, to the outer world, it's kind of a broadcast thing, and then the checkout service could say, "Hey, I'm interested in that." I read all these events. I build a local data of what's on stock or not. Only that limited amount of data which is interesting for me. It's kind of a cache, a lot of people think about that as a cache, which I think it's valid for now. But the important thing is now I don't have to ask inventory in order to say if it's shipping or not. It has also disadvantages. I mean, it might be off by a couple of milliseconds so that's about eventual consistency. It might not be the real amount on stock but it has a couple of advantages to do this kind of thing. That's one way of thinking about event-driven.

Event-Driven Architecture

There's another example in this small, order fulfillment thing which I also think it's quite a really good example of using event notifications. You could say, if you have events flowing around in the systems, which are probably saying something about, "Hey, there was an order placed." or "Hey, there was payment received." or "Hey, there was goods shipped at the moment." You could build a notification service. Some service which sends out emails or whatever it is, SMS texts to customers, completely on its own so you don't have to think, in the left-hand services, about when should I send an email. Did he give his GDPR compliant consent in order to send emails? You can really centralize that in that one service. You have one team only caring about notifications. So, that's a good way of thinking about that because then you have clear responsibility what this service should do and you really get it out of the other services. So, that's a good part of the notifications.

So, yes, an event can decrease coupling and I think there are really good use cases for that, but as always, if you have that hammer, people will start applying it for a lot of things. And what I saw over the last years, emerging in a lot of situations is that, if you have these services and you have to implement that workflow - pay, fetch and ship, what happens sometimes is that you can do that with event notifications as well and it would look like checkout says, "Hey, there was an order placed. Somebody pressed the button." And then probably you'll say, "Okay, payment is first." So, payment listens to their order placed event, do something, whatever it has to do, and then, emits another event where it says, "Hey, the payment was received." And inventory knows, "Oh, if there was money received, I have to fetch the goods." And then it emits another event and shipment says, "Got fetch, I should ship them." and so on and so forth.

Obviously, you can implement these and that's what I call that complex event flow because very often, it's not just for services, it's much more. You can implement that flow by event notifications. The problem I see with this approach is that you're kind of losing the sight of what's going on. It's hard to understand how the workflow on the right-hand side is implemented there. You really have to look at the events during runtime what's really going on and that personally gives me a weird feeling. I don't really feel comfortable with that. I discussed it very often over the last years. Very often I get the response, "You're that workflow guy. You're that orchestration guy. I mean, your tool is not on the slide. There's no place for your tool on the slide so obviously you don't like that."

But that's not the real reason. Actually whoever knows me personally knows that I would rather dump the product and the architecture if it doesn't fit. So, I really have a bad feeling about that. The good thing is now it's 2019 so you find a lot of blog posts or things out there which back up that bad feeling I have there as well. And the most prominent one is from Martin Fowler. He blogged at the beginning of 2017 and said, "The danger is that it's very easy to make nicely decoupled systems," I mean that's not yet the danger but that's why we're doing it, "without realizing that you're losing sight of that larger-scale flow and thus set yourself up for trouble in future years." So, that captures very good the feeling I have for this kind of architecture as well.

Monitoring Workflows across Microservices – Typical Approaches

If it's just about losing sight, if it's just about not knowing what's going on, probably one approach of looking that would be monitoring or tracing all these kind of tools. I just recently wrote an article, it was published last week on InfoQ about monitoring workflows across these collaborating microservices and I basically go through a couple of approaches you could take in order to regain sight into these kind of workflows. So, one thing you could do is distributed tracing or these kind of tracing tools, open tracing. There are a couple of them, like Zipkin or Jaeger or if you're on AWS, X-Ray or whatever. There are a lot of these kind of tools which basically can give you some kind of, let's say, call stack throughout your distributed system, throughout different components where you see send the first request and you see how it rippled through the whole system. That gives you an idea but it has the downside of its being very technical. It's normally focusing on really very technical details, like what's the exact message, like method cause, services invoked, functions invoked, it doesn't have a business perspective on that. It's much too fine-grained. Another problem is normally tracing just sampling. So they write 10%, probably, of all the requests so you don't get a complete picture of what's going on. Very often, you only look at failure cases, not at the happy cases. So, this is, I think, not the best way of approaching it. It's an important piece but it's more focusing on technical things.

What most customers I know are doing, they're building some kind of, whatever you want to call it, probably data lake nowadays. What they really do is they take elastic, put it somewhere, get all the events, if they have an event bus, and store it. And then they try to make sense out of that, in order to understand the flow or what's going on. If they're good at that, you can even hook in own visualizations. So there are frameworks, like BPM and .io. It's open-source. You can just build it, like a small HTML page in order to visualize events on a virtual model, like graphically. So, there are ways of doing that. I think that's a very powerful approach. The downside of that is that you have to put some effort into that to make it happen, so, it doesn't come for free.

There's another range of tools which are normally called process mining which I'm conceptually really curious about. I like what they're doing. They try to make sense out of events. They try to gather what's the normal flow out of these different events and what are the workflows going on here. They try to make a discovery on that. The problem is most of these tools really focus on log file analyzers. So, if you're really an inventory system, you're probably using Kafka or whatever, you're not really into that world. They're also not very developer friendly at the moment. So, I hope this will evolve a bit more and give you some more insight.

And there's the last approach, I want to talk about that later on, not now, which is called, or I call that process tracking, but I'll come back to that in a second. If you want to read about that, there is the InfoQ article, I summarized that there.

Emerging Behavior

But there's one problem that still remains. Now you have some insight but what you still have is something I took that from a talk from Stefan Tilkov at the microXchg last year in Berlin, where he talked about microservice patterns and anti-patterns. The talk is on YouTube, it's linked there. I totally recommend to you to look through that. And he, at one slide he said, "Okay, but this is getting an emerging behavior, right?" because it emerges during runtime, you don't design it upfront. It's an emerging behavior." And I like actually the real slide where he has, how he also calls emergent behavior and I think that's what a lot of people here know as well. For me personally, I know that's totally unfair comparison, but it often reminds me of database triggers, like, "Hey, I write something in database. Something happens. I have no idea what." So, you have to be at least cautious about this emerging behavior. I'm not saying it's a bad idea. I'm just saying there's risk involved as well.

Peer-to-Peer Event Chains

And I can make another example which makes it very let's say transparent what I mean here. Let's assume you want to change that workflow which is going on here. Let's just assume you want to fetch the goods before you wait for the payment, which is actually not that unrealistic. I mean, we have same-day delivery nowadays. Amazon sometimes fetches goods from the stock before you checked out. That's an amazing thing actually, because they can predict what you might order, if you have recurring orders and they already fetch it before you order it. And I heard they're pretty good at that. So, it's not unrealistic to change orders here to get faster. Everything gets faster.

If you want to do that, visually it would mean changing the sequence of these two things. On the left-hand side, it means that I have to change a couple of event subscriptions. And I actually like this effect because what we can see is that you have to change three services down there. It's better than REST calls. If you have a REST-based architecture, that means you have to change all four of them. But there's still three services and I have to change them and I have to coordinate the deployment because I probably have ongoing orders in my system and they're flowing around somewhere and I changed that order. I have to make sure, whenever they arrive in payment, that they know if they already fetched the goods or not, otherwise it will not work. So, I have versioning problem. It's not totally trivial to solve that. And it probably also means that you have to redeploy at the same time and you have to talk about that, you have to coordinate between the microservice teams.

And that's actually exactly what we don't want to do with microservices. That's the slide - I learned that metaphor from Eric Evans, the guy who wrote the DDD book. And that's a three-legged race, it means we bind teams together by their feet, right? If they coordinate, I'd bind them together. And what you can see is they run slower, as if they would run on their own, and the risk of falling down is much higher. And with microservices, we want to cut that tape which tapes them together, but this is not happening here. We tape them together again because we have to coordinate there.

This is not really in the sense of microservices, from my perspective. I even used this picture a lot recently. What I saw in a lot of talks whenever they talk about event-driven systems and choreography, these kind of things, they use that metaphor. I never totally understood why, but it's kind of a dance, where you don't have a central orchestra, a conductor, but these professionals, they just dance because they know how to dance. And you can add another professional and it will be a beautiful dance because he knows how to behave in that kind of dance. But, if I look at what I see at the companies really adopting it, it's, from my perspective, a bit different. It's probably still a dance but it's hard to manage sometimes. It's hard to understand what's going on. It's probably still fun but maybe also a bit painful. So, I see a lot of risk there involved.

We did a recent survey, just to give you some numbers, it's probably not the most credible survey there, but the top one challenge people saw with microservices, the lack of visibility into end-to-end business processes and their workflows spanning multiple services. That's one of the top challenges people we asked saw.

Extract the End-To-End Responsibility

What can you do about that? In this example, what I would see is you could extract, in this case, the end-to-end responsibility of fulfilling an order to an owned microservice. That's actually for me, in that example, is totally natural to do that. I mean, if you look at companies like Zalando, that's their main purpose why they are there. They want to fulfill orders. It's kind of naive thinking that there will not be a microservice doing that, because there are a lot of people which care about SLAs which care about is that efficient how we fulfill, or that would care about is there any order stuck or taking too long or whatever. There are a lot of questions people ask about the orders. So, it makes a lot of sense to have this as an owned microservice in that example.

And then, you can change the picture of how the communication goes. You would say "Probably, that's still event-driven, like, somebody pushed the button." And then you have the order microservices which can listen to that, that's still an event subscription, that doesn't change, but then you could change to using commands in order to basically command the payment service to do something for you. It's not event-driven. It's, like commanding somebody, "Hey, I want you to collect payment for me." And then probably that raises another event when it's done, like, payment received and it can go on. And that's an important concept. Actually, if you look at DDD for example, domain events are there. Domain commands are not a concept there and I think they should [be]. For me, that's an important thing in order to get that going, and shown for force probably then the order service commands, the inventory and the shipment and then you have a single point where you can control the sequence of things.

And for me, that's not an evil thing. I mean, every time I have two services communicating, you will have some kind of coupling. There's no way around that, it's some kind of coupling. And it either means that you're coupling on the receiving side because then you know which event you're listening to and which data will be in that part of that event, so you're coupling in the receiving side, or you could say the order sends out a command, then you're coupling on the sending side because the sending side knows which command to send and which data to add there.


And in the first case, the checkout has no idea who picks up the event, and the second part, the payment, has no idea who commanded that, which is much more natural for me. If you think of payment, why should it know about orders? It shouldn't. If you think that through and you're building something like Stripe collecting credit card charges for your customers, you're event-driven, listening to order placed, it would probably mean every time somebody uses your service you have to redeploy. That's kind of a weird thinking, I think. So, for me, it's quite natural to have this kind of commands there.

Also a lot of people, when I talk about that think, "Hey, but you're proposing doing some kind of REST-ish things there. We don't want to do REST. We want to do asynchronous messaging." Okay. It's not about the communication channel, you can send events over messages, you can send commands over messages. For me, that's totally fine. It's just a different concept. And what I see a lot actually in real life architectures, that people are doing commands but in disguise because everything has to be an event. And then you have something like payment required event. That's a command. Let's do it for me. It's not a payment required event. That's kind of weird. But if that works for you, I'm happy to do that if you're aware that that's kind of a command thing.

Smart ESB-like Middleware

I think commands are an important concept to keep in mind and I'm pretty much aligned with CQRS thinking there as well, so, I think that's not too controversial now. And for me commands are orchestration because it tells somebody to do something - that's orchestration. As soon as they say something about orchestration, it's normally, "Oh no. We don't want to do orchestration. We are loosely coupled." And I think that comes from a thinking back in the, let's say, old SOA days, then we often have things like orchestration which is thought in a way like this, so you have these services and then you have the ESB, the bus, and you do the orchestration in the middleware, in the infrastructure, in the bus, because it's where these kind of ESB and BPM tools and there you did the orchestration. And, in microservice thinking, that's a fundamentally wrong approach because then you have a central tool somewhere. You have to redeploy probably services on the left-hand side always together with the orchestration flow. That's a three-legged race again. So, there are a lot of reasons why this doesn't work.

But this is not what I had on the slide. And you probably also know this famous thing from Martin Fowler where he said, "No, we don't want to do ESBs, those are smart pipes. We want to do some dumb pipes and smart endpoints." and that's very much agreed on nowadays. What I did is I didn't say the orchestration should be part of the infrastructure. The orchestration should be part of one of your microservices, right? It's still in Microsoft. So, you can still do it in whatever code you like, that doesn't matter. But as soon as you command somebody, for me it's orchestration, and for me that makes a lot of sense and totally fits into that picture.

God Services

If you read further in the microservice journey, you probably come across the book from Sam Newman, the microservices book in O'Reilly, there in some chapter he writes, "If you do orchestration, you have the risk that you end up with a few smart god services that tell anemic CRUD services what to do." That's a risk. If I visualize that, like this, the order sucks in all the logic, and all the other services are not doing anything anymore. Well, I understand why this is a risk and you should definitely watch out for avoiding this but it's not what happens automatically if you do orchestration. I think that happens if you do bad API design. Then you can get these god services.

And I want to make an example of that as well in order to make the point. So, let's assume order payment, you already know that and we sent that command of, "Hey, payment, retrieve money for me. Collect money." Payment doesn't do that on its own because it's probably credit card charge so it has to ask a credit card service in order to do it, probably something like a SaaS service somewhere, waits for the response and then it tells back the order what happened. Relatively easy to understand. What could happen is what I said earlier - you press that Dash button, so it takes your credit card saved in your account. It might be expired. The credit card might get rejected. What a lot of services do and a lot of developers do in this kind of situation, they just pass on that problem to the order service, like, ”Hey, credit card was rejected.” And now when you start leaking domain concepts of payment into order because order shouldn't know about credit cards. It shouldn't. And especially if you get new requirements, like, “He pressed that Dash button. He wants to give us money. So, let him do that. If the credit card expires, send him an email, try to update his credit card.” That was what GitHub and other tools are doing. If they want to recharge the yearly fee and your credit card is expired, they send you an email, "Hey you have two weeks. Update your credit card. Then we're good."

And, if you want to do these kind of things and you have this architecture, normally you start doing that in the order probably because you have some kind of orchestration tooling or workflows there, but that's okay, then you get a god service. But that's a bad API of the payment. What you should do instead is ask yourself who is responsible of that problem and that, in this case, it's clearly payment service. This handles credit cards. So, I would do some kind of architecture design where it'd say the payment service handles that and sends out a response ultimately, like, “payment received or failed”. Nothing about credit cards, nothing. Either received or failed. But if we have these requirements about pending two weeks or whatever, it normally means that this service has to be able to be long-running - what I call long-running because now you have to wait for two weeks, for example, in order to get it going. But this gives you a much cleaner API and this avoids these kind of god services quite a bit.

Handling State

The problem I see with that or the problem I see teams struggling with it as soon as I say that, they say, "Yes but, payment should be ..." Stateless service was a good place to be. That was fun to code, but if you want to have long-running and need state, that's kind of ugly. I don't want to do that. If you're looking at state handling, it's 2019, there are ways of handling states. The most naive thing is I always call that persistent thing because it could be manifold things. It could be an entity, a database table; it could be a persistent act or it could be a document, whatever you like. So there are ways of persisting things. The problem is that this very often involves some additional things you have to do and projects are really quickly understanding that because they have to do, for example, some scheduling because you have to recheck if the two weeks already expired. You have to do versioning, what I mentioned earlier. Something is always going on so you have to think about that. You have to operate that probably at scale. You want to have an operation tooling where you look into problems. So, there are a lot of things you probably get subsequently.

Workflow Engines

That's, from my perspective, not the best way of doing it. There are these other types of tools which are either state machines or nowadays most often called workflow engines. And as soon as I normally propose that in a project, people are kind of, "No, workflow engines are - that's crap. That's from the past. They're really complex or these three-letter acronym vendors, big boxes, hard to understand, very proprietary. We don't want to do that." And that's not true anymore and that's for me, also an important message. So, yes, there are these kind of tools, very often marketed in a way where you say, “Low code, we don't need all of you. We don't need developers in order to do the workflows.” That's understandably crap. That doesn't work. To my experience, that doesn't work at all, probably except very simple workflows. I always call that death by properties panel because you have to click through a lot of property panels in order to do that. You cannot just copy and paste something from GitHub but you have to watch a 10 minutes YouTube video in order to do that. That's crappy, so don't do that. Definitely not!

But there are different tools emerging and actually a lot of them. If you look at the cloud space there is something like AWS, for example, Step Functions, Azure Durable Functions, Google Cloud composer, and the CNCF, the Cloud Native Computing Foundation. It's an own workflow subgroup. So, there are a lot of things going on about workflow. If you look at the Silicon Valley stacks, everybody built something there - Uber Cadence, Netflix Conductor, Airbnb, Apache Airflow. So, all of these kind of tools are emerging at the moment. If you look at the open-source space, there are quite a lot of what I would call lightweight workflow engines out there which you can use, which you can easily leverage. Very often, most of the tools except AWS here are open source so you can easily get started with that. What do I mean by lightweight? I don't want to go into any details. If you want to see some real code or run some real code, approach me later on, I have my laptop with me. I'm happy to do whatever you want to see.

For today, just to give you an idea, and I use Java here, you could use any language, but what you can do there, for example, you can build a virtual engine in one line of code with a default configuration, you can define a workflow in code if you prefer. What happens in the background it's generated like the real graphical model automatically, and you could also model graphically, if you prefer. But I often have the impression that developers like the code first because they understand what's hidden in there. And when they see graphical models, they're more like, "Oh, there's some magic behind them." I'm a bit afraid of that. Don't you?

But you can start with a DSL. And that's BPM - and by the way, it's an ISO standard. A lot of these tools are implementing it. I totally love the language so I can recommend at least to look at it. The adoption worldwide is quite good. And then you can attach code, like there's just in this case, a Java code, which is executed when the workflow runs through a certain step, and then I can start instances of that. And this slide, this code you can copy and paste it in your Eclipse. If you're in Java, you can directly run it. That's it! Now you have a state machine. Now you can do this kind of long-running things in the service. You can have something like charging the credit card. If that doesn't work, I'd probably send out some information to the customer and then I wait for him to respond and now I have the notion of time. This is persistent; I can wait for seven days, I can do something else. It's very easy to implement this kind of things.

Distributed Systems

Now, we're in distributed systems nowadays. And actually I'm quite happy actually to be on Jonas' track today because I stole that metaphor from you, I learned that from you the first time. That's a metaphor for distributed systems. You have these different components out there which might be the small hut there. That might be one Java program. You have asset transactions there. You have threading under control. You're on one environment. You pretty much know what's going on there. But when you open the door you face that rough ocean and that's the network and there will be incredibly bad things happening with the network. Peter Deutsch, “Fallacies of Distributed Computing”, the first fallacy is the network is reliable. It's obviously not.

And because of this, we also face a lot of new problems we have. Just making an easy example here, if I do that payment and credit card thing, and let's assume I call a REST API here, what can always happen is that this is not available. It's the network, it's probably really the internet. So, it might not be there, it might be a network hiccup, it might be a downtime, whatever. You should plan for this. And that's the same reasoning I did a minute ago here because what you can do naively is the first reflex of most developers is “Oh, there's network exception. I'll just throw that to my client." And then you throw it to the order and order should handle that and order grows bigger and gets a god service. That's a bad idea. That's clearly about handling the credit card communication, it should be part of the payment. But in order to do that, it has to be long-running probably, especially if you want to want to be really surviving outages or downtimes here because you can retry for minutes or probably even hours. It's an order fulfillment. It doesn't have to be done in microseconds. You have minutes or probably even hours to solve this kind of thing. So that's also long-running.

You can make the same case with messaging, by the way. If you send out a message and never get a response or probably get a late response, you have to care about that. You have to persist it, you have to wait for it. And that's relatively easy to do in this kind of tool. These are all examples of long-running. And there's even one - and that's my favorite example actually, because a lot of people are not thinking about that, which is surprising. So if you look at distributed systems, there's even one characteristic which is inverse. Let's assume you have a REST call. And you do that REST call, you get a network exception. You have no idea what just happened. It could be that the network was just broken while you tried to reach the service provider. You never reached it, it was broken before. Or you probably reached the service provider. It started to do something for you and it exploded. Did it commit the transaction or not? You have no idea. Did it [do] everything correctly and the response got lost? It could happen as well.

So, whenever you do that call, you get a network exception, or you send out a message and never get a response. You have no idea what happened, you could have done something or not, and that's something you have to think about. Especially if you think about payments, you have to think about it. It could be something, like, "I tried to charge the credit card. I even did retries for an hour, but I gave up and then I raised a payment failed event. I say, order, we cannot do that." But then I have to think about, for example a cancellation or a refund or whatever the strategy is, but I have to think about it. And again, it has to be kind of persistent because, if the service was not really reliable when I tried to call it the first time, chances are low that it's responsible when I do the cancellation. So this is again long-running.

Workflows inside Service Boundaries

And then I think you get a much better and a much cleaner API for the payment service and that will definitely avoid these kind of god services. That's for me an important kind of thinking. And what's connected to that very often - that's also an important thought – [is], if I have this kind of workflow now as part of the payment service, it's an implementation detail of the payment services. It's just its internal logic. You might leverage some tool like a workflow engine in order to make that easier to implement, but you probably decide not to. You can store it in the database or use persistent actors or saga implementations from vendors or whatever. So, you can do that however you want - it's an implementation detail. You don't see it from the outside, you don't see it by the API.

And, if you have other services also having this logic, like, for example, the order fulfillment, which is, let's say, a coincidence that this is the end-to-end capability that also has a workflow. But it's a different workflow, it has no details about how I really do the payment. It'd just say, "I command a payment and now I wait for it to finish." Not much more. So, you really separate this kind of thing. So, know what I call a BPM monolith. I blogged about that because that's what most people have in mind when I say something like workflow. "Oh no, you want to do this kind of, like, huge thing which cares about everything where it says, ’'If the charting doesn't go through, I ask the customer to update a lot of things.' Where you in one workflow model you mix all the different bounded contexts." And obviously that's a bad idea. I mean then you end up with a three-legged race again. You have one model where you don't have one clear owner. That doesn't make sense. So, you have to have models which have one clear owner rooted in one microservice, if you do this kind of architecture. If you do a monolith, that might be good, by the way. So, we start to see a trend back towards appreciating monolith. If you're doing that, you definitely are fine with that, there's no problem in doing that. But if you do microservice, that's wrong. That's the important thing here.

Life beyond Distributed Transactions

One last thought on that. I had a talk yesterday on that, I called it "Lost in Transaction." What we see is we have a lot of problems nowadays because we don't have these asset transactions anymore. If we have two components which are communicating remotely at distributed systems, we cannot simply use a transaction manager in order to commit or roll back. If you don't believe me, there's the paper to read in order to understand why. It's a really good paper. I'm a big Pat Helland fan. And there are basically two reasons. The first is it doesn't scale. If you have a bigger system, it doesn't scale because these kind of transaction managers are a bottleneck down. And the second, it's too complex to understand - developers don't understand it, operations don't understand it, so, you will never really master it in production anyway. So, don't use this kind of distributed transactions. And he puts it in “Grown-Ups don’t use distributed transactions.”

But we still have that requirement of doing things where we say, "This should be all or nothing." This is a lot about where the mindset now shifts toward eventual consistency. And one way - I'm just scratching the surface here - but one way of implementing it, is also known as the saga patterns. So, what it could do is, if you have extended payment a bit and say, "The first thing is I look at the customer account if he has credit, a voucher or something on his account. If yes, I deduct it from there, and only if that's not sufficient, then I charge the credit card." But these are two different microservices I use here or two different services. If this goes wrong I cannot roll back, I already did that thing here. So, what I do instead is I do a compensating logic, an undo. It's kind of a business rollback, and therefore I have to define it. And BPM, by the way, can do that where I say, "If this activity was successfully executed or executed in the past, please call this undo method." And workflow engines which understand BPMN can do that out of the box, and that's also quite powerful. And I'm not sure if it’s fair to give homework at QCon but I thought it's a good idea to start that. Try to do that with a pure event-driven approach with the services I sketched. I did that exercise once but it's really hard because that's kind of a complex logic and it's really hard to do that. But if you send me a good example, I’ll ruffle something, whatever, I'll make something up. No worries, send me an email.


I started with monitoring and managing workflows, like, getting insights into that. I call that BizDevOps, and that started to be a thing as well. So, you find BizDevOps in the back there as well. What I talked about so far is a lot about, "Yes, of course, I have this workflow engine as a state machine, as a help for the developer to do something and for me, that's a lot about living documentation." Dan North can talk about that for ages as well, all this kind of things. So yes, that's true, but that's one part of the story. The other part is that these are probably test cases, living documentation so it's really executed. It's really code, these graphical models. And that's what a lot of people really miss in this emerging behavior thing. This is not a picture in PowerPoint, it's code.

But it's also about operations and that for me, is one of the important things. If there's something failed during runtime, some service call that's faulty, some bug, whatever, you have a clear understanding. The tools might look different or depending on the product you use, but all the products have something like this, where you can look into that, see, "There are failed instances. There is a problem." It might not only be 317. It could be like 200,000 probably. You look into that, you can investigate the problem. You can probably restart from there. You can understand and that without even probably without involving a core developer but that's an important thing. And you get a lot of statistical data, a lot of insight of what's going on. You can combine that - I'll come back to that - but for me, that's an important point. You have to balance that. You have parts of the system where you want to have this orchestration in the services and you want to have parts of the system which really were choreographed by pure event notifications. That's about BizDevOps.


Coming back (I had that at the very beginning) to that monitoring and managing article. There's one last thought in there and I skipped that at the very beginning, which is, I call that tracking. That's actually an interesting approach. We did that at a couple of customers recently. If you assume you have an event-driven system, let's say you have some kind of event bus, for example like Kafka or RabbitMQ or whatever it is, and you have these services sending all of the events. What you could do as a first step, you could attach something like, for example, a workflow engine in order to have a tracking workflow. This doesn't control anything, it doesn't steer anything, it just listens. But whenever you have some order placed event, you start an instance and then you wait for the other events to happen. This already gives you an idea that if the system really behaves like you modeled it, first of all, you get all the tooling from these virtual vendors where you can see something like a statistics, operational, this cockpit thing I just showed. You get already a lot of support for managing the SLA, so, that's a good start.

And what you can do easily, is you can start on a journey towards probably using more orchestration. And that's what customers currently doing that started with this event-driven pure choreographed approach, they now move towards more orchestration. For example, just to add these are so-called timers in BPMN so they can look at the SLA and say, "This takes too long. I'll do something else." and whatever. And now this tracking approach starts to get active, starts to use a bit of orchestration there and then you can do that, extend that step by step.

There's a good case study. If you fancy that, there's a YouTube link. It's a 30 minutes video where Vodafone basically talk about what they did, and they did more or less the same thing. They had a system which was completely connected under the hood which was hard to manage, and they started with a pure tracking workflow in order to see what's going on and then they migrated step by step to an orchestration flow for the certain extent and removed the links under the cover. And that way, they got into kind of an orchestration approach. So that's a good source to also learn about that.

I love to actually to pull up a couple of quotes because that's what I see out there. 24 Hour Fitness, the biggest fitness company in the U.S., they did the same thing and they said more or less they started with a pure messaging approach and said, "Before mapping processes explicitly, we had no idea what's going on." And that's a common pattern, you see that again and again. And I have that, for example, a good quote, there's a link from Josh Wulf. He wrote a Medium post where he said when they introduced some kind of workflow, in the core issue, that addressed the core issue in a distributed system. Where is the source of truth for the coordination of these services? And the system they were replacing uses a complex peer-to-peer choreography that requires reasoning across multiple codebases to understand. That's a problem which we should solve there.

So, I think workflow engines are an important piece of the puzzle. Look for lightweight solutions, don't use the old-school big vendor stuff there. That's basically, by the way, to go back to the beginning, exactly what Zalando does. They have kind of a Kafka divert underneath and they're using this kind of workflow, so a common engine on top of that in order to do order fulfillment.


If you want to see more code, and I'm actually a code guy, so I hate only doing slides today, but you have the full example on GitHub, in code. You can run it directly. You can approach me later on. I can show you what that means really in code in order to do. If you look at my GitHub account, you can find that, it basically has all the services I sketched on the slides. It uses Kafka underneath but it's basically in the Spring example, it's Spring Cloud. It could be Rabbit, it could be anything. So, you can easily exchange things and really look at what that means in order to get going.


To wrap it up, events decrease coupling. Yes. Sometimes, there are good use cases for an event chain cases. So I think that's definitely something you should have in your toolbox. But I would avoid these complex peer-to-peer event chains because they are hard to understand. Emerging behavior, hard to change. So, that's kind of a risk you have there. You should be totally aware of that.

Orchestration needs to be avoided, only sometimes. I mean that's the same thing from another perspective. Yes, there are use cases where you want to be event-driven or choreographed. And you definitely don't want to use an ESB if you're doing microservices on a big scale. Smart endpoint, dumb pipes are a good idea but you have to balance orchestration and choreography. Orchestration starts with the commanding of others from within the microservice. And that makes a lot of sense actually.

And then, if you have long-running requirements, use tools for that like a workflow engine. And I tried to make the case that, if you're able to make services long-running, you get a better API out of that. You can have the ownership better distributed, the responsibilities better distributed. I think that's important - to have something in your toolbox for that. That's all I have. Thank you very much.


See more presentations with transcripts


Recorded at:

Apr 03, 2019