Key Takeaways
- Recent discussions around the microservice architectural style has promoted the idea of event-driven-architectures to effectively decouple your services.
- Domain events are great for decentralised data-management, generating read-models or tackling cross-cutting concerns. However, you should not implement complex peer-to-peer event chains. Using commands to coordinate other services will reduce your coupling even further.
- Centrally managed ESBs don’t fit into a microservices architecture. Smart endpoints and dumb pipes are preferable. However, don’t dismiss coordinating services for fear of introducing central control: important business capabilities need a home.
- In the past, BPM and workflow engines were very vendor-driven, and so there are many horrible “zero-code” tools in the market scaring away developers. However, lightweight and easy-to-use frameworks now exist, and many of them are open source.
- Do not invest time in writing your own state machines, but instead leverage existing workflow tools. This will help you to avoid accidental complexity.
If you have been following the recent discussions around the microservice architectural style, you may have heard this advice: “To effectively decouple your (micro)services you have to create an event-driven-architecture”.
The idea is backed by the Domain-Driven Design (DDD) community, by providing the nuts and bolts for leveraging domain events and by showing how they change the way we think about systems.
Although we are generally supportive of event orientation, we asked ourselves what risks arise if we use them without further reflection. To answer this question we reviewed three common hypotheses:
- Events decrease coupling
- Central control needs to be avoided
- Workflow engines are painful (this might seem unrelated, but we will show the relevant connection later on)
We have previously presented the results of our review within a talk that has been delivered at muCon London 2017 (slides and recording available), O’Reilly Software Architecture London (slides and recording) and KanDDDinsky (slides and recording).
The talk is based on experiences in different real-life projects, but uses a fictional retail example motivated by the order fulfillment process of Zalando (as shared in this meetup).
We assume we have four bounded contexts resulting in four dedicated services (which might be microservices, self-contained systems or some other form of service):
How to decouple using events
Let’s assume the Checkout service should give feedback to the user if an item is in stock and can be shipped right away. The Checkout service can ask the Inventory service about the amount in stock using request/response, but this couples Checkout to Inventory (in terms of availability, response times etc).
An alternative approach is that Inventory publishes any change to the amount of in-stock items as a broadcasted event to let the world know about it. Checkout can listen to these events and save the current amount in stock internally, and then answer related questions locally. The information is obviously a copy, and might not be absolutely consistent. However, some degree of eventual consistency is typically sufficient and a necessary tradeoff in distributed systems.
Another use case is to hook in non-core functionality like cross-cutting concerns. Consider that you want to send out customer notifications for important steps in your order fulfillment. A Notification service could be implemented in a completely autonomous manner and store notification preferences and contact data for customers. It can then send customer emails on certain events like “Payment Received” or “Goods Shipped” without any change required in other services. This makes event-driven architectures (EDA) very flexible and it can become simple to add new services or extend existing ones.
The risk of peer-to-peer event chains
Once teams get started with event-driven-architectures, they often become obsessed with events - events provide amazing decoupling , so let’s use them for everything! Problems start to arise when you implement end-to-end flows like the order fulfillment business process via peer-to-peer event chains. Let’s assume a rather trivial flow. It could be implemented in a way that the next service in the chain always knows when it has to do something
This works. However, the problem is that no one ever has a clear overview, making the flow hard to understand and - more importantly - hard to change. Additionally, keep in mind that realistic flows are not that simple, and typically involve many more services. We have seen microservice landscapes where this leads to a situation where a complex system of services does something, but nobody really knows what exactly, or how.
Now, think of how we would implement a simple change request to fetch the goods before we do the payment:
You have to adjust and redeploy several services for a simple change in the sequence of steps. This is generally an anti pattern within the context of microservices, as the core design principle of this architectural style is to strive for less coupling and more service autonomy. Accordingly, we advise you to think twice before using events to implement peer-to-peer flows, especially when expecting considerable complexity.
Commands, but without the need for central control
A more sensible approach to tackle this flow is to implement it in a dedicated service. This service can act as a coordinator, and send commands to the others -- for example, to initiate the payment. This is often a more natural approach, as in this case we would generally not consider this a good design if the Payment service had knowledge about all of its consumers by subscribing to manyfold business events triggering payment retrieval. The following scenario avoids this coupling, and could be described as an orchestrated approach: “Order orchestrates Payment, Inventory and Shipment to fulfill the business for the customer.”
However, as soon as we speak about orchestration, some people think of “magical” Enterprise Service Buses (ESBs) or centralised Business Process Modelling (BPM) solutions. There were a lot of bad experiences with related tools in the past, as this often meant you had to give up easy testability or deployment automation for complex, proprietary tooling. James Lewis and Martin Fowler laid down some of the foundations for microservices architectures when he suggested to better use “smart endpoints and dumb pipes”.
However, the picture above doesn’t suggest a smart pipe. The orchestrating service handles the order fulfillment as a first-class citizen and implements it in a dedicated service: Order. Such services can be implemented in any way you like , with any technology stack you like. It is simply now that you have a dedicated place where you can understand the flow, and change it by changing one service only.
Sam Newman describes another risk of such an orchestrating Order service in his book Building Microservices -- over time this can develop into a “god service“ sucking in all business logic while others degrade into “anemic” services, or in an even worse case become CRUD-like “Entity” services. Does this happen because we sometimes prefer commands over events? Or is it because of orchestration? No. Let’s quickly revise Martin Fowler’s “smart endpoints”. What constitutes a smart endpoint? It’s about good API design. For the Payment service you could design an effective coarse-grained API that can act on the Retrieve Payment command and emit either a Payment Received or Payment Failed event. No internals of payment handling like customer credits or credit card failures are exposed. In this case, the service doesn’t get anemic just because it’s orchestrated (or to put it more simply, “used“) in some other context.
Respect the potentially long running nature of services
In order to design smart endpoints and provide a valuable API to your clients, you must recognise that many services are potentially long running because they need to resolve business problems behind the scenes. Let’s assume that in case of an expired credit card we give the customer a chance to update it (taking inspiration from GitHub, where you have two weeks before they close your account after a payment failure). If the Payment service does not care about the details of waiting for the customer, it would push the responsibility for that requirement to its user , the Order service. However, it is much cleaner and more in line with the DDD idea of bounded contexts to keep that responsibility inside Payment. Waiting for a customer to provide new credit card details means that the payment could still either be retrieved, or fail. Thus, the Payment API becomes very clear, and the service easy to use. However, in some cases it can take two weeks now before we get a business response. This is what we call a “long running” business flow.
Implementing persistent state for long running services
Long running services need to keep persistent state somehow. In our case we have to remember that the payment is not yet retrieved, but also that the order is not yet fulfilled and in a state of waiting for that payment. These states have to survive system restarts. Of course, handling persistent state is not a new problem, and there are two typical solutions for it:
- Implement your own persistent “thing” e.g. Entity, Persistent Actor, etc
- Ask yourself if you have ever built an order table with a column called status? There you are
- Leverage a state machine or workflow engine.
From our own experience we can report that implementing your own persistence mechanism for state handling often leads to home-grown state machines. This is due to the fact that you often face subsequent requirements such as timeout handling (“hey, let’s add a scheduler to the game”), visibility and reporting (“why can’t the business folks just use SQL to query the information?”), or monitoring operations if something goes wrong (“hmmm”).
The reason why it is so common to implement your own state machine is not only because of the “Not-Invented-Here” syndrome, but also because of conceptions around workflow and old-fashioned BPM tools in the market. Many developers had painful experiences with tools which were typically positioned as “zero-code”. Such tools were sold to business departments with the idea of getting rid of developers, which of course has not yet happened. Instead, the tools were handed over to IT departments and remained “alien” there. These tools were often heavyweight and proprietary, and developers experience what we call “death-by-properties-panel”.
Lightweight state machines and workflow engines
Lightweight and flexible business process engines do exist, and can be used like every other library -- with very few lines of code. They position themselves not as “zero-code”, but as a tool in the developer’s toolbox. They solve the hard problems of state machines and the pay-off is often seen early in many projects.
Such tools allow you to define flows either graphically with the ISO-Standard BPMN, or with other flow languages often based on JSON, YAML or language dependant DSL’s (e.g. Java or Golang). One important aspect is that the flow definitions are effectively source code, as they will be executed directly. Executing means that the state machine knows how to transition from one state to the other.
Mature flow languages like BPMN offer quite powerful concepts, for example, handling time and timeouts, or sophisticated business transactions. Because there are so many projects leveraging BPMN, we know that we can tackle even tricky requirements with it.
In the above example workflow instances wait for a Goods Fetched event, but only until a certain timeout triggers. If that happens the business transaction is compensated -- meaning all compensating activities are executed, and in this case the payment will be refunded. The state machine keeps track of already executed activities and therefore is able to trigger all necessary compensating actions. This allows the state machine to coordinate business transactions — the underlying idea is also referred to as the Saga pattern.
Leveraging graphical notations to define such flows also adds to the idea of living documentation -- documentation aligned to your running system in such a way that it cannot become out of sync with the actual behaviour. Some tools provide special support for unit testing certain scenarios including long running behavior. In Camunda for example, every test run generates an HTML output which highlights the executed scenario, something that can be easily hooked into normal continuous integration (CI) reports. This way the graphical model adds even more value:
Workflows live inside service boundaries
A very important concept is that using business workflow frameworks and tooling is a decentralised decision that is made by each and every service team. The state machine should be an implementation detail that is not visible externally from a service. There is no need for any central workflow tool, and the state machine should just be a library that is used to enable the long running behavior of some of your services more easily.
Another way of looking at this is that such a workflow engine is a logical part of your service. Depending on the tool of your choice, it can either run embedded into your application process (e.g. using Java, Spring and Camunda), as a separate process using simple language clients (e.g. using Java or Go and Zeebe) or used by a REST API (e.g. using Camunda or Netflix Conductor). Having this infrastructure available frees services from the burden of implementing state handling themselves in order to concentrate and focus on the business logic. You can design good service APIs and really smart endpoints because you can easily decide to make services potentially long running.
Some code please
To ensure we are not only discussing these concepts theoretically, we have developed a small sample application showing all these concepts in action. The running code is available on GitHub. We have used Java and only open source components (Spring Boot, Camunda, Apache Kafka) so that it is very easy to explore and experiment with.
Wrap-Up
With a complex topic like business workflow modelling we could only scratch the surface of this topic by challenging some common hypotheses. Here are the key takeaways:
- Do events decrease coupling? Only sometimes! To be honest, events are great for decentralised data-management, generating read-models or tackling cross-cutting concerns. However, you should not implement complex peer-to-peer event chains. If you are tempted to do this then make sure that you send commands and are not afraid of orchestrating some other services.
- Does centralised control need to be avoided? Only somehow! We agree that centrally managed ESBs don’t fit into a microservices architecture. Smart endpoints and dumb pipes are preferable. However, don’t forget that all important business capabilities need a home. If you design smart endpoints you will not end up with bloated god services. Smart endpoints will often mean that you will have potentially long running services which internally handle all business problems that they are responsible for.
- Are workflow engines painful? Only some of them! In the past, BPM and workflow engines have been over-hyped concepts that were very vendor-driven, and so there are many horrible “zero-code” tools within the market. However, lightweight and easy-to-use frameworks exist, and most of them are even open source. They can run in a decentralised fashion, and can solve some hard developer problems. Don’t invest time in writing your own state machines, and instead leverage existing tools.
About the Authors
Bernd Rücker has coached countless real-life software projects to help customers implementing business logic centered around long running flows. Examples include the order process of the rapidly growing start-up Zalando selling clothes worldwide and the provisioning process for SIM cards at several big telecommunication organisations. During that time he contributed to various open source workflow engines. Bernd is also an author of the best-selling book Real-Life BPMN and co-founder of Camunda. He is totally enthusiastic about how flows will be implemented in next generation architectures.
Martin Schimak has been working for over a decade in complex domains like energy trading, telecommunication or wind tunnel organization. As a coder, he has a soft spot for readable and testable APIs, and enjoys working with sophisticated but lean state machines and process engines. As a domain “decoder”, Martin is into Domain-Driven Design and integration methods which shift his focus from technology to the user value of what he does. He is a contributor to several projects on GitHub, and speaks at meetups and conferences like ExploreDDD, O'Reilly Software Architecture Conference and KanDDDinsky. He blogs at plexiti.com and in his hometown Vienna he organizes meetups around Microservices and DDD.
Community comments
+1
by Tom Baeyens,
State Machines
by James Woods,
Re: State Machines
by Martin Schimak,
ESB
by James Woods,
Re: ESB
by Richard Clayton,
Re: ESB
by Gustavo Concon,
Re: ESB
by Bernd Ruecker,
+1
by daw sevenhundred,
+1
by Tom Baeyens,
Your message is awaiting moderation. Thank you for participating in the discussion.
Thanks Bernd & Martin for the great article. I totally agree that 'such a workflow engine is a logical part of your service' Each team should have their own flow service, putting the flow inside the smart endpoint.
I also do believe this approach is going to become increasingly relevant. With the trend towards microservices, more coding is moving from monolithic HTTP handlers towards asynchronous event handling between microservices. These interactions mostly occur over HTTP (read non-transactional) transport. To keep dependent microservices consistent, you need to keep good track of the individual service updates.
I was also triggered by "Do you prefer coded or a graphical DSLs?" In the 3 DSLs shown, the control flow is based on state machine principles and a diagram. With RockScript.io I'm exploring a different approach where saga's (aka long running flows or workflows) can be expressed in a scripting langauge like JavaScript. The first results are really encouraging because specifying transformations between the activities (aka local transactions or service functions) becomes a lot easier.
-- Tom
State Machines
by James Woods,
Your message is awaiting moderation. Thank you for participating in the discussion.
Overall an interesting article but the BPMN diagrams seem to confuse events, states, and processes. State machines work best as a concept when those are clearly separated. You should be able to look at the diagram and clearly identify what state the machine is in, and see what events cause transitions to what states.
ESB
by James Woods,
Your message is awaiting moderation. Thank you for participating in the discussion.
I don't see your justification for not using an Enterprise Service Bus. An ESB is just a pipe that allows messages to be broadcast. What is the alternative? Do you suggest chaining processes together with REST interfaces? If so that would either require the clients to know about the consumers of its outputs or use some service registry to dynamically discover them. What is the advantage there!
Re: State Machines
by Martin Schimak,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi James! Your comments are valuable to me because I always appreciate feedback and like to learn about how people read - or like to read - diagrams. With respect to BPMN diagrams, there are actually different styles out there. My focus often is to reduce them as much as possible to what is needed in order to discuss them with domain experts. Flexible engines allow to wire commands, activities and events in ways which either show more or less "technical aspects". To give an example for such a "reduced" visualisation: when one of the diagrams above says "Charge credit card" as one node in the flow, we could technically use and wire it in the following way: when arriving at the node, issue an async command message, e.g. via some messaging infrastructure, then wait in that state, which represents that the activity triggered by that command has not yet completed. Later we could correlate an event "CreditCardCharged" which would now successfully complete that activity and transition to the next node. Modeling the graphical representation like that has upsides and downsides, ofc. Trade offs, like always...
... now regarding your comment about ESBs. We actually use the term to separate central components containing certain business logic (like e.g. a bit more complex routing rules wiring together a "business process") from a "dumb pipe", which would e.g. be implemented with some messaging queue, which really just publishes e.g. an event via a topic or routes a command via a queue. So in fact we are very much thinking in terms of asynchronous messaging here, but these commands and events could also be exchanged asynchronously via e.g. REST feeds.
All the best and thank you for sharing your thoughts!
Re: ESB
by Richard Clayton,
Your message is awaiting moderation. Thank you for participating in the discussion.
I would think there is a large overlap between ESBs and orchestration engines like the ones listed in the article. In fact, I believe some ESBs integrate BPM engines (I'm reminded of Mule and Drools and perhaps Activiti in something else - I don't remember).
Perhaps the primary difference is the ESBs tend towards heterogenous composition of services, where some of these other systems have very specific mechanisms for executing actions. For instance, Amazon Step Functions uses Lambda for actions, but allows integration with external services via API; however, the external service has to know how to call the Step Function API, not the other way around. In this case, homogeneity may allow teams to build/integrate easier and faster than forcing the orchestration engine to have to work around the idiosyncrasies of each service/method integrated. Of course, the benefits have a lot more to do with cultural/political aspects of the employing organization than they do with the merits of either technology or pattern.
In some time, I suspect we will see a convergence of what was traditionally considered an ESB with orchestration capabilities and these cloud orchestration engines. As you point out, they are not so different from ESBs and I would argue that if your organization believes it could deliver faster with an ESB, you should do so.
Re: ESB
by Gustavo Concon,
Your message is awaiting moderation. Thank you for participating in the discussion.
I think there is 2 problems around ESBs.
1. Centralized tools frequently creates dependency around teams. Imagine 5 teams working on different contexts, with their own rhythm and deployment cadence. In my experience, ESBs and BPMs are always a "point of sync" between these teams, that have to handle dependencies and are locked into the same release of this kind of component. This always leads do slow down the development flow to manage merges and rebases.
2. These tools have so much capability that you lose control and some teams make it smart, breaking the concept of dump pipes and adding business logic to that layer.
Solutions like Kafka and Zuul are designed to be more "do one thing well", so it can be modularized to not couple teams and keep the dump pipes approach.
Re: ESB
by Bernd Ruecker,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi Gustavo.
I agree that a lot of companies used ESB & BPM in a way, that required to make these tools "point of sync".
But: We emphasis in this article that it doesn't have to be this way. You can see a workflow engine (or BPM if you prefer) as "implementation detail" of a service. So you leverage the workflow engine as state machine whenever you have requirements for it. This is a team local decision about make (build your own state machine with all the complexity involved) or buy (which might be just "use" in case of open source). I wrote about this in blog.bernd-ruecker.com/avoiding-the-bpm-monolit...
And yes, it might need some experience to use the tools wisely and neither end up with smart pipes or god services. This is why we wrote this article :-) And it is true for each and every methodology and tool out there: you have to apply it wisely. If the alternative is to build the state management yourself I know what I would choose (as I see how-grown state machine beasts very regularly. Nothing you want to have at home :-)
Did you see that we use Kafka + Workflow Engine (service local) in this example: github.com/flowing/flowing-retail/? There you have clear responsibilities for each component of the stack.
Cheers
Bernd
+1
by daw sevenhundred,
Your message is awaiting moderation. Thank you for participating in the discussion.
I realise I'm late to this article, but it's brilliant. Someone had to articulate the tradeoffs of event-based systems.