Key Takeaways
- Peer-to-peer communication between components can lead to emergent behavior, which is challenging for developers, operators and business analysts to understand.
- You need to make sure to have the overview of all of the backwards-and-forwards communication that is going on in order to fulfill a business capability.
- Solutions that provide an overview range from distributed tracing, which typically misses the business perspective; data lakes, which require some effort to tune to what you need to know; process tracking, where you have to model a workflow for the tracking; process mining, which can discover the workflow; all the way through to orchestration, which comes with visibility built-in.
- In this article we argue that you need to balance orchestration and choreography in a microservices architecture in order to be able to understand, manage and change the system.
Last year I met an architect from a big e-commerce organisation. He told me that they do all the right things and divide their functionality into smaller pieces along domain boundaries, even if they don’t call this architectural style "microservices". Then we talked about how these services collaborate to carry out business logic that crosses service boundaries, as this is typically where the rubber meets the road. He told me their services interact via events published on an event bus, which is typically known as "choreography" (and this concept will be explained in greater detail later). They considered this to be optimal in terms of decoupling. But the problem they face is that it becomes hard to understand what’s happening and even harder to change something. "This is not like that choreographed dances you see in slides of some microservices talks; this is unmanageable pogo jumping!"
This matches what other customers tell me, e.g. Josh Wulf from Credit Sense said, "the system we are replacing uses a complex peer-to-peer choreography that requires reasoning across multiple codebases to understand."
Photo byPedobear19available under CC BY-SA 4.0
Let’s investigate this further using a simplified example. Assume you build an order fulfillment application. You choose to implement the system using an event-driven architecture and you use, for example, Apache Kafka as an event bus. Whenever somebody places an order, an event is fired from the checkout service and picked up by a payment service. This payment service now collects money and fires an event which gets picked up by an inventory service.
Choreographed event flow
The advantage of this way of working is that you can easily include new components into the system. Assume you want to build a notification service to send emails to your customer; you can simply add a new service and subscribe to relevant events, without touching any of the other services. You can now manage communication settings and handle all complexity of GDPR compliant customer notifications in that central place.
This architectural style is called choreography, as there is no orchestrator required which tells others what to do. In contrast, every component emits events and others can be reactive to them. The assumption is that this style reduces coupling between the components and systems become easier to develop and change, which is true for the sketched notification service.
Losing Sight of the Flow of Events
In this article I want to concentrate on the most-often asked question whenever I discuss this architecture: How do we avoid losing sight (and probably control) of the flow of events? In a recent survey, Camunda (the company I work for) asked about the adoption of microservices. 92% of all respondents at least consider microservices, and 64% already do microservices in some form. It is more than just hype. But in that survey we also asked about challenges, and found a clear confirmation of this risk; the top answer was about lack of visibility into end-to-end business processes that span multiple services.
Remember architectures based on a lot of database triggers? Architectures where you never exactly knew what would happen if you did this - and wait - why did that happen now? Challenges with reactive microservices sometimes remind me a bit of this, even if this comparison is clearly misplaced.
Establishing Visibility
But what can we do about it? The following approaches can help you get back visibility, but each have their different pros and cons:
- Distributed tracing (e.g. Zipkin or Jaeger)
- Data lakes or analytic tools (e.g. Elastic)
- Process mining (e.g. ProM)
- Tracking using workflow automation (e.g. Camunda)
Please be aware that all approaches observe a running system and inspect instances flowing through it. I do not know any static analysis tool that yields useful information.
Distributed Tracing
Distributed tracing wants to trace call-stacks across different systems and services. This is done by creating unique trace ids that are typically added to certain headers generically (e.g. HTTP or messaging headers). If everybody in your universe understands or at least forwards these headers, you can leave breadcrumbs while a request hops through different services.
Distributed tracing is typically used to understand how requests flow through the system, to pinpoint failures or to investigate the root of performance bottlenecks. The great thing about distributed tracing is that there are mature tools with a lively ecosystem around. So it is relatively easy to get started, even if you typically have to (potentially invasively) instrument your applications or containers.
So why not use this to actually understand how business processes emerge by events flowing through our system? Well, basically two reasons make it hard to apply distributed tracing for this use case:
- Traces are hard to understand for non-engineers. My personal experiments aiming to show traces to non tech people failed miserably. It was far better to invest some time to redraw the same information with boxes and arrows. And even if all the information about method calls and messages is totally useful to understand communication behaviors, it is too fine-grained to understand the essence of cross-service business processes.
- To manage the overwhelming mass of fine-grained data, distributed tracing uses so-called sampling. This means only a small portion of all requests are collected. Typically, more than 90% of the requests are never recorded. A good take on this is Three Pillars with Zero Answers - towards a New Scorecard for Observability. So you never have a complete view of what’s happening.
Data Lakes or Analytic Tools
So, out-of-the-box tracing will probably not be the way to go. The logical next step is to do something comparable, but bespoke to the problem at hand. That basically means not collecting traces, but instead collecting meaningful business or domain events that you might have flowing around already anyway. This often boils down to building a service to listen to all events and storing them in a data store that can take some load. Currently a lot of our customers use Elastic for this purpose.
This is a powerful mechanism which is relatively easy to build. Most customers who work in an event-driven manner have this setup already. The biggest barrier for introduction is often the question of who will operate such a tool within a large organization, as it definitely needs to be managed by some centralized facility. It is also easy to build your own user interfaces on top of this to find relevant information for certain questions easily.
One shortcoming is the lack of graphics to make sense out of a list of events. But you could build that into this infrastructure by for example projecting events onto some visualisation like BPMN.
Lightweight frameworks like bpmn.io allow you to add information to such a diagram in simple HTML pages (an example can be found here) which could also be packaged into a Kibana Plugin.)
This model is not executed by some workflow engine; it is a diagram used to visualize the captured events in a different way. In that sense you also have some freedom as to what granularity it will show, and it is also OK to have models that show events from different microservices in one diagram, as this is what you are especially interested in: the big picture. The good news is that this diagram does not stop you from deploying changes to individual services, so it does not hinder agility in your organization, but the tradeoff is that that introduces the risk of diagrams becoming outdated, when compared with the current state of the system operating in production.
Process Mining Tools
In the above approach you have to explicitly model the diagram you use for visualization, but if the nature of the event-flow is not known in advance, it needs to be discovered first.
This process discovery could be done by process mining tools. They can derive the overall blueprint and show that graphically, often even allowing to dig into a lot of detailed data, especially around bottlenecks or optimization opportunities.
This sounds like a perfect fit for our problem. Unfortunately, the tools are most often used to discover process flows within legacy architectures, so they focus on log file analysis and are not really good at ingesting live event streams. Another issue with these tools is that they are either very scientific and hard to use (like ProM) — or very much heavyweight (like Celonis). So in our experience it is often impractical to introduce these tools into typical microservice endeavors.
Nevertheless, process discovery and mining add interesting capabilities into the mix in order to get visibility into your event-flows and business processes. I hope that there will be technology emerging soon that offers comparable functionality but is also lightweight, developer-friendly and easily adoptable.
Tracking via Workflow Automation
Another interesting approach is to model the workflow, but then deploy and run it on a real workflow engine. The workflow model is special in a sense, in that it is only tracking events, and not doing anything actively itself. So it does not steer anything -- it simply records. I talked about this at the Kafka Summit San Francisco 2018, and the recording includes a live demo using Apache Kafka and the open source workflow engine Zeebe.
This option is especially interesting as there is a lot of innovation in the workflow engine market, which is resulting in the emergence of tools that are lightweight, developer-friendly and highly-scalable. I wrote about this in Events, Flows and Long-Running Services: A Modern Approach to Workflow Automation. The obvious downside is that you have to model the workflow upfront. But in contrast to event monitoring, this model is executed on a workflow engine - in fact you start workflow instances for incoming events or correlate events to that workflow instance. This also allows conformance checks - does the reality fit into what you modeled?
And this approach allows you to leverage the complete tool chain of workflow automation platforms, which allows to see what’s currently going on, monitor SLA’s and detect stuck instances or do extensive analysis on historical audit data.
Sample workflow monitoring (from camunda.com)
When I validated this approach with customers it was easy to set up. We just had to build a generic component picking up events from the bus and correlate it to the workflow engine. Whenever an event could not be correlated, we used a small decision table to decide if it could be ignored or would yield in an incident to be checked later. We also instrumented workflow engines used within microservices to execute business logic to generate certain events (e.g. workflow instance started, ended or milestone reached) to be used in the big picture.
This workflow tracking is a bit like event-monitoring, but with a business process focus. Unlike tracing, it can record 100% of your business events and provide a view that is suitable for the various stakeholders.
The Business Perspective
One big advantage of having the business process available in monitoring is that you understand the context. For a certain instance you can always see how and why it ended up in the current state, which enables you to understand which path it did not take (but other instances often do) and what events or data lead to certain decisions. You can also get an idea of what might happen in the near future. This is something you miss in other forms of monitoring. And even if it is often not currently hip to discuss the alignment between business and IT, it is absolutely necessary that non-engineers also understand business processes and how events flow through various microservices.
A Journey from Tracking to Managing
Process tracking is fine, as this gives you operational monitoring, reporting, KPIs and visibility as an important pillar to keep agility. But in current projects this tracking approach is actually just the first step in a journey towards more management and orchestration in your microservice landscape.
A simple example could be that you start to monitor timeouts for your end-to-end process. Whenever this timeout is hit, some action is taken automatically. In the following example we would inform the customer of a delay after 14 days — but still keep waiting. After 21 days we give up and cancel the order.
One interesting aspect in the above picture is the sending of the command cancel order. This is orchestration -- and this is sometimes discussed in a controversial fashion.
Orchestration
I often hear that orchestration should be avoided with the argument that it introduces coupling or violates autonomy of single microservices. And of course it is true that orchestration can be done badly, but it also can be done in a way that aligns with microservices principles while adding a lot of value to the business. At InfoQ New York 2018 I talked specifically about this misconception.
In its essence, orchestration for me means that one service can command another to do something. That’s it.That’s not tighter coupling, it is just coupled the other way round. Take the order example. It might be a good idea that the checkout service just emits an order placed event but does not know who processes it. The order service listens to that order placed event. The receiver knows about the event and decides to do something about it; the coupling is on the receiving side.
It is different with the payment, because it would be quite unnatural that the payment service knows what the payment is for. But it would need that knowledge in order to react on the right events, like order placed or order created. This also means it has to be changed whenever you want to receive payments for new products or services. Many projects work around this unfavorable coupling by issuing payment required events, but these are not events, as the sender wants somebody else to do something. This is a command! The order service commands the payment to retrieve money. In this case the sender knows about the command to send and decides to use it; the coupling is on the sending side.
Every communication between two services involves some degree of coupling in order to be effective, but depending on the problem at hand it may be more appropriate to implement the coupling within one side over the other.
The order service might even be responsible of orchestrating more services and keeping track of the sequence of steps in order fulfillment. I discussed the advantages in the above mentioned talk in detail. The tricky part is that a good architecture needs to balance orchestration and choreography, which is not always easy to do.
But for this article I wanted to focus on visibility. And there is an obvious advantage of orchestration using a workflow engine; the model is not only the code to execute the orchestration, it can also be used directly for visibility of the flow.
Summary
It is essential to get visibility into your business processes, independently of the way they are implemented. I discussed multiple possibilities, and in most real-world situations it typically boils down to some event-monitoring using Elastic-like tools or process tracking using workflow engines. This might depend slightly on the use case and role involved, so business analysts need to understand the data collected from all instances on the correct granularity, whereas operations need to look into one specific instance in varying granularity and probably want to have tools to resolve system-level incidents quickly.
If you are a choreographed shop, process-tracking can lead you to a journey towards more orchestration, which I think is a very important step in keeping control of your business processes in the long run. Otherwise, you might "make nicely decoupled systems with event notification, without realizing that you’re losing sight of that larger-scale flow, and thus set yourself up for trouble in future years", as Martin Fowler puts it. If you are working on a more greenfield system, you should find a good balance for orchestration and choreography right from the beginning.
However, regardless of the implementation details of your system, make sure you have a business-friendly view of your business processes implemented by collaborating services.
About the Author
Bernd Ruecker is co-founder and technologist at Camunda. Previously, he has helped automating highly scalable core workflows at global companies including T-Mobile, Lufthansa, Zalando. He is currently focused on new workflow automation paradigms that fit into modern architectures around distributed systems, microservices, domain-driven design, event-driven architecture and reactive systems.