BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Know the Flow! Microservices and Event Choreographies

Know the Flow! Microservices and Event Choreographies

Bookmarks

Key Takeaways

  • In a microservices architecture it is not uncommon to encounter services which are long running and stretch across the boundary of individual microservices.
  • Event based architectures with corresponding choreographies  are one increasingly common way this can be handled, and typical decrease coupling.  
  • Central to the idea is that all microservices will publish events when something of business relevance happens inside of them which other services can subscribe to.  This might be done using asynchronous messaging or perhaps as a REST service.
  • The article authors explore a pattern they call  “event command transformation” as a way of reasoning about these cross-cutting concerns avoiding central controllers.
  • For implementing the flow you can leverage existing lightweight state machines and workflow engines. These engines can be embedded into your microservice, avoiding any central tool or central governance. 

Let's assume we want to design a microservices architecture to implement some reasonably complex "end-to-end" use case, e.g. the order fulfillment of a web based retailer. Obviously, this will involve multiple microservices. Consequently, we have to deal with managing a business process that stretches across the boundary of individual services. But when reaching out for nicely decoupled microservices, some challenges arise related to such cross-service flows. In this article, we will therefore introduce helpful patterns as the “event command transformation” and present technology approaches to tackle the complexity of coding flows stretching across microservices –without introducing any central controller.

First, we have to come up with an initial microservice landscape and define the boundaries of microservices and their scope. Our goal is to minimize coupling between various services and keep them independently deployable. By doing so we want to maximize the autonomy of the teams; for every microservice there should be one cross-functional team taking care of it. As this is particularly important for us, we decided to follow a more coarse-grained approach and design just a few self-contained services built around business capabilities. This results in the following microservices:

  • Payment Service – the team is responsible for dealing with everything related to "money"
  • Inventory Service – the team is responsible for taking care of stock items
  • Shipment Service – the team is responsible for "moving stuff to customers"

The web shop itself will probably be comprised of several more microservices, e.g. Search, Catalogue etc. As we focus on the order fulfillment, we are only interested in one web shop related service that allows the customer to place an order:

  • Checkout Service - the team deals with checking out the customers shopping cart

This service will ultimately trigger the order fulfillment to start.

Long Running Flows

We have to consider one important characteristic of the overall order fulfillment: it is a long running flow of actions to be carried out. Referring to the term “long running”, we mean that it can take minutes, hours or even weeks until the order processing is complete.

Consider the following example: whenever a credit card is rejected during payment, the customer has one week to provide new payment details. That means the order might have to wait for one week. The implications of such long running behavior poses requirements on the implementation approach which we will discuss in greater detail in this article.

Event Collaboration

We do not discuss the pros and cons of communication patterns in this article, but rather decided to illustrate our topic by means of an event centric communication pattern in between services.

Central to the idea of event collaboration is that all microservices will publish events when something business relevant happens inside of them. Other services may subscribe to that event and do something with it, e.g. store the associated information in a form optimal for their own purposes. At some later point in time, a subscribing microservice can use that information to carry out its own service without being dependent on calling other services. Therefore, with event collaboration a high degree of temporal decoupling in between services becomes a default. Furthermore, it becomes easy and natural to achieve the kind of decentral data management we look for in a microservices architecture.

The concept is well understood in Domain Driven Design, a discipline currently accelerating in the slipstream of microservices and the "new normal" of interacting, distributed systems in general.

Note that event collaboration could be implemented with asynchronous messaging but could also be implemented by other means. Microservices could e.g. publish REST based feeds of their events which could be consumed by other services on a regular basis.

Event Command Transformation

Our order fulfillment starts with the event Order Placed. The first thing that must happen in our minimum viable order fulfillment is the customer's payment. The payment service successfully finishes with the event Payment Received after which we take care of consignment of the goods in the stock (Goods Fetched) and the shipping to the customer (Goods Shipped). So we have a clear flow of events - at least in the "happy" scenario. One could now easily create a chain of events as depicted in Figure 1.

Figure 1: Each of the microservices is listening to the previous one in the chain

As much as we support the fundamental idea of event collaboration, these types of event chains provide a suboptimal approach for implementing the end-to-end logic of whole business processes. We see it happen for the noble goal of reducing coupling but these solutions might even increase coupling. Let’s dive into that.

An event is by definition meant to inform you about a relevant fact that occurred and that some other service might be interested in. But the moment we require a service to follow up on an event, we use that event as if it had the semantical meaning of a command. The consequence of this: we end up with tighter coupling than necessary.

In our example, the payment service listens to the event Order Placed in the checkout service. Now the payment service has to know at least something about checkouts. But it’s better if it doesn’t for the following reasons:

  • Consider that our organization probably needs payment services for various reasons and not just when retail orders are placed. The payment service would have to be adjusted and redeployed whenever we want to bind the payment service to a new event even though the specifics of how exactly payments are carried out do not change at all.
  • Consider simple business requirements like changing the order of some steps. If the business wishes to make sure that goods are correctly fetched before the customer is charged, three services would have to be adjusted at the same time: Payment now listens to Goods Fetched, while Inventory listens to Order Placed and shipment now subscribes to Payment Received.
  • Consider we issue invoices for special orders, e.g. for VIP customers. Now, not only does the payment service have to understand the rule as it needs to decide if the payment has to be done whenever an order is placed, but also the inventory service has to understand that just listening to Payment Received events will bring the overall process for VIPs to a halt!

Therefore, we recommend what we call the event command transformation pattern. Make sure that the component responsible to make a business decision (payment is needed now) transforms an event (Order Placed) into a command (Retrieve Payment). The command can be sent to the receiving service (payment) without the service knowing about the client, nor realizing the above disadvantages.

Note that "issuing commands" for long running flows does not necessarily mean to make use of request/reply oriented protocols. It can also be implemented by other means. Microservices could listen to asynchronous command messages in a similar way they already listen to events. Furthermore, note that event command transformation also takes place when the event subscriber transforms an event to an internal command. We recommend the transformation to be made by the party responsible for making the decision that “something needs to happen.”

But who is that party in our example? Should the checkout service issue the Retrieve Payment command? No. Reconsider the change scenarios given above. All of them suggest that we need a separate microservice handling some of the end-to-end logic of the order fulfillment.

  • Order - the team is responsible for dealing with the end-to-end logic of the customer facing core capability of the business - fulfilling orders.

This service does the event command transformation. It transforms Order Placed into Retrieve Payment. It might decide autonomously to do that for Non-VIP customers only. It might also consult another microservice first which encapsulates the rules for what constitutes a VIP customer. Such an end-to-end service improves decoupling massively when being compared to a puristic event collaboration as described.

But, how can we avoid that the mere fact of introducing an end-to-end service will result in a "God-like" service holding most of the crucial business logic and delegating to "anemic" (CRUD) services? As this would eliminate a lot of benefits of event choreographies, God services are not recommended by many authors like e.g. Sam Newman in Building Microservices. Furthermore, isn’t a commanding service using the orchestration principle which is perceived as the enemy of loose coupling?

Choreography vs. Orchestration - Decentral Governance for Business Processes

Avoiding God services and central controllers is a question of taking the responsibilities and autonomy of the teams seriously. Having end-to-end responsibility for an order in a highly decentral organization does not mean that you constantly interfere with the responsibilities of other teams like e.g. payment, on the contrary! Having the end-to-end responsibility for orders will mean that "payment" is a black box for you. You are only in charge of asking it to perform its work (Retrieve Payment) and wait for its completion: Payment Received.

Consider the previously mentioned business requirement that whenever a credit card is rejected, the customer has one week to provide new payment details. We could be tempted to implement such logic in the order service but only if the commands offered by the payment service are very fine-grained. If the payment team takes its own business capabilities and associated responsibilities seriously, it will determine that it’s responsible to collect payments even if this potentially takes longer than just attempting to charge a credit card. The payment team can guard against any God-like service tendencies by providing a few coarse-grained, potentially long running capabilities instead of a myriad of fine-grained or even CRUD-like functions. This idea is depicted in Figure 2.

Figure 2: End-to-end flow logic is decentrally governed, the responsibilities are distributed

In a highly decentral organization, the end-to-end order service will be as lean as possible because most aspects of its end-to-end process will be managed autonomously by other services specializing in their own business capability. The payment service serves as an example for that principle: it's the responsibility of the payment team to implement everything necessary to collect the payment.

This is a crucial aspect to consider and a common misconception when talking about the implementation of business processes: it does not necessarily mean that you design the overall process in one piece and let a central orchestrator carry it out, like it was advertised in the old SOA and BPM days. The ownership for the process and the needed flow logic can be distributed. How much will primarily depend on your organizational structure which should also be reflected in your service landscape (see Conway’s Law). Following this approach, you do not end up with a central, monolithic controller.

If you now think that splitting up the end-to-end flow logic increases the complexity of your system you might be right. Similar trade-offs apply to introducing a microservices architecture in the first place: Monolithic approaches are often easier but will reach their limits when the system grows and can no longer be handled by one single team. It's just about the same with flow logic.

To sum up what we discussed so far: choreography is a fundamental pattern for a microservices architecture. We recommend following that pattern as an important rule of thumb. But when it comes to business processes, don’t create puristic event chains but implement decentral flow logic and use the event command transformation pattern instead. The microservice responsible to decide an action should also be responsible to transform an event into a command.

Flow Logic Implementation

Let’s look at the implementation of long running flow logic. Long running flows require their state to be saved, as you might have to wait an arbitrary time. State handling is not a new thing to do. That’s what databases are for. So an easy approach is to store the order state as part of some entity, e.g. as shown in Code Snippet 1.

public class OrderStatus {
  boolean paymentReceived = false;
  boolean goodsFetched = false;
  boolean goodsShipped = false;
}

Code Snippet 1: A simplified order status to be used as part of some entity

Or you might use your favorite actor framework. We discuss basic options here. All of this works to some extent but typically you face additional requirements as soon as you start with implementing the states needed for long running behavior: how can you implement waiting for seven days? How can you handle errors and retries? How can you evaluate cycle time for orders? Under which circumstances do orders get canceled because of missing payments? How can I change the flow if I always have some orders somewhere in the processing line?

This can lead to a lot of coding which ends up in a home-grown framework. And teams working on affected projects complain as an enormous amount of effort is buried. So we want to have a look at a different approach: leveraging existing frameworks. In this article, we use the open source engine from Camunda to illustrate concrete code examples. Let's have a look at Code Snippet 2.

engine.getRepositoryService().createDeployment()
  .addModelInstance(Bpmn.createExecutableProcess("order")
    .startEvent()
      .serviceTask().name("Retrieve payment").camundaClass(RetrievePaymentAdapter.class)
      .serviceTask().name("Fetch goods").camundaClass(FetchGoodsAdapter.class)
      .serviceTask().name("Ship goods").camundaClass(ShipGoodsAdapter.class)
    .endEvent().camundaExecutionListenerClass("end", GoodsShippedAdapter.class)
  .done()
).deploy();

Code Snippet 2: The order flow can be expressed in code, e.g. by using Java

The engine now runs instances of this flow, keeps track of their state and stores it in a persistent way mitigating disaster or long periods of waiting. The missing adapter logic can be easily coded, too, as shown in Code Snippet 3:

public class RetrievePaymentAdapter implements JavaDelegate {
  public void execute(ExecutionContext ctx) throws Exception {
    // Prepare payload for the outgoing command
    publishCommand("RetrievePayment", payload);
    addEventSubscription("PaymentReceived", ctx);
  }
}

Code Snippet 3: Additional logic needed can be coded with adapters, e.g. by using Java

Such an engine can also handle more complex requirements. The following flow catches all errors when charging the credit card. The flow moves forward in an alternative way and asks the customer to update their details. As we don’t know if and when the customers will do that, we then have to wait for an incoming message from them (or technically speaking most probably from some UI or other microservice). But we wait only for seven days and then we automatically end the flow and issue a Payment Failed event. Compare Code Snippet 4.

Bpmn.createExecutableProcess("payment")
  .startEvent()
    .serviceTask().id("charge").name("Charge credit card").camundaClass(ChargeCreditCardAdapter.class)
      .boundaryEvent().error()
        .serviceTask().name("Ask customer to update credit card").camundaClass(AskCustomerAdapter.class)
        .receiveTask().id("wait").name("Wait for new credit card data").message("CreditCardUpdated")
          .boundaryEvent().timerWithDuration("PT7D") // time out after 7 days
          .endEvent().camundaExecutionListenerClass("end", PaymentFailedAdapter.class)
        .moveToActivity("wait").connectTo("charge") // retry with new credit card data
    .moveToActivity("charge")
  .endEvent().camundaExecutionListenerClass("end", PaymentCompletedAdapter.class)
  .done();

Code Snippet 4: The flow logic now allows for a time frame of a week to update credit card data

We will point to some other potentially interesting aspects later in this article, e.g. to visualize such flows. For now, we summarize that you can leverage such a state machine to handle your state and define powerful flows around state transitions.

Embeddable Workflow

Such a state machine is a simple library that can be embedded into your microservice. In the source code examples provided in this article, you can see an example of how to start the Camunda engine as part of a microservice implemented in Java which could be also done via Spring Boot or similar frameworks.

Let’s highlight this: every microservice that implements long running flows must tackle the requirements around flow and state handling. So, should every microservice use an engine like Camunda's? The team responsible for a microservice may decide to, but such a decision will not necessarily be the same across all teams. In a microservices architecture we typically find decentral governance regarding technological choices. A team might very well use a different framework or even decide to hardcode their flows. Or they use the same framework but in different versions. There isn’t necessarily any central component involved when you introduce a workflow engine. We clearly advocate to not undertake unnecessary enterprise architecture standards in a microservice environment.

Embeddable doesn’t have to mean that you run the engine yourself, especially in the polyglot world of microservices the programming language might not directly fit. Then you can also deploy your engine in a standalone manner and talk remotely to it. This could e.g. be done via REST but more efficient ways are also available on the horizon. The important aspect here is that the responsibility of the engine lives with the team owning the surrounding microservice; it’s not a centrally provided engine (Figure 3).

Figure 3: Teams decide decentrally to leverage and embed an engine for their flow logic - or not

In the proposed decentralized architecture you have multiple workflow engines where every single one only sees a part of the overall flow. That poses new requirements on proper process monitoring which aren’t yet solved. But depending on the product there are workarounds possible or you can leverage existing monitoring tools in your microservice universe (like the Elastic stack for example). Therefore, it also helps to introduce an artificial transaction id or trace id which you hand over to each and every service invocation in the chain. We plan to write a blog post dedicated to this topic as this is especially important in the more complex operational environment of collaborating microservices.

The Power of Graphics

“I love code, and I love DSLs. Graphic UIs are terrible“ – a statement we often hear when talking to developers. It’s understandable because very often graphical models hinder the way developers like to work by what we call “death-by-properties-panel.” The models might also hide complex configurations made under the hood. But this aversion should not stand in the way of an important fact: graphical representations are extremely handy during operations as you don’t have to dig into code to understand the current state or exceptional situations. And you can leverage the graphics to discuss the model with business stakeholders, requirements engineers, operators, or other developers. Often after discussing and modeling a flow (graphically) in a short workshop, we hear comments like “now I finally understood what we already do for years!” Visibility also makes it easier to change flows down the line as you know how it’s currently implemented (don’t forget, the flow is running code) and you can easily point to areas where it should be changed.

With workflow engines you can get a graphical representation of the flow. However, we often see one very important aspect missing: being able to define flows not only in a graphical format but also in code or by a simple DSL as shown above. The code example we gave above can be presented in auto-layout and monitored as shown in the figure below. Many projects we know use graphical models as it’s often easier to follow. It comes especially in handy if you have complex flows including parallel paths which are hard to understand in code but easy to spot in the graphics. The graphical model is often directly saved in the BPMN 2.0 standard. But we also know of projects using the coded DSL successfully.

(Click on the image to enlarge it)

Figure 4: The power of graphics - from business users to developers to operations

When building your own end-to-end monitoring solution, you can still easily visualize a graphical flow with lightweight JavaScript frameworks like bpmn.io as we demonstrate in the code examples. You just read the process models and their current states from different engines via an API and show all running instances for the already mentioned artificial transaction id.

The granularity of the flows shown in monitoring should reflect the event collaborations we introduced earlier which correspond to events being meaningful for the domain expert. That makes these flows readable for all kinds of project participants. The flow should actually be seen as part of the domain logic and centered around the ubiquitous language as promoted by DDD. “When exactly do we do the payment?” is then easy to answer for everybody – from business users to developers to operations.

Handle Complex Flow Requirements

As we all know: the devil is in the details. As soon as we leave the cozy island of one single microservice we don’t have atomic transactions at hand, experience latency and "eventual consistency" and have to do remote communication with potentially unreliable partners. Developers therefore have to deal with failures a lot - also in regards to business transactions which can’t be carried out by atomic transactions.

There is a lot of power in workflow engines for these uses cases, especially when using a BPMN tool as introduced. We give an example in Figure 5, using the graphical format this time.  We catch the error that goods are not available and trigger a so-called compensation. The compensation mechanism of the engine knows which activities were successfully executed in the past and will automatically execute defined compensation activities, Refund payment in this case. One can leverage this functionality which nicely implements the so-called Saga pattern.

(Click on the image to enlarge it)

Figure 5: In case the ordered goods turn out no to be available, the payment is refunded

Note that the shown logic still lives inside a (potentially very lean) service, the Order Service, whereas other parts of the overall flow will be maintained by the teams responsible for those parts. There is no need for any central controller – the flow logic is distributed.

Why are State Machines not Commodity for Microservices then?

Existing tools providing flow logic capabilities needed for long-running services are often named workflow or BPM engines. However, there were errors made around Business Process Management (BPM) in “the old SOA days” which give it a bad reputation especially among developers. They think they get an inflexible, monolithic, developer-adverse and expensive tool which forces them to follow some model-driven, proprietary, zero-code approach. And some BPM vendors really deliver platforms which are not usable in the microservices universe. But it’s important to note that there are lightweight open source engines available which provide an easy-to-use, embeddable state machine as shown above. You can leverage these tools to handle the flow instead of re-inventing the wheel, saving you time, a very precious commodity as we all know.

One important aspect to overcome misconceptions is to take wording seriously. The flows we present here are not necessarily “business processes”, particularly if you “just” want to have a bunch of collaborating microservices forming a business transaction. The flows may also not be “workflows” as this is often perceived of involving humans to do some manual work. That’s why we often just talk about “flows” – which works fine for different use cases and different stakeholders.

Example code

The use case presented here is not just pure theory. In order to make concepts concrete and explainable we developed the order fulfillment example as working system composed of multiple microservices. You find the source code online on GitHub.

Conclusions

Microservices and event driven architectures go very well together. Event choreographies enable decentral data management, typically decreasing coupling and work well for the kind of long running "background" processes we focus on in this article.

Most of the end-to-end flow logic required to support long running business processes should be distributed across the microservices. Every microservice implements the part of the flow it’s responsible for, according to its own business capabilities. We recommend transforming events to commands inside the services responsible for the business decision that something is needed and therefore needs to happen. A service responsible for a remaining end-to-end logic can be as lean as possible, but it's in our mind better to have one than relying on non-transparent and tightly coupled event chains.

For implementing the flow you can leverage existing lightweight state machines and workflow engines. These engines can be embedded into your microservice, avoiding any central tool or central governance. You can see them as a library helping the developer. As a bonus, you get graphical representations of the flow helping you throughout your project. You might have to overcome some common misconceptions about workflow or BPM in your company but believe us, it’s worth it!

About the Authors

Bernd Rücker helped many customers to implement business logic centered around long running flows, for example the order fulfillment process of the rapid growing start-up Zalando selling clothes worldwide or the provisioning process for e.g. SIM cards at a couple of big telecommunication companies. During that time he contributed to various open source projects, wrote two books and co-founded Camunda. Currently he thinks about how flows will be implemented in next generation architectures.

Martin Schimak has been into long running flows for 15 years, in fields as diverse as energy trading, wind tunnel organization and contract management of telecommunication companies. As a coder, he has a soft spot for readable APIs and testable specs and made manifold contributions on GitHub. As a domain “decoder”, he is on a first name basis with Domain-Driven Design as well as BPMN, DMN and CMMN. He is also co-editor of the german software magazine OBJEKTspektrum.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • SAM Pattern

    by Jean-Jacques Dubray,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Bernd, Martin:

    thank you for such a precise and complete article on the topic. I am not sure if you ever came across or considered using the SAM Pattern for orchestration. Classical State Machines (Petri Nets) tend to have some issues. TLA+ offers a much better foundation for this kind of state machines.

    I created a couple of years ago a Java and Javascript library (someone is actually currently porting it to C#). I would not recommend using "Sagas" or "Workflow". This is small example that illustrates how the library works.

    The main advantage of SAM is that it gives you a robust state machine structure, in code. No need to use a different language. Even better the state machine structure is nearly invisible to the developer.

  • AsyncAPI specification

    by Francisco Méndez Vilas,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Nice article. I guess you might be interested in the AsyncAPI specification: asyncapi.org.

  • Great!

    by Leonardo Rafaeli,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Very good article lads, congratulations.

    Question: How scalable is Camunda? Assuming I spawn loads of containers, will Camunda work well against parallel processing? Or it will centralize all workflow instance information into a single server?

  • Re: Great!

    by Bernd Ruecker,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Leonardo.

    Camunda is very scalable, we have big customers using it in huge scenarios. The limitation of the current architecture is the relational database used underneath. So you can spawn a lot of containers and work can be perfectly distributed - but a logical cluster meets on the database - which can become the bottleneck in very high-load scenarios. The few customers who are facing this use logical shards to run mutliple logical Camunda instances. With microservices every Microservice can quite naturally become a shard on its own, easing load on the various Camunda instances very much. So this typically is no problem.

    If you still worry about scalability one additional hint: we work on a next generation engine which implements persistence differently (basically using event sourcing). We will release the first version and put it open source next month - stay tuned :-)

    Cheers
    Bernd

  • Re: Great!

    by Martin Schimak,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Leonardo,

    many thanks for your encouragement. :-) As it's out now, I add the link to the new open source project Bernd was referring to in his answer: zeebe.io/

    Cheers,
    Martin.

  • zero code

    by carol mcdonald,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    In numerous areas in the past there have been zero-code approaches, which can not keep up with technology changes or become legacy for various reasons, and get replaced by newer programming. I remember numerous times programming was going to be completely replaced by various GUI tools, yet we are still programming :) Programmers like to program , non programmers like a GUI fix

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT