Common Pitfalls in Microservice Integration: Bernd Rücker at QCon London

In a microservices architecture, every microservice is a separate application, typically with its own data storage, and communicating with other microservices over a network. This creates an environment that is highly distributed, and with that come challenges, Bernd Rücker explained in his presentation at QCon London 2018. The talk further explored common pitfalls in microservice integration and solutions that included workflow engines, either running as embedded or available as a service.

Communication

Communication is complex, and this complexity should not be hidden; instead, a service should be designed to handle failures internally. Rücker, co-founder of Camunda, uses an example from his own experience where he tried to check-in for a flight and expected a boarding pass in return. Instead, he got an error message that they had technical issues and a request to try again later on. For him this is bad design and a typical problem where the service provider, instead of requesting the customer to do retries, should do the error handling themselves and send the boarding pass when they have managed to create it.

The same behaviour can be applied in service-to-service communication. If a service can resolve failures internally, it should retry and asynchronously return a response when it is available. This encapsulates the error handling, thus making the API both cleaner and simpler.

Rücker calls this stateful retries and in his experience a common reason for not using this pattern is the perceived complexity of the state handling needed. Often, state is stored as a persistent entity or document, but a scheduler and other components are also needed to handle for instance the retries. His recommendation is to use a workflow engine to take care of all these details. He also points out that even if retries often should be used, there may be uses cases in a domain where error handling should be left to the client.

Asynchronicity

For Rücker, there are many advantages using asynchronous communication, and often this implies using messaging. A problem that then arises is timeouts. In his example waiting for the boarding card, had this process been done using messaging and the boarding card message never was created or somehow got lost, it will again be up to the customer to handle the failure. What is needed on the server side is some form of monitoring that discovers messages lost or arriving late. Commonly this is done using messaging middleware, but Rücker has met some customers that have implemented this using a workflow engine. However, this requires an engine that behaves like a message queue by using the pull principle.

Distributed transactions

In distributed systems, ACID transactions don’t work unless you try two-phase commits, but currently they are mostly seen as too complicated, and Rücker refers to a paper by Pat Helland: Life Beyond Distributed Transactions: An Apostate’s Opinion. Instead, Rücker prefers long-running business transactions in cases where you must do several activities in an "all or nothing" semantics. One solution for this is the Saga pattern where you work with multiple steps, eventual consistency and compensations if something fails.

To use sagas, every involved service provider must offer compensation activities and Rücker strongly recommends that they also be idempotent. In network communication there are three failure scenarios that you cannot differentiate between:

The request didn’t reach the service provider
The request did reach the provider, but it failed during processing
The request was processed, but the response from the provider was lost

One solution when an error is detected is to ask the service provider about the request, but this means it must be possible to distinguish it from other requests. The common approach is to just retry the request, but this means it must be idempotent, and he mentions four types of idempotency:

Natural, for instance when setting a specific state
Business, where you have a business identifier, like an email address
Unique ID, generated by the client
Request hash, where the service recognizes a request by a hash of the message

Rücker notes that an embedded workflow engine can implement the saga pattern, and points out that in a microservice-based system there are commonly multiple engines within the different microservices, each handling different workflows. He emphasizes that the engines are embedded; there is not a central engine that every workflow has to pass.

Looking into the space of workflow engines and state machines, he notes that there are several open source frameworks, and that new frameworks have emerged during the last 1-2 years. In the serverless space, AWS has created Step Functions and other cloud vendors are at least thinking in this direction.

Rücker has published sample code implementing his ideas. This presentation was not recorded, but most presentations at the conference were, and will be available on InfoQ over the coming months.

The next QCon conference, QCon.ai, will focus on AI and machine learning and is scheduled for April 9 – 11, 2018, in San Francisco. QCon London 2019 is scheduled for March 4 - 8, 2019.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the QCon London 2018 topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter