Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Resilience in Deep Systems

Resilience in Deep Systems


Key Takeaways

  • When a system grows deep it complicates the company’s ability to quickly diagnose and respond to errors or performance bottlenecks. 
  • Look to encapsulate and focus a service on a real business need to balance service granularity and the depth of the system.
  • Look to define a set of cohesive and loosely coupled services that communicate asynchronously as much as possible to increase the fault tolerance of deep systems.
  • Use the right tools to overcome the observability challenges that come with deep systems so that when a problem is happening it’s easy to understand where and why it is happening.
  • Maximize your system’s resilience by bringing together a balanced service granularity, asynchronous communication and the right tools to increase observability. 

When using microservices architecture to build a successful product, a product that needs to rapidly grow, sooner or later, you realize that your system becomes "deep."

The depth of a system can be considered as the number of microservices layers in the application stack.

Today’s forefront cloud technologies, such as Service Mesh, Containers, and Serverless computing, enable teams to easily add many microservices layers to their system.

A microservice within such a system is not actually independent - it relies on other microservices and vice versa.

When microservices communication grows deep, it complicates the company’s ability to quickly diagnose errors or performance bottlenecks.

Therefore, deep systems are a serious challenge for R&D teams who want to sustain resilience, fault-tolerance, and performance.

Without the right mindset and the right tooling, the product and it’s customers will be jeopardized.

It’s possible to build complex systems, with deep chains of microservices, without compromising on resilience.

Here are 3 ingredients that can help you nourish resilience in deep systems.

1. Service granularity

Don’t go with the hype; correspond to a real business capability.

When designing complex applications using microservice architecture, we’re looking to define a set of cohesive and loosely-coupled services. One of the biggest questions in that regard is, how will we break down our application into microservices?

Because microservices architecture essentially follows the Unix philosophy of "Do one thing and do it well," you could simply say that each atomic function should be a microservice (i.e. the hype). While in theory it sounds perfect, simply following that philosophy can create an enormous amount of microservices. Are you going to be successful in maintaining that many services effectively?

I’m going to provision the "integer" service in us-west-1
Said once a hyped developer

In reality, we found that defining microservices that correspond to a real business capability will result in a sane amount of self-contained pieces of business functionality. Those pieces can still be very cohesive, loosely coupled, scale well, be testable, built, and owned by a small enough team. All of which are the pillars of microservices.

Moreover, it’s a common practice to avoid code duplications in multiple microservices by creating a shared library, i.e. DRY (Don’t Repeat Yourself). DRY is an important concept. Yet, sometimes too hyped. In reality, we found that sometimes shared libraries couple our microservices to each other, reducing the effectiveness of the isolation and independence between the microservices. It also slows down teams from making changes, since they are not always fully aware of the usage patterns of other teams. In fact, balancing microservices granularity goes hand in hand with balancing the right amount of shared libraries. Both shared libraries and code duplication are a burden in hyper-granular microservices architecture. But when keeping a balanced microservice granularity, the price of code duplication can pay off as increased independence.

2. Sharing data - safe and consistent

Be Aware: The worst monolith you can have is a distributed monolith

Microservices can communicate in a variety of synchronous and asynchronous ways.
As your system grows, the connections between microservices become more complex. Communicating in a fault-tolerant way, and keeping the data that is moving between services consistent and fresh becomes a huge challenge.

Sometimes microservices must communicate in a synchronous way. However, using synchronous communications, like REST, across the entire deep system makes the various components in the chain very tightly coupled to each other. It creates an increased dependency on the network’s reliability. Also, every microservice in the chain needs to be fully available to avoid data inconsistency, or worse, system outage if one of the links in a microservices chain is down. In reality, we found that such a deep system behaves more like a monolith, or more precisely a distributed monolith, which prevents the full benefits of microservices from being enjoyed.

Using an asynchronous, event-driven architecture enables your microservices to publish fresh data updates to other microservices. Unlike synchronous communication, adding more subscribers to the data is easy and will not hammer the publisher service with more traffic.

Asynchronous systems are "eventually consistent". This means that if a microservice lags in consuming data updates, its copy of the data may not be the latest. In read-intensive, high-throughput services, using synchronous communication requires complex cache management and purging mechanisms, service discovery, and retries techniques. However, by using "push" instead of "pull," the system can handle data updates in almost real-time and dispose of all that overhead.

The fact is that event-driven architecture is challenging and not trivial to adopt. You will have to introduce a new tool to your tech stack - the event bus (e.g. Kafka). And to learn how to handle a stream of events instead of just responding to a synchronous REST request. Even though it’s challenging and requires extra effort to adopt, many companies shift to such architectures since it helps to lower data inconsistency issues and increase the resilience of deep systems.

3. Distributed storytelling

Don’t lose sight of the flow of events; increase observability.

Observability isn’t just knowing that a problem is happening, but knowing why it is happening.
The growing amount of microservices layers and their logged data is challenging for both R&D teams and business analysts. Troubleshooting problems will sometimes require reasoning across multiple codebases, teams, and dashboards. Unfortunately, it sometimes goes together with "on-call shifts from hell" and "blaming games" between teams.

Assuming that you already collect logs and metrics from your various microservices into a centralized analytical tool, (like ElasticSearch and/or a data lake), and assuming you have a set of well-defined alerts, when trying to figure out an issue in a deep system, you might still find yourself losing sight and saying: "we don’t have enough visibility, let’s create a dashboard" or "OMG! We have too many dashboards."

At that point, ask yourself: how can a dashboard or a log trace tell a story?
Here are some ideas on visualizing the "what and why" of what’s happening:

Correlation ID/Trace ID

Deep systems are composed of chains of microservices where each one logs meaningful operational data. By assigning a correlation ID at the beginning of a workflow, (whether it’s an http request or a triggered job), and then by propagating it on any log message on sequential requests down the chain and on the response, you’ll gain the option to track a complete flow through all communication channels. Looking at your logs, whether it is on a specific Correlation ID or grouping data by correlation IDs can actually tell a great story that will help you to quickly find and pinpoint issues from across your deep system.

Distributed tracing

Distributed tracing systems, like Zipkin and Jaeger, enable you to trace call-stacks across different systems and services. This is done also by automatically creating unique trace IDs, however, unlike the plain "Correlation ID," these systems come with powerful UI tools to drill deep into your system and pinpoint issues. Also, unlike the "CorrelationID," traces usually run on a small portion of your traffic, so although it’s very detailed, you don’t have a complete view. Moreover, distributed tracing can sometimes miss the business perspective, since traces can be hard to understand for non-engineers (e.g. business analysts, support).

Taken from

Monitor workflow automation

Do you have a business SLA where an event needs to be processed in a certain amount of time while going through a chain of microservices? Tools like Zeebe (a workflow engine for microservices orchestration) will enable you to monitor for timeouts or other workflow errors with the ability to configure error-handling strategies such as automatic retries or escalation to teams that can resolve an issue manually. Zeebe can record 100% of your business events and provide visibility into the state of a company’s end-to-end workflows, including the number of in-flight workflows, average workflow duration, errors within a workflow, etc. Such a tool is usually more suitable for the various non-engineer stakeholders. If you ever tried to collect that telemetry in a deep system, you know it isn’t a trivial task.

The combination of these observability techniques can enable you to focus and pinpoint issues faster. Many tech companies found that it can eliminate unnecessary disruption to teams who are not needed for incident resolution but might otherwise have been involved.

Putting it all together

The right mindset and the right tooling are coming together.

The first key to resilience is to start by balancing granularity. If you choose to build your microservices in an extremely granular way, maintenance, data consistency, and overall observability can simply become a nightmare. Resist the temptation to go all-in with granularity and shared libraries. Prefer to define microservices that answer real business capabilities and examine whether your shared code is speeding you up or holding you back.

Your microservices will have to communicate, share data, and maintain its freshness. So the second key to resilience is to shift your mindset towards asynchronous, event-driven architecture, wherever possible. It will be hard and time-consuming at the beginning. Synchronous methodologies such as REST are easier to implement, while event-driven is less trivial. However, sooner is better. Do it before your traffic grows, and causes your microservices to hammer each other to death just to keep their copies of the data fresh and consistent. Try to avoid finding yourself wasting precious resources in a very error-prone environment.

With that in mind, the third key to resilience is to use the right tools to increase observability. That will enable you, when the need arises, to introduce more business capabilities (i.e. microservices) and more communication channels between them without losing sight. Remember that the growth of your product and technology stack is enlarging the amount of telemetry you need to monitor and analyze. And being able to easily reason about issues means a healthier business and happier teams.

About the Author

Amir Souchami is Chief Architect, ironSource Aura. With a great passion for technology, he’s constantly learning the latest to stay sharp and create highly scalable solutions with a positive business ROI. Amir loves working with teams and individuals to envision and achieve their goals. Follow up with him to chat about: Empathy, Yoga, Hiking, Startups, AdTech, Machine Learning, Stream Processing, Continuous Delivery, and Microservices.


Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p