Facilitating the Spread of Knowledge and Innovation in Professional Software Development



Choose your language

InfoQ Homepage Articles Chaos Engineering and Observability with Visual Metaphors

Chaos Engineering and Observability with Visual Metaphors


Key Takeaways

  • For modern software systems, observability is not about mathematical equations. It is about how people interact with and try to understand their complex systems. 
  • Chaos engineering is leveraged by observability since it allows to detect a deviation from the steady-state of a system. Observability is leveraged by chaos engineering since it helps to discover and overcome the weaknesses of a system.
  • Observability feeds on the signals that a system emits and that provide the raw data about the behaviour of the system. However, observability is not only limited by the quality of those signals, but the way in which those signals are visualised and interpreted.
  • Considering that chaos engineering, observability and visualisation involve humans and their individual interpretations, it is a fact that the designers of dashboards can bias those interpretations. In this sense, visual metaphors are not a guarantee that we are interpreting these data in a proper way.
  • Dashboards based on visual metaphors can provide more useful data than classical visualisations. However, both strategies can be easily biassed; for example, in a study, most of the participants noticed that the overall results were biassed by showing bad bar and line charts that did not show the important cutoff points in the plots.

Since leading technology companies such as Netflix, Slack, and Linkedin have adopted the discipline of chaos engineering to withstand unexpected disruptions in production, this discipline has become mainstream in recent times. In this path, Observability has played a critical role, bringing the power of data and monitoring to engineers who now have strategies to know their systems, determine how they will behave when something fails, and add resilience and reliability.

Chaos Engineering and Observability are two disciplines that are closely connected. According to Russ Miles, ”The principles of observability turn the systems into inspectable and debuggable crime scenes, and chaos engineering encourages and leverages observability as it seeks to help to pre-emptively discover and overcome the system weaknesses”. Chaos engineering encourages and demands Observability, because to confidently execute a chaos experiment, observability must detect when the system is normal and how it deviates from that steady-state as the method experiment is executed. It is illustrated in Figure 1.

Figure 1. Chaos engineering and observability

Both academia and the tech industry have made a huge effort to provide the tools for practising chaos engineering and observability. However, the visualisation of metrics and the appropriate selection of visual strategies is still limited. This article introduces a new actor: visual metaphors. Specifically, it provides the conceptual foundations of chaos engineering and observability,  presents a state of art of visualisation techniques available in the market and shows how treemaps, gauge charts, geocentric and city metaphors can enrich the spectrum of the visual strategies to observe the chaos.

Foundations of Chaos Engineering and Observability

Regarding chaos engineering: chaos, resilience and reliability are key concepts, and regarding observability, monitoring, metrics and dashboards are critical when humans want to observe their systems. So before delving into the relation between chaos engineering and observability, it is important to have those definitions clear. 

Chaos engineering is defined by the principles of chaos as the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. To specifically address the uncertainty of distributed systems at scale, chaos engineering provides a methodology based on experimentation following four steps: the first one consists in defining the steady state, a measurable output of a system that indicates a normal behaviour. The second step is associated with the hypothesis, which proposes a sentence with the consequences of altering the steady state. With this hypothesis, it is time to introduce real world events like servers that crash or hard drives that malfunction that confirm or disprove the hypothesis. Finally, the objective is creating an analysis of the difference in steady state between the control group and the experimental group.

Observability is being able to fully understand a system. In control theory, it is defined as a measure of how well internal states of a system can be inferred from knowledge of its external outputs. And particularly, in software engineering observability can be described as the art of making the proper questions, providing the correct answers and building knowledge with the data collected. 

Monitoring is different from Observability and it is important to understand the difference. Monitoring is about collecting, processing, aggregating, and displaying real-time quantitative data about a system; while observability is about processing and analysing that data, allowing teams to actively understand and debug the behaviour of their systems. For modern software systems, observability is not about mathematical equations. It is about how people interact with and try to understand their complex systems. 

In this sense, monitoring involves reading the signals that are sent by the systems through numbers which are named metrics. A metric is a single number, with tags optionally appended for grouping and searching, such as query counts and types, error counts and types, processing times, or server lifetimes. These values are visualised in dashboards which are applications that provide a summary view of a core of the service core metrics. 

Traditionally dashboards have been built on line charts, pies or bar plots. Considering that observability feeds on the signals that the systems emit and the quality with which those signals are visualised and interpreted, it is important to provide the best tools and designs. If the colours, legends and scales are not properly used, some visualisations could be limited and confused for the operators. The next section provides the state of the art of monitoring and observability, and describes some of those limitations in more detail.

Monitoring and Observability

Monitoring and observability have become one of the most essential capabilities for engineering teams and in general for modern digital enterprises who want to deliver excellence in their solutions. Since there are many reasons to monitor and observe the systems, Google has documented Four Golden Signals or metrics that define what it means for the system to be healthy and that are the foundation for the current state of the observability and monitoring platforms. The four metrics are described below:

Latency is the time that a service takes to service a request. It includes HTTP 500 errors triggered due to loss of connection to a database or other critical backend that might not be served very quickly. Latency is a basic metric since a slow error is even worse than a fast error.

Traffic is a measure of how much demand is being placed on the system. It determines how much stress is the system taking at a given time from users or transactions running through the service. For a web service, for example, this measurement is usually HTTP requests per second. By monitoring real-user interactions and traffic in the application or service, engineering teams can see how the system holds up to changes in demand and how they should scale their resources in order to attend to the demand.

Error is associated with the rate of requests that fail, either explicitly or implicitly. Monitoring error cases can be drastically different according to the system and the component that is failing. That is a reason why engineering teams need to monitor the rate of errors happening across the entire system but also at the individual service level. It’s also important to prioritise which errors are critical and which ones are less dangerous.

Finally, saturation is the signal of the system about the utilisation of a resource, such as memory, I/O or CPU. Considering that many systems degrade in performance before they achieve 100% utilisation, having a Saturation target is essential. It allows us to answer questions like: how much more capacity does the service have?  And what level of saturation ensures service performance and availability for customers?

Traditional Visualisations for Monitoring

Nowadays, the four golden signals described in the previous section are monitored using traditional methods such as line charts, bar plots or pies. 

A line chart is the most common strategy to visualise the behaviour of the four golden signals of a system in terms of the time, as illustrated in Figure 2.

Figure 2. Line charts in a fictional project

Line charts present different challenges with colour, legends, titles of axis and series because the variables converge, cross, and generally tangle together. If the creators of a dashboard do not use the proper visual assets, this type of graphics could be transformed into one of the most confusing charts.

Another common chart is the bar chart, which is used to present categorical data with rectangular bars with heights or lengths, proportional to the values that they represent. Some cloud providers use them to represent categorical data of logs as shown in Figure 3. 

Figure 3. Bar plot in a fictional project

Finally, although less used, the pie charts are a simple way to represent and compare proportions in the distribution of data. They are most effective when one proportion is dominant—a half or three-quarters. More than a few wedges of more than a few colours creates a sameness among wedges that makes it hard to compare values.

Considering these limitations, the next section presents a different way to visualise the four golden metrics. Since this article is about chaos engineering, this technique is analysed in a scenario in which an incident is reported.

Visual Metaphors as a Proposal to Visualise Chaos

To overcome the limitations mentioned before, this article proposes a new strategy to visualise the chaos on production. The proposal is based on a concept conceived in other fields of science: visual metaphors. Visual metaphors are a strategy to map from concepts and objects of an application domain to a system of similarities and analogies. A computer metaphor is the basic idea of assimilation between interactive visual objects and model objects. Its role is to promote a better understanding of the semantics of an object. A familiar example could be using a panther in front of the picture of a sports car, suggesting that the product has comparable qualities of speed, power, and endurance. 

Some examples include: maps, cities and geometric scenes as illustrated in Figure 3. This figure shows the city metaphor, a popular method for visualising properties of program code. Many projects have employed this metaphor to visualise properties of software repositories, for example. Existing research has been used to map neighbourhoods with packages, and classes with buildings.


Figure 4. City metaphor in a fictional project, taken from here.

In this case, the metaphor represents the classes as buildings, and the packages as neighbourhoods in which the buildings reside. Each edge of the building is used to map properties of the classes.

Presenting an experiment for visualising incidents

With the intention to identify the perception of engineering teams involved in operations activities, 28 of them were surveyed regarding classical dashboards and visual metaphors.  Specifically, they were asked about the incidents in which the four golden metrics, errors, latency, traffic and saturation, were visualised using classical dashboards and visual metaphors.

This study consisted of specific questions about an incident, in which two visualisations were provided: one with a traditional chart and another with a visual metaphor. For each situation, the value of each type of visualisation was analysed. In the next paragraphs, each question and an analysis is presented. 

Regarding the demographics, there were 28 participants with backgrounds distributed among backend, frontend and full stack engineers, software architects, data engineers and site reliability engineers. The largest participation comes from backend development engineers, as illustrated in Figure 5.

Figure 5. Demographic data

The first question was about the Saturation Signal. Basically, two dashboards were used - a line chart and a city metaphor - for asking about the state of five microservices: ms_authentication, ms_patients, ms_payments, ms_medications and ms_appointments. These microservices were part of a fictional healthcare system. 

Specifically, the question was: using a traditional dashboard (see Figure 6) and visual metaphor (see figure 7), which microservice was impacted? The correct answer was ms_authentication.

Figure 6. Traditional line chart

Figure 7. Visual city metaphor

As is illustrated in Figure 8, when they used the visual metaphor, the answers of some participants changed, choosing the correct answer.

Figure 8. Answers from participants using traditional charts vs visual metaphors

All participants agreed that the microservice impacted by a high utilisation of CPU was authentication. In this case the visual metaphor was more useful than the traditional chart, since the char line was confusing and poor in colours, shapes and sizes changed the perception of the participants.

About errors signal, a classical bar chart and a treemap were used to ask the participants to calculate the average of errors for each microservice as it is illustrated in Figure 9.

Figure 9. Traditional Bar Plot for visualising Errors

Figure 10. Visual treemap metaphor for visualising errors

The correct answer was ms_appointments, and although some participants did not choose it, many of them changed their answer when they used the visual metaphor. Figure 11 illustrates it.

Figure 11. Answers of participants using Traditional Charts vs Visual Metaphors for visualising errors

Regarding traffic signals, a classical bar chart and a geocentric metaphor were used for asking the participants to which third-party service there is more traffic. In this case, the interaction between the original microservices and the new four third-party services: srv_ldap, srv_goverment, srv_assurance and srv_authentication was analysed. Figure 12 shows this integration using a bar plot and Figure 13 presents the same traffic values using a geocentric metaphor. In the metaphor, the circles represent the services and microservices and the lines are connecting the relation among them. 

Figure 12. Traditional bar plot for visualising traffic between microservices and third-party services

Figure 13. Visual geocentric metaphor for visualising traffic between microservices and third-party services

In spite of having lines and sizes for representing the connection and the traffic load among the microservices and third-party services, the metaphor was confusing for the participants. It is possible that the size of the circle could be associated with a least percentage for srv_ldap, which was the correct answer and which is represented by the green portion in the pie (see Figure 14).

Figure 14. Answers of participants using traditional charts vs visual metaphors for visualising traffic between microservices and third-party services.

Finally, we analysed the latency signal using a bar plot visualisation versus a gauge metaphor. Both visualisations are illustrated in Figures 15 and 16, respectively.

Figure 15. Traditional bar plot for visualising Latency signals.

Figure 16. Visual gauge metaphor for visualising Latency of microservices.

For this case, the metaphor definitely did not provide value for the participants, since the correct answer was ms_patients, which is illustrated in Figure 17.

Figure 17. Answers of participants using traditional charts vs visual metaphors for visualising latency between microservices and third-party services.

Conclusions of Introducing a Visual Metaphors

The visualisation of chaos, and specifically of incidents on production, presents several challenges for industry and academia focused on observability. As we showed in this article, since chaos engineering, observability and visualisation involve the interaction of humans with machines, the bias in the interpretations is a constant risk. Through a study in which 28 engineers answered 12 questions related to classical dashboards versus visual metaphors, it was possible to conclude that observability is not only limited by the quantity and quality of those signals, but the way in which those signals are visualised and interpreted. The conclusion was that the visual metaphors could perform better than classical dashboards, however, since both involve humans, none are a guarantee that operators are interpreting the data in an incident in a proper way.

Interested in learning more about observability in chaos engineering?

I would like to recommend these three books:

  • My first recommendation is the book Chaos Engineering Observability from Russ Miles. In this book, the author demonstrates how to bring your chaos experiments into the world of system observability. Chaos observability enables to surface, debug, and visualise chaos experiments across the system in real time. 
  • Another great reference is the book Observability Engineering from Charity Majors, Liz Fong-Jones and George Miranda, who work in a company named Honeycomb, specialised in observability, precisely. I love this book because in it, observability is treated as a field of engineering. As you know, observability is critical for engineering, managing, and improving complex business-critical systems, so it should be considered a discipline, even a role inside of an organisation.
  • Finally, I would like to mention a great reference focused on Chaos Engineering. I am talking about a practical book from ​​Mikolaj Pawlikowski, published by Manning. The book documents labs and practical experiments for simulating real-life failures. The author maximises the benefits of chaos engineering by learning to think like a chaos engineer, providing examples that cover a whole spectrum of software. 

About the Author

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p