Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Chaos Engineering Observability with Visual Metaphors

Chaos Engineering Observability with Visual Metaphors



Yury Niño Roa introduces a new actor: visual metaphors, discussing visualisation and how to use colours, textures, and shapes to create mental models for observability and chaos engineering.


Yury Niño Roa is a Software Engineer with 8+ years of experience designing, implementing and managing the development of software applications using Aile methodologies such as Scrum and Kanban. 3+ years of DevOps and SRE experience supporting, automating and optimizing mission-critical deployments, leveraging configuration management, CI/CD, and DevOps processes.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Roa: I am Yury Niño. I'm from Columbia. I am here to speak about observability, chaos engineering, and visual metaphors. I work as a cloud infrastructure engineer at Google. Also, I am a chaos engineer advocate in my country. I am going to provide definitions for three concepts, observability, visualization, and chaos of course. In the second part, I am going to explain the classical charts and dashboards that we are using to monitor our systems, and I am going to show you several weaknesses of these charts. With this context, I am going to explore another point of view I am talking about, so-called visualization with metaphors. Finally, I am presenting the results of a survey that I apply among some colleagues. With this survey, I try to see the effectiveness of the classical charts and dashboards. I try to identify if the visual metaphors could be useful for improving the observability of our software systems.

The Royal Botanical Expedition to New Granada

I am Columbian. People say many things about my country, such as we grow delicious coffee, or that we have beautiful landscapes. There is another awesome thing about my country, the Royal Botanical Expedition to New Granada. It took place between 1783 and 1816 in Colombia, Ecuador, Panama, Venezuela, Peru, and the north of Brazil. The expedition was held by Jose Celestino Mutis, a botanist, mathematician, and illustrator. Jose Celestino Mutis is recognized because during 25 years, he documented the flora and fauna to the New Granada using more than 20,000 drawings. Here's Jose Celestino Mutis. His illustrations are visual treasures of the flora and fauna of our country, and the best visualization of the Royal Botanical Expedition.

Probably many of you are asking why I am speaking about the illustration of plants and insects in a software conference. The answer is because humans are highly visual creatures. Probably it was one of the reasons for Jose Celestino Mutis to draw more than 5000 flowers and insects. According to our research, half the human brain is directly or indirectly devoted for processing visual information. In the brain, for example, neurons devoted to visual processing take up about 30% as compared with 8% for touching, and just 3% for hearing. In this investigation, scientists have seen that at least 65% of people are visual learners. The results show also that presentations using visual aids were found to be 43% more persuasive than unaided presentations.


In our context, visualizations, charts, and graphics are super important. Here, you're seeing the timeline of an incident during and now touching to a software release. It was taken from the book "Incident Management Operations," a really good book. The first instructions from the command manager was to check the analytics dashboard, but the access to the dashboard was not working yet. Let me now review some definitions that we should have clear before trying to learn about observability and chaos engineering. Observability is being able to fully understand our systems. In control theory, for example, observability is defined as a measure of how well internal states of a system can be inferred from knowledge of its external outputs. For me, observability is about asking questions, providing answers, and building knowledge about our systems. Here, another important definition, for modern software systems, observability is not about mathematical equations, it is about how people interact with and try to understand the complex systems.

Observability is different from monitoring and it is super important to understand why. According to the Google SRE book, monitoring is about collecting, processing, aggregating, and displaying real-time quantitative data about our system. There are many reasons to monitor a system, including analyzing long trends. For example, monitor how big is my database and how fast it is growing. Alerting that is very common if something is broken. For example, somebody should be notified to fix it. Building dashboards, I'll do so in the incident timeline. Dashboards answer basic questions about our services, and they are our first tool to try to understand what is happening. We monitor our system through the signals that they are sending. These signals are called metrics. A metric is a single number with tags optionally appended for grouping and searching, such as query counts, error counts, processing times, and server lifetimes. According to Jason English, data visualization is a more general concept, because it involves designing and engineering a human-computer interface to allow a better human cognition and analyzing of metrics like data streams and archived data. Finally, a dashboard is an application, usually web based, that provides a summary view of a service's core metrics. A dashboard may have filters and selectors, with the objective to expose the metrics most important to the users.

Since this talk is about graphics, dashboard, visualizations, and observability, I put those definitions in this sketch. Observability is being able to fully understand a system health monitoring and analyzing metrics. Monitoring is about collecting, processing, aggregating, and displaying real-time metrics of a system. Metric, an important term here, is a single number with tags optionally appended such as query counts, processing times, and server lifetimes. Visualization involves designing and engineering a human-computer interface or metric dashboard to allow human cognition, for example. Dashboard is an application that provides a summary view of a set of metrics about a system. Finally, I would like to introduce a new concept here, chaos. This talk is about observability, but it is about chaos engineering also. Chaos is a state of turbulence in a system whose consequences are unpredictable and random.

Relation Between Observability and Chaos Engineering

What is the relation between observability and chaos engineering? According to the website, Principle of Chaos, that contains some manifesto for chaos engineers. Chaos engineering is the discipline of experimenting on a system in order to be confident in the system's capability to face turbulent conditions in production. Chaos engineering and observability are closely connected, according to me, both concepts can be related using this expression. Chaos engineering is the sum of chaos, observability, and resilience. Because to confidently execute that chaos experiment, observability must detect when the system is normal and how it deviates from that steady state as the experiment is executed. In this expression, there is an important concept, I am talking about knowledge. Specifically, chaos plus observability gives us the parts for defining knowledge in this context. If we identify that something is not normal with our system, and we are able to determine how our system will respond to a chaotic situation, we could say that we know the system. Precisely, knowledge is the concept that connects these two concepts: chaos engineering and observability. Take a look at this definition for observability. Observability can be defined as the sum of metrics plus questions plus answers. Observability is about having tools for making the proper questions and providing the correct answers. In this definition the concept of knowledge is present again, considering that if you know the answers for these questions, you know the system. Here is a summary of what I was trying to explain. Both concepts are complementary and they are bridged by an important concept, knowledge. In this sense, chaos engineering is leveraged by observability, since it allows to detect a deviation from the steady-state of a system. Observability is leveraged by chaos engineering since it helps to discover and overcome the weaknesses of the system.


Let me focus on observability again. I would like to share this, observability feeds on the signals that a system emits that provides the raw data about the system's behavior. Observability is limited by the signals and the quality of the signals that a system puts out. I am talking about the four golden signals, latency, saturation, traffic, and errors. Let me remember a short definition for everyone using these beautiful sketch notes from Denise Yu. Latency is defined as the time that it takes to service a request. It is a symptom of degraded performance in a system in an incident, for example. Traffic is a measure of how much demand is being placed on the system. Some examples include the number of HTTP requests, sessions, and errors. Errors are the rate of requests that failed, for example, HTTP 500 errors. Finally, saturation is about the utilization of the resource, for example, the utilization of the CPU or the memory.

How are we seeing those signals? Here, I have a visualization of a set of dashboards in Google Cloud Platform that are showing the behavior of a system. Here I have a question for you. How many of you see chaos here? Chaos is the deviation of the normal state of a system. In my case, I see a problem in the dashboard QPS per region. Although the chart line is the most common visualization for these types of incidents, it is confusing because according to the title, we are seeing a counter of waves, but the y-axis maps time in seconds. It is important to mention that if we don't use the proper colors, layers, and variables in the axes, one of the most simple could be transformed in one of the most confusing chart. Another common chart is the bar chart. A bar chart is a graph that represents categorical data with rectangular bands, with heights and length proportional to the values that they represent. The challenge is the same. If we don't use the proper categories, the chart could be confusing. Considering those limitations, what about visualization? They are the proper charts to visualize the chaos. Do you remember that this talk is about chaos engineering also?

Visual Metaphors

I am going to introduce a new definition here, visual metaphor. Visual metaphors are mappings from concepts and objects of the simulated application domain to a system of similarities and analogies. A computer metaphor is considered the basic idea for simulation between interactive visual objects and model objects of the application domain. Some examples include maps, cities, geometric coefficient. See this illustrator here with a beautiful map. The city metaphor is a popular method of visualizing properties of a program code.

Many projects have employed this metaphor to visualize properties of software repositories, for example. Existing research has used cities to visualize packages, classes, and size of tools, cyclomatic complexities. I am going to show you more details in the next slide. Here, for example, we have a city metaphor for showing the properties of our software systems. In this case, a city metaphor represents Java packages as neighborhoods, Java classes as buildings, and dimensions as classes properties, or cyclomatic complexities.


With the intention to identify the perception of engineering teams involved in software operation activities, I applied a study consisting of specific questions about an incident, in which two visualizations would provide. One with a traditional chart, for example, line charts or bar plots, and another view using visual metaphors. For each situation, the value of each type of visualization was analyzed. Twenty-eight of them were surveyed regarding traditional dashboards and visual metaphors. Specifically, they were asked about an incident in four categories or metrics: errors, latency, traffic, and saturation, and were visualized using classical dashboards versus visual metaphors. The backgrounds of the participants were distributed among backend, frontend, and full stack engineers, software architects, data engineers, and site reliability engineers. The most participations come from backend development engineers as it is illustrated here.

The first question was about the saturation signal. Basically, two dashboards were used here, a line chart and a city metaphor, for asking about the state of five microservices. Microservice authentication, microservice patients, microservice payments, microservice medications, and microservice appointments. These microservices were part of a fictional healthcare system. Specifically, the question was, using this traditional line chart, which microservice was impacted? Here, city represents the utilization of CPU by each microservice. For example, the orange line represents the utilization of the payments microservice. The correct answer for this question was microservice authentication. The answer is confusing since it is not clear which line and colors represent each microservice. Probably, this line chart was confusing for our participants, since the answer was distributed among several options, just the 55% selected the correct answer. Remember, the correct answer is microservice authentication. See this line, orange in the pie. On the other side, it is curious, like 11.1% of participants choose payments, a service that effectively had a high consume, but in the previous day is not [inaudible 00:17:49].

I asked the same question but now using a visual city metaphor. I used a building to represent each microservice, for example, a pharmacy represents the meditations microservice. I used silhouettes of people to map the level of saturation. The number of people is proportional with the utilization of the CPU. Finally, I used the red color to represent in another world, if the saturation is higher than a value, the building is painted in orange. As you see, the visual metaphor was more useful than the traditional dashboard. All participants agreed that the microservice impacted by a high utilization of CPU was authentication. Although it did not manage to prove my hypothesis, there is a fact that colors, shapes, and size change the perception of the participants. The open answer of some participants currently that I am seeing, for example, the first one says that the city metaphor was very useful to see the current state of the CPU. Although they claim that the city metaphor didn't show the behavior through the time. About the other signals, the second golden signal, a classical bar chart and a treemap were used to ask the participants to calculate the average of errors for each microservice as it is illustrated here. If you calculate the average, you can see that the correct answer was microservice appointments. Although participants didn't choose it, many changed their answer when they used the visual metaphor. This figure illustrates that I am stuck. Which was selected just by 38% of the participants. It is very curious that 88% of the participants think that the correct answer is authentication, just for having more nodes, but not necessarily have more errors.

With a treemap, the distribution percentage changed, but the majority continue thinking that the correct answer is authentication. Here a summary that I am talking with visualization for the distribution of answers for traditional dashboard, for visual metaphor, and the correct answer. It is interesting because it allows to conclude that visual metaphors, another guarantee that we are interpreting those data in a proper way. If you see, in the second case, using a visual metaphor, just the 32.1% choose the correct answer. Regarding traffic signals, a classical bar chart and geometric metaphor were used for asking the participants to which third-party service there is more traffic. In this case, the interaction between the original, the microservices and the new third-party services, service LDAP, service government, service assurance, and service authentication was analyzed. They are external or third-party services that interact with our microservices. This figure shows this integration using a bar plot and geometric metaphor. In the metaphor, the circles represent the services and microservices, and the lines that are connecting the relation among them. In spite of having lines and size for representing the connection and the traffic load among this microservice and third-party services, the metaphor was confusing for the participants that you see here. It is possible that the size of the circle could be associated with at least percentage of service LDAP. That is the correct answer. In which it is represented by the green portion in the pie. Finally, the most people answered that the metaphors were more useful that is illustrated here. As you see, the majority choose visual metaphors in order to get better results.

Key Takeaway Points

For modern software systems, observability is not about mathematical equations. It is about how people interact with and try to understand the complex systems. A second important point here, is considering that chaos engineering and observability involves humans and their individual interpretations, designers of dashboards can bias those interpretations. In this sense, visual metaphors are not a guarantee that we are interpreting this data in a proper way. Finally, it is important to keep in our minds, observability feeds on the signals that a system emits, and that provides the raw data about the behavior of the system.

Questions and Answers

Bangser: It'd be really interesting to hear from you about why you decided to explore the strategy around different visual metaphors when visualizing incidents. How'd you get started with this?

Roa: Regarding your question, why I decided to explore this strategy, because in my experience, chaos engineering, observability, and visualizations in performance, as I have mentioned in my presentation. The individual in presentations, it is a fact that the designers of dashboards can bias those interpretations. That is my main motivation for this study. The bias is the main topic here. Since classical dashboards can lead to bias, I was wondering that if we have an alternative option to explore our dashboards, it could be highly valuable for our operator systems, for engineers, cloud engineers. Out of that, I thought that dashboards based on visual metaphors can provide more useful data than classical visualizations. However, after the study that I shared, I discovered that both strategies have the same worries, because for example, when I was showing the third study related to the geometric metaphor, the participants were confused with the metaphor. For me, the main motivation is related to bias. With this study, while I was preparing this presentation, I discovered that both the strategies and any strategies could be biased, because we are interacting with humans. That is really challenging for dashboard designers.

Bangser: I've heard a saying before that there's lying and then there's statistics. The joke behind that saying is that, depending on what frame and what lens you put on statistics, it can really show the bias of what you want people to see. What are the things you're showing, and so exploring, what has been traditional about our visualizations, and what that turns into bias is important, because we may be not as aware of the biases we're building in, because this is just how it's always been done. This is really important.

Do you think that the perception of understanding could be evaluated to improve the visual metaphor dashboards?

Roa: Yes. I agree, because that is an input for us. That is an input for us as designers, the perception of the human. Probably we can provide more strategies and more metaphors that cover more perceptions. That is a fact, we have a limitation here that is the interpretations, experience, and backgrounds of the readers of dashboards, but effectively, I think the perception, understanding. We have some frameworks in the literature, in order to analyze this perception in order to get the best input for designing more strategies. Because at this moment, we are limited to line charts and bar charts, that is the charts available in the cloud providers, for example. Although some tools specialize in observability and monitoring, have more strategies to monitor our systems. That is a fact. We have a lot of possibilities. At this moment, we have few strategies for monitoring, but we have a universe of metaphors. Although some of them can be related to the business as my study related to healthcare. I used buildings related to healthcare, hospital, pharmacies, medication buildings. That is a great opportunity to create many tools and share our thoughts about this topic.

Bangser: So often, it's that cat and mouse game of the tools exist, so people start using them more, and then the tools are encouraged to become broader and effect more things. It's hard to get started without those tools. Do you have any suggestions of tools to create the visualizations that you did? Specifically, that city visualization, how would you suggest other people get started with that?

Roa: For my study, I designed the metaphors for this case. I used common tools to design and to provide the treemaps. There are some tools in the market, but for example the city, there is a tool that visualization is 3D. I am going to share the link in Slack, because as a result of a paper that I published in the past, I created a tool that provides some visualizations. These visualizations are focused in visualizing software, in visualizing the characteristics of the software. Specifically, for monitoring, I don't have tools in the market. I am going to share with you some tools that could be extended or used here. I have to recognize that for this study, I draw the circles and lines in order to prove my perception about this topic.

Bangser: How important is color when you're looking at these visual metaphors. I would maybe extend this to ask as well about how you deal with accessibility when color is a big part of what you were trying to show at times with red versus green, and things like that?

Roa: The color is really important. The color is very important here, because, for example, in my metaphor, with the hospital, I use the red color. When you see a red part or that red section in your dashboard, that takes your attention immediately, because we are familiar with these colors. Red represents fire, represents an alert. Green represents that that is what. It is really important to use the proper colors. For example, in the third metaphor when I was using the geocentric metaphor, because that is really curious for me, because I was expecting that these metaphors could be more valuable for our participants. This was confusing, because I used the same colors, I used blue and gray. I don't use, for example, the red or green colors. I tried to use the size and shapes in this case, and it was confusing for our participants. I think the color is really important, and it is really important to use in a familiar way for humans, because in our understanding, in our experience, red color, for example, represents alerts. I should take advantage of that.

Bangser: It makes sense that for a large part of the population that is the first thing we look at. I am privileged in that I do not have any colorblindness. I look at red. I grew up in a culture where red means stop, or bad, or error, and so that sits well for me. How do you think the industry can take on board making that geometric shape and size that you tried to use in the last example, more common for people so that it therefore makes it more accessible and less dependent on color, which is something which may or may not work for everybody due to colorblindness and other aspects.

Roa: That is a fact also considering the accessibility headings. I am thinking at this moment, that is a great opportunity to run another experiment, because probably we are ignoring there are some other persons with this challenge to access our tools. We need to consider the standards for accessibility for them. Precisely, I was reading a study published by InfoQ related to this topic, with 10 guidelines to build applications that are accessible for our users. I think we can design an experiment for this, but I didn't consider this topic in the study. That is a fact. I think that it's really interesting to explore these considerations also.

Bangser: Yes, there's just so many angles. You have to try and tackle a lot of them. One side we definitely want to include is accessibility, and there's so many others, though, that you were able to get insight into during the study, which was really interesting.

I realized this was a great question that came in around the animation, because everything that you showed us was static, even the arrows that had motion in them in the sense that they were pointing in a direction, they were just stationary arrows. Have you thought about adding animation or movement to your visualizations?

Roa: Yes, it could be great to have this opportunity, because we could have more variables for showing the situation or the state of a system. In this case, it is important to consider that if we have a lot of variables and a lot of things in the same dashboard, it could be confusing also. Because for example, when you have movement, you can distract with these movements in the dashboard. It is a really good idea for a dashboard, but we need to consider there is a risk considering that, for example, if you have a dashboard with fast movement in the dashboard, and you have the proper dashboard in this section of your page with the static situation, probably you can distract with the other dashboard. It is important to consider that. Yes, so the lack of movement. That is an interesting discussion for this topic, because the movement can provide a lot more information for us, and it could be highly valuable for our readers. In the same case, I need to run experiments, and I need to go to the users in order to understand them. I think probably a user experience expert could be valuable or could be very useful for us here. I think we need to explore all options, in order to provide the proper visualization for our users. That is a great opportunity for industry and academia. That is important for academia considering that we have, for example, people studying these topics related to visualization, related to human factors, related to accessibility, it is a great topic for a PhD thesis, for example. There is a great opportunity to explore these topics in academia also.

Bangser: You mentioned there a really interesting lever to pull on, which is the number of criteria that you can use when you add color or shape or motion. These are all ways in which you can describe different attributes. As you add more attributes, it can get more confusing.

I was just curious, you seemed almost a bit surprised by some of the study results, the things that were confusing for people, and you didn't get the results you were expecting. What do you think maybe caused some of that surprise, or those unexpected results?

Roa: Probably, in the third study related to traffic signals, a classical bar chart and a geometric, because that was really surprising for me, considering that I was expecting that the circles and line could provide more information for the users. The reality was another reality. Regarding the traffic signals, the traffic signals that were surprising for me. In the metaphor, the circles represent the services and the microservices represents the relation between the third-party services, in spite of having lines and size. I think my problem with this metaphor was related to the color because I didn't use the proper colors here. In the other case, for this incident, this chaos, probably the lines and charts and the simplicity, it could be more useful for the attendance. In conclusion, I think the main cause is related to humans, is related to our perceptions of the systems, because each of us is a unique universe with different experience, with different backgrounds. The main root cause for this confusion was related to the background, for example. Because when I explore with details, the answers, I found that the backend software engineers, same similar perceptions, and the frontend engineers, same similar perception that are different, for example, from the cloud infrastructure engineers or persons who work in the operations topics in an engineering team. I think the background experience no less is the main cause. That is the challenge for designers of these types of dashboards.

Bangser: That is a challenge. It makes sense, though, that what you're used to seeing every day you make assumptions, or you start to read in. I remember when, for example, the three bars meaning like open up a sidebar in an app, became new, but now it's become something that people are aware of, and that can start to build a repertoire. When you're dealing with such a broad base, backend, frontend, operations, all of that, that can be really hard.

Do you think that these new visual metaphors are something that can be brought into the industry in the future, despite all these challenges around different backgrounds, and all those kinds of things?

Roa: Yes, but I hope that it will be useful for them. I hope that, in the future, we have the possibility to interact with our cloud, using metaphors. I think that it could be great for us. It is valuable considering the open answers for our participants. I think the visualization of chaos, and specifically of the incidents on production, represents several challenges for industry and academia. I would like to open this gate and this universe of the metaphors for our industry providers. For example, some cloud providers, some of them work, for example, in treemaps, and heat maps, in order to provide more strategies for our operator systems. They are working on that. At this moment, we don't have the possibility, for example, to use a city metaphor or geocentric metaphor, because considering that, for example, those metaphors are related to the business, are related to the organization, and related to the proper and specific business topic. I think that, for example, we could provide tools for building these metaphors, for providing these metaphors, providing tools that allow us to draw, or to provide, or to design our dashboards in order to connect our business, our preoccupations, our priorities with our dashboards. If we have the possibility to design the dashboards in our cloud provider, it could be great. It could generate value for our operator's system. I think that that is challenging, but there is an open gate for creating things as our imagination allows.

Bangser: As you say, we have to bring industry and academia together to solve these problems. What's really exciting is if the cloud providers do start working in this space, they operate at such a scale that we can start to really get feedback into academia and start to actually run studies at scale and get feedback on that. That'd be a very exciting opportunity for the industry.


See more presentations with transcripts


Recorded at:

Feb 24, 2023