InfoQ Homepage Articles How Do We Utilize Chaos Engineering to Become Better Cloud-Native Engineers?

How Do We Utilize Chaos Engineering to Become Better Cloud-Native Engineers?

Jun 13, 2022 8 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

The evolution of cloud-native technologies and new architectural approaches bring great benefits to the companies that adopt them, but on the other hand, it becomes challenging as the team and system scale.
Cloud-native engineers are closer to the product and the customer’s needs.
Being a cloud-native engineer means that it’s not enough to just know the programming language you are working on well—you must also know the platform and cloud-native technologies you are relying on.
Chaos Engineering is an awesome method to train engineers the cloud-native principles and boost their confidence while responding to production failures.
Investing in engineering training, such as the “On-call like a king” workshop that we invented, enhances your engineering culture and formalizes a unique learning environment.

The evolution of cloud-native technologies and the need to scale engineering, has led organizations to restructure their teams and embrace new architectural approaches, such as Microservices. These changes enable teams to take end-to-end ownership of their deliveries and enhance their velocity.

As a result of this evolution, engineers these days are closer to the product and the customer needs—there is still a long way to go and companies are still struggling with how to get engineers closer to their customers to understand in-depth what their business impact is: what do they solve, what’s their influence on the customer, and what is their impact on the product? There is a transition in the engineering mindset—we ship products and not just code!

With great power comes great responsibility

We embrace this transition which brings with it many benefits to the companies that are adopting them. On the other hand, as the team and system scale, it becomes challenging to write new features that solve a certain business problem and obviously understanding the service behavior is much more complex.

When talking about the challenges and the transition to Microservices, I usually like to refer to this great talk: “Journey from Monolith to Microservices & DevOps” by Aviran Mordo (Wix) given at the GOTO 2016 conference.

Such advanced approaches bring great value but as engineers, we are now writing apps that are part of a wider collection of other services that are built on a certain platform in the cloud. As Ben Sigelman is calling them in his last posts and talks,—“deep systems” and images are better than words and this one explains it all:

Source

As part of transitioning into being more cloud native, distributed, and relying on orchestrators (such as Kubernetes) at your foundation, engineers face more and more challenges that they didn’t have to deal with before. Just one example is that when you are on-call for a certain incident and you have to identify the root cause quickly, or at least recover fast, this usually requires a different set of expertise (i.e. 33% of your deployment could not reschedule due to lack of node availability in your cluster).

The engineer evolution at a glance

Being a cloud-native engineer is fun! But also challenging. These days engineers aren’t just writing code and building packages—they are expected to know how to write the relevant Kubernetes resource YAMLs, use HELM, containerize their app, and ship it to a variety of environments. It isn’t enough to know it at a high level. Being a cloud-native engineer means you should also keep adapting your knowledge and understanding of the cloud-native technologies you are depending on. Besides the toolbox you are using, building cloud-native applications involves taking into account many moving parts, such as the platform you are building on, the database you are using, and more. Obviously, there are great tools and frameworks out there that abstract some of this complexity out from you as an engineer, but being blind to them might hurt you someday (or night). If you haven’t heard of the “Fallacies of distributed computing,” I really suggest you read further on them. They are here to stay, you should be aware of them and be prepared.

What did we do to cope with these challenges?

We utilized Chaos Engineering for that purpose! We have created a series of workshops called: “On-call like a king.” We have found this method pretty useful and I think it can be nice to share our practices.

The main goal of Chaos Engineering is as explained here: “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

The idea of Chaos Engineering is to identify weaknesses and reduce uncertainty when building a distributed system. As I already mentioned above, building distributed systems at scale is challenging, and since such systems tend to be composed of many moving parts, leveraging Chaos Engineering practices to reduce the blast radius of such failures, proved itself as a great method for that purpose.

We leverage Chaos Engineering principles to achieve other things besides its main objective. The “On-call like a king” workshops intend to achieve two goals in parallel—(1) train engineers on production failures that we had recently; (2) train engineers on cloud-native practices, tooling, and how to become better cloud-native engineers!

How are the workshop sessions composed?

The session starts with a quick introduction of the motivation—why do we have this session, what are we going to do this time, and make sure the audience is aligned on the flow.

Source: workshop slide

Sometimes we utilize the session as a great opportunity to communicate architecture, platform or process changes that we had recently, such as updates to the on-call process or core service flow adaptations.
We work on 2 production incident simulations and the overall session time shouldn’t be longer than 60 minutes. We have found out that we lose engineers’ concentration for longer sessions. If you work hybrid, it is better to do these sessions when you are in the same workspace, as we have found that to be more productive.

Before we dive into one of the sessions, let me share with you how we do on-call.

We have weekly engineering shifts and a NOC team that monitors our system 24/7. There are 3 alert severities defined: SEV1, SEV2 and SEV3 (urgent -> monitor). In the case of SEV1, the first priority is to get the system back to normal state. The on-call engineer is leading the incident, understands the high-level business impact to communicate, and in case there needs to be a specific expertise to bring it back to a functional state, the engineer is making sure the relevant team or service owner is on their keyboard to lead it.

Our “On-call like a king” workshop sessions usually try to be as close to real-life production scenarios as possible by simulating real production scenarios in one of our environments. Such real-life scenarios enable the engineers to build confidence while taking care of a real production incident. Since we utilize Chaos Engineering here, I suggest having a real experiment that you execute—we are using one of our load test environments for that purpose. We use LitmusChaos to run these chaos experiments but you can use anything else you would like to or you can just simulate the incident manually. We started manually, don’t rush to use a specific chaos engineering tool. You will be convinced that when they are practicing and not just listening to someone explaining, it makes the session very productive.

Right after the introduction slides, the session starts with a slide explaining a certain incident that we are going to simulate. We usually give some background of what is going to happen, present some metrics of current behavior, and an alert that just triggered:

Source: workshop slide

Then, we give engineers some time to review the incident by themselves. We pause their analysis from time to time and encourage them to ask questions. We have found that the discussions about the incident are a great place for knowledge sharing.

If you are sitting together in the same space, it can be pretty nice because you can see who is doing what, and then you can ask them to show which tools they use and how they got there.

What I really like in those sessions is that it triggers conversations and engineers tell each other to send some of their CLIs or tools that make their life easier while debugging an incident.

Drive the conversations by asking questions that will enable you to share some of the topics that you would like to train on, such as: ask an engineer to present the metrics dashboard to look at; ask someone else to share his logging queries; ask another one to present its tracing and how to find such a trace.

You sometimes need to moderate the conversation a bit as the time flies pretty fast and you need to bring back the focus.

During the discussion, point out interesting architectural aspects that you would like the engineers to know about. Encourage engineers to speak by asking questions about these areas of interest that will enable them to suggest new design approaches, or highlight the challenges that they were thinking about lately and add them to the Technical Debt.

At the end of every challenge, ask somebody to present their end-to-end analysis. It makes things clearer for people who might not feel comfortable enough to ask questions in such large forums, engineers who have been just onboarded to the teams, or just junior engineers who might want to learn more.

Make sure you record the whole meeting and share the meeting notes right after the session. It is a great resource for people to be reminded of what has been done and also a fantastic source of knowledge as part of your onboarding training process.

We found out that these sessions are an awesome playground for engineers. I must admit that I didn’t think about using Chaos Engineering for these simulations at first. We started with just manual simulation of our incidents or just presented some of the evidence we gathered during the time of failure to drive conversations about them. As we moved forward, we leveraged the usage of chaos tools for that purpose. Besides the training to become better cloud-native engineers, the on-call engineers are feeling more comfortable in their shifts and understand the tools available to them to respond quickly.

I thought this would be good to share as we always talk about Chaos Engineering experiments to make more reliable systems but you can leverage that also to invest in your engineering teams training.

Good luck!

About the Author

Eran Levy

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

How Do We Utilize Chaos Engineering to Become Better Cloud-Native Engineers?

InfoQ Article Contest

Key Takeaways

With great power comes great responsibility

Related Sponsored Content

The engineer evolution at a glance

What did we do to cope with these challenges?

How are the workshop sessions composed?

About the Author

Eran Levy

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter