InfoQ Homepage Articles How to Use Chaos Engineering to Break Things Productively

How to Use Chaos Engineering to Break Things Productively

Sep 02, 2019 15 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

The benefits of chaos engineering. This process makes for more comprehensive and effective vulnerability testing, which ultimately benefits the customer.
Is there a downside? Not really, unless you prefer to discover system weaknesses on the fly in the real world, which is not recommended.
Overview of chaos engineering. A quick breakdown of how chaos engineering is the best way to stress a system and the four steps to implement it.
1. Defining a measurable steady state that represents normal circumstances to use as a baseline.
2. Developing a hypothesis that this state will continue in both control and challenge groups.
3. Introducing realistic network stressors into a challenge group, such as server crashes, hardware malfunctions, and severed connections.
4. Attempting to invalidate the hypothesis by noting differences in behavior between control and challenge groups after chaos is introduced.

Repeat the process through automation. Automated tools like Metasploit, Nmap, and VPNs allow you to change variables and expand testing exponentially.
Chaos engineering best practices. Understand when to use manual testing, how to apply containerized chaos testing, as well as how to confine the blast radius.
Testing tools and resources. As chaos testing gains popularity, expect an already impressive selection of tools and resources to increase.

More people connected to more servers, increased reliance on complex distributed networks, and a proliferation of apps in development mean more opportunities for data leaks and breaches.

Modern problems require modern solutions, as Amazon found out the hard way. Netflix escaped with minor inconvenience by being prepared.

What did they do differently?

Amazon Web Services (AWS), Amazon's cloud-based platform, experienced an outage on September 20, 2015, that crashed their servers for several hours and affected many vendors. Netflix experienced the issue as a blip because they've been there and done that when they changed their service delivery model. This led their engineering team to craft a unique solution for software production testing.

The solution? Chaos as a preventative for calamity. It's predicated on the idea of failure as the rule rather than the exception, and it led to the development of the first dedicated chaos engineering tools. Otherwise known as the Simian Army, they're called Chaos Monkey, Chaos Kong, and the newest member of the family, Chaos Automation Platform (ChAP).

What Are the Benefits of Chaos Engineering in DevOps?

Focusing only on a network environment and the associated security considerations (because the world of chaos engineering is quite large), we have already seen it as a positive force in an already strong cybersecurity market for improving business risk mitigation, fostering customer confidence, and reducing the workload for IT teams. If you're a business owner, you'll be blessed with happier engineers, reduced risk of revenue loss, and lower maintenance costs.

Customers, whether B2B or B2C, will enjoy greater service availability that's more reliable and less prone to disruptions. Tech teams will be able to reduce failure incidents and gain deeper insight into how their apps work. It will also lead to better design, faster mean time in response to SEVs, and fewer repeat incidences.

Is There a Downside?

Critics feel that chaos engineering is just another industry buzzword or cover up for apps that were poorly designed in the first place. Some chaos engineering proponents opine that this is the result of an ego-driven mentality. If you're confident in your capabilities and work product, there should be nothing to fear in testing their limits.

Chaos engineering is meant to eliminate the eight logical fallacies that plague many developers and software engineers who are new to distributed networks while providing a system for more refined testing.

These incorrect assumptions are that:

Networks are reliable
Latency is zero
Bandwidth is infinite
Networks are secure
Topology never changes
Each system has only one admin, who also doesn't change
Transportation costs nothing
Networks are homogenous

A quick look at internet usage statistics around the world demonstrates the need for a focus on innovative network testing at all phases of software development. Achieving that means taking a non-traditional approach to DevOps.

Overview of Chaos Engineering and Use Cases

Cloud-based, distributed networks enable a level of scalability that was previously unseen. Because these networks are more complex and have built-in uncertainty by the nature of how they function, it's essential for software engineers to utilize an empirical approach to testing for vulnerabilities that's systematic and innovative.

This can be achieved through controlled experimentation that creates chaos in an effort to determine how much stress any given system can withstand. The goal is to observe and identify systematic weaknesses. According to principlesofchaos.org, this experiment should follow a four-step process that involves:

Defining a measurable steady state that represents normal circumstances to use as a baseline.

Developing a hypothesis that this state will continue in both control and challenge groups.

Introducing realistic network stressors into a challenge group, such as server crashes, hardware malfunctions, and severed connections.

Attempting to invalidate the hypothesis by noting differences in behavior between control and challenge groups after chaos is introduced.

The wisdom behind this process proposes that the more difficult it is to disrupt the steady state of the challenge group, the greater confidence developers can have in the strength and integrity of their applications. Vulnerabilities uncovered provide a starting point for improvement before deployment.

Now, let's take a deeper dive into how introducing controlled chaos can change the way software is engineered.

Establish and Define Your Steady State and Metrics

Your steady state is the normal, expected behavior of an app with no apparent flaws or vulnerabilities. This can vary depending on the purpose of the app and use case, so it's important to determine your steady state at the outset of the experiment, define the metrics that will be used to evaluate performance, and decide on acceptable outcomes.

For example, online shoppers would be expected to fill their carts and proceed through the checkout process without incident. Metrics used to evaluate the outcome might include Key Performance Indicators (KPIs) like shopping cart abandonment rates (which are often disconcertingly high to retailers) or latency and service level agreement (SLA) guarantees like uptime percentages. You can gather this type of data by evaluating output gathered over a specified time period that would represent typical performance at the steady state.

Develop a Hypothesis

You should make an educated guess about the expected outcome using datasets and other information. The data should support your hypothesis and rely on measurable output rather than internal system attributes. This forces software engineers to focus on verifying that the system works rather than delving into how or why it works, which misses the point.

Using the shopping cart example, engineers might use data from the checkout process working at optimal levels and hypothesize that an app will allow the customer to load their cart from the product page, proceed to the checkout page, tabulate the order, estimate a delivery date, and produce an invoice without glitches, redirects, or other errors.

Force Chaos to Ensue

There are a number of variables that can be simulated or otherwise introduced into the process. These should reflect actual issues that might occur when an app is in use and prioritized by the likelihood of occurrence. Problems that can be introduced include hardware-related issues like malfunctions or a server crash as well as process errors related to sudden traffic spikes or sudden growth.

For example, what might happen during the whole online shopping experience if a seasonal sale results in a larger than expected customer response? You can also simulate the effects of your server being the target of a DDoS attack that's designed to crash your network. Any event that would disrupt the steady state is a candidate for experimentation.

Results vs. Hypothesis

Compare your results to the original hypothesis. Did the system perform as anticipated, beyond expectations, or produce worse results? This evaluation shouldn't be undertaken in a vacuum, but include input from team members and services that were utilized to conduct the experiment. Gather all related data and take the time to reflect on what it means, where the system functioned properly, and any areas that need improvement or a complete overhaul.

Repeat the Process Through Automation

Again, keep in mind we’re focusing on network security for this article and any particular software or tools are discussed from that more narrow perspective.

There are several ways that you can expand the testing in order to increase your knowledge and find potential solutions. Once you've resolved one area of concern, reset the testing criteria or parameters and run the experiment again with a new hypothesis. You can also expand the blast radius by increments with each test, introducing new or more powerful stressors into the testing environment in order to gauge the limits of your system.

The idea is to introduce as much controlled chaos into the mix, one element at a time, in order to determine the maximum limits of your system before it breaks down completely. This can be performed by introducing automation after the initial test. That way, you can easily run additional tests with new factors or predictions and scale the testing or redefine parameters.

For network vulnerability testing, two of the most effective automation tools to deploy in this environment are Metasploit and Nmap.

Metasploit is an open source framework that allows ethical hackers to probe for vulnerabilities through the eyes and mindset of bad actors. Because you're utilizing the tools and techniques of a criminal using an interface that requires you to remove protections like anti-virus software and firewalls, your VPN will allow you the freedom to find flaws without needlessly endangering data integrity and putting the rest of your system at-risk.

With Nmap, which is a free, open source network mapper, you're able to gain a more robust view of the entire network environment during testing. It continually scans your system to record KPIs like uptime and speed, but it serves a more important function in chaos engineering: remote host detection. This way, testers aren’t going into the situation blind.

Since a huge part of chaos engineering involves making a hypothesis and then proving or disproving it, obtaining as much information as possible about the remote networks, including OS and software, allows you to make predictions based on known vulnerabilities rather than guessing and hoping for the best. In other words, it takes the testing team from "knowns/unknowns" to "knowns/knowns".

You can scan for a single network or take a look at the entire subnetwork just by changing the command.

Single network scan:

nmap target
# nmap target.com
# nmap 192.168.1.1

Subnetwork overview:

nmap target/cdir
# nmap 192.168.1.1/24

Scanning multiple targets requires the simple addition of a dash and the final target number, like this:

nmap target-50
# nmap 192.168.1.1-50

That will surface scan the range of IP addresses to 50. You can also simply write nmap target target1 target2 in the first line of the command, separating each target by a space.

Since both of these tools are open source, they can work with any OS or platform.

The virtual private network (VPN) mentioned above can also come in handy for this type of experimentation in more than a simple defensive role. While it applies military standard encryption (AES-256) to all incoming and outgoing web traffic from a browser, which is a generally handy thing, a standard feature of many consumer-focused VPN services is it allows you to change the source IP address and location information. The latter is the important part here. A VPN is perfectly suited to chaos experimentation because it allows you to simulate users from other countries or regions and add them as variables, thus modeling a decentralized network - the likes of which is often used in cyberattacks like DDoS, which we discussed earlier.

Chaos Engineering Best Practices

Although chaos engineering has not yet replaced more traditional forms of quality assurance (QA) testing, it has been around long enough to have evolved some procedures and best practices.

One of the first things you should pay attention to is minimizing your blast radius. This is especially essential when testing in production because you can cause real pain to real customers in the process. Choose the smallest possible point of impact that will provide meaningful results without increasing undue risk. You can gradually expand the danger zone with subsequent testing, if that is necessary for you to gather the data and insight you need.

If it helps reduce your level of anxiety, you can perform a dry run using manual testing in a simulated environment, analyze the results, and then repeat or scale in a live production arena in order to obtain more relevant, real-world outcomes. It's also a good idea to have a backup plan in place in the event that you need to roll-back your experiment or kill it.

You should be prepared for the fact that the backup plan might also fail. The idea is to introduce controlled chaos, but you can never be certain how a complex system will perform until it has been put through its paces.

Another practice that's becoming common is to containerize chaos testing. The orchestrated environment of containers allows you to deploy more experiments in isolation at service levels without a high risk of disruption. You can destroy one container at a time to create a little chaos, and the technology will create a new replica to replace it when you're done.

Containerized environments also make it much easier in real-world applications to rapidly create and deploy new containers quickly if existing ones disappear due to a server crash or other issue. It allows you to trust the scheduler and perform verification, which is imperative in this environment.

For containerized experimentation, one way forward is a four-step process that leads you from testing for factors you’re aware of towards those that you aren’t, though there’s no law against switching things around depending on personal preference:

Testing for "Known Knowns" or things that you should already be aware of and understand. For example, you know that when one node or replica container shuts down, it will disappear from the node cluster. New replicas will be created and re-added to the cluster.

Experimenting for "Known Unknowns" or elements that you are aware of but lack comprehension. This is where you know the above, but lack knowledge of the time it will take between destruction of one clone and creation of a new one.

Checking "Unknown Knowns," which are things that you comprehend but which are beyond your perception. In this case, you don't know the mean time for creation of new replicas on a specific date or in a certain environment, but you do know how many there were and how many will be created to replace them.

Looking at "Unknown Unknowns" for which you have neither knowledge or awareness. For instance, you don't know what will happen during a total system shutdown or whether the virtual region failover will be effective because you have no previous trials or basis for comparison.

Kube Monkey is a version of the Netflix team's Chaos Monkey that's made especially for testing in a containerized environment.

Testing Tools and Resources

Once you learn how to embrace chaos effectively, there are numerous possibilities for deployment. It's best to perform these experiments in production, but you can employ a number of tools to simulate realistic events without jeopardizing actual systems or users. Many automation tools are available that are efficient and reliable in both live and simulated environments.

- Manual testing is simple if you run just a few servers in a containerized runtime environment. You can halt random servers from the management UI or inject some SSH into a few containers and destroy them. This can also be automated via a simulation tool called Pumba.

- All of the tools in the Simian Army are now open source and completely free to use. You can find them here.

- The Git community has a lively chaos engineering community, growing knowledge base, and additional resources.

- The chaostoolkit allows you to inject anomalies into multiple platforms by defining the policies as Json files.

Looking for a few easy experiments to start with? Here are two easy ones to help you loosen up and get used to the idea and feel of chaos engineering.

Resource Exhaustion Experiment

This is a common problem for website owners who are on shared (often cheap) hosting platforms or experience rapid, unexpected growth that strains resources, pushing the limits of the CPU or RAM.

This single instance test involves an attack on the disk, CPU, and memory with the expected outcome of a decent response as the system goes down. But, the attack will multiply on all layers, resulting in the system entering brownout mode or issuing fire alerts before traffic is rerouted.

Unavailable DNS Experiment

Next to hardware reliability, DNS servers are critical to keeping networks up and running. Simulating an unavailable DNS server will allow you to have a comprehensive recovery plan in place for the eventuality that this rare event happens to your network. You'll also obtain greater insight into how your applications respond in the event of server loss.

The attack simulates DNS server black hole for a single occurrence in order to test the hypothesis that the inbound traffic will be reduced and startup may fail to initiate. How will your team deal with this if it happens in real life?

Final Thoughts

Testing in production (TiP) is a standard software engineering practice. There's no question that DevOps teams will discover failures at some point in the production process. The difference between the traditional approach and chaos engineering is whether those failures will come as an unexpected event or whether they're introduced intentionally in order to measure systematic strengths and weaknesses.

Chaos engineering was born of necessity as a means of specifically targeting and tackling vulnerabilities in large-scale distributed network models. Embracing it may require a cultural shift: from protecting your applications at all costs, to allowing a little danger into your life.

It doesn't really break your system.

Introducing controlled chaos allows you to learn more about it so you can build it to be more resilient. Learn more and become part of the discussion by applying to join the Google Chaos Community forum.

About the Author:

Sam Bocetta is a former security analyst, having spent the bulk of his as a network engineer for the Navy. He is now semi-retired, and educates the public about security and privacy technology. Much of Sam’s work involved penetration testing ballistic systems. He analyzed our networks looking for entry points, then created security-vulnerability assessments based on my findings. Further, he helped plan, manage, and execute sophisticated "ethical" hacking exercises to identify vulnerabilities and reduce the risk posture of enterprise systems used by the Navy (both on land and at sea). The bulk of his work focused on identifying and preventing application and network threats, lowering attack vector areas, removing vulnerabilities and general reporting. He was able to identify weak points and create new strategies which bolstered our networks against a range of cyber threats. Sam worked in close partnership with architects and developers to identify mitigating controls for vulnerabilities identified across applications and performed security assessments to emulate the tactics, techniques, and procedures of a variety of threats.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?