BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles How We Improved Application’s Resiliency by Uncovering Our Hidden Issues Using Chaos Testing

How We Improved Application’s Resiliency by Uncovering Our Hidden Issues Using Chaos Testing

Bookmarks

Key Takeaways

  • Chaos testing is a disciplined approach to test a system’s integrity that can be carried at unit, integration and system levels
  • Chaos testing has well-defined principles for building hypothesis, varied real world events, running experiments on production, automating them and limiting the blast radius
  • While chaos testing presents a lot of advantages for business and testers, like identifying few scenarios that get surfaced only via live data streams, it also has some disadvantages and risks; for instance testing in production may sometimes disrupt a service that may result in system halt
  • Chaos testing needs to be performed on a small part of the system, and as the confidence builds up, the blast radius can be expanded
  • To perform chaos testing, the resolve should come from the management and they should know upfront what they want out of the tests, as well as the risks associated

 

As the software has become more complex, and with the rise in microservices and distributed infrastructure, it is very hard to control systems failure. In the past, as infrastructure was developed and managed on premise, the sysadmins found it very easy to maintain it. Now that systems are hosted on globally distributed infrastructures, it's hard to predict what failure might occur to the system.

Chaos testing is the act of disrupting and breaking an application system to build resilience. Generally it is performed on production systems, making it extremely sensitive to perform.

In this article, I have listed the chaos testing principles which are outlined by Netflix. The readers should be able to understand the advantages and disadvantages that chaos testing offers. This will help them to decide whether they want to perform it or not. I have also explained why we should convince the management to perform chaos tests, considering all benefits over the risks.

What is chaos testing?

Chaos testing is the highly disciplined approach to test a system’s integrity by proactively simulating and identifying failures in a given environment before they lead to unplanned downtime or a negative user experience.

It can also be defined as a method of testing distributed software that purposely introduces failures and faulty scenarios to verify how the system behaves in the face of random disruptions. These disruptions can cause applications to respond unpredictably and break under pressure.

As an example, while working for a banking client, we came across an issue which happened once a day. Each day, for a few seconds, the site became unresponsive and then this error showed up on screen: "Bank Website down, not working". After a few seconds it started responding again fine. The issue was not reproducible on staging environments, but after each release, it was reported on production.

After we convinced the clients that we need to watch the production systems closely, we used chaos testing. First we decided to perform the testing on a Sunday between 12:00 Am and 4:00 Am, when least traffic was reported. We started shutting down services randomly one at a time, and observing the impact on the overall system. Soon we found that there was an API-A that got its data feed from a third party API-B and that was never in scope of testing. On inspecting API-A closely we found that once a day, there was a time lag of 31 seconds in receiving the data from API-B. We asked the third party to look into this issue and passed our findings to them. They found that the API-B hangs up for exactly 31 seconds when AM time of their servers shifts to PM times at cusp of noon, and that kept API-A waiting, ultimately leading to hanging of the system.

It was a learning experience as we had focused on our application alone while testing, without taking in count the external code and dependencies on those codes. We modified our strategy and included chaos testing in our test plan.

Principles of chaos testing

Chaos engineering is made up of five main principles:

  1. Identify a steady state
    We should define a "steady state" or control as a measurable system output that indicates the normal working behaviour (in most cases it is well below a one percent error rate).
  2. Hypothesize that the system will hold its steady state
    Once a steady state has been determined, it must be hypothesized that it will continue in both control and experimental conditions.
  3. Ensure minimal impact to your users
    During chaos testing, the goal is to actively try to break or disrupt the system, but it’s important to do so in a way that minimises any negative impact to your users. Your team will be responsible for ensuring all tests are focused on specific areas and should be ready for incident response as needed.
  4. Introduce chaos
    Once you are confident that your system is working, your team is prepared, and the impact areas are contained, you can start running your chaos testing applications. Try to introduce different variables to simulate real world scenarios, including everything from a server crash to malfunctioning hardware and severed network connections. It’s best to test in a non-production environment so you can monitor how your service or application would react to these events without directly affecting the live version and active users.
  5. Monitor and repeat
    With chaos engineering, the key is to test consistently, introducing chaos to pinpoint any weaknesses within your system. The goal of chaos engineering is to disprove the established hypothesis from number two and build a more reliable system in the process.

Chaos testing test pyramid

Over the years, the IT industry has experienced some rather dramatic changes in the design, building, and operational scale in which computer systems operate. This results in development of more complex systems. The cumulative effects result in large-scale distributed systems with more opportunities for failure.

The goal of chaos engineering is to educate and inform the organization of unknown vulnerabilities and previously unanticipated outcomes of a computer system. A primary focus of these complex testing procedures is to identify hidden problems that can potentially arise during production environments prior to an outage failure outside of the organization’s control. Only then can the disaster recovery team address systematic weaknesses and enhance the system’s overall fault-tolerance and resiliency. Hence, Chaos testing is being carried out at various levels.

A typical test pyramid consists of three areas of common testing:

  1. Unit Level
    The primary objective of unit tests is to evaluate an individual component’s specific, expected behaviours. The component being tested must be unattached from its conventional dependencies while the chaos engineering team maintains control of its behaviour with the help of mocks. The worst case scenarios are tested against expected behaviour.
  2. Integration Level
    Individual components interact with each other and hence integration tests focus on the interactions and interrelationships between individual components. Engineers ideally run these tests automatically following the successful unit testing of the individual components. These integration tests can be very useful in determining the stable state or common operational metrics of complex applications and systems.
  3. System Level
    Systems tests proactively evaluate how the entire computer system reacts under the increased stress of a particular, worst-case failure scenario. Only in real-world conditions involving standard production environments can the disaster recovery team definitively determine the steady state behaviours of the individual components and their integration protocols within the overall architecture.

Advantages and disadvantages of chaos testing

Advantages:

By testing the limits of your applications, the insights you can gain will deliver a lot of benefits for your development teams and your overall business. Here are the benefits of a healthy, well-managed chaos engineering practice.

  1. Increases resiliency and reliability
    Chaos testing enriches the organisation’s intelligence about how software performs under stress and how to make it more resilient.
  2. Accelerates innovation
    Intelligence from chaos testing funnels back to developers who can implement design changes that make software more stable, resulting in improved production quality.
  3. Advanced collaboration
    Developers aren’t the only group to see advantages. The entire technical group gains insights that will lead to faster response times and better collaboration.
  4. Speed incident response
    By learning what failure scenarios are possible, the disaster recovery teams can speed up troubleshooting, repairs, and incident response.
  5. Improves customer satisfaction
    Increased resilience and faster response times lead to less downtime. Greater innovation and collaboration from development and SRE teams means better software that meets new customer demands quickly with efficiency and high performance.
  6. Boosts business outcomes
    Chaos testing can extend an organisation’s competitive advantage through faster time-to-value, saving time, money, and resources, and producing a better bottom line.

The more resilient an organisation’s software is, the more customers can enjoy its services without distraction or disappointment.

Disadvantages:

Although the benefits of chaos testing are clear, it is a practice that should be undertaken with deliberation. Here are the top concerns and challenges.

  1. Unnecessary damage
    The major concern with chaos testing is the potential for unnecessary damage. Chaos engineering can lead to a real-world loss that exceeds the allowances of justifiable testing. To limit the cost of uncovering application vulnerabilities, organisations should avoid tests that overrun the designated blast radius. The goal is to control the blast radius so you can pinpoint the cause of failure.
  2. Lack of observability
    Without comprehensive observability, it can be difficult to understand critical dependencies vs non-critical dependencies. A lack of visibility can also make it difficult for teams to determine the exact root cause of an issue, which can complicate remediation plans.
  3. Unclear starting system state
    Another issue is having a clear picture of the starting state of the system before the test is run. Without this clarity, teams can have difficulty understanding the true effects of the test. This can diminish the effectiveness of chaos testing and can put downstream systems at greater risk.

Convince the upper management about doing chaos testing

Chaos testing is a new approach where any failure may lead to the shutting down of whole systems. Hence, it is very critical to convince bosses about performing chaos testing. Here is what we can do:

  1. Educate them
    Biologist Henri Laborit wrote in 1976: "Faced to unknown experience, man has only three choices: fight, do nothing or flee". This subject is quite new, with little visibility, so it is important to allow your boss to discover the concept at his own speed, avoiding instinctive rejection. You may start by sharing some interesting papers on the subject on internal or external social networks:
  2. Tell the right story
    Once you have educated them, you should adapt your story to their concerns, questions or objections. Never hesitate to play with emotions; they are a major factor in decision-making. The most obvious emotion to play with is fear: fear of major outage that will impact your revenue. For instance, a five-minute outage represents around one million dollars for Google/Alphabet, one hundred thousand dollars for Netflixm and three million for Apple:

Image source

Moreover, it is during an incident that you should not hesitate to be opportunistic to advance your pawns, to propose new practices that will limit the impact in future incidents.

One of the best ways is to speak about resilience - the ability to recover quickly from difficulties.

Things to take care of when planning for chaos testing

Chaos testing is a new concept, but we always had the mindset to perform it, and we did perform it sometimes, without knowing that it was a chaos testing. It has its own principles, benefits and pitfalls. However, I would advise all teams to weigh the pros and cons of conducting these tests before formulating a plan. You should be very clear as to what you want to achieve from these disruptive tests. Take permissions from your bosses and convince them why it is important to carry out these tests. Once they are convinced, then lay out a plan, where you should be defining the blast radius of your tests. The monitoring of systems should be in place and under full observation. Basically it requires a lot of preparations before even beginning them. If the preparations are right, and the intentions are clear, these tests will give a lot of valuable insights.

About the Author

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Strady state

    by Nirmalya Sengupta,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    The topic is well-elucidated but the point remains:

    How does one identify a steady state!


    In the bank example given, may argue, that but for those N minutes, the system is always in the steady state. Therefore, the temporal aspect is important here. Now, if that indeed is the case, then which strip of time along the day, should we scan: every second? every minute? every hour?

    Please do not get me wrong. I understand the motivation behind >Chaos Testing and I know that it is useful in many situations. But, indeterminate behaviour of any distributed system throws a spoke in the wheel, IMHO. In order to prove the behaviour of the solution as a whole, I must consider what of a failure situation but also when of that situation. And, that is tricky.

    Just an observation.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT