Chaos Testing of Microservices
The world is naturally chaotic, and we should both plan for and test that our systems can handle this chaos, Rachel Reese claimed at the recent QCon London conference describing how Jet, an e-commerce company launched in July 2015, works with microservices and chaos engineering.
Reese emphasizes how extremely important it is to test the interaction in your environment. Even though all components have been tested it doesn’t mean the interactions between them are solid and they can be used together in production, all these have to be tested. She calls Jet a “the right tool for the right job” company, and for her chaos testing is one of the right tools.
Reese defines a microservice as an application of the Single Responsibility Principle (SRP) but at the service level and, because of their functional way of looking at microservices, that it has an input and produces an output. The benefits she sees using microservices include simplified scalability, independent ability to release, and a more even distribution of complexity. Jet runs with somewhere between 400 and 1,000 microservices spread over 10-15 teams, mainly written in F# (a functional-first programming language).
Reese notes that chaos engineering is not about wreaking havoc with the code for fun, instead she defines it as:
Controlled experiments on a distributed system that help to build confidence in the system’s ability to tolerate the inevitable failures.
Referring to Principles of Chaos Reese’s defines four steps in chaos engineering:
- Define “normal” (the normal state of the system).
- Assume “normal” will continue in both a control group and an experimental group.
- Introduce chaos: servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Look for a difference in behaviour between the control group and the experimental group.
More specifically this means:
- Build a hypothesis, defining normal behaviour and state of the system like throughput, latency, etc.
- Vary real-world events, spikes in traffic and other things that can make something chaotic.
- Run experiments in production to guarantee authenticity of the tests.
- Automate experiments to run continuously.
The benefits of chaos engineering that Reese has found include:
- Outages occur due to testing during daytime, instead of fixing problems at 3 a.m.
- Engineers start to design for failure.
- It makes systems healthier, by preventing outages happening later on.
Looking at their experiences Reese notes that they are not yet testing in production. As a start-up company their primary objectives has been launching and getting everything right. Right now they are testing in QA randomly at all hours during daytime.
One of their most “interesting” disasters happened a few months ago when their manual testers noticed that their search engine was down, resulting in cascading issues downstream. The reason for this failure was that the chaos testing has restarted the search engine in the wrong way. Due to this single failure they were able to find 5-6 different issues.
Reese concludes by claiming:
If availability matters, you should be testing for it.