At QCon London 2016 Peter Alvaro and Kolton Andrus shared lessons learned from a fruitful collaboration between academia and industry, which ultimately resulted in the creation of a novel method for automating failure injection testing at Netflix. Core learnings included: work backwards from what you know - this can often bound the solution search space, for example, within Netflix it was known that the end-user experience is vital; meet in the middle - defining and working towards shared goals in combination with effective collaboration is essential; and adapt the theory to the reality - as the real world is often ‘messy’.
Alvaro, assistant professor at Santa Cruz, and Andrus, founder of Gremlin Inc and ex-Netflix ‘chaos’ engineer, began the talk by stating that the goals of academia and industry are often different, which can make successful collaboration a challenge. Alvaro, the self-described ‘professor’, stated that he enjoys modelling problems and prototyping solutions. Success within academia is typically defined by a high citation impact h-index, successful grant applications, and university department ranking. Andrus, the ‘practitioner’, discussed that he enjoys proactively finding operational issues and productionising solutions. Success within an industry context is often defined as system availability, a small number of failure incidents, and reduced operational burden.
Potential for the collaboration arose during 2014, when Andrus was responsible for implementing ‘failure as a service’ within Netflix, which built upon the well-documented ‘chaos engineering’ work. There are two key concepts for failure testing in production: failure scope and injection points. Bounding the potential impact of a failure test is referred to as the ‘failure scope’. The scope can be an individual user, an individual device, or a percentage of the total user base. ‘Injection points’ are key inflexion points within the system where failure is known to happen, e.g. remote procedure call (RPC) boundaries, caching layers and the persistence tier.
The ultimate goal of failure testing is for genuine production failures to not trigger outages, and for the system to gracefully degrade without the need to page engineers. Andrus demonstrated the type of conversations that occur when this is implemented correctly.
One of the best feelings is when you come in to the office and you chat to a co-worker, and they say ‘Hey, do you know that service X fell over last night?’, and you get to say ‘No - were you paged?’, ‘No, were you?’, ‘No.’
I love talking about outages, after the fact, in business hours, instead of in the middle of the night
Traditionally within Netflix the process of failure testing has been manually implemented. Andrus would meet with service teams, discuss what could go wrong and determine failure scenarios, and manually implement and execute corresponding failure tests. Believing there was a better way, Andrus searched on the Internet and discovered a recording of Alvaro’s RICON talk, ‘Lineage-driven fault injection’ and the associated paper, which he believed could potentially be used to build automated failure testing in a safe way that didn’t require manually exercising failures.
Andrus and Alvaro traded ideas over lunch, and were determined to push the state-of-the-art forward within the domain of failure testing. Common ground was established based on the concept of ‘freedom and responsibility’, a theme that is embraced by both Netflix and academia. Key shared goals were 1) prove that models of reality actually work in reality, 2) show that is scales and 3) find real bugs. Accordingly, they decided to work closely over the summer months, and Alvaro joined the Netflix team on-site as a contractor.
The initial questions that were asked included “how big is the space of possible failures within a typical microservice architecture?” and “how many combination of failures are there in that space?” Alvaro commented that with a quick ‘back of the envelope’ calculation on a microservice system computation involving 100 services, the search space would be 2100 executions. This is how many failure test experiments would be required to exhaustively search for bugs, and running these tests takes a non-trivial amount of time. If the depth of the search was constrained to bugs involving just a single failure then only 100 executions would be required. However, if the number of simultaneous failures was expanded to 4, then the search space is 3 million. Expand the number of failures to 7, and the search space is 16 billion.
Fault tolerance from some sort of abstract academic level is really just redundancy.
A system that is fault tolerance must provide more ways of getting to some good outcome than we anticipate there will be failures.
An easy to automate approach is to conduct a random search (essentially “throwing darts at the search space”), but this would take too long. An alternative approach would be to implement an engineer-guided search, using domain-expertise and intuition to identify potential weaknesses of a service. The problem with this approach is that it is not automatable - it scales only with people.
Typical approaches to verification ask an open-ended, hard question “could a bad thing ever happen?”. However, this gives very little direction in how to carry out a search. One of the key concepts of the lineage-driven fault injection paper was to begin with a good system outcome and ask “exactly why did a good thing happen, and could we have made any wrong turns along the way?”. Faults are cuts in the lineage graph such that all supports are broken and an error occurs. The ‘lineage graph’ of a successful outcome can be used to identify failure tests to experiment with.
After the combinations of things that could have gone wrong along the lineage path have been determined, the resulting collections of paths is essentially a boolean formula in conjunctive normal form, which can be solved with a highly efficient satisfiability (SAT) solver. The resulting solutions are sets of failures injection combinations that can be experimented with. From this initial work Alvaro created a prototype system named “Molly”, which implements the following algorithm
- Start with a successful outcome. Work backwards
- Ask why it happened: Lineage
- Convert lineage to a boolean formula and solve
- Lather, rinse, repeat
The result of applying the algorithm is either a bug is discovered, which Alvaro half-jokingly referred to as a ‘good outcome’, or a bug is not found, and the algorithm is repeated. The first challenge encountered when productionising the algorithm for use within Netflix failure testing was defining the notion of ‘success’ for a typical user request within the Netflix stack. A simple 200 HTTP status code does not always indicate success. Andrus stated that the Amazon core leadership principle ‘start with the customer and work backwards’ provided inspiration - the question of primary importance was “did the customer see success?”
The Netflix stack provides real user monitoring (RUM) to measure both the system characteristics and customer experience. An asynchronous stream of RUM data is constantly sent from the client to the Netflix backend service, which can be joined with a stream of user request data issued by the client, and successful (and failed) outcomes identified. Netflix also utilise a distributed tracing system with annotated failure points, which allows the combination (sets) of failure injection sites generated from the SAT solver to be determined and configured with failure scenarios.
Redundancy is implicit within the Netflix system with, for example, the deployment of services across multiple availability zones and regions. Execution within the code, for example, using the Hystrix circuit-breaker, allow fallback operations to be defined if a service in a zone fails. This effectively provides redundancy over time, or ‘redundancy through history’, with multiple attempted executions allowing a successful outcome. This redundancy defines the lineage graph of a Netflix request.
Alvaro commented that the discovery of this approach was only possible with the close collaboration between academia and industry, and an attempt to ‘meet in the middle’ over common goals. Andrus also stressed the collaboration was only possible by working closely together on-site, with frequent discussions and whiteboard sessions.
It’s the beauty of being side-by-side, being able to have these conversations, and the whiteboard diagrams that come from it. Not just a quick email or a chat, but being together onsite and working together on a regular basis.
With the failure injection points and lineage graph defined, the remaining step was to implement the algorithm in code. Determined to prove his software writing ability as an academic, Alvaro implemented the algorithm that is now deployed into production within Netflix.
Alvaro: I’ve committed real code that I’m proud to say is now running in Netflix…
Andrus: ...well minus all of those println() statements
Alvaro: ... yeah, sorry about the println()s
With the algorithm implemented, it was now a case of “lather, rinse, repeat”. If a user request coming into the system doesn’t trigger failure, then it would be a case of re-running the request with different failure injection schedules (replay is essential to the algorithm’s success). However, not every request endpoint is idempotent within the Netflix distributed system, and request traffic cannot simply be replayed without unintended consequences.
This potential issue was solved by looking at the replay as a ‘balls and bins’ problem, where requests can be grouped based on ‘sameness’ i.e. if two requests travel through the system in the same way, and can fail the same way, then they are conceptually the same for the purpose of failure testing replay. However, requests cannot easily be matched before they have completed execution, but the sameness of requests must be identified upfront in order to allow the failure injections points to be set correctly. Essentially a function was required that could map a request to the corresponding trace (lineage) through the Netflix system.
This problem could have potentially been solved via supervised machine learning. However, due to timescales this approach was not practical. An alternative approach was found by utilising the Falcor framework, an open source Netflix JavaScript library for efficient data fetching, which specifies (for a limited subset of requests) which backend services will be called before a request is made. This allows the request graph to generated upfront. Alvaro commented that although this was not a perfect solution, it did capture the spirit of what they were trying to achieve, and accordingly, the concept of ‘adapt the theory to the reality’ was essential to make progress. After several weeks of productionising the solution Alvaro and Andrus determined that this novel approach to automating failure injection was successful.
In the final section of the talk, the ‘Netflix AppBoot’ case study was presented, in which the new approach to failure injection testing was applied to the initial request issued from a device starting (booting) the Netflix application. The failure search space for this request included approximately 100 services, and although 2100 experiments would be required for an exhaustive search, the new failure testing approach resulted in only 200 experiments being performed in order to exhaust the search space. Six critical bugs were found and fixed.
Future work from the academic perspective includes search prioritisation, richer lineage collection, and the exploration of temporal interleavings (i.e. order of failure). From the industry perspective, Andrus suggested that richer device metrics, a more effective method for request class creation, and better experimentation selection would be the focus of future work.
The video of Peter Alvaro and Kolton Andrus’ “Monkeys in Lab Coats: Applying Failure Testing Research @Netflix” QCon London talk can be found on InfoQ.