Netflix has introduced Failure Injection Testing (FIT) to prevent the Simian army, particularly Latency Monkey, from impacting customers when testing in a production environment.
FIT allows the impact of failure to be controlled by supplying metadata that indicates the limits of the test. Zuul's filtering mechanism uses the metadata to add information to the request context that specifies the limits of the test. Each injection point checks the request context to see if a failure is specified for that component. If it is, that component simulates that failure.
A failure scenario might be the impact of a loss of a particular service, or parts of that service such as persistence or caching. Another failure scenario might be testing the minimum set of services need to stream video. The types of failures might be an added delay to a service call, or a service failure such as the persistence layer being unavailable.
The impact of the test is then expanded gradually. If this initial failure test is handled appropriately in a test account or specific device, the failure requests can be expanded to a small percent of production requests. If these failure tests behave appropriately, the failure request can be allowed to affect all production requests. The more the results of a test demonstrate that the system can degrade gracefully, the more failure is allowed to occur.
Netflix’s services are built from components that provide an injection point for inserting failure. Examples of these components are Hystrix, Ribbon, EVCache and Astyanax. Hystrix isolates failures and defines fallbacks. Ribbon provides internal load balancing. EVCache is a distributed in-memory cache. Astyanax is a client library for Cassandra. These components would examine the FIT context that Zuul has inserted into the request to determine how the service should respond. Possible responses include increased latency, returning a 500-level response code, or throwing an exception.
One of the problems in testing or even trying to recreate a past outage is to know what could fail. An internal trace tool can track requests to find all injection points for a given path. Once those injection points are found, these found injection points are used to create failure scenarios as previously described. Essentially a failure scenario is a set of injection points and their associated behaviors (succeed or fail in a particular way).
Netflix is developing a set of automated failure tests that can be run continuously. These automated tests would identify every dependency for a particular function, and fail each one individually. An example of a function is a user launches Netflix on a device, browses to pick a video, and begins streaming.