AWS engineers recently wrote about an open source chaos engineering tool called AWSSSMChaosRunner that they used to test fault injection in Prime Video. Built using AWS Systems Manager that can execute arbitrary commands on EC2 instances, the team was able to mitigate latency related issues using it.
The AWSSSMChaosRunner is built using the AWS Systems Manager to remotely execute commands against a specific set of EC2 instances. The set of commands, specified declaratively as a collection, creates the set of injected faults.
Varun Jewalikar, software engineer at Prime Video, and Adrian Hornsby, principal developer advocate (Architecture) at AWS, write that typical chaos engineering experiments include simulating resource exhaustion and a failed or slow network. There are countermeasures for such scenarios but "they are rarely adequately tested, as unit or integration tests generally can't validate them with high confidence".
AWS Systems Manager is a tool that can perform various operational tasks across AWS resources with an agent component called SSM Agent. The agent - pre-installed by default on certain Windows and Linux AMIs - has the concept of "Documents" which are similar to runbooks that can be executed. It can run simple shell scripts too, a feature leveraged by the AWSSSMChaosRunner. The SendCommand API in SSM enables running commands across multiple instances, which can be filtered by AWS tags. CloudWatch can be used to view logs from all the instances in a single place.
The security aspects like creating a user for execution on the EC2 instance are taken care of by the agent. Examples of what the chaos runner can do include silently dropping all outgoing TCP traffic on a specific port, introducing network latency on an interface, hogging CPU etc. It’s important to note that the currently supported failure injections are either at the infrastructure or at the AWS service layer.
The AWSSSMChaosRunner originated from a set of SSM documents written for injecting faults into AWS resources. After executing the document using standard SSM Agent APIs, the load generation component simulates real-world traffic against the applications, according to the article. AWSSSMChaosRunner can be used against ECS also, but not against Lambda as the latter is a fully managed service. There are other approaches to perform fault injection into AWS Lambda.
Prime Video, which uses AWS services under the hood, utilized the AWSSSMChaosRunner to test performance when dependent services have high latency. Jewalikar and Hornsby mention that it helped them to fix a bug in the Elasticache timeout configuration.
There are other libraries for running chaos engineering experiments, one of the earliest being Netflix's Chaos Monkey. Others have written their own frameworks like LinkedIn's Project Waterbear and Twitter's Python library. Companies like Gremlin offer fault injection as a service.
The source code for AWSSSMChaosRunner is available on GitHub.