Increasing the Resilience of APIs with Chaos Engineering

The Gremlin team has described a simple chaos experiment as a method of validating that an organisation's APIs are resilient. Using the principles of chaos engineering and techniques like running "game days" (a fire drill for IT systems and people) can provide value, as can the appropriate use of commercial and open source tooling emerging within this space.

Tammy Butow, principal site reliability engineer at Gremlin Inc, begins the blog post by discussing that although many organisations expose their services (and provide core business value) through web-based APIs, in her experience these APIs and associated infrastructure are often considered "second-class citizens". As an organisation scales, there is a risk that a failure in the API layer can result in a degraded user experience or a high severity incident. In a related pattern, increased usage of an API can also increase the strain on the associated backend system, and the load exerted as the number of requests increases may not have a linear relationship with performance and reliability. Engineers should therefore formulate and run experiments to understand the impact of increased load, degraded systems, and infrastructure failure, and ultimate design systems to mitigate risks.

Butow suggests that one of the best way to develop this understanding and design experiments is through the use of the principals of Chaos Engineering and Game Days. For readers unfamiliar with the phrase, at QCon San Francisco Adrian Cockcroft described game days as "the fire drill for IT". Unexpected application behaviour or infrastructure failure often causes engineers to intervene and make the situation worse; in everyday life fire drills save lives in the event of a real fire, because people are trained how to react, and in the world of IT game days perform the same function.

The Gremlin blog post provides sample scripts to simulate heavy load against a typical API gateway, and describes how the commercial Gremlin "resilience as a service" SaaS platform can be used to inject failure (such as high CPU or memory usage, or the complete termination) on the compute instance running the API gateway. Butow stresses in the blog post (and in her previous QCon London talk) that the requirement for monitoring and observability is paramount before beginning to run chaos experiments.

"Running chaos experiments on a consistent basis is one of many things you can do to begin measuring the resiliency of your APIs. Making sure you have good visibility (monitoring) and increasing your fallback coverage will all help strengthen your own systems."

The discipline of chaos engineering is moving into mainstream adoption, driven by commercial chaos tool and service businesses like Gremlin, but also by pioneers in the space like Netflix (the creators or the original chaos monkey), community-led efforts such as the Chaos Toolkit, and enterprise organisations like Expedia and Bloomberg (who have released the Kubernetes-specific "PowerfulSeal" chaos tool as open source).

The rise in popularity of chaos engineering is leading organisations to consider whether to build or buy associated tooling, and here the Gremlin team recommending that engineers consider tradeoffs such as the Total Cost of Ownership (TCO) of building their own chaos experimentation platform, the capability (and desirability) of exposing internal systems to an external SaaS platform, and the current team skill set, and how much control the team will need in regards to the platform roadmap. Several thought leaders within this space -- such as John Alspaw, co-founder at Adaptive Capacity Labs -- are cautioning that the human side of resilience engineering should also not be forgotten, and is in fact more important than the associated tooling.

Additional information on Gremlin can be found on the organisation's website, and the inaugural Chaos Conf will be running 28th September 2018 in San Francisco.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter