Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Gremlin Announces General Availability of Status Checks

Gremlin Announces General Availability of Status Checks

This item in japanese

Gremlin, an organisation focused on chaos engineering, recently announced the general availability of Status Checks. This new feature automatically validates systems that are healthy and ready for running chaos experiments in production. Status Checks support integration with CI/CD pipelines along with third-party tool integration for PagerDuty, Datadog, New Relic, or any other monitoring tool.

One of the "core aspects" of Gremlin is safety. Speaking about it further, Matthew Fornaciari, CTO and co-founder of Gremlin said,

"Since launch in 2017, we’ve had a big red HALT button that makes it simple for Gremlin users to reactively rollback experiments, should an attack negatively impact the customer experience. Today, companies that have matured are automating more of their experiments with CI/CD, and they need a way to programmatically check the health of their systems and proactively stop an experiment. That’s Status Checks."

In times past, companies would address safety concerns by running experiments in staging environments, then apply those learnings to problems in production. Sometimes staging environments do not accurately mirror production environments, making this approach less valuable. By making an API call to a third-party monitoring or alerting endpoint, Status Checks evaluate the status code, request-response time, JSON response body, and then provide a response (pass or fail) depending on your system conditions. If the system is healthy, Gremlin runs a chaos engineering experiment on the system before continuing with another status check.

Expounding on the approach, Ana Margarita Medina, chaos engineer at Gremlin said, "The point of chaos engineering is not to add unnecessary chaos. You want to control the chaos in your system." Taking this point further, Matt Schillerstrom, product manager at Gremlin, explained SD Times in an email, "It’s very important to note that Gremlin doesn’t advocate for ‘chaos’ — the term chaos engineering can be a little misleading. We advocate for hypothesis-driven testing, in order to tame the chaos. To better understand our systems in order to prevent chaos. It does no one any good to be attacking infrastructure that’s already under stress."

Status Checks provide additional confidence and safety to organizations so that Chaos Experiments can be regularly scheduled and automated.

While the world's top-performing IT organizations have adopted chaos engineering, safety remains a big concern for companies that are yet to adopt the practice. Jim Scheibmeir, senior principal analyst, Gartner, said, "Many organizations approach the concept of Chaos Engineering with the attitude that the practice is far too risky to execute into production. The reality is that avoiding Chaos Engineering is equivalent to embracing crisis engineering."

Rate this Article