BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Gremlin Releases "Resilience as a Service" SaaS Platform to Enable More Effective Chaos Engineering

Gremlin Releases "Resilience as a Service" SaaS Platform to Enable More Effective Chaos Engineering

This item in japanese

Bookmarks

Gremlin Inc has released Gremlin, a "resilience as a service" SaaS-based platform, which allows organisations to "break things on purpose" and conduct chaos experiments in order to help prevent application downtime before it happens. Gremlin allows the controlled injection of resource, network and state failure to managed infrastructure so that engineers can observe the behaviour of the system operating under these conditions. Gremlin also provides an 'undo' button, and automatically cleans up if things go wrong.

The concept of chaos engineering and resilience testing has become increasingly popular over the last year, even though pioneers such as Netflix have been talking about this for quite some time. Tooling such as Netflix's Chaos Monkey (and associated Simian Army collection) are relatively mainstream, and many recent conference presentations feature a mention of chaos. However, use of this technology often requires an advanced level of infrastructure and operational skill, the ability to design and execute experiments, and available resources to manually orchestrate the failure scenarios in a controlled manner -- chaos engineering is not simply about breaking things in production.

The Gremlin platform provides a web-based GUI for executing and managing chaos experiments on compute instances on which a Gremlin daemon (agent) has been installed. The daemon can be installed on Linux via Debian and RPM packages, and there is also a Docker installation option (with accompanying Kubernetes support) for organisations running applications within containers.

Gremlin chaos as a service application UI

The Gremlin web UI allows a series of failure "attacks" to be issued in a controlled fashion against infrastructure, including:

  • Resource Gremlins
    • CPU: Generates high load for one or more CPU cores.
    • Memory: Allocates a specific amount of RAM.
    • IO: Puts read/write pressure on I/O devices such as hard disks.
    • Disk: Writes files to disk to fill it to a specific percentage.
  • Network Gremlins
    • Blackhole: Drops all matching network traffic.
    • Latency: Injects latency into all matching egress network traffic.
    • Packet loss: Induces packet loss into all matching egress network traffic.
    • DNS: Blocks access to DNS servers.
  • State Gremlins
    • Shutdown: Reboots or halts the host operating system, allowing you to test, for example, how your system behaves when losing one or more cluster machines.
    • Time travel: Changes the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
    • Process killer: An attack which kills the specified process, which can be used to simulate application or dependency crashes.

Attacks can be run ad-hoc, programmatically or scheduled. Attacks can also be scheduled to execute on certain days and within a specified time window, and it is possible to set the maximum number of attacks a schedule can spawn. Gremlin provides an 'undo' button and automatically cleans up if things go wrong. Security is "built in from the ground up" using least permissions, multi-factor authentication, auditing, and role based access control (RBAC).

The core value proposition of Gremlin is to allow engineers to initiate, control and observe how a system behaves under various failure conditions caused by the attacks. Gremlin does not provide automated canarying (offered, for example, by Barometer) or automated failure detection (offered, for example, by LightStep), but instead provides a comprehensive set of failure primitives that can be used to design experiments and observe what happens when issues occur within a complex distributed system. Gremlin also does not require modifications to a deployment pipeline or networking infrastructure, and so can be utilised more easily for a range of infrastructure and deployment paradigms (e.g. bare metal, cloud/IaaS, or containers).

InfoQ recently sat down with Kolton Andrus, CEO and co-founder of Gremlin Inc, and discussed the role of chaos engineering within software systems, the value of resilience testing, and the future of the Gremlin platform.

InfoQ: Could you briefly introduce yourself please, and also explain a little about your background in chaos engineering?

Kolton Andrus: I'm Kolton Andrus, CEO and co-founder of Gremlin, and we're building more resilient systems through chaos engineering. We're offering failure-as-a-service.

At Amazon and Netflix, I served as a Call Leader, responsible for managing and resolving high-pressure incidents to keep our services up. We were all too familiar with the pain of middle-of-the-night disruptions to identify and fix critical issues.

While the roots of chaos engineering have been around for over a decade, and many great open source tools have helped to pioneer the way, the industry needs more. All of us at Gremlin have solved this problem before, so we're familiar with what engineers need to safely and securely execute these experiments, and then to automate them, in a simple and easy to use tool.

InfoQ: Chaos engineering appears to be a current "hot topic" -- there are numerous blog posts being created, and the topic also appeared in Nora Jones' AWS re:invent keynote -- but it's not a particularly new topic (both Netflix and AWS have been talking about this for some time). Why do you think this is popular now?

Andrus: AWS and Netflix experienced the need sooner, but the complexity of distributed systems are beginning to affect all companies, big and small; we're at an inflection point. The new way of building distributed systems is much more complicated, making it hard to know what will fail and when. In the old world, software was running in a controlled, bare metal environment with fewer variables. In the new world, software is reliant on infrastructure and services outside of our control.

The adoption of cloud computing and the trend of microservices has created infrastructure that continues to mature and reveal new ways to develop, deploy, and operate applications that were never before possible. However, this has created a complexity gap - systems are too complex for any engineer, or team of engineers, to understand and it's made failure inevitable.

InfoQ: What is the biggest benefit of chaos engineering or resilience testing? Is this appropriate only for large organisations with dedicated infrastructure teams?

Andrus: A lot of companies are operating in reactive mode when it comes to outages. What many don't realize is that you can identify these bugs and prevent them before they wreak havoc, saving time and headache for your engineering team, and also saving millions to the bottom line. Any team that is on-call and supports a service or system can benefit from this approach, in reduced operational burden and downtime.

Chaos engineering allows for thoughtful, planned experiments to introduce failure into systems to identify and fix unknown faults - ultimately  building robust and resilient systems. It's like a flu shot for your application.

Traditionally, people think of chaos engineering as randomly breaking systems, but we believe it works best when it's highly curated and automated as a part of regular testing. This will enable teams to create benchmarks to measure against, as well as ensure the system is constantly improving as variables evolve.

InfoQ: Can you talk a little about the biggest challenges for organisations looking to implement chaos engineering?

Andrus: The biggest challenge today is education. Chaos engineering as a discipline is just starting to become a topic of discussion within the engineering community; though many don't know where to begin. On the other side, executive teams don't realize that many outages are preventable, and that's a preconceived notion we have to correct.

We've just launched Gremlin, and for us we see this next year as focused entirely on education. That's why we partner with companies to help them understand the philosophy and how it's beneficial to them.

One of the ways we do that is through Gamedays, where we work directly with engineering teams to test drive Gremlin attacks before they've even signed on to become a customer. That way they can see firsthand how their system will react in the face of failure and the value the tool brings.

InfoQ: What has the uptake of Gremlin been like? Can you share any success stories, or explain ways in which the tool has been used effectively?

Andrus: As I've mentioned earlier, we've just launched, but early signs are strong. We're working with more than 12 customers, including the likes of Expedia, Twilio, Confluent, Remind, and many others.

We ran a Gameday with a team from Confluent. They had run this previously by hand, and it had taken their entire team a full day. We were able run all of their scenarios and more, in under two hours. Twilio has blogged about automating some of their failure experiments using Gremlin.

InfoQ: Can you share any of the roadmap for Gremlin? What does the future hold for chaos engineering?

Andrus: What is next for chaos is more awareness and increased adoption. Ultimately, I see a world where this will become something that every company is creating budget for, and it will be incorporated into university CS curriculum. We're starting to see a few schools do it already and think it's just the beginning.

InfoQ: Thanks for answering our questions today! Is there anything else you would like to share with the InfoQ audience?

Andrus: We're excited to help make the internet more reliable for everyone. We all rely heavily on it, and when things break it has a substantial impact on our lives. By helping teach the industry, and helping our customers be successful, we can have a small impact on our society as a whole.

Gremlin Inc has recently secured a Series A round of funding worth $7.5M, and early customers include Expedia, Twilio and Confluent. Additional information on Gremlin can be found on the website, and further information on The Principles of Chaos Engineering can be found the website run by the Chaos Community.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Could be useful...

    by Greg Liebowitz,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I considered something similar and registered faultlab.io but never went further :) I assume non-Netflix clients won't be keen on deliberately attacking prod, so you still need an automated mechanism to stress test the system in isolation, as well as have instrumentation to identify faults and event causality.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT