InfoQ Homepage Podcasts Kolton Andrus on Gremlin’s Newly Announced SaaS Chaos Engineering Product and Running Game Days

Kolton Andrus on Gremlin’s Newly Announced SaaS Chaos Engineering Product and Running Game Days

Dec 22, 2017

Gremlin is a Software as a Service that lets you plan, control and undo Chaos engineering experiments built by engineers with experience from Netflix, AWS, Dropbox and others. In this podcast Wes talks to Kolton Andrus about the Gremlin product and architecture and related topics such as running Game Days.

Key Takeaways

Chaos Engineering is about thoughtful, planed experiments. An example might be a Game Day where we get a group of engineers in a room and whiteboard out what can wrong, and then we run experiments to test our assumptions such as what happens when your log files fill up the disk.
In general the advice is to minimise your blast radius; start with the smallest thing that you can do that will teach you something about your system - a single instance, a single container - once you’ve got faith in that scale up and repeat.
Gremlin, which is based on the idea of chaos engineering or “resilience as a service” is now available as a product, with early customers including Twillio, Expedia, Confluent and Remind. There are open source alternatives but it’s the first enterprise product to our knowledge.
Examples of the Gremlins you can inject include things that consume resources (for example memory, CPU, disk-space, IO overhead), things that change the state of the host or the VM (kill processes, time travel such as a leap second or a clock skew, rebooting hosts and containers), and things that change the network (we can’t resolve DNS, the service is slow and so on).
The Gremlin client is written in Rust. Rust does deterministic memory management, and failure is a first class citizen in Rust the signatures in Rust have a result. The service is written in Java running on AWS, and the web interface is written in Ember.

Subscribe on:

What’s the big news for Gremlin?

1:45 Yesterday was our big launch day - we publicly spoke about what we’ve been building.
2:00 Gremlin - our failure as a service - is now publicly available, and is enterprise ready.
2:15 We closed our series A round earlier this year - 7 1/2m dollars.

What is resilience as a service?

3:30 Failure as a service resonates with engineers, but it is a means to an end.
3:40 What we’re really focussed on is helping companies be more resilient to failure.
4:05 We want to make software stronger.
4:30 It’s a bit like inoculation from a virus - we cause a little bit of harm, but similar to what would happen normally.
4:35 It helps to verify our defences.

What is chaos engineering?

4:55 It’s about thoughtful, planned experiments that teach us how things can go wrong.
5:10 We love Chaos Monkey - it was a pioneering software release.
5:15 However, because of Chaos Monkey people think that the only way to do chaos engineering is to randomly break stuff.
5:20 We don’t think that’s the right way to go - thoughtful experiments are much better both at reviewing where the system is weak but also to train your teams.
5:35 It’s not just the technology - it’s the people as well.

What is a thoughtful planned experiment?

5:55 A game day is group of engineers in a room and thinking about what could go wrong.
6:15 We have engineers whiteboard out their service.
6:20 Often the first iteration doesn’t have all the details - where the configuration is stored, for example.
6:30 Once we’ve drawn out all of the network connections and how they connect to each other.
6:35 We can then look at what would happen if we break a connection, and predict how it would behave.
6:45 Just knowing what could go wrong, and what you think could happen, helps you learn a lot.
6:50 The next step is to execute the experiment.
7:00 At Gremlin we provide the tooling and planning the game day, but we let the engineers drive.
7:15 Often what we see is that our assumptions were incorrect and there are some subtleties as to what happens.

So you could run an experiment on, say, log files filling up a drive?

7:45 We wrote a blog post about that [https://www.gremlin.com/wont-get-fooled-again/] because we ran into this exact problem ourselves.
8:05 So we ran a game day and filled up our disks, and made sure that the log rotation happened, that things continued to operate as expected.

How do you actually run the experiment?

8:35 There’s a couple of ways this can be done.
8:40 There’s a concept of a blast radius - you want to start with the smallest thing that will teach you something about your system.
8:50 For example, it might be a single host, single container, or in the lambda world, a single function.
9:00 You test that assumption - and if things break, the cost of that breakage is low.
9:10 Once you see how it behaves, and you see how it affects your system, you increase the blast radius.
9:20 Think about what could go wrong; plan the experiment; run it on the smallest thing you can learn from and scale up as appropriate.
9:45 A lot of people tend to build up their confidence in a staging environment - if you switch and run it in production, re-set the blast radius.
10:10 At the end of the experiment, when you have fixed any issues, you then automate it so you can’t be hit by it again.

Are there any products like Gremlin on the market?

10:55 There are a lot of open-source projects in varying levels of maturity, but we think we’re the only enterprise-ready product.
11:40 Simplicity is a key tenet of our product.

What does Gremlin give a software developer?

12:25 In a game day process, they know the systems.
12:35 When it comes to running experiments quickly, Gremlin provides the platform to do that.
12:50 We think safety is important, and having an undo button in case things go wrong, is a key differentiator.
13:30 For filling up disks, a Gremlin will write files to a particular location - but if you undo it, it will delete those files.
13:40 We also have a failsafe mode - if something goes wrong in the execution of that test, then we will automatically halt and undo those tests.
14:00 Another example is the CPU Gremlin, which consumes resources - if you undo that, you’re just killing that process.
14:10 We also have another Gremlin which can modify system time on the box - but we have the ability to revert to what the time should be.

So what kind of Gremlins have you built?

14:35 We have three top-level categories: things that consume resources, things that change the state of the host, and things that change the network.
15:00 The resource gremlins are ones that max out CPU, hold onto memory, eat disk space, or add IO overhead.
15:15 The state ones are things like process killer, what happens when daylight savings or leap seconds are added, or reboot hosts.
15:55 The network ones are important - we live in a micro-services world, where one remote micro-service can break your own.
16:10 For example, what happens when we can’t perform DNS lookups, or what happens if a service is slow or experiencing a lot of packet loss or corruption.
16:30 We make these easy to scope - for example, we can limit traffic to www.google.com.
16:35 We can also fail Amazon specific services; for example, inhibit traffic from S3 on us-east-1.

How do you maintain the blast radius with Gremlin?

17:05 It’s about how you do chaos engineering properly, as well as the tooling built into Gremlin.
17:10 It’s a process approach - you want to start small.
17:20 We make it very easy in Gremlin to filter and run experiments on a small set of hosts.
17:25 For example, show me all the hosts in this service, or in this zone.

How do you handle fleets of machines?

18:00 We made it very easy to automate the deployment of Gremlin.
18:20 However you manage your infrastructure, Gremlin fits in seamlessly.

What is the purpose of the APIs?

18:30 Being a developer-focussed company, we built the command line interface first, followed by the API and then lastly the web-based interface.
18:45 Everything you can do in the web interface, you can do via the API.
18:55 You can do manual testing and exploration via the web interface, but you will want to ultimately automate it, so having an API lets that happen.

What have the customer experiences been like?

19:20 It’s been great to see the interest in the industry as a whole.
19:35 For a lot of large enterprises, chaos engineering is something that they’re looking at doing in the future.
19:40 The value we’ve seen at the beginning is about saving engineers’ time.
19:45 We ran a game day with confluent, and they had run a game day earlier that week - their team took an entire day to run all the failures that they wanted to do.
19:55 We showed up and ran the entire thing in two hours.
20:30 Expedia talked at QCon SF about their exploratory use of Gremlin.
20:40 Now they’re moving into doing more automated testing and making it a regular part of their process.
20:50 They had some great content in the score cards and maturity and readiness of teams.
20:55 It’s part tooling, part teaching.

What’s the story around containers and orchestrators?

21:45 We built Gremlin using containers from the beginning.
21:55 It was clear we needed to support containers as first-class citizens.
22:00 You can install Gremlin as a Debian package, a RedHat package, or as a container.
22:10 You can configure the container to attack itself, or to attach other containers running on the same host.
22:15 The ability to run attacks but only on containers that match specific [Kubernetes] labels can help the blast radius.
22:30 We test with Kubernetes and Amazon ECS and support the tools that are customers are using.
22:35 The ability to break failure testing down into finer granularity is powerful.

How does this apply to cloud providers?

23:00 Gremlin is built to run on Linux - we don’t care where you’re running, whether it’s on bare metal, in a data center, on Azure, Google or AWS.
23:15 We have some nice-to-haves if you’re running on a cloud provider like AWS where we can run some attacks against those out of the gate.
23:25 Being able to test both sides of the migration to see if they’re ready is important.

Is this safe for production?

23:45 Engineers really like the tools and the ability to run these things, but they have concerns that the security teams won’t like them.
23:50 That’s another one of our core tenets; security is a first-class citizen, and is a problem that open-source delegates to the reader as an exercise.
24:10 For example, we don’t run as root, we use permissions everywhere, we communicate via our control plane via secure out-of-band channels.
24:25 Our security engineers are ex-RSA and have engaged pen-testers to investigate our service and clients.
24:40 This summer, we built in the enterprise security features like single sign-on, SAML 2.0, multi-factor authentication, role based access control and auditing.

How do you build a gremlin?

25:15 Rust is the language we chose to write the client in.
25:20 Rust treats failure as a first-class citizen, so we’re ideologically aligned.
25:25 If you’re writing something that will be performant, low-level, be able to impact the network, then Rust takes everything that’s been learned in the last couple of decades of programming languages.

What was the learning curve like?

25:45 Rust was new to us, so the first few months of the company we were learning that.
26:00 Whenever someone joins our team we’ve had them write a new Gremlin.
26:10 People love to learn a new language and play with Rust.

What are the key features of Rust?

26:20 Rust has deterministic memory management; garbage is collected at known bounds, so you have a very consistent profile.
26:30 Failure is a first-class citizen in Rust, so instead of the C convention of returning a -1, the signatures in Rust allow you to encode the success or failure of the call.
26:50 When things fail, you have to handle it - and the compiler does a great job of checking you in the beginning, making it safe.

How do the Gremlin services communicate?

27:20 We wrote the service in Java running on AWS - we’re using some Amazon services under the cover.
27:40 The web tier is written in Ember - something I used at Netflix.
27:45 We’re evaluating whether to stick with Ember or move to React.
28:0 The client polls to the service via an outbound SSL connection, asking if there is work to do.
28:10 That gives us a nice touchpoint, where if we ever lose communication with the service then we know something has gone wrong and we can rollback.

What’s the scalability like?

28:45 We do regular game days and load tests - before we came out of beta we tested with thousands of clients and attacks.
29:00 We’ve tested what happens when dynamo gets slow or fails.
29:10 We have another load test scheduled for next week which is going to have 10,000s of clients.

What overhead does the Gremlin client have?

29:30 The client binary is around 25Mb.
29:40 When we’re not doing anything on the system, the usage should be minimal.
29:50 The way that we poll and manage things in the background, there’s no discernible overhead or load unless you are running an attack.

What are you doing in the community?

30:15 More than building a good tool, you have to build culture and community.
30:25 A lot of the plans for this year are to teach the community about chaos engineering.
30:30 Obviously if we can’t help our customers have a more resilient service and prove our value, we won’t be around.
30:40 We have a vested interest in ensuring that everyone is successful, whether they’re using Gremlin or not.
30:50 We have a publicly available slack channel [http://tinyurl.com/chaoseng] to allow people to come and ask questions.
31:05 We’re putting together some white papers - for example, What is Chaos Engineering?

What do you mean by cost of downtime?

32:05 There are a couple of metrics; IHS market did a survey and they estimated that US companies are losing 700 billion dollars due to outages every year.
32:20 PagerDuty did a survey for downtime being $5k a minute, or $300k an hour.
32:30 From my time at Amazon, if an e-commerce site goes down on Black Friday, it can be hundreds of thousands of dollars per minute.
32:40 Driving home the point that every time there’s an outage, not only are your customers losing faith in you and your brand is impacted, there is a financial cost.
10:30 Engineers have to be paged, triaged, root cause, follow up and fix - like an iceberg, a lot of people just see the outage at the top but there’s a lot more afterwards to understand and prevent recurrence.
33:10 Then there’s the money lost because you can’t do business.

Resources

Related Podcast: Nora Jones on Establishing, Growing and Maturing a Chaos Engineering Practice
Gremlin, Inc.
More on Chaos Engineering
The Art of Chaos Engineering Track at QCon San Fransisco
Slack Channel

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.