Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Chaos Engineering with Containers

Chaos Engineering with Containers



Ana Medina discusses the benefits of using Chaos Engineering to inject failures in order to make our container infrastructure more reliable. She also shares how to improve container monitoring and observability and lessons learned from running Chaos Engineering GameDays with Gremlin customers.


Ana Medina is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. She last worked at Uber where she was an engineer on the SRE and Infrastructure teams specifically focusing on chaos engineering and cloud computing.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Medina: I'm glad someone else thinks I'm cool. My mom's usually telling me I'm cool, but there's only so much I can trust her on that. But thank you for coming to my talk. I hope all of you are enjoying the ice cream you're eating. I kind of walked by break and I was like, "Man, I want ice cream but I should probably get to my talk." So I'm jealous basically. But welcome to Chaos Engineering with Containers. My name is Ana Medina.

A little bit about me. I'll start off with saying that I'm a proud Costa Rican and Nicaraguan, so a proud Latina. I'm currently working at Gremlin as a chaos engineer. Gremlin is a company focused on risk resiliency tools and we've been focusing on doing chaos engineering for now. If you're interested on learning more about them and hearing the pitch, they have a booth outside. Previous to joining Gremlin, I was actually at Uber for two years. I got a chance to work on some really cool stuff there. And basically it was mostly on the SRE team and on the cloud infrastructure team and on developer platform.

So I got lucky and when I joined Uber I actually joined their chaos engineering team. They had something similar to Chaos Monkey, if anyone's very familiar with that. Chaos Monkey is an open source project that Netflix came out with for chaos engineering. It basically just shuts down AWS instances very randomly. That came out various years ago and since then Netflix has come up with a new tool. They had to build their own tool for that so I got a chance to come on board for that and I got to be on call for it. I got to learn the pains for it and tried to make it a healthier application.

Then after that, I moved over to the cloud infrastructure team at Uber. And that was actually really cool, because Uber is running on completely bare metal and they actually wanted to make the switch over to the cloud, but they wanted to do that in a vendor agnostic way. So we were building an obstruction layer for that, and we were trying to do that by leveraging AWS and GCP to kind of spin up an Uber ready data center in a really short period of time. For upper management, OKR was basically the last bare metal data center that they had built; took about six months. Well, by using the cloud they actually want to speed that up and make it 24 hours. So it was actually a really crazy moonshot project. I recently talked to a manager at Uber and they're actually still running on bare metal. Some projects do have access to running stuff on the cloud, which is really cool.

Apart from that, I've also worked at a small startup in Miami, a credit union in south Florida, Google, Quicken Loans and I got a chance to do college research with Stanford University and Miami Dade College. I'm also a proud college dropout. I'm a self-taught engineer. I began coding at the age of 13 and it was just like my curiosity. I pressed the button that said “insert html” on the Microsoft Publisher application, and then I went on AltaVista and I went down the rabbit hole of just teaching myself how to create websites, then later that became back into mobile applications. And now here I am in the infrastructure space. But let's get started.

With the raise of hands, how many of you here have heard of chaos engineering? All right, cool. And with another raise of hands, how many of you have actually run a chaos engineering experiment? So hopefully by the end of this talk, all of you are a little bit encouraged to get started with chaos engineering. There's actually a few ways that you can just kickstart on running chaos engineering in your side project or in your company.

So what is chaos engineering? Chaos engineering is thoughtful planned experiments designed to reveal the weaknesses in our systems. The CEO of the company I work for came out with this really cool analogy. Chaos engineering is like a vaccine where you inject something harmful in order to build an immunity. So it's not about breaking things at all; it's thinking about how do I make my infrastructure, my services and my systems more reliable and more resilient? But why do we need something like chaos engineering? Well, the number one thing that we like talking about is microservices. We have seen that as companies move over to microservices, their systems start getting more complex, they start not being able to know exactly the main points of their incidents. At the same time that they're moving to microservices, some of them start adopting cloud. And once all of these things are happening in their systems, those obstructions are getting heavier and heavier and the systems just get more complex. So with chaos engineering, you actually start looking at how is it that every single little thing in that system can actually fail.

Our companies continue growing; we continue serving more customers, we continue going to other parts of the world. And with that, we need to make sure that our companies continue providing good experiences for every single customer. And with that, it's hard to make sure that the users are having good experiences, by just doing testing alone. You actually need to start thinking about what are other failure modes that can happen, and maybe look at other use cases of what other companies are doing. In general, downtime is extremely, extremely expensive. There's a report out there that shows that a company loses about $300,000 an hour for just being down for one hour. And of course, this completely depends on what type of company it is, or if maybe they're a company that's actually running a sale just for five hours and they actually have an outage in two of those hours. The actual losses for that downtime are probably a lot more than $60,000 an hour.

We've also learned that failure will happen. And when failure happens, well we also need to think about that our dependencies will fail. We've come to a time that as a company it's a little hard to just necessarily be like, "Hey customers, we're sorry we're down, our cloud provider has some issues." And you throw the blame on your provider. At that point, your customers don't necessarily care what type of provider you have, whether you're running on bare metal or cloud. They necessarily just care that they were able to use your website, your application and actually do something. So when you think about the dependencies failing, you actually want to think about how do I actually prepare for that, and not necessarily prepare in the sense of like, "Oh crap, there's an outage. Now I know that this happens." Well, you have to start thinking about that before those outages even happen.

Apart from that pager fatigue and burnout really hurts. I myself suffered through burnout and that's the reason I left the infrastructure organization at Uber. And with chaos engineering, you're actually able to bottle a little bit of engineering fatigue, and just burnout in general. With chaos engineering, you can actually test your monitoring, your metrics, your thresholds and make sure that the pagers, the pages that your engineers are getting actually mean something, and that they actually have an actionable item that the engineer can actually take. Apart from that, when you're doing chaos engineering, the engineers are actually able to focus on building features and not necessarily just triaging tasks. So you're able to like build and move fast.

There's this really cool quote by Charity Majors. She's a CEO of Honeycomb. ''Chaos engineering without observability is just chaos.'' And when you have just chaos, you might just feel like that Elmo, where everything's on fire and you don't necessarily know where to go or where to look.

Prerequisite of Chaos Engineering

So before doing chaos engineering, there's a few things you must have, and the number one thing is that you actually must have in place some monitoring and some observability. Without monitoring/observability, you actually don't know how your system or services are doing currently and you definitely won't know how they're doing when you actually run your chaos engineer experiment. Another prereq for chaos engineering is actually having on-call and incident management. Not only is this a good SRE practice and making sure that you actually try keeping your company up and running all the time, it's also a good way to know which are actually the most critical services that you have in your company, and how necessarily do you set those aside in your company. Do you have different tiers of your services that you're able to be like, well if this five services go down, we actually have like a sub zero incident. That means that those are the ones that are causing the most customer impact, and you have more engineers on hands trying to triage that incident as well.

You also want to know the cost of downtime per hour. This in general is good of a metric to have if you're working on something like resiliency and reliability. And I also think that it's pretty cool because if you're actually working on doing something like chaos engineering in your company, you're actually able to go up to upper management when perfect view comes around and it's like, "Well, you see the company's losing X amount of money for being down for every hour. By me actually helping on doing chaos engineering we've actually managed to bring incidents down like three X from what we had three months ago." And you can actually come up with a formula how much you're saving the company.

Use Cases for Chaos Engineering

What are some use cases for chaos engineering? Well, the number one I like talking about is outage reproduction. We have to learn from every single postmortem that we have inside a company. We can go ahead and have good principles and actually be able to do these really cool postmortems combined with action items. But there's really no other followup meeting that we have in place that we actually go back and make sure that the fixes that we said we were going to fix, after this postmortem, have actually gone into play. So you can actually reproduce those outages in a very controlled way with chaos engineering. You're able to think about, “Well, the last two outages that we've had were because of our API, so I actually now I want to do chaos engineering with the API and specifically focus on just that.”

With outage reproduction, the cool thing about this too is that we've actually gotten to a point that the culture of postmortems is a little bit more open and more blameless. A lot more companies are actually sharing what their postmortems are, and what has actually affected their systems and services. And with all this information, you're now able to actually look at all those postmortems and think about, “Hey, I actually use those same systems and services” or, “My infrastructure is very similar to this. It's crazy that they had an outage like that. Well, now we can go into my company and actually triage something like that.” So a good example can just be the most recent GitHub outage, which was just a network a database partition from the network that lasted 43 seconds and that led to like a 24-hour outage. So basically, you're able to now think about like, “Oh, maybe I should make sure that my databases are still having reliability when it comes to network issues and actually run a chaos engineer experiment from there.

On-call training is one of my favorite reasons why chaos engineering should be practiced. How many of you in the room, when they went on-call just got thrown on a pager and said, “You're on-call, go for it”? Yes? Kind of like the common scenario for it. That was actually my case. When I joined Uber, I actually joined as an intern. I was their first [inaudible 00:11:44] intern and I was put on-call my second week, and this was a production service. And it basically was like, “Here's your pager [inaudible 00:11:53], here's the runbooks, have fun.” There's not so much fun you can have when it's Wednesday at 4:00 in the morning and there's a huge incident going on in your chaos engineering service. You don't necessarily want to be that person that wakes up your team just to walk through the incident with them. And at the same time I looked at that runbook and it was 180 days old.

So here I was like, “Cool, I could I get to execute all this commands, but I can also get fired tomorrow just because of this.” And as an intern, that was actually extremely scary. So looking back at that experience, looking at ways to actually make on-call training a lot easier, well, with chaos engineering you can do just that. You can actually reproduce exact incidents or possible incidents and actually onboard an engineer to production that way, or onboard an engineer to on-call. This lets them see exactly what they need to do in the technical space. They actually get to mentally be in the space when there's an actual incident on how do I need to perform on it.

Another cool way to use chaos engineering is to strengthen your products or just in general test reliability of your services. At Gremlin, we actually do this very, very often. Not only is it because we want a dogfood our own chaos engineering tools, but in general we want to make sure that we're also serving reliability to our customers. So we used to have something called Failure Fridays. Now we have moved it over to Thursday, so we're calling it Takedown Thursdays. But basically, every two weeks a leader engineer gets put on and they come up with something that they actually want to test and a few chaos engineering experiments they want to run. We're a remote company, so this also means making sure that there's like a Google doc available, there's a Zoom room available and there's a Slack channel also. So now we're all distributed and we all come online and we run through this chaos engineering experiments together. There's going to be a column on that doc that actually explains who's going to be running that experiment, what the expected outcome is, and what the actual results are. And then you actually get to find out a lot about your products that way. You can actually make sure that you're feeling the pain of your customers before they actually get to feeling that pain.

Another one that I'll actually talk a little bit more about later on is actually using it to battle test new infrastructure, any new services that are coming on. We're in a tech space that's moving extremely fast and with that, we just have companies that are providing us with this really cool services especially in the infrastructure space. But how do we actually decide which services we want to use and how do we make sure that they're as reliable as these companies are telling us? Well, with chaos engineering you can put them to battle side-by-side and actually decide which one is more reliable for your needs.

Use Cases for Chaos Engineering - Containers

And specific to containers, well, same thing as battle testing. Like the cloud infrastructure, now you can actually test the reliability of cloud provider like Kubernetes basically. And that would be putting to test just the EKS and AKS and GKE reliability, in terms of how do they actually react when I'm turning off my containers? How do their autoscaling work? Or just in general how do they interact with the rest of my dependencies? Autoscaling is a perfect one that you want to be testing when you're using containers. You get the promise of containers, that autoscaling will be taken care of if you've actually set it up correctly, and that reliability will actually be done for you. So when you talk to people who are just getting started with Kubernetes, a lot of the reasons that they're jumping on Kubernetes is just for their reliability and the easiness of coming on board to it. So now they just basically believe that if you shut down, a container is going to come back up. But that's not necessarily always true. Containers can get stuck and you actually need to manually triage them to bring them back up.

Logs and disk failure is something that can constantly happen with containers. With containers, you set for the reliability of the infrastructure, but then comes all the dependencies that the rest of your systems and applications continue interacting with. How do you actually deal with your logs filling up? We’ve talked to companies a lot about the outages that have affected their customers the most. And we always hear those big outages that happen just because they didn't capture the logs, filling up the disk very well or just in general, anything else that caused the disk failure not handled correctly.

What is the process of running a chaos engineering experiment? There's this really cool diagram that kind of explains it. You run it actually like a scientific experiment. You want to be thoughtful, you want to be planned, so you just don't want to go ahead and think about, “Hey, how can I break production? I'm going to go ahead and do that.” Or, “I'm going to do it on 100% on my host or a 100% on my containers or inject 100% of the effect of the attack into your infrastructure.” So first, you want to think about the blast radius. And the blast radius is basically the impact that your chaos engineer experiment is going to have. So what that means is that you don't want to start off doing chaos engineering in production. You don't want to start off doing chaos engineering experiment affecting all your infrastructure, nor running the attack at the max capacity that it can have.

So you want to think about running this specifically on just two hosts out of the 10 that you have. And maybe first in your dev environment and your QA environment, the closest thing that you have to production. So once you have the mindset of what are the percentages that you want to run this experiment at, you want to come forward and form a hypothesis. And a hypothesis will look a little bit more of like what you think about what happens if I inject 200 milliseconds of latency into my cost to my DynamoDB, or my SQL, or S3 or things like that. So you come up with a hypothesis in the sense of like, "Hey, if I inject 200 milliseconds of latency, I think I'm not going to be breaching my SLA in terms of how my customer actually gets to interact with the website or my API. I think I have that covered, and reliability is not going to be an issue." So now I actually want to test that experiment. I go ahead and I actually get 20% of my fleet and I inject those 200 milliseconds of latency into the calls that I have for DynamoDB. Then when I actually go look at the experience, you actually get to test how it's actually going.

But I mentioned earlier, the number one thing that you want to have in place is monitoring and observability. This is the perfect chance to go look at your monitoring and make sure that things are running properly, that your customers wouldn't be seeing issues theoretically if you weren't running it in production. And you want to think about what would be the abort conditions that you would have had in case this was running in production. When would you have stopped this experiment? It could have been of, when I start seeing my customers drop by a certain percentage, or it's taking too load for the image to come up or in general things are completely failing.

But if you see that you ran a chaos engineering experiment and it didn't go as well as planned, where you see that your systems or services have some weak points, well, you don't want to just stop there and give up. You actually want to go, make sure that you fix the reliability. You might have to do some more extra coding. You might have to think about how do you make your infrastructure more reliable or coming up with different secondary providers that you can have in order to protect yourself from failure. And then go ahead and run this chaos engineering experiment all over again and pretty much run it until you see it be successful as you completely continue maximizing the percentage that you're going to be running this experiment on.

Monitoring/ Observability

So with that, in terms of monitoring and observability, what exactly do you look at? Well, first of all, you want to be making sure that you're measuring system metrics. You want to make sure that you're measuring what your resources are doing, how they're handling the load of users, how they're actually handling this chaos engineering experiments. So make sure you're covering CPU, I/O, memory and disk. At the same time you want to measure the availability of your services and your applications. You want to keep in mind the service specific KPIs that your company has put in place. And this also includes thinking about the error budgets on the SLAs that those services actually have. You don't want to be breaching the error budgets when you're doing chaos engineering. You want to make sure that you're within those error budgets and you've worked together with the team that is in charge of that when you're running chaos engineering. You also want to have a way to know about customer complaints, specifically if you're running this in production. How can a customer let you know that they're not having the best experience or they're actually not able to check out, in the case you're like an ecommerce website, or be able to like log into their account if you have anything like that?

So now I actually wanted to do a demo. The demo gods were not in my favor today, so I ended up having to record the ends of it. So it's actually a little bit weird, but we'll make it work. The first one that I want to run through, is kind of like how I mentioned, you want to battle test infrastructure. So the scenario is that a company or a user is evaluating a cloud provider that manages Kubernetes. But which one is more reliable? Well, my hypothesis is that by shutting down a container where there's only one container that's running the application, well, that should only give a small delay before the application is reachable again. So now the experiment that I'm going to be running is that I'm going to shut down the Kubernetes dashboard container and actually see what happens. My abort condition is that if my application is not reachable after 60 seconds, well, I want to make sure to abort that chaos engineering experiment.

And that's one of the number one things also with chaos engineering, you want to keep in mind, the security and the safety. You don't want to do chaos engineering to cause more outages, to break production or to piss off the rest of the engineers at your company. When you do chaos engineering, you actually need to inform all your dependencies that you're running chaos engineering, so that if they see a spike in their monitors, they know that it's a chaos engineering experiment coming from one of the services that depend on, versus them thinking that it's like a DDOS attack or something like that.

With that, now we're going test to a cloud provider at Kubernetes infrastructures and see how that goes. We have deployed straight out-of-the-box the Kubernetes dashboard containers and we'll see how that works. So on the left side we have one provider. On the second we have another provider. They're both running Datadog in order to kind of like cash some monitoring. So we can see that it's just showing a little bit of how the Kubernetes dashboard is going, then we move over to Datadog and we can see how the containers are on both of them. They're both running and we're going to go ahead and run two chaos engineering experiments on each one.

It takes a little bit to load. So now we go back to our dashboard. I kind of see that it's having some issues and we kind of want to know when was the last time it's been restarted. So we see both of them are still with the old time stamp of 21 minutes and 24 minutes. That text has not reached it. So for one provider we see that there are some issues connecting to the container that just came back up. And then we check Datadog. We're going to see that the container just came back up a minute ago and things are working as expected. That was probably less than the 60 seconds that we had for abort conditions. And if we look over at the left provider, we see that the Kubernetes container also would have come back up. But it's still giving us some issues. So when we pull up the terminal that I realized I just blocked off, we're basically going to manually be restarting the container in order to bring it back up. So when we do that, we can now see that Kubernetes dashboard comes back up and things are running as smooth. But there was that part of it that you're actually able to see, that one of the provider experiences, you actually manually had to go in and fix the container, versus it automatically auto healing itself, which is the promise that these Kubernetes providers actually give you.

Now we're going to move on to doing a little bit different ones. I'm going to be using a microservices demo app put together by Weaveworks running it on Docker. So these are going to be the containers that we actually have. It's a little bit of just a catalog, a shipping, like how it handles that. The Q master, the cards, it has the front end, there's the container for the orders, the database for the catalog, the user, the edge router payment on the user database as well. So this is a little bit of the architecture of how it all comes together. It's all put together with Mongo, MySQL, Go and just some other technologies as well. But this is where you start thinking about, where can things go wrong? So what actually happens if my database goes down or my calls to my database are extremely slow? How do my users actually feel? What is the pain that they're having, or what actually happens if something goes wrong with the user profile, that a user's actually not able to create an account or log into their account? Or maybe it actually gets even worse where the payment is not even able to work.

So you're now able to think about, “Hey, I actually outsourced my payment to dependency company A, well what happens if I lose connection to dependency A?” So you run a chaos engineering experiment on that dependency and you'll see that things will fail. So once you start looking that all these failures are happening affecting the customer experience, you want to think about, “Hey, what can I do to actually make my users or customers not feel this pain?” And that's when you start thinking about bringing a second tier of a provider that you want to have as a backup plan.

One of the big examples that we use for this is the AWS S3 outage that affected a lot of our favorite companies. Basically, things were completely down. A lot of the images were not loading, the experiences were not well. So it all just happened because S3 buckets in U.S. East One actually went down. But this happened to be one of the largest outages that has happened, that has affected many companies. So looking at that outage, how do you actually prepare for something like that not to happen? Well, if you actually do an experiment where you're cutting off all your traffic from S3, you're able to replicate that exact scenario and make sure that your company is ready in case another S3 outage happens.

With this one, this is a perfect experience. There are no experiments running on this user experience of the Weaveworks Sock Shop app. So you can see, you're able to view it. You're able to go look at the socks, browse the images, go back. Add it to cart because I thought those were really cool. Then those really expensive ones that have holes on them, I definitely want them in my cart. So now I can actually see that I needed to have shipping and payment information in order to check out. So it was good that that code has that part of it that it didn't let me check out beforehand.

But that was the experience in the sense of, there is no experiments that I'm currently running on top of that. The dependencies are all currently up and running. So what actually happens if some of that starts failing? So we're now going to go ahead and actually shut down our container. So we're basically, as a company, thinking about moving over to containers, but I want to test if they're as reliable as promised. So that hypothesis is that I'm actually going to shut down one of the containers and I'm going to actually see if it comes back up immediately as is promised. And if not, well, maybe I don't want to do containers. So my experiment is I'm going to shut down this container, wait a few seconds, check if it is up. I'm going to have the same abort conditions that if my app is unreachable after 60 seconds, well, I want to stop this chaos engineering experiment immediately.

So I'm going to shut down the front end of the Weave Sock Shop app and see if it comes back up immediately. So I just did it, I can see it’s actually not working. So now I'm going to reload it and it actually came back up. Basically, I think it was three, four seconds of the amount of time it actually took for this container to come back up, which is actually not what I was expecting. I was actually expecting for it to take at least 30 seconds for the container to come back up, specifically because of the experiments that I had run on the Kubernetes dashboard beforehand. So it was actually pretty cool to see like, “Okay, cool that automatically got brought back up.” So the promise of the containers being as reliable in terms of shutting down [inaudible 00:30:40] containers and them being able to auto heal themselves was covered.

But now I actually want to run another experiment. I actually want to blackhole all the traffic to my catalog and the catalog database. So the scenario is that now I'm actually working with my UI team. And we're thinking about ways that failure happens in the company in general. But we want to think about, “Hey, how can I actually still provide my users a good UI experience when there's issues going on in the systems or in the infrastructure?” Netflix is a perfect example of this. They actually have different ways that the UI behaves in case one of their systems or services is going down. So you have the entire box of shows Continue Watching Now. And in case the service that actually maintains that goes down or that database goes down, your user profile can't be loaded, well, Netflix makes sure that you just don't have a gray box there that shows "Can't load what you currently are watching. Sorry." It actually just shows up. These are the top shows that people are watching or movies.

So that makes the user have a good experience and not realize that the company's actually even having any type of incident or any type of errors behind the scenes. So now my hypothesis for this is that my images will not load, but I will still actually be able to see the product listings and be able to know a little bit about them. So now in terms of the experiment, I'm going to go ahead and blackhole all the traffic from the front end to the Rest API and the database. So with Docker I just managed to get the ports for that and just block all the traffic from the front end container to those ports. The same abort conditions is that if my application is unreachable for more than 60 seconds, I'll go ahead and abort this.

So I ran the attack and I can see that basically the entire website does not load if anything happens to the catalog. And that makes a lot of sense considering it's the entire, like a Rest API for the catalog and the entire database or the catalog. So everything that the website is actually pulling from, we've actually blackholed on the traffic from it, therefore it can't reach it. So this is also the scenario that it kind of gave of what happens if S3 was to go down. How does that actually handle it? So now maybe I actually want to think about how to make this experience better for my users as I implement this into the company. So I might actually want to implement some type of form of caching that even though there's errors in my database, there's a caching layer that is able to have the latest products and the information and images that the user's still able to have some information, versus seeing a broken complete website that you can't do anything or see anything or even know what this company is about.

Case Study

For the case study, I was actually hoping to get the clear from our success team to share a little bit more of what Gremlin is doing with the customers. But due to my non-time management, I wasn't able to get that all put together. But I get to talk a little bit about some of the customers that we have. So this slide is actually just companies in general that are running chaos engineering. There's a few on there that are just running Gremlin for doing chaos engineering experiments. And we've been talking to a lot of companies too, as pre-sales or actually running game days with some of them.

So we started thinking about what type of scenarios they can have. And one of the ones that we actually saw one of our company’s do was actually completely bring down a zone. And they actually thought they weren't reliable, that the entire services will break because of that. But because they ran it previous to production, they actually managed to see that they had two availability zones. So the experience was actually still able to go really well for the customers. So you can think about different ways that you think that your systems are resilient or different ways you think your systems are not resilient, and then actually put them to the test to actually see how they behave.

So if you're interested in picking up chaos engineering after you leave this talk, which I hope at least some of you do, there's a few tools out there that you can actually use to get started. The company I work for is a commercial product for chaos engineering. You can easily find more about us online or go to the booth. There's also a few other open source tools out there that you're actually able to use for doing chaos engineering specific to containers. One of them is Chaos Toolkit put together by a company ChaosIQ, and they've actually done a little bit of just how to perform chaos engineering containers or infrastructure in general. And they've started to implement now with other open source products. I was looking at their Slack channel recently and they're actually looking to implement with Uber's open source product project, Jaeger, to do open tracing on the chaos engineering experiments that they're doing. So there are actually other cool ways that they're actually going to be implementing observability into the experiments that they run.

Litmus and PowerfulSeal are two other powerful open source projects. PowerfulSeal is actually put together by Bloomberg. And they both are CLI tools that you can come together and mostly specifically focus on Kubernetes reliability. Some of them actually still support Docker but they're not necessarily always out of the box. And I thought I had put the GitHub link to this. I totally forgot them, but I'll make sure to fix this and put the links before I give the slides over or I upload them on twitter.

So if you want to continue your chaos engineering journey and you want to know more about what other companies are doing in order to do chaos engineering or how to get started, feel free to join the chaos engineering Slack channel. There's over 2000 members around the world performing chaos engineering experiments. They get to share their failures, their learnings, or just share different open source tools that they're building on to do chaos engineering and give back to that space.

So that's all I had for you all today. Feel free to reach out to me via email or Twitter if you have any questions or want to talk anything chaos engineering related. And I'll now open it up for questions.

Questions and Answers

Moderator: I actually sat for 40 minutes of your talk and was like, “What question could I ask?” I didn't come up with one. So anyone in here want to ask a question? Or actually no, I do have one. So you mentioned the Slack, you mentioned the apps but I've only really seen like one book. It was the O'Reilly book, it's like this thick. Is there anything else that we can read about either chaos engineering or testing in production? Because some of us we learn through words rather than action.

Medina: So there is one book on chaos engineering. It's actually put together by Netflix engineers such as Nora Jones and Casey Rosenthal. They help build chaos engineering teams at Netflix and they have a lot to say about it. Apart from that, there's a website called Principles of Chaos, which is based off the chaos engineering book by O'Reilly and talks a little bit about the principles and how you want to run chaos engineering experiments by doing it in the scientific way. And in that Slack channel there's a whole bunch of more links on what to read up. There was actually a cool case study of Adobe running chaos engineering experiments in a game day format. So they share a little bit of their learnings from it. And there's also an entire community that Gremlin is putting together, where we're putting tutorials on how to do chaos engineering on specific technologies such as like Memcached, Prometheus, Elasticsearch. And just kind of make it easy out of the box to start running on chaos engineering.

Participant 1: I recently had a dev outage escalate to prod. How do you recommend before you're starting a test to figuring out what the blast radius is going to be in terms of dependencies?

Medina: So I always say start with one host. Not necessarily even like thinking long as you have more than one host, just start with one. You want to run with the smallest blast radius possible. And there's a few resources online on the Gremlin community that we have put together, that you're able to actually map out where you want the experiment to go, where you actually think about, “Hey, I want start running my chaos engineering experiment on one host and I want to just start 200 milliseconds of latency.” So you mark it there and then you kind of think about like, “Okay now if I run it on two to three hosts, what percent of latency do I want to run?” So you kind of plan it beforehand, forecasting where you want to go to the point that you end up running it on 100% of your hosts, a lot more latency. But a lot of it just comes with planning.

Participant 2: My question is similar to the other one. Do you recommend building these testing as part of your CICD pipeline, or do you recommend doing it once every two months, three months? What's the typical benchmarking you have seen?

Medina: No. You definitely want to get to a point that you've automated your chaos engineering experiments. So implement it into your CICD pipeline is the end goal at the end of the day. But it takes a long time to get there. So at least the companies that we're talking to, they haven't gotten to the point that they're implementing it into their CICD. But getting to that end point is the ideal goal that you basically want to actually make some fixes, deploy and the CICD pipeline will actually catch it and be like, "Okay, cool. I ran these experiments and you actually have a memory leak in your application, so please go handle that." And then you can actually deploy the fixes into your production. Good question.

Participant 3: One question. You mentioned the game days already and multiple of the other people that were talking about SRE also did. From your experience with game days, if people want to get started with this, is there anything that you would say, this is something that you have to think of when you organize that?

Medina: Yes. Game days go by a lot of different names depending who you talk to or what companies you're talking to. I've actually liked the part of calling it a chaos day and thinking of the sense of you come together as a company and instead of focusing on building more products, more tools, you actually think about how to build more resiliency into your company. So with chaos day, we've actually put together a 90-day tutorial on how to set up a chaos day in your company. And it actually comes together of like, who should you actually invite to a chaos day? And that actually includes some form of upper management, such as a director, an engineering manager, senior engineers that actually know the architecture and the dependencies of their services. And you also kind of want to bring in like junior engineers, interns, so that way they actually get to see how everything works. They actually might have a lot to say. But yes, we've put together this 90-day quick start on how to get started because it's definitely kind of hard to think of like, "Cool. I want to think about resiliency, but where do I get started?" So if you ping me, I can give you the link or if you go to it's all there.

Participant 4: Thank you so much, that was a really informative presentation. I have kind of a two-part question. So the first one is, where does like FMEA testing and chaos engineering differ in? To me they sound kind of similar, but I just want to hear where they differ. And then two, how did you get your teams, or the teams you've been on, to actually adopt this successfully and 100%? Because it does seem like it's very intensive and it doesn't seem something that SRE should be the only one focusing on.

Medina: In terms of comparing to testing, we always get those questions of different from penetration testing or just unit testing, and all these other forms of testing. And it's just a different practice in the sense of, because you have to think about it as a scientific experiment set, it kind of becomes a little bit more engineering. And you're doing it or failure injection, you're not necessarily even thinking about what are cases that I know that have gone wrong. Because when you're building tests and you kind of think of like, "Okay, this failed, I'm going to go build a test for it." With chaos engineering, you're actually preparing for the things that you don't know that have failed. And then in terms of adoption, well, I definitely saw it at Uber where it was extremely hard to get adoption on the chaos engineering service.

So there were talks on kind of making it mandatory that if you want it to be a critical service, or if you were a critical service, you actually had to be running chaos engineering experiments. So a lot of this just kind of comes up from the folks in charge of reliability at the company such as SRE team, that they come together and they put those practices in place, where they're like, "Okay, we're going active update. We need all tier zero tier one services to do chaos engineering and actually run these automated every two weeks" or such like that. But definitely, you don't want necessarily just for it to fall on SRE, that they're like the ones like reaching out to teams, like, "Hey, you haven't run chaos engineering. Please go do it." You want to kind of think about doing chaos engineering in the sense that you want to evangelize it to the company. You want to evangelize it to other engineers that they get to see why this is a good idea, why this is something they should actually start practitioning, so they actually have less outages and sleep better at night without getting paged at 2:00 in the morning.

Participant 5: What are some good principles for coming up with good hypotheses that you can test as opposed to, you know, bad hypotheses that will lead you down the wrong road?

Medina: We have put together some examples of hypotheses that you can kind of look at as a skeleton when you prepare some hypotheses. But a lot of it just kind of comes up to like, “If I do this, I expect this.” And you want to think of it in the sense that you're doing it in a controlled and in a thoughtful way, that you know precisely what you're doing. So that's kind of like why I always give the example of latency, where you inject the 200 milliseconds of latency and then you can go up to like 300 milliseconds, and you're always kind of like redoing the hypothesis as you go through it.

And kind of like the same thing goes of thinking about what type of experiments you want to do. The hello world of chaos engineering, I didn't get a chance to show just because it actually doesn't do anything to containers. The hello world experiment is actually maximizing a resource such as a CPU. So it's a really cool experiment that you can do it and you get to see it in the monitoring and you see that the CPU hasn't maxed out. And then once you stop that experiment, the CPU goes down. But maximizing all CPUs for the Kubernetes containers did nothing. But a lot of it just comes back to what exactly is your end goal? What are you basing off your experiments on? Are you looking at an outage that you had or another company had, and reproducing that and going slowly with hypothesis?

Moderator: I asked this earlier, but I think maybe it's time for chaos engineering in general to have a book on the same lines as the Phoenix project, and what that did for DevOps and explaining first of all why it's needed and how it's implemented. Maybe we're there.

Participant 6: Thank you for your talk. Is there a way to find out the purposes of your systems, so run similar kinds of experiments automatically somehow? And for example, start shutting down nodes of your database and see how performance degrades, how much latency it introduces and learn actually how bad certain scenarios can be, right? Or it's just basically more like a manual process when you're actually writing down the hypothesis? Because a lot of the architectures are there, right? So with microservices you can, what if one third of all my microservices went down for a certain LB, how bad is it actually going to be, right?

Medina: Yes, it definitely goes back to the automation point of things. You can go to a point that you are using open source tools and just have to put together a whole bunch of scripts that scrape your microservices architecture and know exactly what the containers or services are called, and then just performs the experiments on like a regular candence. If you actually use Gremlin to do chaos engineering, there's an API for it too. So you're actually able to just to feed in the information of what attacks to run and also just pass on the credentials of services.

Participant 7: I assume this is most effective in production, especially based on all the slides. But I was just wondering, how does it work for enterprise level software, where you have SLAs and things that are based in contracts that you need to be uptime so the numbers appear at time?

Medina: I actually don't think that the best value is always production. I think that there's a huge value in doing it prior to getting to production. So you know that if you were to just run these experiments on production, you'll probably break stuff and you don't ever want to run a chaos engineering experiment if you know that your systems are brittle or things are going to break. So there's that important level that you need to do to run it beforehand, to build their resiliency that you can take it to production.

And then in terms of like SLAs, it goes also into the practice of error budgets and kind of estimating on that, automating in the sense that you also are testing your thresholds. But let's say that you're monitoring, you're showing that you're breaching your SLA, you automate the monitoring to actually triage on canceling that chaos engineering experiment. So you don't actually get to a point that you breach the SLA to your customers or to your internal customers.


See more presentations with transcripts


Recorded at:

Feb 09, 2019