Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Building Confidence in Healthcare Systems through Chaos Engineering

Building Confidence in Healthcare Systems through Chaos Engineering



Carl Chesser covers how Cerner evolved their service workloads and applied gameday exercises to improve their resiliency. He focuses on how they transitioned their Java services from traditional enterprise application servers to a container deployment on Kubernetes using Spinnaker.


Carl Chesser is a principal engineer supporting the service platform at Cerner Corporation, a global leader in healthcare information technology. The majority of his career has been focused on evolving and scaling the service infrastructure for Cerner's core electronic medical record platform called Millennium.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Chesser: I'm Carl Chesser. This talk is going to be going over a story of what we did at Cerner, introducing chaos engineering and chaos engineering principle and with a big, strong effort of how can build confidence in our systems. As Michelle mentioned earlier, a lot of this could be referenced more like continuance resilience, but we'll be talking about chaos engineering in this talk.

Before I begin, I'll give you a little bit about myself. I usually like to explain to people where I'm from. One, a lot of people do not know what Cerner is. Cerner is a large healthcare IT company. It's one of the leading healthcare IT companies in the world. We have about a little under 30,000 people, and so we have a pretty large engineering work force. We work on a lot things. If you think about electronic medical records, device integrations, so smart pumps, smart rooms as well as even up to more aggregate levels for population health and assessing populations of all that data.

As part of that, we're in Kansas City, Missouri. Usually when I bring up I'm from Kansas City, people are , "I've heard of Kansas City. That's in Kansas." That's true, but there's also Kansas City, Missouri. It's right on the border in the center of the continental United States. Cerner is a very large company in Kansas City. We have campuses throughout the city. I'm in the Missouri side, but there's also campuses in the Kansas side. Actually we'll drive across the state border just going to different campuses.

Another interesting fact about Kansas City is, it's known as the City of Fountains. There's about 200 operating fountains in Kansas City. If you ever come visit, you might be, "They have a lot of problem in this town." No, these are intended. They're called fountains. Another interesting point about Kansas City is, as probably most of you heard, they have really good barbecue. Everyone from Kansas City normally has a very strong opinion of what is good barbecue. There's a large event that happens every year in the fall called the American Royal. It's the world's competition for barbecue. Everyone who wants to be the best at barbecue, they come to Kansas City for that event.

My opinion about barbecue, if you were to visit Kansas City, is to go to Joe's Kansas City Barbecue. It used to be called Oklahoma Joe's, but of course that made it confusing being in Kansas City. It's Joe's Kansas City. If you go there, order the Z-man and go to their gas location. I guarantee you'll be happy with that choice. You have one new thing that you didn't think you were going to learn that you got out of this talk.

From this talk, there's a little bit of this lineage of what we'll go through. First, I'm going to start off with a story. I'm going to talk through the things that we worked through at Cerner, and it starts with a lot of the problems we're facing. I'll briefly talk through certainly technologies we were facing and trying to change, but you get quite a bit of those examples. Substitute technology X for technology Y. It's really how we were trying to evolve these very complex systems. As I talk through this story, I'm going to briefly cut away into a concept of which we'll talk about as traffic management patterns. Then we'll conclude about the story of things we went to change, and then I'm going to share all the lessons learned of how we had to start introducing chaos engineering. One, of the challenges doing it in a large corporation. The other one was around issues of just in an industry that's very sensitive to these type of critical workloads. Hopefully, at the end of this, you're going to several lessons that we had learned that you can hopefully apply back at your own companies.

Our Story

First, about our story. This story, I don't think is very unique, because a lot of you relate to the challenges of just technology. Technology keeps changing over time. Cerner is a very large company, we just hit our 40th anniversary. We've been around for a while, we have lots of different technologies that exist, and we have a lot of systems that are adding value. Since we're adding value, they keep living. Some of those technologies have been around for quite some time, and have grown quite complex. As we've been working through this, my role is on this group called platform engineering. We work on a lot of their cross-cutting technologies at Cerner, and you get exposed to a lot of these different types of things you're having to work on.

The first case was a challenge. The challenge was we have a lot of these type of service deployments in a very complex environment, and they were serving a highly critical workload. It's like if you were to open up a door and you see the inner workings of clock going on, and someone's, "Go change that gear." Everyone's, "I don't know. That seems like that would cause this massive compounding issue." It was because you knew all these systems were working quite well for what they supported, but they had a lot of knowledge of how they came to be and what they were supporting.

In this example, we had a lot of different ways we deployed services over time, and we were unifying those to make sure they were being evolved into the future that we wanted. One example is, we had a lot of service deployments. We had these Java service workloads using IBM WebSphere Application Server. There's a lot of complexity with that technology, and we had a lot of very experienced operators because we hosted all this.

One interesting fact with Cerner is that we host a lot of our clients. In Kansas City, we have a large fiber network, so all these hospitals remote into our data centers to access that software. We know how to operate that software really well, but when you go look at what's running in production, there's a lot of tribal knowledge that was built into how to optimally run that workload. We wanted to change that workload, but we realized there was a lot more than just the code that's running there. There's a lot that's about that environment.

We knew that with that same problem that what we wanted to change, we wanted to essentially incrementally build something but allowing it for change. We wanted to have a simple way to understand the system, because we're going to keep discovering new things about this very complex existing environment, and we wanted to carry that environment forward. Instead of us coming in and saying, "Yes, we want to do Kubernetes and forget all the rest of that old deployments. We're not doing it anymore," we said, "No, we have to think about how we transition that existing technology to carry it forward." Figured out the ways to pass it because we didn't want more multiple ways of doing deployments.

One way we started going about this, which I found to be very valuable, is we created our own little declaration of what the service workload is. Most things now, it has to be in AMIL, so it's a declarative type thing that we have in a central repository. It was just a collection of all the facts about a service. A service is declared in this way. We know all the facts about it. We knew about the humans associated with that service, so we called them service owners. We had this declaration of what something served as a service, and we had a way of regenerating or rebuilding a system based off that.

If we change some other technology, we're using our own specification of what a service was and generating those things. These service profiles served in many different ways for communicating. We used these also to communicate out to teams with our scorecarding of a service, so we can assess what those are and it communicates that to the service owners that are tied to those.

As we were building these systems, we realized we were trying to seek further and further more functions of availability and that was just growing more and more complexity in the system. Given that earlier example that we already were dealing with IBM WebSphere Application Server type deployments, we were wanting to introducing container deployments of newer services we were building. We were trying to merge those together. All that technology getting composed together was obviously more than what one person could put in their head, and it became difficult to understand how far can we take this, how safely can we manage this larger amount of that workload.

As we're going through it, we realized, we're entering in a newer mode that we don't understand all those boundaries of safety, because we're hitting lots of complexity at once, but we wanted to make sure we could actually change that. How do we live in this world where we know we're getting to more complex states, because we wanted to have higher availability functions, more resilience, doing lots of cross-site deployment type of stuff?

As we were going through this, Cerner, as I mentioned earlier, is a large company. As such, we have lots of different organizations. When we try to view who and all operate the certain services, we have different organizations that care about that thing. When we develop something and we have operators of it and we have infrastructure folks, they're all from different parts of the organization. At this time, we were really closely aligned from our platform engineering team to our operations organization, to where they would all sit in the same area. They were all part of the same view.

We viewed ourselves as the same team, but we also recognized that we had different reporting structures. When that happens, you know you can get conflicts of incentives or initiatives. If we think we should go do this, another organization said, "No, this is more important, this is what we're going after." We thought we had a pretty good alignment on there, but we also had another group we wanted to build alignment, and that was with our infrastructure team. We have pretty large data centers that we operate, these were the data center type folks that would be racking, configuring infrastructure of how we'd scale out the deployment. We wanted to get really close alignment on there because as we were expanding and building this system, we realized we didn't understand all the other things that we actually were influencing on technology choices in that space.

As such, our infrastructure group is also introducing OpenStack. OpenStack on an on-premise infrastructure space is quite complex. I don't know if many of you have used OpenStack before, but it takes a very strong talent set to operate and manage OpenStack. As we were going through it, there was choices of we're building and configuring this infrastructure that we knew that we wanted to be very closely aligned on. We didn't want to make a lot of optimizations and then say, "Here's OpenStack, now go use it." We wanted to be really closely aligned of how we were configuring that infrastructure and testing, because in that case, we were finding out like availability zones within OpenStack.

We wanted to test OpenStack availability zones so when we would actually kill something in an availability zone, there wasn't some hidden, underlying, shared resource that existed that would affect something else in another availability zone. We realized that since we were introducing more and more complexity in the system, we were going to discover newer things that we thought we were safe from, from what it looked like on the paper, or our design or even what we built out in a lab. We had many cases we had built infrastructure, which we call like a lab environment, and so we would segment the infrastructure and say, "Here's what we'll test to validate that works, and then we'll go apply something else in production."

We said, "No, let's get as close as we can to this test and production, because the system we want to validate that's how it actually operates." It's part of this story, and we started going through this challenge, we realized one thing we wanted to do early on was the unifier team. We called this Tiger Team. This was because we wanted to unify a group of people across organizations as one team. This required a little bit of a grassroots communicating up to our leadership and say, "We're not causing a re-org. We'd like this other team, our infrastructure group, to be sitting with us." We have to start walking through how we're building out our infrastructure in this data center locations so that it closely to what we want to test, and what we want to try. Work through all the normal things day to day, so it's not we're communicating through a ticketing system. It's separated from who and what is working on that system. This was the first step we took to organically grow as a team. Again, we still had different organization structures, but we were operating very closely together. This was a very important first step that we realized, one, how do we optimally start working through the problem?

At the same time, we were also starting to use DC/OS, so in this case, Mesos for our deployments. Much of this work started occurring back about 2016 and 2017. As we were introducing DC/OS to start doing our container deployments. - if you're not familiar with DC/OS, think about Mesos but with other additive type things for full inclusive set of things you'd need for container deployments. The way we were trying to build out our deployments and using DC/OS exposed us to yet more complexity. We had to actually really know how Mesos and all the components of DC/OS operated, because if there was any issue, we're the front line to be fixing that and resolving those type of issues.

As such, we are also organizing parts of the DC/OS infrastructure on the primitive of availability zones in OpenStack. To do that, we wanted to make sure that if we had something go out in the availability zone, our control plane or what we understood on the data plane side of DC/OS was still going to function as we expected or it failed in the right way, versus something being surprising. Both of those two big newer technologies to our teams, OpenStack and DC/OS were these combining effects. We're saying, we actually really need to understand these two composing types of technology because those, themselves, are enough to really go beyond what we thought we could understand from these existing systems.

This is where the first step we started taking and doing these chaos engineering by doing game days. This was where I'll try to recommend a little bit later about how we're going to approach it through a step phase. We wanted to test this infrastructure which was part of our production system, but it had no live traffic on it. It was a system that was being built in production. So it's, here’s what it was and how the infrastructure team essentially says, "It's ready for you to use. Go ahead and use it." Since we were unified in a team, we're saying, "Holding on, it's not production ready yet. We're going to start putting workloads on this, and start doing these game day exercises."

One of the first things we started doing is saying, "Let's start building out some experience that we really want to test." A lot of it was around our availability zone type test, so let's test shared resources, let's actually kill a full hypervisor, let's actually take out infrastructure in an organized way. Have everyone part of it to see what the effects were. It was in our production area, but it really wasn't on a production workload. It was very safe and understandable to team. Just by having the team all work together on it, we knew how to approach and start stepping through it.

That was a very valuable first step, because when we talked to the teams, especially our infrastructure team, they're, "This is the first time we've ever purposely shut down infrastructure. We went to the data center and powered things down." It's, "No, it's ok. We're actually testing that this works." Whereas before, once it was set to be used for production, there was a lot more rigor and control and process around that infrastructure. By being really well aligned and said, "No, it's production, is being prepped for that." We want to do these tests early to find out what we're going to experience in terms of failures.

At this point, we are now introducing OpenStack, and DC/OS, and started doing these types of tests and learning quite a bit about it. This was in that 2016, '17 range. Probably about 2017 is where we were getting more traction on what we were building and going about it. As with most technology, something else changed. There was Kubernetes that started gain traction. You had a lot of people always saying, "Look, there on the horizon, there's Kubernetes." We're, "We're not even done with this. We're just transitioning." You would naturally say, "No, no, no, let's not move on to another technology. We're still transitioning a bunch of workloads in this first one."

As with technology and time changes, we realized this was becoming that standard, the thing that we wanted to reference in our infrastructure. We said, "We actually have to start moving towards to that, but we're facing another competing challenge." The team that was managing all of our container workload deployments, which is focusing on DC/OS, also wanted to focus on Kubernetes. That team wasn't going to get double staffed to do that work. They wanted to move to the other, but we knew we were going to be start growing infrastructure.

That is a thing I found that has been repeatable as we go through technology, where we're doing some type of transition of whatever technology, from one to the other. Then you start finding that once you're getting almost there, you find where you want to go to the next one. Then, how are you going to balance that type of time and work into it? Because you don't want that to be the large tax to your overall team for delivery.

In this case, we had competing time. Our team wanted to get to Kubernetes faster, but we were now running workloads in production for systems on DC/OS. We were growing those deployments, which was discovering newer types of challenges, how we were scaling things, how we were running more applications within Marathon for scheduling those workloads. This was the first case where we said, "How are we going to actually get to Kubernetes when we're taking so much mental power and cognitive effort to understand how we would scale DC/OS?" We have to do something to minimize how much we keep going on this other one, and start applying something in a newer technology on Kubernetes, which was low enough taxed to the team, so it's actually achievable.

We started having to rethink, how are we going to do this type of keep scaling out our deployments? Now scaling out with newer technologies, doing assessments of those, and start applying what we were just getting started with the company of trying to do these game day exercises on our infrastructure. This is where we found we had a good primitive in place early on. Back in 2017, our team that was doing a lot of our container deployments was investing in Spinnaker. One of the releases in 2017, we had contributed support for container deployments on DC/OS and Spinnaker.

That was a great first primitive to have in place, because as we did deployments, in this case with DC/OS via Spinnaker, the team started getting a lot of good understanding of how to actually do those deployments. Understand the infrastructure within Spinnaker, because that's quite of a complex system. Then they understood, as we were going through it wow, there's a lot of these other capabilities that are already supporting for Kubernetes that we're implementing for DC/OS. If we get this set up, it should be a nice way to actually start bridging in deployments then to go to the other.

That's what we actually ended up starting to do with our Spinnaker deployments, as we were deploying things. Think about what I mentioned earlier on how, we had these service profiles, this declarative set of services that we would build. We would generate, again, another target for the deployments. Now you're going to deploy to both Kubernetes and DC/OS in all these environments, and we would just start cross deploying both, all the time. The Kubernetes one was just spinning up and running. It wasn't doing anything. The DC/OS one was still the live thing, but we were getting this converged state of a system all the time of what that newer set of technology was. That was the first goal that we had to lower the overall tax to the team to do this type of development.

Then they said, "Now the plumbing is somewhat easy. We're getting the deployments to go out to infrastructure, but now we want to start assessing some of this stuff of how it's going to work against Kubernetes." One of the first issues you'll have is, we were now running with live traffic in our systems, and when that came up, we were saying, "Now, how are we going to start applying some of these chaos engineering principles and in an environment which had live critical workloads on it?" We're not going to power down infrastructure that could literally affect patient care, and cause financial impact in some of the healthcare institutions that we were supporting.

The typical response a lot of people was, "We're not Netflix. Our production system is really important." The truth of it is, every company production is extremely important to them, for different reasons. The spirit of what we want to do is still true, but we realized we had a lot of challenges that we had to get around to figure out, how can we best start introducing what we were doing earlier in that isolated set of production infrastructure, to start doing it on actual production workloads, because we knew there was lots of risk built into this. We did not want this to end in a terrible story where of us introducing this type of testing, and us building confidence, also show how we negligibly affected our clients who use our software.

We were hitting a point where we were saying it's too risky for us to run live traffic workloads as we do these tests. We wanted to figure out what could be a good feasible way of doing it that would essentially be a low cost and it wouldn't impose a lot of bigger fears of what that system would be supporting. This is where we came up with one of our traffic management patterns. We'll talk about this here in a little bit. This is where we introduced shadow traffic. I'll talk here shortly of how we do traffic management with our system, but we were identifying, "Wait a minute. We want to learn all production." Production really helps validate what is actually going to happen in a system, even though we can do synthetic test and re-simulate things that we did before.

We want production. We want that workload, but if it fails, we don't want that to affect anybody. We definitely do not want that to be the reason why we have a system that tips over. This is where we introduced shadow, which is essentially a replay of traffic. That was an important piece because early on in our systems, we made a choice of having a façade, an abstraction point of how we managed traffic to be this point where we could actually make changes into it, to enable these type of traffic management patterns.

I'm going to cut away a little bit from our story a little bit, and we're going to talk about what are these traffic management patterns.

Traffic Management Patterns

When I talk about traffic management patterns, it's really just like we're talking about just directing web traffic, in this case. We have a lot of web traffic hitting these other web services that currently run. We currently use an API gateway, it's the Netflix Zuul open source one. There's been several talks that have kind have referenced that from Netflix. This has worked quite well for us, because as we've been building this out, there's different types of filters you can run within there. We have this case of these routing filters that we can make choices of how to dictate traffic within this API gateway. We also had stuff redesigned for how and when traffic would come into the API gateway, there was enough data from the request type data to make choices of where that traffic should go.

As we were using this, if you look at Zuul, it's a web application, itself – a Java web application. We want to have a consistent way of building and running that service. If you think about it, what is the application server that this is going to run in? As we did our container deployments, we built up what we called was our runtime, which uses Dropwizard, which is a very minimal implementation. It supports like a JAX-RSX web application, but it became very helpful for us because of what we'll talk about a little bit later.

There's a lot of basic shared libraries or infrastructure type components that we wanted to manage cross cutting across all the service. When we would rebuild the system and all the containers we deploy, we could manage that run time consistently so it wouldn't be a high coordination cost of saying, "You guys had this old metrics library. You need to upgrade to get this next attribute to show up. We can control that from the run time." With our API gateway, it uses that same run time.

What we found helpful about this is the dog fooding effect, or sipping your own champagne. We would run all of our workloads with our own run time that we'd also have all of our service consumers use as well. All these clinical product teams that are building something for like a allergy system, they would build a JAX-RSX web service that would run in the same run time. As we found things that we would want to measure in terms of traffic, we would then make adaptations of the metrics or telemetry data in that service, and then all other services would also get that same benefit when we would rebuild it.

Lastly about this, this served as an important abstraction to our system. We started thinking about, how would we manage traffic? What things do we have to change? As we mentioned earlier when we had this existing system of DC/OS that we were routing traffic to, and we were going to start routing to other systems, we did not want our consumers having to understand at all about these technology changes. They should just be coming to our front door again, and we can deal with that on the backside so we could actually have ways of managing, moving between different technologies without them being aware of it.

The first thing we worked about when we were dealing with how we would transition traffic, if you're familiar with some of the Netflix open source projects. We used Dareka as another example for service discovery. When a request would come into our API gateway we had things set up for how to know what type of service it would have to route, and then it would dynamically discover it and route to that service. We want to build a different part of a routing filter, so it would support what we called chaining. That's what we reference as static routing.

When a request would come in, it would say, "Actually, for that type of request, I'm just going to call this other thing. I'm not going to call my normally dynamically identified service. I'm just going to call this other thing," which happened to be another API gateway. The reason why we called out chaining was that we wanted to put some basic safety mechanisms in place, so that we knew that we were humans, we were going to configure as we built the system. We didn't cause a loop. We went and route to this other system, and then it accidentally was configured to call back to itself. Then you had this massive recursion type problem and of course the system didn't take.

With chaining, we can manage to where we have headers that would communicate to the other service so it knows it was part of a chain, and so that it would have a very fast breaking point if it found out the chain was coming back to itself. The chaining was the first part, so we can understand how we actually compose this type of traffic against it. The next part of it was around Canary. As we were going to start routing traffic to these services so we could understand how to manage that workload, we were saying, now built off chaining, we're going to have a gradual - think about versus a light switch to a rheostat, like a very gradual control of migrating traffic to this other system.

That was extremely helpful for us, because we could then say, "Ok, for the requests that are coming into our system, we can then say that service and for that tenant, we're going to route this percentage of that traffic over to the other one." If there's an issue, we can turn it off. We have, essentially, a way to push, config updates to these services so that the API gateway can say, "Ok, I'm going to fall back."

This was helpful, but this hits where we're doing still live traffic. If I turn the Canary on, I start routing over, and we're doing this game day exercise, and we encounter a big issue, I still have failed 5% of the traffic or something like that. That, to us, was still unacceptable, because in our case, again, it's going to affect patient care. There's a lot of process that we have in place and we're saying, "Yes, we actually were intentionally trying to take out infrastructure to validate something," we did not want to be where even 5% was going to fail as a result of us doing this type of experiment.

This goes back to our last point, which is around shadowing. Shadowing was similar to Canary where we were going to route over, but it was going to essentially do a background thread and replay calls to the other system. This can do a lot of great benefits in your system to start testing. The challenge with this though is that you have to make sure you have safe requests that can do this. From a web traffic flow, a lot of times this is like a HTTP GET. That's usually a call that you can normally repeat, but you'll find that even some choices that services may do say, "No, no, when I get a request, even though it's a read, I may audit something additionally out." We didn't want duplicate auditing occurring or anything like that.

It's good to, one, validate what services are good candidates for that, but also instilling early on stability patterns of how you're dealing with traffic. We have essentially retry logics and circuit breaking. If something is failing, we may stop for a bit, but we also may try to actually do connection retries, or retry on a call. These calls were already going through a retry flow, which already indicated that they already had to be sufficient to be candidates for it. Having some of that stuff in place makes it clear of what are workloads that are safe to actually replay, because they replay in the system today if it hits some type of timeout.

The other part about this that also we wanted to be very sensitive about is as we changed, our API gateway take a lot of traffic. If we were to make this new capability replaying into another system so we could do these experiments, we did not want it where this shadowing effect would cause that service to tank. As it got a large burst traffic coming into the system, then it's trying to queue them up in memory and sort of serving out to this other system. Maybe this new system is just not working, because we're doing this experiment.

We didn't want that queuing to cause the service to tip over and fail, so we looked at trying to apply like a bulk heading pattern of how you would minimize how much resources your service change would get. We have a very minimal queue, so it doesn't really back up in the service. It would just stop trying to do other requests if it fills up. That's a important piece when you think about what stuff you want to do migrate or deal with the traffic change, so that you actually can safely introduce it without putting risk of the same critical service that's in place today.

As a result for those traffic management patterns, the shadowing of traffic was a very helpful first step, because that was going to get us where we could start actually doing this chaos engineering in production and lowering off that production workload by doing the shadow effect. Ii was a subset of the traffic. Not all the traffic could go, be considered safe for doing replays like this, but it was enough to where we could start running the system. You could see the teams could then say, "I can actually see what's happening in comparison to production," and if that system were to go down, no end users would be impacted as a result of it.

This was also especially interesting as we were starting collecting all the metrics back from these systems. One of the things we do on all of our requests, especially when it comes to API gateway is that if traffic comes in, we annotate it with a correlation identifier. All the traffic goes from that service to the next one, has these correlation identifiers tied to it. In the shadow reply, you had this replayed correlation identifier. You could have fine grained. Every transaction we were collecting, you could do a comparison of that data, which is helpful because when you do aggregates between those two systems, you'd see the shadow is just a subset of the traffic. I assumed it was working good for that same subset, because maybe there was different latency between the two.

This gave you the ability to have really fine-grain comparisons against the two. As we did it, we were collecting a lot of metrics about these two systems, and you could then just have clear comparisons, "Look how good Kubernetes is doing as compared to DC/OS in our current deployments." That was happening then all the time. When we looked back at how can we lower cost to the team that was trying to move to Kubernetes, one was getting to where the plumbing was set up with Spinnaker to do the cross deployments all the time. The other one was doing the shadow effect for traffic, so we were just learning off production traffic without us doing other additional things to learn off of it.

All the time, we had this thing of collecting, and we could just comparing it against the system. That became very important for us for how we learned and utilized that system. We hit a point which usually takes some other tipping point. We had to work for several months where we had this setup where we were deploying to Kubernetes, things were looking great. We were doing this comparison of the systems. We were doing other things that actually testing out that infrastructure, similar to what we did before with DC/OS. We even had cases where we found different types of bugs how we're doing with our container network set up. We learned a lot of really valuable things, and we were able to do it without effecting the actual end users at that time.

However, we hit a point with our current live system, which was DC/OS, and it was at a point where we had learned things from previous game days that we were in a very critical mode, where we had lost parts of our control plane. We were, "We have a pretty good idea of how we could get out of this, but it's going to be a lot of work." We've done all this stuff, that now we've shown how we've built, how we're doing this deployment into Kubernetes and it's looking great right now. Full comparison.

We've already done how we plan to migrate traffic to it, we've tested those, we know how long it takes. The team had prepared a lot for this, and they knew the range of times it would take for each one of them. What it came down to is this issue that came up was at the end of the week was on a Friday. It was the worst type of thing. "The production system is hitting this. What do we want to do next?" You could see the team saying, "We should move to Kubernetes. It's ready. Look at all the stuff we've been comparing all this time against it. We did the same type of test we did before. We're ready to do this."

While we do not have the migrate 5000 machine on hand with a big button to migrate, it was the same feeling of, it was simple in their mind. Meaning, simple was that they knew what they had to go do. They knew, as a team, what were the things we had to do to actually make that change occur. This was a really important first step to highlight, look at the abilities we can make on these changes.

We then later applied the same thing further. Then as we started doing deployments, we started building out infrastructure across different data center sites. More OpenStack in different areas doing the same thing, Spinnaker dual deploying. Then it was always shadowing to this other site. When we wanted to do a test, we actually had to have cases where we'd field over from a whole other data center site, and having that shadow effect validated a whole bunch of concerns. When we did it, it actually worked. Whereas before, we'd be, "What's going on? We've tested this once a year, or every quarter, hopefully it's still up to snuff." We kept the plumbing in place to keep both those deployments going, and then the shadowing effect helped us validate that the system was still operating as we expected. All the network connectivity existed that it should have.

That was the story piece of this, and that was a little segment about the traffic management.

Introducing Chaos Experiments

I'm now going to go into some of our lessons learned that we had on what we got out of all these chaos experiments. One was aligning the introduction of chaos with organized experiments. The point here is that we really want to make sure as we did this unified team type thing, a lot of that was the learning of it. If you think about the word experiment, you want to build up what was going to be the scenario. What did you need to test? Planning time for that and having the team focus on it.

Using Spinnaker, Spinnaker has a capability we can use, Chaos Monkey, and there's other tools out there to even do chaos things in your Kubernetes cluster. What we found was every time we would try to consider introducing something else that was causing chaos without that focus of an experiment, it was becoming like a scapegoat for mysterious problems. If something was going wrong, it was, "Yes, it's probably that Chaos Monkey," without really finding out that it was going to be that case. We said, "We don't want that to be the mentality. We want to have a very structural focus on what that problem is and get the learning out of it, in that rich time window."

Another part of this is about preparing for the experiment. As you went to introduce this, it does take some upfront work. Like the thought of a experiment or even how we normally would label it as a game day, there's a plan that's going into this. You want to know what you're going to actually test on the infrastructure, what are all different team members that need to be involved, and also what you actually expect to apply, and what the expectations are supposed to be out of that. What is the effect? You're not typically doing this experiment with a thought you have no clue what's going to happen. You at least have some expectation based off what you've already tested in other scenarios. You want to validate that, so you have that up front. Also by doing that preparation, you start probing, what things do we have to do to measure that, to actually get the data? You'll find, in many cases, you won't have all the stuff you want to measure, and you actually have to make sure that added telemetry data or what you're going to capture before the test, so you're ready to do that comparison.

You've got the data up front, and you apply the tests, and then you can see if that metric changes as you expected. That preparation really helps to make sure that you know what you expect, you also know what actually may occur as a result of that.

Observability out of this is extremely critical. What you want is have a lot of these different types of telemetry data. Having ways where you can rebuild your system when you find out a gap, so you can rebuild it again, so all the systems can get that change, is really helpful. You want to keep it very simple and low cost to your teams of introducing this new type of metric you want to collect.

The big part about it too is that as you're doing the experiment, you're going to be asking different questions of the system that you may not have known before, because you'll see the metric is surprising. You're, "Wait a minute. That doesn't make sense." You'll ask a different question of it, and you want to have a lot of data at hand so you can make those type of questions to the system and get answers without thinking, I need to make a change to the system to get that added type of metric. Really focus on making it easy for you to do actually instrument and make changes to it. We do quite a bit of that based just off of just our service level tier of what we have to apply, configurations, so all the services can get that benefit. It's very easy to do the test again.

You'll find as when you find gaps, those same changes are getting built into your production system for the longer term. When it comes to doing your actual experiment, get a dedicated space. If you're all physically located in the same area, get in an actual conference or some dedicated area. If you're virtual, have some dedicated space where you're discussing all this. This is important to, one, maintain focus, but also make sure that you all are going to be present when you're going to be doing this experiment. It's very valuable, because you don't want this to be a simple logistical challenge when it comes to doing your game day.

As a result of it, you should also plan for adequate time. When you do this type of setup, make sure you plan more time than you expect because you're going to be surprised. You want to actually do the experiment multiple times to actually make sure what went wrong, let's try that again, because that couldn't have been the case. You try it out again, and you'll find where there's multiple iterations. The team would block off the whole afternoon to do it. Whether they use the whole afternoon or not, may not be the case, but you want that to be to where there's adequate time to go through that experiment. You're making a planned investment for learning in that cycle.

Another part of this is, with all the health care stuff, we had a lot of critical workloads of how to manage and understand a lot of compliance and rules in place. Much of this that we have to face is around risks, so we had a lot of risk that was being managed from our own internal processes that would get mapped to external type of compliance rules. Both, we had development type processes that we had to follow as well as change control type processes for these production environments.

When you start going through this, you have to understand what those mean and usually a lot of times, it's the first barrier, you're, "I don't understand why we have to all these change control type rules in place." It's very helpful to make sure you find the right authority who can help answer those. I got a lot of good advice earlier on that was like, get the actual specific name of the experts in your company that would know how to answer that, so when it comes up, it doesn't become this mythical thing that you can't change. Usually you have some internal processes getting mapped to these other external processes or compliance rules. Find out how you can actually get clarity of what did we actually have to have in place? This is how, today, we map to it. Is there another way we can do that to make sure we're still compliant and safe? Don't disregard these. Don't feel , well, we just have to work around it. Understand what they are, and find out who are the right authorities in your company to help answer those.

Something I mentioned earlier, plan to be surprised. When you go through these, I don't think we had any experiments that resulted in a surprising effect on it. If it actually matched what we expected, that was also a surprise. You'd find things like, "Wait a minute. That worked out really well. Why was that?" You'd find out other things they'd try to investigate to make sure they could understand it. Since you're going to have to plan for surprises, you want to get one, the easier case of adding additional time to digest, to understand, why was that surprising? What were the other things we learned out of it? What are the follow-ups we want to apply? Sometimes even in the experiments, you want to apply some changes so you actually can do the experiment again.

The other part about this is when you're doing these prepping, use some type of whatever you use internally as an open, searchable repository. Whether it's you're like one note or something, to where it's very easy to find all these results, because you'll find as the team goes through it, not all team members can contribute on these chaos experiments. You want other people to learn off it, but it's very easy to then look back, what did we do in the last one that we found that was really weird that we could carry forward into the next one?

Cross-functional involvement was really important because this is where we got a lot of that rich learning. I feel like this is a lesson learned as well from incident analysis of understanding what all things could happen. Having that insight about what infrastructure we're seeing at the same time from our data center location, what we were seeing within our container orchestration, what we were seeing from the service perspective. Those are all really valuable things to look at, to you understand what's the bigger impact of that system.

All those experiences usually will help demystify things you find in your environment. Having these experts in the room saying, "No, that's not how it works. This is what's really happening in there," because you may build up a theory really quick. Our DNS typically has this configuration for TTL, or something. No, it doesn't. Here's where that config's at, here's what it's being mapped to. It helps to really get to a faster answer, when you find these experiments.

Lastly, this really helps prepare your team for production. When we went through a lot of this, and I mentioned earlier when we did that transition from Kubernetes in this other environment, this really helped the team to get to production. You could see they were much more confident in what the system could and couldn't do. They knew what the limits were. It wasn't just from what they read about, or limits they'd seen other people publish about. By them trying this, they understood what signals came from that system when it was suffering.

They understood what other things to expect when something would fail. They'd built really strong relationships with other people in the company to find out, who do you talk to this person when this other issue is occurring? Even finding out different sources of metrics we had in our systems that may actually contradict the metric we're seeing in another system. It really helped build up that confidence level and it was preparing the team for anything else in production, let alone just from that single experiment.


To wrap up here, we'll have a little bit of summary. One is to make sure that you plan for your experiments. That was a very valuable thing up front, make sure you plan adequate time to doing those type of experiments. Make sure you can identify how to make easy improvements for observability in your system, how you can actually improve the metrics and so other people can observe and get those benefits easily. If that's hard to do, you'll find that these experiments don't get a lot of benefit in the next step, because you're so taxed, "That'll take a long time to go in and reassemble this library" or something.

Understand how you can use traffic management patterns to minimize risk so you actually can introduce this. This helped us tremendously to understand how we can manage what was the live traffic workflows hitting our system, so we could actually introduce these type of experiments early on. Especially in that shadow effect. So that alone, gave us a very strong benefit of doing these type of tests early. Work to build those cross-functional to maximize your learning. That was where just even incident analysis type assessment, getting that context early of, here's what's going wrong in these systems. Who knows best about that system? Who can debunk what that other issue is? These are very valuable to have on hand.

The last one I'd like to bring up is, just make sure to remind your teams and leadership on the measurable improvements through this practice. This is an extremely strong benefit to have. You'll find that as you're going through this, it's almost like you'd say, "Did you hear about that massive incident we went through today?" They're, "What? What happened?" You're, "Actually it didn't. We found out how we would not go in that incident, because of this learning that we had as a result of it."

Share all these big benefits you get out of that process, and advertise all the wins that come out of it. You want people in your organization to see how this is beneficial, and not something that's scary. This is helping you discover all those safety boundaries in your system, and you want other people to say, "Yes, of course we're going to do this type of test. How best can we actually introduce it safely in that environment?"


If you're curious, some of the technologies I referenced here are probably familiar. Again, we're using Dropwizard for the run times, we also use their metrics type library. That's been very helpful for how we do lower level measurements within the services. We also use Vegeta. That's an example when we do synthetic tests, so when we'd see something that's in production and we want to replay it and try it out. It's just a simple HP client library that gives you a simple visualization you can reference. Again, in a lot of our experience, people would just be screen-shotting, here's what the client side would be experiencing, whether it was connection timeouts or long response times. That was an easy to where they weren't having to build a graph again just from some basic data.

All these notes that I've shared here have been built into an 8.5x11 sheet that you can print out on your own. This link off this QR code, you can get and there's a printable version of it. I also brought some copies that I put in the back, on that back table that are on card stocks. It's a harder piece of paper. I went ahead and made the backside a template for how you can make it into a paper airplane. If you have never been confident in how you build a paper airplane, you can follow this and get some idea of how to confidently build it, and then you can do experiments of how far it goes.


See more presentations with transcripts


Recorded at:

Feb 19, 2020