Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Did the Chaos Test Pass?

Did the Chaos Test Pass?



Christina Yakomin discusses how to run Chaos experiments with Vanguard technologies.


Christina Yakomin is a Senior SRE Specialist in Vanguard's Chief Technology Office. Throughout her career, she has developed an expansive skill set in front- and back-end web development, as well as cloud infrastructure and automation, with a specialization in Site Reliability Engineering. She has earned several Amazon Web Services certifications, including the Solutions Architect - Professional.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Yakomin: Welcome to, did the chaos test pass? My name is Christina Yakomin. In this presentation, I'll be talking to you about a new stage of maturity with chaos experimentation that my team has reached at Vanguard. Where now, for the first time, we have an appetite to determine the success or failure of a certain subset of the chaos experiments that we're running. I'll also talk a little bit about what it takes to be a part of that subset, and why we aren't determining a pass or a failure for every single experiment that we run.

I am a senior architect at Vanguard in my current role. I've been specializing in site reliability engineering since about 2018, starting out with a role as a site reliability engineer supporting one of our most prominent cloud platforms for our public cloud application that we were deploying. After that, I went on to our, at the time, very new chaos engineering team, where I helped design and develop a homegrown tool for chaos engineering that I've spoken about a lot in the past. After that, I helped guide Vanguard through the adoption of a formal SRE practice through our SRE coaching team. This really helped us to enable our IT organization to adopt a DevOps model for product support. Through the enrichment and education that we've provided for the SRE roles that we've deployed, we've been able to make it easier for product teams to balance the workload of delivering features for their business, and also delivering on the non-functional requirements, including those related to resilience, through activities like performance engineering, and chaos engineering.

I've been speaking at a variety of conferences over the past couple of years, including QCon London, this year, where I spoke about our process for hazard analysis, and how we decide which chaos experiments we're going to run. In my current role, I'm an architect supporting our engineering enablement and advancement department, which encompasses the SRE practice, and chaos engineering, and performance engineering that I spoke previously about. Also, our continuous integration and continuous delivery pipeline, our developer experience function, as well as several other key functions, all focused around making sure that engineers at Vanguard are having a positive experience and are able to efficiently deliver both features and non-functional requirements for the products that they support.

About Vanguard

Now, a little bit about Vanguard, and the importance of technology at a company like Vanguard. We're a finance company first. We are one of the largest global providers of mutual fund products. We have trillions of dollars in assets under management. We're talking really big numbers, really large numbers of clients. All of the work that we do, all of the interactions that we have with those clients of Vanguard is virtual. Unlike other asset management companies, Vanguard does not have brick and mortar locations where you can go in person and talk to an advisor or talk to someone about setting up your account. The interactions we have with our clients either happen via our call center, or ideally, for both us and our clients, through the web without needing to interact with an individual. That means it's really important to us to not only make our web experience positive for clients to use, but to keep it available, because the vast majority of the time when clients are trying to interact with their portfolios at Vanguard, they're doing so through the web. If we don't keep our website available, they won't be able to make key transactions on their portfolios, which can really be a timely activity, especially with the market volatility that we've seen over the past several years. Technology is at the forefront of our focus as a company, so much so that IT is the largest division at our company.

Chaos Engineering and Experimentation

The role that chaos engineering and experimentation plays. I want to spend one slide setting some context for everyone. Fundamentally, chaos engineering refers to the practice of injecting a fault into a running system, and then observing the resulting system behavior. There are a few different types of testing you can do. I like to categorize chaos engineering a few different ways. The first category is exploratory testing. This refers to going into your system and injecting a fault when you're not quite sure how the system will respond. You do this for the purpose of learning about your system, when you know how you've designed it but this particular stimulus might be more of a mystery to you. You typically do experiments like this under careful observation in a controlled environment, that would be isolated from potential client impact. On the flip side, you might be testing with really well-formed hypotheses about your expectations of system behavior. These types of tests, maybe you did a hazard analysis activity upfront, you developed an, if we inject this fault, then the system will respond in these ways sort of hypothesis. By injecting the fault during the test, you're either validating or disproving the hypothesis that you came in with. Just because you have a hypothesis doesn't mean that that's the actual behavior that you're going to end up seeing from your system. You're usually coming in with some level of confidence that you know how the system is going to behave.

Another way to categorize chaos experiments is testing the sociotechnical system versus testing just the technology. What this means is, when you're testing the technology, you're injecting the fault with the intention of observing how the system independently will handle it through automation. Whether you've built in some automated resilience mechanism through failover, autoscaling, or some other type of automated recovery through a fallback mechanism or something else, you want to make sure that without any human intervention, the technology is able to handle the fault that you are injecting, without any adverse effect or with minimal adverse impact to the client experience. Testing the sociotechnical system involves bringing the human element into the test. When we test the sociotechnical system, we're talking about injecting a fault that in some way we expect the humans that support our systems will need to get involved in. Maybe you inject a certain fault, and the plan for recovering from that is failover, but it's failover that isn't automated. You run this test so that your engineers can practice the incident response procedure that they'd have for this situation and ensure that you have the appropriate access to the toggles, you know where to go to find them, and that the failover happens the way that you expect it to. A combination of human and system behavior being observed.

At this point at Vanguard, we've done tests that fall into all of those categories, and somewhere in between. When I say we're reaching a new level of maturity with chaos experimentation, what I mean is that our highly motivated engineers are finding that when they sit down for their game days, whether that's quarterly, monthly, even more frequently than that, oftentimes the list of chaos experiments that they've tried in the past and that they want to run again, and the new chaos experiments they want to run is growing pretty quickly. At a rate that they're having trouble keeping up with, and they're having trouble fitting it all in. They're noticing that a lot of the time, they're just repeating tests they've run before expecting the same results they've seen before. The only reason they're repeating these tests is because they've continued to iterate on their system. They understand that the underlying system that they are testing has been changing, so the test, just like a suite of unit tests, needs to be repeated in order to ensure that the resilience we expected and observed in the past is still present. If this is something we've done before, and we've seen the results, so we know what to expect. In other aspects of software engineering, what we talk about there is an opportunity for automation. We want to reduce the cognitive load that the development teams are encountering with chaos experimentation by taking those tests that are repetitive, and automating them, so that we can confirm that as teams have been iterating on both the code bases and the system configurations, they haven't introduced any adverse effect on the resilience that we've previously observed in our fault injection tests.

How Do We Know if a Chaos Test Has Passed?

That brings us to, how do we actually know particularly from a programmatic standpoint, if a chaos test has passed? First, you have to determine, is the app even healthy to begin with, before you do anything? Especially if, like Vanguard, you're running your tests in a non-production environment, where the expectation for stability isn't quite the same as your production environment. Then after you've determined you have a healthy application, you have to make sure you can programmatically assert what you actually expect will happen. Just in general, this needs to be the kind of test you come into with a hypothesis. You need to have an expectation of what will happen. It doesn't always mean the expectation is the app will stay healthy. We'll talk a little bit more about that. You do need an expectation. Then we need to, again, programmatically determine, does the actual observed behavior in response to the fault align with the expectation that we have, with the hypothesis that we came in with? Observed behavior in this case, like I said, doesn't always mean a totally stable system. We might be ok with observing a failover with no client impact, but something happening behind the scenes. We might want to make sure that it's ok if we have a dip in our availability for a minute or two. Once we start to talk about five or more minutes of downtime, we're not ok with that. We can start to make some assertions around the speed of recovery, given a certain fault. Lots of different complex expectations we may have that we're now trying to programmatically assert.

Chaos Test Assertions at Vanguard

At Vanguard, we did have some tailwinds working in our favor to help us accomplish this. I want to acknowledge some of those upfront before we continue. At Vanguard, the vast majority of chaos experiments that we are executing are being orchestrated through this homegrown tool that I've talked about in the past called, the Climate of Chaos. You can search for my past talks, Cloudy with a Chance of Chaos, is the title I used for a few different talks on that topic, to get more information about why we built that tool, and how we built that tool, and what it does. Then later in this presentation I'll show some architecture specific to the chaos test assertions piece. Another tailwind we had is, in most applications at Vanguard we're using CloudWatch alarms. There was some standardization there for the notification of an application becoming unhealthy. Several of our core cloud platform teams had centralized on CloudWatch alarms. The types of teams that were leveraging chaos experimentation, and were ready to automate, were also the types of teams that were leveraging that standardized platform. Or at least, were comfortable enough using CloudWatch alarms to determine application health statuses. We figured if we can take the product that we built, and since we've built it, we have the ability to customize it. We integrate that with the existing CloudWatch alarms or the ability to create new CloudWatch alarms. We can create assertions that we can apply in the context of our experiments. That's exactly what our chaos engineering team has done.

Chaos Test Assertions, Dynamo Table

Here, this is the same information displayed twice on the screen, once in the form of a JSON object, and then two at the bottom, which is what I'll run through, an excerpt similar to what our Dynamo Table looks like. You can follow along with whichever one is more palatable to your eye. This is the core of the chaos test assertions, is this Dynamo Table that we built in AWS. We have a unique name that we give every assertion so that it can be uniquely identified. Then we tie it to some other metadata. We associate each assertion with an application identifier, three-letter code, whatever you want to call your unique identifier for an application. This is the same key you can use in our CMDB to pull up metadata about the application, including the application owner, the department, the cost center, the business purpose. Next, we tie it to a CloudWatch alarm. Every assertion is a one-to-one relationship with a CloudWatch alarm, so we get the AlarmName, and the region of AWS where that alarm is stored.

Next up, we have a Cooldown period. When we're talking about chaos test assertions, we want to talk about the app's state, both before the experiment, during the experiment, and after the experiment. In order to get that after, we need to configure, how long do we wait before we consider this after? That's what the Cooldown does. This is in terms of seconds here. You wait 60 seconds, 45 seconds. Some teams will wait 5 minutes or more before they actually go ahead and check the after state. It's all depending on what your recovery time objective is going to be. Next, we have a description. This is optional. If you have maybe multiple assertions, more certain experiments that you want to keep track of, you can add the description there, so you know what the assertion is doing beyond just the uniquely identifying name that you gave it. Then we have a bunch of Boolean values. Some of these I'll talk about now, some I'll save for later. It's basically that, if we see the alarm go into alarm state before, during, and after the experiment, at any of those points, if it's in an alarm state, do we consider the experiment a failure? Most teams are definitely going to say, we consider it a failure if the alarm is in an alarm state after the experiment. Some teams will want to say it's a failure before we even get started. You don't want to run an experiment on an unhealthy app, so we can abort. Others it's, we want stability throughout, so during the experiment, even if the experiment is ongoing, we shouldn't see the alarm state. It depends on the team. ShouldPoll and DidAlarm, we're going to save those for the next slide.

Chaos Test Assertion Checker

There's two key workflows that I'm going to walk us through. This is where we're going to start to get into the technical implementation details of this, and all the icons that you'll see here are going to be AWS specific. There are ways that you could implement this in other clouds or even in an on-premise environment, just using different underlying technologies. Since a lot of folks are going to be familiar with the technologies at AWS, I figured I would include the AWS services that we are leveraging. The Chaos Test Assertion Checker is something you can think of as the background process. The main purpose of the Chaos Test Assertion Checker is to help us figure out if we go into an alarm state during an experiment, not before, not after, only in that during period. The way that we do this is with a CloudWatch Events Rule that fires up every single minute on a Cron schedule. That rule's entire purpose is to trigger our lambda function to get going. The first thing that the lambda function does is reach out to a Dynamo Table, the one that we talked about on the previous slide, with the Chaos Test Assertions. It's going to look for that ShouldPoll column that we skipped over. If ShouldPoll is set to true, what that means is that we are currently in the process of running an experiment for that particular assertion. We want to poll the status of that alarm associated with the assertion.

If we don't find any, that's a pretty normal case. Usually, the majority of the time, we may not see that an experiment is ongoing with an assertion, especially overnight outside the context of the business day. If we don't see any item in the table, with ShouldPoll set to true, then we complete our execution here without proceeding. If we do find any with ShouldPolls, that's true, let's say we found a couple of them. Then the lambda function is going to make a request to the CloudWatch API and retrieve the current status of all the alarms that had ShouldPoll set to true. We'll send the name and regions of those two alarms. We'll interact with the CloudWatch API, and get back the current status of those alarms. We only want to do this for those alarms where ShouldPoll is set to true, because working in a large enterprise, there are times when you run up against your API interaction limits for AWS. While once per minute isn't going to end up with us getting throttled by the CloudWatch API, other teams are also using the CloudWatch API in the same AWS accounts. We want to be as mindful as possible about our consumption of those APIs. Anything that we can do to limit both lambda execution time for cost purposes and API consumption on the AWS side, we're going to try to do.

We're going to get those alarm statuses back from the CloudWatch API. If any of the alarms were in the alarm state, at that point, we make another request to the Dynamo Table. This is where that final table column comes in, the DidAlarm. We will set that to true if we found an alarm that's in an alarm state. Because what that means is, during the context of the experiment, as indicated by ShouldPoll is currently true, the CloudWatch alarm was alarming, and so we want to reflect in the table that it DidAlarm. We got to store this in the table, because by the time we get to the end of the experiment, the alarm might no longer be in an alarm state. We have to capture this during status. There's nothing going on in our actual execution of the experiment that allows us to constantly watch the CloudWatch alarm status. A lot of information thrown at you there about this background process. Its primary purpose is to update the DidAlarm column in the table. Now as we move forward, I'll talk to you about the actual experiment execution workflow and how we set ShouldPoll to true and back to false, and how we consider the results of this background process.

Chaos Test Assertions Flow

This is that Chaos Test Assertions Flow. It's really the broader chaos experiment workflow for an experiment with an assertion associated with it. There's 10 steps here. I'm going to go through them one by one. I know that there's a lot on this slide to take in, but I'll highlight the relevant portions of this diagram as we go through, and then try to give you time at the very end to digest everything that we have talked about. The first step in the process is simply, a chaos experiment was initiated. You're going to kick off an experiment any number of ways. You may use CloudWatch Events Rules to schedule your Chaos experiment to run weekly, daily, hourly, any Cron schedule will do. A lot of teams for something like this do have their experiments running on a schedule on a regular basis, so there is no interaction from the team required whatsoever. You may also have a programmatic execution via an API, whether you're kicking that off from a script that you have scheduled to run some other way, or you are hitting a button and executing your script. We give our engineers an API for this tool so that they have some customizability in the way that they choose to invoke their experimentation. Or, if that's not your speed, if you're not necessarily even an engineer, maybe you're a project manager who wants to run an experiment, that's ok, too. You don't need to understand how to stand up AWS infrastructure, or how to write code. We've got a UI as well, where you might be able to come in and insert just a few key details about the experiment you want to run, and kick off the execution manually that way. Regardless, one way or another, an experiment is initiated. There's an assertion associated with the experiment for the purposes of this particular explanation.

Step two I include because it does exist, but it's not particularly relevant to the assertions. I'll go through it quickly. Note that for the purposes of keeping this slide as simple as possible, some dependencies have been excluded. This is one of those steps where dependencies are not shown. We do a variety of pre-experiment actions and checks here. We check the global kill-switch state because we do have one of those, in the event that we maybe want to halt all experimentation in an environment. Maybe you've got a really important performance test going on that's very large scale, or maybe you have a production issue that you're trying to fix forward, and you want to ensure stability in the non-production environment for testing, we do have that kill-switch that we, the chaos engineering team can flip at any time. At this point, we also write a record of test history, so that we make sure we have a record of the fact that the experiment was initiated. We'll update that later if we get to later steps with the result of the experimentation.

Step three is the first time that we're really doing anything related to this experiment having an assertion associated to it. This lambda function is going to first retrieve all the information we need about the assertion that's attached to this experiment from the Dynamo Table. We're going to get the AlarmName for CloudWatch. Should we abort if we're in alarm before the experiment? Should we fail if we're already in alarm without proceeding? We're going to get all that information out of the table, and then retrieve the current status of the alarm from the CloudWatch alarms service through the CloudWatch API. Armed with that information, we now reach decision state. If the CloudWatch alarm is in an alarm state, and the assertion configuration indicates that we should not proceed because the experiment is already a failure if we're in alarm before the experiment, we don't have to do anything else. We can follow that dotted line on the screen and skip to the experiment result lambda at the very end, and record the experiment as failed. Because I want to explain the rest of the steps to you, we're going to assume that we reached out to CloudWatch alarms and the alarm is in an ok state, so we're able to proceed to the next lambda in the Step Functions workflow.

Next up, this lambda is where we're going to start the during phase of the experiment. We're going to update the item in the Chaos Assertions Dynamo Table, setting that ShouldPoll attribute to true. Now that means the Chaos Test Assertions Checker separately in the background, every single minute, when that CloudWatch Events Rule wakes up, it's going to be polling the status in CloudWatch alarms to see if we enter an alarm state. At this point, we're setting that ShouldPoll to true. The one other thing that we do is we update that DidAlarm attribute and make sure it's set to false, because we're just now entering the during state anew, and we don't want to accidentally get the result from a previous experiment associated with this assertion. We want to make sure that that DidAlarm is set to false, so we've got a clean slate to start the experiment. Step six, as indicated by our little storm icon here, a play off of our Climate of Chaos tools theme, it's time for fault injection. It's also represented by this nonstandard icon because this can look like a lot of different things. It could be an individual lambda function. In most cases, it's actually a nested Step Functions workflow with a whole bunch of other activities inside a combination of lambdas, wait states, decision states, and maybe even reaching out to other resources not pictured. Through a series of actions and wait states, we inject some fault into the system and experience a period of time known as our, during state, which can be several minutes or even longer. Which is why we have that background process just constantly, every minute, checking the status of the alarm outside of the context of the Step Functions workflow.

Once we finish the fault injection portion, we use that lambda that we just interacted with in step five, to go back to the Chaos Assertions Dynamo Table, this time setting ShouldPoll back to false, bringing us to the end of the during state of the experiment. Back in the Dynamo Table, we had an attribute called Cooldown. In between step seven and step eight, not pictured on the screen, we're going to have a wait state. That's going to say, ok, we've completed the during, how long do we wait before we check the after status? Is it 45 seconds, 60 seconds? Is it 2 minutes, 4 minutes, 10 minutes? However long you want to wait. Then, in a lambda function, we're going to retrieve the current status of the alarm via yet another CloudWatch API call. This is where we find out, we're finally after the experiment, and after our configured recovery time objective. We're going to determine what the current state of the alarm is. At this point in time, we also want to check the Chaos Assertions Dynamo Table. What we're looking for there is the DidAlarm attribute. Because the Chaos Test Assertions Checker, that background process has been looking all along through the during state to see if it should update that attribute. We want to check at any point during the experiment, did that flip to true that we DidAlarm during the experiment?

In step nine, we have all the information we need to make the determination of whether or not the experiment overall was a success. We know if we were in alarm state before because of step three. We know if we were in alarm state during the experiment, because of the reach out to the Dynamo Table in step eight, which tells us the results that the Chaos Test Assertions Checker observed. We also know the alarm state after again because of step eight. All the way back in the beginning in step three, we have the information from the Dynamo Table, about when we should consider the experiment a failure if we were in alarm state before, during, or after. Coalescing all of that information, we finally assign a status of success or failure to the experiment. All that's left to do after that is record the results, and end the state machine execution. This is that 10-step workflow, all the way from beginning to end without any of those highlights. You probably don't need the numbers because it is fairly linear on this slide. Due to the decision states, the nested Step Functions, you're welcome for not including an actual screenshot of the state machine on the screen because it does get pretty complex when you look at it in the AWS console. AWS Step Functions is one of my favorite services in AWS for its ability to orchestrate complex workflows like this one, with a nice visualization for you that follows things around. That's a really nice trace for you where the failure occurred if you encounter an error throughout the process.


I do want to give one quick reminder, that you don't actually always want this. This referring to the assertions. Not every chaos test needs one. Personally, I don't think there's any world in which 100% of chaos experimentation can or should be automated like this, because I think that you really shouldn't be automating exploratory testing. Of course, you can't be automating the testing that requires the human intervention, the testing of the sociotechnical system. If you could, then the humans wouldn't be getting involved, and that's a different type of experiment. The goal with something like this, is to make sure that we're optimizing and maximizing the time spent by engineers on chaos experimentation. If we're able to take those repeatable tests that we've done before, and we've confirmed our expectations in the past, then we free up the engineers when they do sit down for their game days to perform new and interesting exploratory tests or participate in those fire drill activities where they're trying out incident response, preparing for on-call, ensuring that our incident response playbooks are up to date, and that their access is where it needs to be. On top of all that, if they're spending less time manually executing chaos experiments and observing the result, they've got more time to actually build things, whether that's increased system resilience to enable you to run some more tests, or maybe, once in a while, we actually build some new features.


See more presentations with transcripts


Recorded at:

Sep 26, 2023