Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations The Scientific Method for Testing System Resilience

The Scientific Method for Testing System Resilience



Christina Yakomin discusses the Scientific Method, and how Vanguard draws inspiration from it in their resilience testing efforts, covering the "Failure Modes and Effects Analysis" technique.


Christina Yakomin is a Senior Site Reliability Engineering Specialist in Vanguard's Chief Technology Office. Throughout her career, she has developed an expansive skill set in front- and back-end web development, as well as cloud infrastructure and automation, with a specialization in Site Reliability Engineering.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Yakomin: My name is Christina Yakomin. I'm a senior site reliability engineering specialist at Vanguard, one of the largest investment management companies in the world. The vast majority of our interactions with our clients happens through the web. The availability of our sites is absolutely critical to our success, and the success of our clients, the investors. This talk is called the scientific method for resilience. My goal is to teach all of you the technique that we use at Vanguard for identifying which chaos experiments to use to gain confidence in the overall resilience of our systems, and the cyclical step by step process that we follow to ensure continuous learning. This process is based on the scientific method that we learned probably all the way back in elementary school. To start things off, let's review what that process is. It's six steps that happen in a cycle, starting with observation. Based on what we observe, we ask questions. Then in answering those questions, we're able to derive hypotheses. We either prove or disprove those hypotheses through experimentation, determine the results through analysis. Finally, we draw conclusions. Of course, it doesn't stop there, the cycle repeats, because based on the conclusions that we've drawn, we may make new observations that lead to new questions, and so on and so forth.

How Does This Apply to Resilience?

How exactly does this apply to resilience, and more specifically, the resilience of our IT systems? At Vanguard, we're using this adapted three-step cycle to test our systems to ensure that they are resilient. The first step in this process is the one that you're likely least familiar with, and that's the failure modes and effects analysis. We borrow this from the physical engineering disciplines like mechanical and hardware engineering. The idea of this meeting is to get the members of a technical team together to discuss all of the ways that a system might fail, and what the effects would be in those scenarios. The output of that meeting is a set of expectations in the form of hypotheses. Those hypotheses are what we test with our chaos experimentation. After we analyze the results of our chaos experiments, the next step is of course everyone's favorite, documentation and planning. While this isn't the most fun, I will explain when we get there, just how critical this step is to the success of the cycle.

Scenario: Our Schedule for the Hypothetical Week

Let's talk about the schedule for a hypothetical week in the life of these engineers. Maybe on Monday at 2 p.m., we're going to have the failure modes and effects analysis meeting. We hope that it'll only take about an hour. By the end of the day Tuesday, we will produce a list of hypotheses based on the discussion. On Wednesday, we'll do all of the preparation necessary to get ready for the chaos experimentation, which will happen on Thursday at 11 a.m. in our chaos game day activity. Finally, on Friday, by the end of the day, we'll publish our findings so that we can go into our weekend.

Step 1: Observation

I'm going to walk you through this all step by step, but instead of following the steps of the Vanguard cycle, the three-step process, we're going to look at this from the perspective of the scientific method. I'll let you know a reminder of where we are in the meetings and experimentation process as we go. Step one is observation. This is either the failure modes and effects analysis meeting itself or a pre-read to the meeting. Here we reference an architecture diagram of the system. When I facilitate these meetings, I typically like to share my screen and just leave an architecture diagram up for the entire duration of the meeting, while we take notes. The goal is to understand how the system works when everything's working correctly, how it was intended to work, what it was designed to do. We want to identify what the critical components are and also consider the business process flow. It's helpful if you have an associated narrative to go with your architecture diagram to help you put it into the context of the order in which these different components are interacted with. For the duration of this presentation, there will be a sample system architecture on the bottom of the screen that we will reference. It's a really simple one, and you can see it here. It's an end user interacting with some web UI, that then is making some request to a cloud based microservice. Finally, that's making either write operations or read retrievals from a database.

In practice, what this would actually look like is probably something like this. A lot of these are going to be familiar icons to you if you're working with AWS, but none of it is specific to any cloud provider, nor does it even need to be specific to operating in the cloud at all. In this case, maybe you have a webapp running in ECS, the microservice also in ECS, both of those have the ability to autoscale. That's what that logo is on the bottom left with the dotted line around them. The load balancer is the little purple circle in the front, indicating that you are balancing load across more than one individual task for both the webapp and for the microservice. You've got some good redundancy built in there. The microservice is interacting with an RDS database, that's what that top left blue icon is next to the database. All of this, again, happening within the context of AWS in likely a single region, but not likely just one availability zone. Those are AWS specific concepts.

Step 2: Question

The next stage is the questioning. Now we are definitely in that meeting, having a discussion as a team. You want to bring together not just your most senior engineers, but every member of the technical team that works on this particular project. You can also include your project managers, as they may be the ones responsible for decomposing work and maintaining a backlog, so they might be helpful to have as well. I tend to make managers optional on this meeting, though, I do find them to be an incredible opportunity to learn, so no one should be off limits. Now that we're questioning, we're trying to discuss how each component might fail, and ask ourselves what the effect would be in each of the failure scenarios. I actually find the junior engineers within the teams to be some of the most effective participants during this stage, because they'll ask the questions that may seem obvious to some of the more senior members who might assume that everyone has the same idea for how things would behave in that scenario, so it could be skipped for the sake of efficiency. The junior engineers are an absolute asset coming through with the genuine questions from their own self-perceived lack of understanding. Don't forget to include them. For the example on the rest of the slides, we're going to go deep on specifically, what do you think would happen if our database became unavailable?

Step 3: Hypothesis

As soon as we're talking about the answers to these questions, we are formulating hypotheses. We're basing this just on what the team knows about the system. All of them are inherently conjectures. If you have a group consensus right off the bat, then that's great, your hypothesis has been created. Keep in mind that people may not always agree. This is my favorite when this happens. A lot of times you have your most senior members of the technical team realizing for the first time that their mental model of the system in these scenarios doesn't quite align with that of their peers. They've never been faced with that reality before, because the situation just never occurred. Now that they're being forced to answer these questions and verbalize their mental models, they're able to realize that not everyone has exactly the same idea. Also, those scenarios where not everyone agrees, are a perfect signal to you that it's a really great opportunity to do some experimentation. Not so that you can prove someone right and someone else wrong, but for the benefit of the entire team to understand what the reality of the system behavior would be. For our example case, let's say if the database became unavailable, we're going to conjecture as a group that writes would fail because there's no database to talk to, and we haven't put in any effort to decouple that interaction with a queue of any sort. We think that reads would be ok. They'd be served from, now this is a new implementation detail we're becoming aware of, the microservices in-memory cache. When we get to the experimentation stage, we're going to talk about what it would look like to test this scenario.

Failure Modes and Effects Analysis

First, I want to show you what the output of a failure modes and effects analysis looks like, because we're coming to the conclusion of that meeting and getting to the point where I'm going to deliver my list of hypotheses. I typically like to take notes in this format, and I provide a template to the teams that are going to be conducting this analysis that looks just like this. We start with understanding the process step. First, the web UI sending the request to the microservice for a read, then the microservice trying to read. The next one is the web UI sending a request to write an update, and then the microservice trying to actually write that update to the database. Keep in mind, I have a one to one relationship here between the process steps and the failure mode, but that is not necessarily going to be the case. You may find that for an individual process step, there are as many as six or seven or more different failure modes that you can think of. If all of those have been determined to be in scope for your discussion, then by all means put them all in there. For the sake of slide space, I've just got the one.

Let's look at the second one first, the microservice trying to read info from the database. If the database is unavailable or returns an error, then the expected behavior is we send back a response with cached data from the in-memory cache. The hypothesis then is exactly the one that we talked about on the previous slide. If the database is unavailable, then reads will continue to succeed for a while due to the in-memory cached data. The hypothesis is simply a synthesis of the previous three columns all in one place. During the meeting, I'm really focused on the process step, the failure mode, and the expected behavior. Then after the meeting, I go and synthesize those hypotheses. I typically have one more column as well, I call it notes. That's where I mark down any interesting side comments from the discussion. I'll indicate, this one, we had really strong group consensus, we're really confident. This one we've actually tested very recently and confirmed in some way. This one, a lot of people disagreed, we're landing on this hypothesis, but we're very uncertain about it because someone had a different mental model that they thought it might work differently, we definitely want to test this one. These are just a few examples of the notes that I might take and leave in the documentation for us to reference later.

Step 4: Experiment

We're into the experimentation phase. We're now preparing for and executing our chaos game day. This one's simple. Just run a test, any kind of test. It doesn't matter how you do it. I'll get a little bit more into the mechanisms for fault injection. For this specific scenario we're going deep on, it's, let's shut down our database in non-prod and test our assumption. I want to quickly call out here that I've said we're going to shut down the database in non-prod. A lot of people are very adamant that the best and only way to test is to test in production. It is the only production like environment there is, nothing else can truly ever match production. Those people are right. In my line of work, I can't go to a senior leadership team and tell them, "I'm going to break just the writes, and I haven't tested it, but I'm pretty sure the reads are going to keep working. Is it all right if I mess with Vanguard's production systems in that way?"

The answer is always going to be no, not even just because my senior leadership team isn't supportive of the chaos engineering because they are, but because in the financial sector we're so highly regulated that doing something like this would result in real penalties. For any scenario where I expect to be causing client impact, I'd rather test in non-production than not test at all, since production is clearly off the table. If you can test in production, if you're not anticipating client impact, and in particular, if you've tested and verified that in non-production already, then I say go for it. Just because I'm not doing it in this example and that I haven't done it much at Vanguard doesn't mean that you shouldn't or that you should assume I am suggesting non-production is best. I am going to say that for certain scenarios, non-production is the way to go.

Mechanisms for Fault Injection

Let's talk about the mechanisms for fault injection. I've got four on the screen. You may even be able to think of some more. The first being advanced chaos tools. AWS has a tool for this now called the Fault Injection Simulator. There's tools that have been on the market for a while, like Gremlin. These are very advanced, come with a lot of functionality out of the box, and they also come at a cost. If you are able to integrate them easily into your environment, which definitely is a big if, then it may be worth trying them out, especially if you have an organization that is already bought into experimentation, and is willing to foot the bill to fund this expense. If that's not the case for your organization, there are a lot of open source libraries that you can take a look at, Chaos Toolkit being one prominent example, but there are many. We definitely looked to a lot of open source libraries when we were getting started with chaos engineering. Even if we couldn't use them directly due to some issue with direct integration or access, we were able to draw inspiration from them. In some cases, straight copy portions of the code into our environment to get things working.

Custom scripting may be a really great option if there is some complexity of security integration that you're trying to do, or if you just want to do something really simple. Sometimes you don't need a whole library to do it. Sometimes, it is just as simple as running a short Python script. This is a lot of what we do at Vanguard. I would recommend, if you are writing these custom scripts, try to write them in a way that is as generic as possible so that they can be reused by other teams. Bonus points if you keep these in some repository that everyone at the organization can look at. It's not quite open source, it's more inner source. That's a good thing. Finally, it's still chaos experimentation, even if it's manual, don't think that it needs to be automated. In the case of the experiment we're talking about here, shutting down the database in non-prod, you could just go into the AWS console or wherever it is, and hit the shutdown button on the database, that counts. Don't think because you don't use a fancy tool or you haven't automated it, that it doesn't count. It absolutely does.

Step 5: Analysis

Now we get into the stage of analysis. You're going to use the available telemetry and observability tool stack to see the effects of the injected fault. This makes a couple of assumptions here. First, it assumes that you're instrumented for some observability, that logs, metrics, and maybe even traces from your system are being sent somewhere for you to observe. This is a pretty important prerequisite to doing this effectively. This also assumes that you have some amount of data in these systems to look at, that there's some traffic coming into the system. If you're testing in non-prod, that's a big assumption to make. You're going to want to generate some load for the duration of your test. It's great if you can do this in the context of a broader performance test with whichever system you're using to orchestrate those today. If not, any load generator, JMeter, Locust, or something else would be effective. What isn't effective is sitting at your desktop and hitting the refresh button about 15 or so times. That's not going to generate enough data, nor will it have enough variability to really provide meaningful data for your test.

Assuming you have met all of those prerequisites, you have some diverse traffic coming in, you have the instrumentation and the observability tool stack available to you. You're able to compare your observations to the hypotheses that you had initially, and see, were the team's expectations met? In our example scenario, the answer is no. You could see at the bottom of the screen what I'm describing, a retry storm of write requests from the web UI took out the microservice. What does that mean? What happened here? Just as we expected, for a short period of time, the microservice was able to continue responding to the read requests successfully. The database was unavailable, but the in-memory cache was in place. The time to live on that cache had not yet expired, so we were able to respond just fine. While that was all happening, write requests from the web UI, were coming in and failing, just as we thought they would. What we didn't expect was that the web UI as it was written, was making unlimited retries to the backend, without any exponential backoff or circuit breaker functionality to cut it off. As more traffic came in, more retries were building up. Ultimately, we ended up with a denial of service on ourselves. The microservice probably became overwhelmed by the traffic, incidences would have started to crash, taking out the in-memory cache along with it. Now nothing's working. The cache data is gone. The instances are unavailable. We aren't able to gracefully degrade like we thought that we could.

Step 6: Conclusion

We draw some conclusions. We're into the documentation stage now. I'm going to reiterate, document your work. You want to make sure that all of the steps you followed are written down, and that your observations have been captured. This is because this is a cycle, which means you're going to be repeating this in the future. If you don't have all the steps written down, it'll be much more difficult to repeat exactly what you did the first time, especially if the people that were involved the first time aren't available when you attempt to retry. If you don't capture the observations, either by taking screenshots of the different dashboards you were looking at, or providing links to different queries, if you are able to save the results of those queries in perpetuity. Or even just saving what queries you ran, and then a screenshot of what you saw. If you don't capture all of those, you won't have anything to compare to, aside from your memory, which will get fuzzy with time. You want to make sure that as you make system changes, depending on the end result of your test, you either maintained the good behavior you saw before without deviation, or you improved and made some progress toward remediating unexpected behaviors that you would see.

Next, you want to spend some time action planning. This is something you want to do, even if the system lived up to your expectations. The result may be, we are consciously deciding not to take action, but you still want to talk about it. Just because the system behaved the way you expected doesn't mean it behaved the way that you ultimately would want it to in the ideal case. Have that conversation no matter what. Then modify the variables by making system changes, and repeat the cycle again. Even if you walk away saying all of our expectations are met, and we're not going to make any changes as a direct result of what we've seen today, your variables are still going to change. You don't get out of repeating it for that reason. The variables that are changing are the new features that you're adding into the system. Maybe you'll migrate to a new architecture down the road, something more modernized. You'll want to repeat this periodically to ensure that as you're making normal business as usual changes to the system, the behavior that you observed previously, is still the same. In the case of our example, we'll implement a circuit breaker in the web UI and some better retry logic for the microservice, and we'll be more resilient to the database failures. Then we'll be able to retest and see how everything works the next time.

Action Plan

Here's what an action plan would look like. We have the process step, the failure mode, and now instead of the expected behavior, we have actual behavior. This is what you observed. Maybe you'll even have the expected behavior in a column here. You can imagine the expected behavior column in between the failure mode and the actual. Then you have your desired behavior. Is it different from what you saw, and the remediation plan to address any gap between the actual observed behavior and the desired behavior? The process step, we know the UI was sending a request to read info from the database, the microservice is unavailable. The actual behavior was web UI retries forever with no limits. The new desired behavior is that we'll use a circuit breaker to fail fast without overloading the microservice. Remediation plan, implement the circuit breaker pattern around the microservice request.

Second one, microservice reading info from the database. If it's unavailable the expectation is that we would send back a response with cached data from the in-memory cache. That's what actually happened. That's also what we want to have happen. In this case, there's no action required. When we're sending a request to write an update to the database, this is again a situation where the UI retried with no limits. We want to make sure we implement that circuit breaker and fail fast. Finally, for the microservice, writing updates to the database, if the database is unavailable, the behavior was respond to the UI with an error indicating downtime. This was what we wanted initially. Now that we're planning to implement a circuit breaker, the desired behavior may have changed. We can now put the retry logic in the microservice instead of in the UI.

The microservice might want to handle transient database failures by using limited retries, definitely with some exponential backoff as well. This is optional. You may not want this. You may still keep the retry logic coupled to the circuit breaker. In this scenario, for the sake of having a few remediation plans, that's what they decided to go with. We'll implement the retry logic around the database request as our remediation plan. You can see how you can take the remediation plan and turn that into stories for your backlog, tasks to pick up just like you would any technical debt, or maybe you have a portion of time dedicated to working on remediation of anything that is having a negative impact on the resilience of the system. That's when you'd pull this in. Or if you're really lucky, maybe you have an SRE dedicated to the team who would treat this as their business as usual work.

Vanguard's Real Stories

I want to share some of Vanguard's real stories. That last story was completely fabricated for the sake of illustrating the cycle. Vanguard really does do this. Our first FMEA was several years ago at this point. I was there for that very first FMEA. I wasn't even the facilitator, I was a participant, and I was helping to evaluate the practice. At the time, we were lifting it directly from those physical disciplines that I was talking about, hardware and mechanical engineering. Honestly, it went really badly. It went really poorly, to the point where I walked out of that meeting and talked to my colleague who had been facilitating, and said, "I don't think this is going to work for us. I don't think that we should adopt this practice. It really isn't valuable." We spent upwards of four hours going back and forth, discussing everything as a team. The reason for that was we were trying to assign a quantitative score to each of the failure modes and the expected effects, a few quantitative scores, likelihood to occur, severity if it were to occur, difficulty to detect. These are valid, helpful. However, we ended up spending most of our time going back and forth on things like, is this a 7 or an 8 out of 10 on the severity scale? The majority of the learning wasn't coming from that time.

It came down to ROI. Yes, we were getting some value out of the quantitative scoring, but not nearly as much value as we were getting out of the discussion of the failure mode and its effects. We nixed the quantitative analysis entirely and focused on the discussion, the questions and the answers. Then we moved right along. This cut down the time that it took to complete an FMEA significantly. Now when I do these, they take me 45 minutes to an hour for a typical web based workload, more like 2-ish hours for those really complex workloads like underlying infrastructure platforms. This is way better than what we were trying to do before, and really maximizes the efficiency for learning, which is the goal of the entire exercise. Now, we are doing this practice all the time. We've gone from, I don't think it really makes sense for us to do this at all, to actually enforcing that our most critical workflows complete this exercise.

We have seen this for, in Vanguard's case, the balances, the ability for a client to view their balances. The ability for a client to log in to their account for our site, and the ability for a client to make transactions. I've gone in personally and facilitated with teams that support all of that functionality, conducted failure modes and effects analysis for them. Took all of the notes, we produced a report that was around 100 pages long with all of the tables included. In that report, we had a list of hypotheses, we had a test plan. Then we planned a chaos game day. We did all of the analysis. We collected data and screenshots. Some of our hypotheses were proven. Some of them were disproven. We came out of the whole thing with an action plan. That was a couple of years ago that I produced all of those reports. Since then, the entire exercise has been repeated.

Some of the chaos experiments were run again, quite recently. I've seen this work. I've seen teams learn about themselves and their mental models. I've seen them learn about the way that their observability tool stack works. I've seen them learn about the ways that the system works. Now we are doing FMEA at scale. I am not going in and facilitating every single one of the failure modes and effects analysis at Vanguard. That would not be effective. Instead, I've worked with my team of SRE coaches to put together a curriculum of material. It's full of everything you need to know to be a site reliability engineer at Vanguard, but it includes a lot of content around these practices, the failure modes and effects analysis, and the chaos experimentation. We have given all of the teams at Vanguard documentation, and even short videos and examples of reports that have already been created. They have everything that they need to go and facilitate these exercises themselves. I even have templates with the table already structured, so teams can copy the template and fill it out over the course of their meeting. For teams using similar architecture patterns at Vanguard, I have a few templates that have a few of the questions already filled in, such as what would happen if the downstream microservice timed out or returned a 500 error? What would happen if you lost a task on ECS, in the middle of processing a request? All of this has enabled FMEAs to absolutely take off at Vanguard, as well as the subsequent chaos experimentation. Now we are running hundreds of chaos experiments at Vanguard.


If you want even more details about chaos engineering specifically, you can check out a presentation that I gave at SREcon, it's called Cloudy with a Chance of Chaos. It's out there on YouTube. It tells you all of the details of how Vanguard is executing chaos experiments. Not much has changed besides the addition of new features since I first gave that talk. The underlying infrastructure for the platform we built for chaos engineering is still the same and we are using it now with hundreds of teams today.

Questions and Answers

Watt: Is there a recurring meeting or an FMEA day? What kicks this process off?

Yakomin: The FMEA, I typically recommend approximately an annual cadence for revisiting that, if the architecture isn't undergoing major changes sooner than that. If you're suddenly undergoing a cloud migration, obviously don't wait for the annual cadence, but if you're just doing normal feature delivery, you're probably going to want to do this approximately annually and then the experimentation piece, the cyclical piece where you're maybe not redoing the entire analysis but repeating the experimentation, I tell teams to do that at least quarterly, and tie it to what we have at Vanguard as a quarterly planning cycle. I'm not going out and scheduling FMEA days as a holiday at Vanguard IT for everyone to do this. I typically leave teams up to their own devices to fit it in with their quarterly planning cycles. I even know teams that do schedule FMEA and game day sorts of things, once per month. Every other technical sprint they're picking a day and it's a technical enrichment day, and they're doing the resilience investment in their systems, in one way or another. It varies.

Watt: One of the questions is around when you have limited time. You may have loads of things that you need to test, how do you prioritize? How do you guys at Vanguard go about prioritizing which things to look at and maybe what to leave?

Yakomin: The first priority is getting the initial analysis done, especially because that's something that, like I mentioned, the typical web based workload can knock out in about 45 minutes to an hour. Once you've done that when it comes to prioritization of the testing, I like to look first at anywhere where there was disagreement in the room so that we can very quickly address that disparity in the mental models. I also like to look at things like feasibility of the testing, test that you have a precedent for running already that you know how to test, and tests that have a limited blast radius. Anything that you aren't going to need to do large scale collaboration across many teams, because those are the things that you're able to knock out more quickly. They are more feasible and they are lower coordination effort. After that, you can get into those higher complexity, higher effort, higher coordination items on the list. Even, you can go through and you think, how likely is this really to happen? How worried about this are we? The last thing I'll say is focus on the scenarios where you expect your system to be resilient. Put the ones at the bottom of the list where you're expecting that the system isn't going to be able to respond, because it's a lot more valuable to get feedback on whether or not your system can be resilient, than whether or not you are right, and it's going to cause an incident. You rather be surprised by a non-incident than by an incident, I would think.

Watt: At the moment, we are living in the scenarios where everybody is distributed, they're all over the show. They're working remotely, different time zones, all these type of things. Is it possible to conduct an FMEA exercise asynchronously, for example, if you've got a globally distributed team and they've got different time zones, and all of that? How does it practically work in that case?

Yakomin: It is possible but it is difficult. There are disadvantages to conducting this asynchronously, though there are going to be scenarios that necessitate it. If you're in one of those situations with a globally distributed team, or a team that simply isn't going to be able to get on a synchronous meeting for whatever reason. The way that I recommend doing this is designating a facilitator, in particular if you can grab a third party facilitator. Not meaning third party external to your company, but phone a friend on a nearby team. Have them set up the template for you, maybe some questions, and then have everyone on the team asynchronously fill out the failure modes only. Just the failure modes, none of the expected effects. Send that back to the facilitator, have them get all the different failure modes that were contributed into a document, send it back out to everyone individually, not collaboratively. Because what we want to avoid here is a situation where the one architect gets to it first, fills out the entire template. Then everyone else says, that looks good, and the confirmation bias kicking in. We want to force the conflicts to actually come up, if there is disagreement. Let everyone raise the questions, then let everyone individually answer the questions. Have someone synthesize all of that together, whether it's a project manager or a third party. Send that back out to the entire team for review after the fact. You could see why that takes a little bit longer. It's less ideal than doing it all in a synchronous meeting, but it can work if that's the situation you're in and you need it to be that way.

Watt: This is one of the ways of doing things with failure modes and effects analysis. Are there any workloads where this is maybe a bad candidate or a bad technique to do it, and you should consider something else. What have you found?

Yakomin: I think that there is a way to do this with too great of a scope pretty easily. Make sure that you're scoping the workload appropriately. Think about the scope of a product team. Don't try to bring an entire department into a room and have 100 people trying to contribute failure modes of a broader system. Even if there are hard dependencies at every level, scope it off somewhere. Then the other thing is, if you don't already have the system built, it's not the time to do this type of analysis. You should architect this system the way that you're going to architect it, start building it out, and then complete this analysis. Because it's not about how we want the system to behave, how we're going to build the system to behave. It's about how we think the existing system is going to behave. I could see this working in some adapted way for that pre-building exercise as a way to support the architecture decisions around resilience patterns, but that's not how I've leveraged it, and certainly with then trying to map it to hypothesis and chaos experiments, and think about implementation. It's not going to be as valuable for those workloads that don't exist yet.


See more presentations with transcripts


Recorded at:

Jan 06, 2023