InfoQ Homepage Presentations How to Test Your Fault Isolation Boundaries in the Cloud

How to Test Your Fault Isolation Boundaries in the Cloud

View Presentation

Speed:

46:58

Summary

Jason Barto discusses fault isolation boundaries and ways to take advantage of fault isolation in AWS, demonstrating initial tests used to ensure a system has successfully isolated faults.

Bio

Jason Barto is a Principal Solutions Architect at AWS where he works with customers to design resilient system architectures and develop chaos engineering practices. Prior to joining AWS Jason was designing and building distributed systems for complex event processing and real-time telemetry analytics.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Barto: Welcome to this session about fault isolation boundaries, and simple initial test that you can perform to test them. My name is Jason Barto. I work with a lot of teams and organizations to analyze their systems and develop ways to improve their system's resilience. We're going to introduce the concept of failure domains and reducing those failure domains using fault isolation boundaries. We will introduce some methods for creating fault isolation boundaries, and then try to apply this method to a sample system running on AWS. My hope is that by the end of this session, you will have a method that you can use to analyze your own systems and begin to test their implementation of fault isolation boundaries for reducing the impact of failures on your system.

A Tale of Two Systems

I want to start by first sharing a story with you that caused me to take a very deep interest in the topic of resilience engineering. This is the story of a banking application that was running in the cloud. This system was critical to the bank as part of their regulated infrastructure. Because of this, downtime meant fines and customers who couldn't make payments. The system was designed using a lot of recommended practices. The team used highly available components and built an application cluster with three nodes to reduce the impact of any one node going down. The application cluster used a quorum technology, which meant that if one of the nodes was down, the other two would continue to function, but in a reduced capacity. If the second node went down, the system would fall back to a read-only mode of operation. While the first system was built using highly available technologies, the bank wanted additional assurance that the system would remain available in the face of failure. They deployed a secondary instance of the application in the same AWS region, but in a separate VPC. They operated these systems in an active-passive fashion, and to fail over to the secondary system was a management decision that would redirect traffic by updating a DNS record. Firewalls and so forth had already been worked out so that communication with on-premise systems was possible from both the primary and the secondary deployment.

Unfortunately, one day, there was a disruption to an AWS availability zone, which impacted the infrastructure hosting one of the application cluster nodes, causing the system to operate in a reduced capacity mode. The bank decided that they did not want to run in a reduced capacity mode, so they made the decision to fail over to the secondary system. They modified the DNS records and began to route requests to the secondary system. However, the same availability zone disruption, which had impacted the primary system also affected the EC2 instances in the secondary system, because they both resided in the same availability zone and shared physical data centers. The bank then consigned itself to having to operate at a reduced capacity until the services had been restored, and the EC2 instances could be relaunched. Luckily, there was no downtime involved, and no customers were impacted. However, it highlighted for me that there is no such thing as a system which cannot fail, and that there are very real dependencies that even single points of failure, that as technologists, we either knowingly acknowledge and accept, or are completely unaware of.

How Do We Confidently Create Resilient Systems?

If we adopt best practices, and even deploy entirely redundant systems, only for undesirable behavior to still creep in, how can we say with confidence that the systems we have built are resilient? Every team I've ever worked on or with, always had an architecture diagram. However, those teams did not go through those diagrams and ask themselves, what happens when one of those lines breaks, or one of the boxes is slow to respond. We don't often ask ourselves, how can the system fail? What can we do to prevent it? Then, also, how do we test it? We're going to use this architecture diagram and the example system that it represents to identify the failure domains in the system, explore what fault isolation boundaries have been incorporated into the design. Then, also, test those fault isolation boundaries using some chaos experiments.

Failure Domains

What exactly is a failure domain? A failure domain is the set of resources which are negatively impacted by an individual incident or event. Failure domains are not mutually exclusive, and they can overlap with one another when one failure domain contains another failure domain as a subset. For example, the failure domain for a top of rack switch failure is all of the resources within the rack from a network computing perspective. The failure domain for a tornado is everything within the data center. The same concept can be applied to system software. Single points of failure in a system architecture will have a failure domain for all of the resources that rely on that single point of failure to complete their task.

Let's look at a simple example to further illustrate our definition. Say there is an issue with the web server process running on a server. This is bad, but as requests to the system can flow through a secondary redundant web server, the failure does not have an impact on the system as a whole. The failure domain includes only the web server, which experienced the issue. However, if the database experiences an issue, maybe it loses its network connection or is somehow affected, it'll have an impact on the system as a whole because both of the web servers rely on the database to service requests. The next logical question then is, how can we get a list of the failure domains within an architecture? This is a process that's normally called hazard analysis. There are various techniques that can be used to create a solid understanding of the failure domains within a system. Adrian Cockcroft gave a talk at QCon San Francisco, which introduced many of these techniques, in 2019. Some examples of hazard analysis techniques include failure mode effect analysis, fault tree analysis, and system theoretic process analysis, which was described by Dr. Nancy Leveson of MIT in the STPA Handbook. All of these provide a formal method for diving into a system, identifying potential failures, and estimating their impact on a system.

In the interest of simplicity, we are going to forego these formalized methods and instead focus on any of the single points of failure in this architecture to identify the failure domains. By looking at the lines and boxes in the diagram, we can ask questions like, what happens if the database becomes unavailable, or when the system is unable to issue requests to the AWS Lambda service? We can then start to think about how to mitigate such occurrences. There are enough boxes and lines in this diagram to create a good list of what if scenarios. What if a database instance fails? What if the network connection to the database fails? What if a web server fails? For today, we're going to focus on just two failure domains. The first is that of an availability zone. What happens when an availability zone is experiencing a failure? The second is, what happens when the system is unable to send a request to an AWS service, namely, AWS Lambda?

Fault Isolation Boundaries

We've identified some potential failure domains for our system, what can we then do to reduce the size of each failure domain? How can we mitigate the failures and their blast radius? Using conscious design decisions in the characteristics and design of our system, we can implement fault isolation boundaries to manage these failure domains. We can isolate the fault so that they have minimal impact on the system as a whole. A fault isolation boundary is going to limit the failure domain, reducing the number of resources that are impacted by any incident or event. Let's talk about a couple of the patterns that you can use in your application code to create fault isolation boundaries. We see on the right, three upstream systems, client, and their downstream dependency, a service. The lines between these systems are where failures can happen. These lines could be disrupted by failures with servers, networks, load balancers, software, operating systems, or even mistakes from system operators. By anticipating these potential failures, we can design systems that are capable of reducing the probability of failure. It is of course impossible to build systems that never fail, but there are a few things that if they would be systematically done would really help to reduce the probability and impact of failures.

First, we can start by setting timeouts in any of the code which establishes connections to, or makes requests of that downstream dependency. Many of the libraries and frameworks that we use in our application code will often have default timeout values for things like database connections. Those defaults will be set to an infinite value, or, if not infinite, often to tens of minutes. This can leave upstream resources waiting for connections to establish or requests to complete that just never will. The next pattern that we can employ is to use retries with backoff. Many software systems use retries today, but not all of them employ exponential backoff in order to prevent or reduce retry storms, which could cause a cascading failure. If you're looking for a library to provide backoff logic, ensure that it includes a jitter element so that not all upstream systems are retrying at exactly the same time. Also, limit the number of retries that you allow your code to make. You can use a static retry limit of three attempts or some other number. In most cases, a single retry is going to be enough to be resilient to intermittent errors. Often, more retries end up doing more harm than good. You can also make your retry strategy more adaptive by using a token bucket so that if other threads have had to retry and failed, that subsequent initial calls are not also then retrying. This is similar to a circuit breaker, which will trip open if too many errors are encountered in order to give the downstream dependency time to recover.

Just as we protect our upstream clients from a failing service, we should also protect the downstream service from a misbehaving client. To do this, we can use things like rate limiting to limit the number of requests that any one client can send at a time. This creates a bulkhead which prevents any one of the clients from consuming all of a service's resources. Also, when the service does begin to slow down or get overloaded, don't be afraid to simply reject new requests in order to reduce the load on your service. There are many libraries out there today that will help you to incorporate these patterns into your code. Two examples are the chi library for Golang, which implements things like load shedding and rate limiting out of the box. The Spring Framework for Java can also leverage Resilience4j which has implementations of the circuit breaker, bulkhead, and rate limiting patterns, amongst others.

At the infrastructure level, similar to redundancy, you can limit the blast radius of a failure through a cell based architecture. Cells are multiple instantiations of a service that are isolated from each other. These service structures are invisible to customers, and each customer gets assigned to a cell or to a set of cells. This is also called sharding customers. In a cell based architecture, resources and requests are partitioned into cells, which are then capped in size. The design minimizes the chance that a disruption in one cell, for example, one subset of customers, would disrupt other cells. By reducing the blast radius of a given failure within a service based on cells, overall availability increases and continuity of service remains. For a typical multi-tenant architecture, you can scale down the size of the resources to support only a handful of those tenants. This scaled down solution then becomes your cell. You then take that cell and duplicate it, and then apply a thin reading layer on top of it. Each cell can then be addressed via the routing layer using something like domain sharding. Using cells has numerous advantages among which, workload isolation, failure containment, and the capability to scale out instead of up. Most importantly, because the size of a cell is known once it's tested and understood, it becomes easier to manage and operate. Limits are known and replicated across all of the cells. The challenge then is knowing what cell size to set up since smaller cells are easier to test and operate. While larger cells are more cost efficient and make the overall system easier to understand. A general rule of thumb is to start with larger cells, and then as you grow, slowly reduce the size of your cells. Automation is going to be key, of course, in order to operate these cells at scale.

Beyond the infrastructure level, we move to the data center level. We have to consider the fact that our sample application is running in the cloud, so it has a dependency on cloud services and resources. AWS hosts its infrastructure in regions, which are independent of one another, giving us isolated services in each region. If we have mission critical systems, there may be a need to consider how to use multiple regions so that our system can continue to operate when a regional service is impaired, creating a fault isolation boundary for our system around any regional failure. Each region is made up of availability zones. These availability zones are independent fault isolated clusters of data centers. The availability zones are far enough apart, separated by a meaningful distance, so that they don't share a fate. They're close enough together to be useful as a single logical data center. By using multiple availability zones, we can protect ourselves against the disruption to any one availability zone in a region. Our system has a dependency on these availability zones, and the data centers that operate them. Our system also has a dependency on the AWS services like AWS Lambda and Amazon DynamoDB, which are operated and deployed within our selected region.

Looking back at our sample application, let's review some of the mitigations that we have in place to create boundaries around the failure domains in our system. First, we've used redundancy to limit the failure of the database server. If the database fails, the system will switch over to the secondary database, which was kept in sync with the primary. We also have redundancy across the application servers, which operate independently across availability zones, and the application code will timeout and reconnect to the database if and/or when that database failure does occur. Also, where the application communicates with AWS Lambda services, we have used timeouts and retries with exponential backoff to send requests to the Lambda service. Then, finally, the Lambda service and the DynamoDB service operate across multiple availability zones, with rate limiting in place to improve their availability.

Initial Tests for Chaos Engineering

To test these fault isolation boundaries in our system, we're going to use chaos engineering to simulate some of the failures that we've discussed earlier, and ensure that our system behaves as it was designed. We'll introduce an initial set of tests that we can use to begin testing our fault isolation boundaries. This is not a chaos engineering talk. However, it's worth a recap of chaos engineering theory and process just briefly. This talk is intended for teams who are new to chaos engineering, and these tests assume that the system they are testing on exist in a single AWS account by itself. As such, please do not inject these faults into a production environment until after the system has demonstrated a capacity to survive these tests in a non-production environment.

Earlier, we identified some failure domains for our system, and now, using the chaos engineering process, we will perform some initial tests to observe how our system performs when there is a failure with an availability zone, or with an AWS regional service. An initial test has two characteristics. The first is that the test is broadly applicable to most systems. The second is that it can be easily simulated or performed, having minimal complexity. For our sample system, there are a few initial tests that we can perform, which meet this criteria, and will allow us to begin to test our system and to develop our chaos engineering expertise. The set of initial tests includes whether the system can continue to function if it loses any one of the zonal resources such as a virtual machine or a single database instance. Also, whether the system can continue to function if an entire availability zone is disrupted, potentially losing roughly one-third of the compute resources required to run the system. Also, whether the system can continue to function if an entire service dependency is disrupted.

Let's talk a little bit more about each of these initial tests. The first initial test to be applied to a singular resource will test whether a system can withstand a compute instance being stopped or a database falling over. These can easily be tested using AWS APIs. The command line examples that you see here use the AWS CLI tool to terminate an EC2 instance, or to cause an RDS database to failover. Both of these simulate some disruption to the resource software, virtual resource, or physical compute resource which is hosting the targeted capability. Going beyond the zonal resource, we can now move on to the shared dependency of the availability zone. We see another set of CLI commands here that can be used to simulate disruption of the network within the availability zone. These AWS CLI commands create a stateless firewall in the form of a network access control list, which drops all incoming and outgoing network traffic. The resulting network access control list can then be applied to all of the subnets in an availability zone in order to simulate a loss of network connectivity to the resources in that availability zone. This is the first initial test that we will be demonstrating in a moment. The other initial tests that we want to use today, was a disruption to a regional service. We can do this at the networking layer as in the previous slide, but for variety, these CLI commands that you see here will show the creation of a permissions policy, which prevents access to the AWS Lambda service. This will cause the service to respond with an Access Denied response, which we can use to simulate the service responding with a non-200 response to our requests. We will demonstrate this disruption as a part of the second test.

Demonstration

Let's get into the demonstration. As part of that, I'll briefly introduce the two chaos testing tools that we will use. As was mentioned earlier, there is a process to follow when performing an experiment, verifying the steady state, introducing a failure, observing the system, possibly even canceling the experiment if production is impacted. There are numerous tools available for performing chaos experiments. You can do things manually to get started using the CLI commands we showed earlier, for example, or you can go a step beyond manual and rely on shell scripts of some sort. For today's test, though, I will rely on two tools, one per experiment, which provides some assistance in terms of the scripting and execution of our experiments. The first, AWS Fault Injection Simulator, is a prescriptive framework that allows us to define reusable experiment templates, and to execute them serverlessly. The execution history of each experiment is kept by the service for record keeping purposes, and can integrate with other AWS services such as AWS Systems Manager to give you the ability to extend its failure simulation capabilities. AWS FIS is also the only chaos testing tool that I know of, which can cause the AWS EC2 control plane to respond with API level errors.

The second tool that we'll use is the Chaos Toolkit, which is an open source command line tool that executes experiments that are defined as JSON or YAML files. It supports extensions written in Python for interacting with numerous cloud providers, and container orchestration engines. Like the Fault Injection Simulator, the Chaos Toolkit keeps a log of the actions it takes and the results of each. Both of these tools enforce a structure much like we described earlier in terms of defining the steady state, the methods for injecting failures, and then observing the steady state to determine if the experiment was successful. For the system we have today, there's also a dashboard that has been created. This dashboard has both the low level metrics and the high level KPIs that we will use to define our steady state. Along the bottom of this dashboard, we have counters and metrics for datum such as CPU usage, and the number of API calls to the AWS Lambda service. Up towards the top, we're also tracking the client experience for those hitting the web application. Our key measure is going to be the percentage of successful requests that have been returned by the system. In addition to the metrics, we also have an alarm that is set to trigger if the percentage of HTTP 200 responses drops below 90%.

Demo: Simulate Availability Zone Failure

For this first demonstration, let's use the AWS Fault Injection Simulator to simulate a network disruption across an availability zone using a network access control list. This system has been deployed into a test environment not into production. As such, we have a load generator running that is issuing requests to this system. You can see here in this dashboard a reflection of those requests as they hit the different API endpoints that are hosted by our application server. You can also see that the system is currently in its steady state as 100% of the requests that it's receiving, are being returned successfully. If we look a little bit under the covers, we'll see that there are currently three application servers running, all of them are healthy and handling traffic that are distributed by the load balancer.

If we take a look now at the AWS FIS console, we can see here a copy of the template that we've defined for today's experiment. This template is defined in order to create and orchestrate the failure of an AWS availability zone, or rather to simulate that failure in order to allow us to test our application. The experiment first starts by confirming that the system is in its steady state. It does this by checking that alarm that we've defined earlier. That alarm again looks at the percentage of successful requests and ensures that it is above 90%. After confirming the steady state, step two is to execute an automation document that is hosted by Systems Manager. AWS Systems Manager will then execute that document using the input parameters that are provided to it by the Fault Injection Simulator. You can see those input parameters being highlighted here. Some of the key parameters that are going into that document are the VPC that contains the resources that is running our application, as well as the availability zone that we want to have failed by the document.

If we take a look at the automation document itself, we see that here in the Systems Manager console. There's a number of bits of metadata associated with this document, such as any security risks that are associated with executing it, the input parameters that the document expects to receive, notably the AWS availability zone. Also, details about the support that this document has for things like cancellation and rollbacks. It's important that if there is something that goes wrong, or when the experiment is concluded, that this document undoes or removes those network access control lists that it creates. We can see here the steps that the document follows to create the network access control list, attach it to the subnets, wait a period, and then remove them. Here we see the source code of this document, which is a combination of YAML and Python code. The YAML defines all of the metadata for the document, the input parameters, for example, that we've just discussed. As well as wraps around the Python code that defines how this document will execute and how it will interact with the different AWS APIs in order to create the network access control lists and attach them to the appropriate subnets.

Let's go ahead and get the experiment started. We go back to the Fault Injection Simulator, and from actions, we click Start. As a part of this, we're creating an experiment based on the experiment template. As a part of this, we're going to add some metadata to this experiment so that we can go back later on and compare this experiment and its results with other experiments that are being run in the environment. After we've associated a couple of tags, we will then click Start experiment. The Fault Injection Simulator will confirm that we want to run this experiment. We type start and click the button, and away we go. After a few minutes, you'll see that the Fault Injection Simulator has confirmed and is now running. We can see that reflected here in the dashboard as well for the system. You'll notice that the alarm in the upper left-hand side of the dashboard has a small dip in it. We'll come back to that in a moment. We can also see that the infrastructure is starting to fail. It's reporting unhealthy nodes. Some of the application servers are not responding properly to their health checks. The autoscaling group is trying to refresh and replace those in order to achieve a healthy set of nodes.

Going back to the Fault Injection Simulator, we can see that it's now executed the automation document and has now completed the experiment. We're able to review the results or the details about that document. We can also see that the experiment itself is still making some final storing metadata and recording the logs. We're just going to click refresh a couple of times while we wait for that experiment to conclude. The experiment is now completed, it's now logged in the Fault Injection Simulator. We can go back and dive deeper into the logs and the actions that were taken, and compare the behavior of that against other experiments that were performed. If we go back to the dashboard for our system, we should see that it has now resumed at steady state. Actually, technically, it never left its steady state, so, actually, our hypothesis for this experiment holds. If we take a look at that small dip, I think we can see that the percent correct or the percent of percentage of requests that were served, never dropped below 98% or 97%. Our hypothesis is holding true, the chaos experiment is successful. We can now move on to our second chaos experiment. If we take a quick look at the infrastructure, we can also see that we now do have three healthy nodes as well. We are fully back online and having recovered from the disruption to the network.

Demo: Simulate Region Failure

For the second demonstration, we're going to hypothesize that the system will be able to continue functioning in the event that there is a disruption to the AWS Lambda service. As you saw earlier, we have included some fault isolation patterns such as retries and timeouts. What will be the effect on this system then if it's unable to invoke the AWS Lambda service? Again, we've got our dashboard here, we're going to first start by confirming the steady state of the system. We can see that there are a number of requests that are being sent to this by our load generator. Our alarm is currently in a healthy state. Our percent request complete is above 90%, so we're good to go for the experiment.

Let's have a quick look at the underlying infrastructure. Again, just confirming, we've got three healthy application servers. We haven't looked at the console for it, but the database is also up and running and healthy, so we're good to go. As we said, the Chaos Toolkit is a command line tool that relies on experiments that are defined in YAML, or JSON documents. This is our experiment for today. It's a YAML document, obviously. Some of the key sections that I want to call out include the configuration section. This experiment is going to take an IAM policy and attach it to an IAM role that is associated with our application server. This will prevent the system from communicating with the AWS Lambda service. Using those variables that are defined in the configuration, we'll then use those later on as part of the method. The next section that we are focusing in on is the steady state hypothesis. This is a standard part of the Chaos Toolkit experiment. It'll use the AWS plugin, Forecast toolkit, to check that CloudWatch alarm and ensure that it is in an ok state.

If the system passes that test, the Chaos Toolkit will then move on to executing the method. Here, again, it's going to use the AWS plugin for Chaos Toolkit to attach that IAM policy to the role, which is associated with our application servers. After it carries out that action, it'll then pause for about 900 seconds or 15 minutes in order to give the system a chance to respond. After it's done waiting, it'll then check that alarm again, to ensure that it's still in an ok state. Once it's done, it'll then execute its rollback procedure, which will detach that IAM policy from the AWS IAM role. Just a quick look at the IAM policy that we're talking about here. You can see it's a very simple IAM policy designed to deny access to the Invoke function and InvokeAsync APIs for any Lambda functions that are in the Ireland region.

Going to our command line, we're going to execute the Chaos Toolkit command line tool. We're going to give it our experiment.yaml file so it knows what experiment to execute. We're going to tell it to roll back when it's done, and to log everything to an exp-fail-lambda-run-003.log. We can see here in its output that it's checked the steady state of the system, found that it's in a good state, and proceeds by executing the action so that it attaches the policy to the IAM role. It's now pausing for that 900 seconds as we saw in the experiment definition. Roll forward a few minutes, we can check our dashboard for the system and see that there is indeed quite an impact to the system. It looks like the percent correct is sitting around 60% at the moment, so it's dropping drastically beyond that 90% threshold that we try to keep as part of our SLA. We can see over on the right-hand side that we've got 500 or so odd requests that are successfully being serviced, while 900 in total are being issued to the system. The reason for this discrepancy is that there's actually two APIs that the system makes available. One of them relies on AWS Lambda as part of its critical path, one of them does not. We can also see that the infrastructure is starting to thrash as different application servers get replaced, as the system tries to replace them with healthy instances. Going back to the dashboard, we're now at about 9% of successful requests being serviced. This is roughly in line with the distribution of traffic that our load generator is producing, where 10% of the requests are being issued to the path that doesn't rely on AWS Lambda, 90% of the traffic does rely on AWS Lambda. This is about as I think we might expect.

Looking back again, the infrastructure is thrashing as instances are brought online, they're failing their health checks, so they're getting replaced. At this point, the experiment has concluded, Chaos Toolkit has waited 15 minutes, and rechecked that CloudWatch alarm. It's found that it is not in an ok state, so it rolls back to detach that IAM policy and marks the experiment as deviated or essentially as a failure. We can then go through the logs, go through the dashboards, and start to perform our learning exercise in order to understand what happened. If we go back to the dashboard, again, we want to ensure that the system has resumed its steady state. The experiment isn't over until the system has resumed its steady state. We can see that actually it has, with that IAM policy detached, it has resumed its steady state. Now that effectively the AWS Lambda service is back available, we're able to issue requests to it. We are now successfully processing all of the requests that are flowing through the system, so we have effectively recovered. Equally, if we check the infrastructure, we'll start to see that we do have some healthy instances that are now coming online. I'd imagine we'll have one in each availability zone again shortly.

Key Takeaways

What are some key takeaways, and what have we shown today? From our first initial test, we demonstrated that the system was unaffected by the loss of an availability zone. As you saw, this system continued with a load balancer routing requests around the affected resources. The second initial test which prevented access to the AWS Lambda service demonstrated that, in fact, our system was hugely affected by not being able to successfully invoke its Lambda function. No amount of retries or timeouts were going to help if the service becomes completely unavailable. What we'll need to do now is to understand and estimate the likelihood of this event, AWS Lambda not being reachable. If it's a high enough priority, devise a mitigation for this failure mode, perhaps caching the data locally or posting the data to a queue where it can be processed later on by a Lambda function. Or perhaps we can call a Lambda function located in a different AWS region, understanding that there will be additional network latency as a result of this. After implementing a fix, we can perform the experiment again, to observe whether the system is more resilient.

Wrap-Up

First, the initial tests presented today trade realism for simplicity. The failures simulated availability zone network failure and regional service failure, were very binary in their nature and unrealistic as a result. With the failure simulation in place, all network traffic was dropped, and all API requests were immediately denied. This provided us a way to quickly assess the ability of the system to withstand such failures. However, in reality, it's far more likely that only some of the network traffic will be dropped, and some of the API requests will receive no response, a delayed response, or a non-200 response. In this way, partial failures or gray failures are far more likely but are also harder to simulate. You can simulate them using tools like traffic control for Linux or a transparent network proxy. However, for initial testing purposes, a binary simulation is sufficient. Also, after the system is able to continue operating in the face of binary failure simulation, then you can consider taking on the additional complexity of simulating partial failures.

It can seem at first like a very large problem space to have to search when you first look to apply chaos engineering to a system. Other engineering disciplines for mission critical systems have devised formalized processes for creating an exhaustive inventory of all of the failures, which could affect the system. These are good processes to use and to be aware of. To get started, you could also simply look at your architecture and ask yourself, what happens when a line goes away, or when a box is disrupted? Where are the single points of failure in the architecture? This will be a good starting point to identify your failure domains and to begin to catalog your fault isolation boundaries. With that catalog to hand, you can then come up with ways to simulate these failures, and to ensure that your fault isolation boundaries are effective, and operating as designed.

Questions and Answers

Watt: You mentioned about some formal hazard analysis methods like the failure mode effect analysis. What are the differences between these methods? How do you actually choose which one to use?

Barto: It's worth pointing out that what we're trying to do with these hazard analysis methods is identify all those things that could cause our application to have a bad day. Initially, that can be quite a large space. If we are as developers just throwing darts at the wall and trying to figure out what breaks the system, it could be a very long time before we have a complete set of those hazards. Regardless of whether we're talking about fault tree analysis, or failure mode effects analysis, or any of the other methods, that's the goal at the end of the day, is to have that definitive list of this is what can cause us problems. There's been some research done to understand where they overlap. As you said, are all of these at the end of the day equivalent? If we come up and we perform a system theoretic process analysis, are we going to come up with the same set of hazards that we would do if we did a failure mode effect analysis?

The research has shown that there's roughly about a 60% or 70% overlap, but each one of those analysis methods does also find outliers that other analysis methods don't find. What I would suggest is that, as engineers, a lot of us think bottom up. We think about the lower order architectural components in our system, and less about the higher order controls and processes that we use to govern that system. In that way, something like a failure mode effects analysis is fairly recognizable to us as engineers, and I see a lot of adoption of that within different teams. All of them are perfectly valid. It's worth doing a little bit of reading, not extensive, but getting more familiarity with fault tree analysis, lineage driven fault analysis, and any of the other ones that you can come across, and just getting a feel for which one you think is going to work best for you and for your team.

Watt: What would happen like say, now you've got your architecture diagram, and you've got loads of possibilities, so many things can go wrong. How do you work out which ones will give you best bang for your buck in terms of testing them? How do you work out which ones to focus on and which ones maybe you should maybe leave for later?

Barto: Again, as engineers, we focus a lot on the architecture, we focus a lot on the systems. It's important to recall that the reason we've got these IT systems running in our environment is to deliver different business services to our clients and to our customers. I always encourage people to start with those, with the services rather than the system itself. The reason for that is, is that any given system will probably host multiple services. In the demonstration today we saw a single system that had two APIs, a very simplistic example, but it illustrates the point. In that, depending on which API you were talking with, which service you were consuming from the system, your request either succeeded or failed based on the subsystems that that request went through in order to get serviced.

I would encourage customers, or I would encourage teams to look at the services that their business systems provide, and to then identify the components that support that service. This allows us to make the problem space a little bit smaller in terms of not considering all of the different permutations, but also allows us to identify deep criticality or importance of any one of those. Again, a single system delivers multiple services, what are the really important services? Because at the end of the day, once you do identify these different failure modes or hazards, it's going to cost engineering time and effort. You're going to be not implementing features while you're implementing those mitigations. You want to make sure that we're going to get return on the investment that we're going to spend putting in.

See more presentations with transcripts

Recorded at:

Jan 20, 2023

Jason Barto

InfoQ Software Architects' Newsletter