Modeling Failure Scenarios in Systems
The increase in number of integrated systems in today’s enterprise solutions necessitates dealing with dependency and environment failures in a systematic way. By modeling dependency failures at the architecture stage, system response to failures can be communicated, tested and implemented reducing the business risk and cost.
In this paper, we will introduce system failure modeling. System failure modeling is a technique that helps anticipate dependency and environment failures and take pro-active action in the early architecture stage of a solution.
We will use a four step process to build failure models. Each step will gradually uncover the necessary information to create the failure models. At the end of the paper we will explore different tools and patterns available to architects that enable design and implementation of desired system responses to different failure modes.
The list below gives an overview of each step:
- Step 1: We will uncover a solution’s functional dependencies at scenario level and introduce the dependency matrix model for use in later steps.
- Step 2: We will look at the environment in which the solution will operate from an integrating systems point of view and define the Service Level Agreements (SLAs).
- Step 3: In this step we will identify the key data points that our systems can collect to be aware of the current operational state of the solution. This data will be used later on to identify the failure scenario the solution is in.
- Step 4: We will look at the different ways in which the integrating services can fail. We will build solution failure models and determine how our solution should behave in failure scenarios.
Step 1. Understanding Functional Dependencies
As we start our planning process for dealing with dependency failures in a new solution, one critical question is to understand the solution’s functional areas and the dependency of each functional area on 3rd party systems. Understanding these fundamental dependencies is the first step to architect a solution that can survive failures in dependencies and environments.
The first model that we will use is the dependency matrix. This model outlines and tracks the dependencies of the system is the system dependency matrix. In this model system functional areas (or scenarios) are broken extracted and are associated to third party systems they rely on. Figure 1.x shows an example of the system dependency matrix.
(Click on the image to enlarge it)
This system dependency matrix helps the solution architect understand which functional areas are dependent on which integrated services. Throughout this whitepaper this matrix will be leveraged to calculate the response time, availability, failure models and system response to dependency failures. As such, the system dependency matrix is the basis for the rest of the models in this whitepaper.
In the next section we will cover some of the solution properties that need to be incorporated into solution architecture and need to be taken into consideration when creating failure models.
Step 2. Establishing Operations Metrics
After identifying the solution scenarios and different systems/services that each scenario relies on, we need to understand what the normal operating mode is for our system. Only then, we will be able to identify if there is an anomaly in the system and take pro-active action. In this section we will go over the system resource constraints and outline the normal operating model by modeling the solution quality characteristics.
Solution characteristics can be categorized into two high level buckets; solution features and solution operational qualities. Solution features characteristics are the actual capabilities that the solution provides. Solution operational qualities, sometimes called the non-functional requirements, determine how the system should operate. These quality characteristics include system scalability limits, response time of the system under normal and error conditions, system availability etc. Although not visible to the end-user, solution quality requirements determine the long term value of the system to the business.
A sub-set of the solution quality is usually classified as solution Service Level Agreements (SLAs). Common SLAs include, but are not limited to; response time, availability, and the number of concurrent users the system supports.
The SLAs that your system needs to satisfy are more often than not determined by the business needs and the market demand. Sometimes the SLA expectations are set prematurely and later the analysis of the dependent systems reveals the infeasibility of previously agreed to SLAs.
In this section we will look into three of the most singled out SLAs and how we can bring clarity to the SLAs that our solution can support by looking at its dependencies.
System Availability Model
Availability of a given solution always tops the SLA lists. Naturally all software solutions strive to achieve an availability of 100%. Unfortunately, in a complex solution where a series of systems are integrated to implement system features, it is extremely difficult to achieve 100% availability.
Due to different feature dependencies, calculating an overall system availability number is very tricky. However, below formula can be used to roughly calculate the availability of the solutions under planning:
Total Availability = Dependent System 1 availability * Dependent System 2 availability * Dependent System N availability.
In order to realistically understand the maximum availability of a solution feature and plan for dependency system response, a model like the one below can be used to calculate the scenario availability based on the dependent service availability.
(Click on the image to enlarge it)
Note that in the example above, individual availabilities of the dependent services are acquired by application/service owning groups. In most enterprise scenarios, getting this data from operations team directly is the desired approach for accuracy reasons.
System Response Time Model
As the number of dependencies grows in complex software solutions, the response times are chained together. A fluctuation in one system can negatively impact the overall response time of the solution under development.
The most common scenario where this is overlooked is the parent-child processing scenarios. There are many cases when a response to a request contains many child elements that need to be processed further. Although it may seem trivial, there are many data dependency questions to be answered before understanding the overall response time of each call chain.
For brevity we will only concentrate on the call-chain dependency and not the data dependency among all the participating services.
To model the expected response time, we will re-use our service dependency matrix. This time, instead of simply indicating the dependency, we will count the number of calls made to the service. When coupled with the individual service response-time metrics, the excel sheet should like follows:
(Click on the image to enlarge it)
By using this model, we can identify the expected response-time of a given scenario/feature and if the monitored response-time starts deteriorating the solution can try to take-proactive architecture action such as; falling back on cache layers, processing results of calls with a parallel solution and using overlapped I/O. We will explore some of these approaches later in this whitepaper.
System Throughput Capacity Model
The final SLA we will cover in this whitepaper is the throughput rate of the solution, which roughly translates to number of requests that can be processed in a second. For simplicity we will not consider payload size or any other factors in this calculation. Throughput of the solutions depends on the transaction per seconds (TPS) of each of the dependent systems in the call chain. We can consider transaction as an operation performed by any given system in the call chain.
Discovering the TPS of the dependent systems can sometimes prove to be a challenge because of lacking system documentation. In such cases, it may be necessary to write simple driver applications to observe the TPS of each system independently.
If these dependent systems are shared with multiple other applications, such as a database that is serving three different applications, only a fraction of the system TPS is available to the solution under development. The utilization percentage of the overall dependency should be factored into the calculation. Below is an example of a system capacity model for a simplified online retailer that provides functionality by orchestrating service calls across order fulfillment, inventory, product catalog and customer review services and a customer database.
(Click on the image to enlarge it)
After getting these numbers, it is apparent that for a given call chain of services, the maximum throughput of the solution can only be as high as the slowest TPS in the call-chain. This critical piece of information can be used to identify potential scaling planning and take pro-active measures in the solution to make sure this threshold is not crossed. We will look into some techniques on how to do this later in this whitepaper.
In this section we covered some of the SLAs that need to be analyzed thoroughly before committing to. We also covered some tools to help us estimate the realistic SLAs that the solution can commit to. Once these are set, we set our focus to design and plan the system to achieve the committed SLAs and pro-actively design our system to operate within these constraints.
Step 3. Instrumenting Key Data Points
After we set the expected SLAs of our solution by looking at the dependencies, necessary data points should be collected to check against the expected operating model. In this section we will cover the different data points that can be collected during system run-time.
Although each project requires different key metrics to be collected, 90% of the time collecting the following minimum set of metrics is sufficient to adjust system behavior appropriately.
1. Response Time: this is the total measured time from the time request was sent to the dependent service to the time a response was received. This response time takes into account all network latency as well as true service processing time of the service the system depends on.
2. Data Payload Size: this is the total size of the response that is returned from the service. Especially when dealing with coarse grained services, a large data size can affect the entire processing pipeline. Having the data payload size becomes very helpful
3. Throughput: although it can be measured either as number of requests processed by a service per second or data size returned by a service per second. This key data point can be used to control the flow of requests to services.
4. Exception Type/Count: this is the total number of exceptions over a fixed period of time grouped by exception types. This data point can be used to temporarily shut down parts of the system based on the state of integrated services.
In the next section we will use the aforementioned data points to create a surviving architecture. Namely we will look into controlling the request flow (sometimes referred to as throttling) and circuit breaker implementation.
Step 4. Creating System Failure Models
After understanding all the services with which the solution integrates, realistic SLAs that the solution can meet, and identifying the data points that will be instrumented to check the real-time operations quality of the system against the set SLAs the solution architect needs to make sure that the solution can withstand failures in dependent systems and responds to the failures in a predictable manner.
Each dependent service can fail due to a different reason and in completely different ways. Unfortunately, these failure scenarios will not be as eloquently described in the requirements documents as the desired features of the system.
Thus it falls on the shoulders of the architect to plan for the failures in all dependencies and present possible recovery/limited functionality course of action that the system can take to continue the system operation in the optimal way.
Although different diagramming techniques can be used to convey the system failure models, we will describe one specific way of documenting the information below. However, as long as the audience you are communicating with (the development team, business analysts and project stakeholders) can read the diagrams without room for subjective interpretation, diagramming tool/method of choice is not as important as the content it represents.
For each system the solution integrates with, there can be a series of things that can go wrong. In our whitepaper so far we have created models for availability, response-time, and throughput. Thus, we will consider only the following types of failure in our failure model:
- Increased Response Time
- Repeated Application Errors
- System Unavailable
A System Failure Model is created below that captures the different types of failures in our e-commerce solution. Please keep in mind that the table is not comprehensive for brevity.
(Click on the image to enlarge it)
In the above example, each scenario is dependent on a set of integrated services. In each row, we put in one possible way in which the integrated service can fail. A simple “x” is used to denote services that are working properly.
In a more elaborate implementation of this model, we need to consider the different combinations of failures. Thus, a more complicated model would have many rows for a given scenario with different failure combinations.
Finally, on the right hand side, the desired system failure response is listed for that scenario/feature. Each system failure response defined in this column will have pros and cons.
It is imperative that the architect considers the full impact of defining a system failure response. In our example, when the product catalog service has high response times, products are looked up in the cached version. Using this specific cache may end up serving the user a slightly outdated version of the product catalog. However this may be preferred over simply throwing an exception and ending the search product use case flow.
As seen in the above example, impacts of system behavior decisions are significant to the business. Consequently it is advised to identify and confirm the system response with the key business stakeholders for better alignment with business objectives.
One approach used to speed up the decision process is to list possible system failure responses as part of the options analysis with a brief explanation of their relative rating in terms of business risk. Some examples of business risk are; operational cost, development time/cost, reduced functionality.
In the next section we will go over how solution architects can pro-actively design solutions that can implement the desired behavior specified in the system failure model and test the implementation.
Available Tools and Patterns to define system failure response
In the previous sections of this paper we covered different steps that helped define the failure models and decide on desired system response to the failures in 3rd party services/systems that the solution depends on.
In this section we will look into different patterns and methodologies that can be utilized both at design time and run-time to ensure specified solution response to failures is implemented and tested correctly.
Simulating Failures with Dependency Injection
Although it may sound like a low level concern for the architect to deal with, the use of dependency injection pattern is extremely important when designing solutions that can survive service disruptions.
Dependency injection enables the plugging-in of different implementations of services into the solution. This is particularly useful when simulating dependent system behavior to test defense measures put in place.
Through dependency injection developers can inject services that are crafted to trigger known scenarios without re-compiling the code. Simple configuration changes can be used to simulate different service failure combinations. This pattern is tremendously helpful in testing the desired system behavior later on using unit tests and integration tests that we will cover shortly.
However, this pattern should be enforced across the entire project. A single non-compliant developer can take away the entire premise of dependency injection. A common measure that solution architects can take to ensure that the pattern is implemented correctly is the use of code reviews. This is a tool that is only useful when utilized uniformly across the project; it may be advisable to incorporate it into the application development frameworks that the development team may be using.
Making Unit Tests and Continuous Integration work for you
It is important to repeatedly test and ensure that the solution behaves in specified ways when certain failure scenarios occur. The most efficient way to achieve this is by automating such tests.
One way of achieving this is through mapping of failure modes to individual tests and creating an entire test suite that only deals with simulation of specific service behaviors of dependent services. Since different combination of errors can, and should, be tested in this test suite, the total time it takes to execute the test suite may take longer than unit test suites
As solution architects, we need to ensure that these longer-running integration tests that utilize the benefits of dependency injection are not executed on developer workstations.
The natural candidates for this job are the build servers. Unfortunately, most development shops today do not utilize continuous integration to its full potential by only taking advantage of the build capabilities on the build server. Many build servers today come with utilities to execute test suites. If your build server does not provide these capabilities or you do not have a build server, there are many open source solutions available today.
Executing this test suite in addition to the regular unit test suite on the build server, will give the architects the peace of mind when it comes to ensuring the system behavior on single occurrences of specific scenarios.
Although a great tool in the arsenal of architects, continuous integration tests do not ensure the desired behavior of the system over a period of time. They only provide a specific behavior in the face of a single failure scenario. To see the long term behavior of the system we will look into usage of test harnesses in the next section.
Creating a Test Harness
In some cases, just setting up individual stubs that return pre-defined exceptions may be sufficient. However most of the time, it is important to be able to gradually change the responses of the dependencies and observe the solution behavior to the changing environment. This is especially important in testing out scenarios that involve circuit breaker implementations.
In order to test the behavior of the system under changing failure modes, it is imperative to build a controllable environment to simulate individual or combination of service failures with minimal effort. Sometimes, this may require creation of applications that can be scripted and be controlled to return a series of canned responses over a period of time
With today’s virtualization capabilities provided by operating systems and third party solutions, it is feasible to create virtual environments to achieve this task. These environments that can be utilized for a period of time in each iteration and then taken down when not in use.
One thing to note is that creating a test harness is not a one-time activity and should be re-visited with each iteration of the solution. As more dependencies are introduced, it could require significant effort to keep the test harness relevant to the current code-base.
Wrapping it up
As solution architects we are increasingly facing the challenge of delivering solutions that integrate with existing solutions that are in the cloud, in different client data centers or the same data center. No matter where these services are, there are a number of ways in which the behavior of the integrated systems can affect our solutions.
Part of our responsibility as architects is to identify, design and test against the failures or anomalies in these systems. In this white paper, we have covered how we can better plan for integrating with other services. We looked at identifying our SLAs to understand our target system performance. We then looked at creating failure models to identify how the systems can fail and what the solution response should be for different failure combinations. Finally we visited some common practices that we can incorporate into our solutions to make the implementation and testing of desired system behaviors easier.
Tools and techniques we have covered in this whitepaper are only a small subset of what we can do as solution architects to ensure delivery of quality software. In addition to the technical side, the tremendous effort required to organizing the project teams should not be underestimated. The human and team dynamics aspect of working with all service teams deserves a separate whitepaper.
About the Author
Mr. Cetinkaya is a Principal, Solutions Architect with Netsoft USA, a strategic technology and design firm headquartered in New York City, where he designs and guides the development of innovative solutions contributing to Netsoft’s award-winning Solutions Delivery capabilities. Focused on recombinant innovations, Mr. Cetinkaya specializes in the design of scalable systems using advanced software architecture constructs including software factories, zone-based application frameworks and re-usable components. Mr. Cetinkaya holds a B.S. in Computer Science from Polytechnic University in Brooklyn, NY and a M.S. in Management of Technology from The New York University.
Tom Gilb & Kai Gilb Jan 26, 2015