InfoQ Homepage Podcasts Oliver Gould Discusses Architecting to Avoid and Recover from Failure

Oliver Gould Discusses Architecting to Avoid and Recover from Failure

Jan 01, 2017

In this week’s podcast, Robert Blumen talks to Oliver Gould at QCon San Francsico 2016. Gould is the CTO of Buoyant, where he leads open source development efforts. Prior to Buoyant he was a Staff Infrastructure Engineer at Twitter where he was technical lead on Observability, Traffic, Configuration and Co-ordination teams.

Key Takeaways

Stratification allows applications to own their logic while libraries take care of the different mechanisms, such as service discovery and load balancing
Cascading failures can’t be tested or protected against, so having a fast time to recovery is important
Having developers own their services with on-call mechanisms improves the reliability of the service; it’s best to optimise automatic restarts so problems can be addressed during normal working hours
Post mortem analysis of failures are important to improve run books or checklists and to share learning between teams
Incremental roll out of features with feature flags or weighted routing provides agility while testing with production load, which highlights issues that aren’t noticed during limited developer testing

Subscribe on:

Show Notes

1m:35s - What is done at the architecture level to prevent failure?
1m:50s - As systems become more complex we have much larger infrastructure than before
2m:00s - There are many more components than before; instead of having everything on one large computer we now have a warehouse of computers
2m:10s - Things are going to break because there’s more opportunity for failure, in the form of connections between components or components themselves

Stratification

2m:30s - Stratification- separating concerns- is one way of architecting to avoid failure
2m:35s - Mesos and Aurora at Twitter simplify and separate a bunch of the hardware and operational concerns
2m:50s - Finagle manages communication between systems, and a standard deploy tool that could automate the deployment
2m:55s - The end goal is to move the standardised components out of the application developer’s concern and into standard libraries and tools
3m:15s - Stratification is about separating the layers of the stack, as opposed to isolation which may have other meanings

Resiliency

4m:05s - Each domain has different failure and operating modes, and the layered approach to resiliency means that the layer handles this automatically
4m:30s - Large systems may fail in unexpected ways
4m:35s - Twitter originally had the “Fail Whale” but this has been phased out as the system has become more stable
4m:50s - As Twitter grew, it needed to move quicker, with more engineers and less whale time
5m:10s - Automation and social tools were needed to improve the situation

Failures

5m:35s - First deploy over a few hundred thousands of machines where incidents would occur
6m:00s - It’s a large and hard matrix on where problems may occur when deployed to that number of machines
6m:10s - Being able to respond to failures efficiently is as important as avoiding them in the first place
6m:30s - Known unknowns are the things you test for; unknown unknowns are the things you learn about during failures
6m:40s - Capturing information is important during the failure handling process

Cascading failures

7m:30s - The really complicated failures are difficult to think through
7m:35s - Single issues- like a digger going through a fibre line- are simple but it’s rarely that easy
8m:00s - One incident involved a power outage, and a second power outage in a nearby vicinity meant that failovers did not go well
8m:25s - Cascading failures are known but not usually considered all the way down, like for power
8m:35s - Testing for a power outage is possible but the cascade effects may not be

Teamwork

9m:15s - Teamwork is essential; a DevOps model where the service providers are on call was the only way to make the system better
9m:30s - If there isn’t an optimal feedback loop between the developers of the service and the runtime, then it’s difficult to get the changes fixed in place
9m:45s - The more barriers there are to communication, the more difficult things can be to fix

Production Load

10m:00s - Observability system launched on Cassandra and a time series database
10m:25s - The system was tested for a month or two, but when it went into production there were a number of problems that only became apparent because of the traffic
10m:35s - There were traffic spikes that were excess of the capacity of the counter system
11m:00s - It took months of several incidents (rather than one incident) to allow the problems to be fixed across the board

Root Cause Analysis

11m:25s - There are times where a single issue is the cause, but it is rare. In many cases it is the interactions between components
12m:00s - Outages are an opportunity to learn about the unknown unknowns, and turn them into known unknowns for the future
12m:10s - When you’re tackling an outage, often you don’t know what the right questions to ask are- even learning what the right questions to ask for future events is beneficial
12m:15s - As a team, getting better at asking questions reduces the time to recovery

Outages

12m:35s - Outages involving Zookeeper and Finagle (which power the service discovery): initially service discovery with zookeeper was used to discover which service to send requests to
13m:00s - When Zookeeper went down, the clients would lose their state and fail themselves, so the client libraries were improved to cache state
13m:15s - Zookeeper went down another time but when it recovered the service state was empty, and didn’t get repopulated, which caused further problems
13m:40s - Finagle now includes a probation which doesn’t eject service configuration immediately if Zookeeper is down, but considers the problem in a different way
13m:55s - An intermediary load balancer is used, which allows clients to transparently connect to services
14m:10s - A series of incidents led to the re-analysis of the problem and overall a better solution

Post Mortems

14m:55s - Tools and processes are needed to understand outages
15m:05s - Being part of a company that doesn’t do post mortems means that it’s unlikely they are learning anything, or aren’t sharing the knowledge
15m:10s - Post mortems are great at socialising the changes
15m:20s - One of the things to get out of a post mortem are the unknown unknowns; what questions should be asked instead when a problem occurs?
15m:35s - There’s always a checklist of things to do, and it should be up-to-date
16m:10s - Processes aren’t perfect; they need to be refined in small ways
16m:40s - There are both low cost and high cost fixes that come out of a post mortem, and need to get added to the road map for an appropriate fix schedule

Learning from Failure

17m:35s - Google has been working at large scale for years with tools like Zookeeper which are hard to get right, and now can be used by anyone
18m:05s - Not everyone needs to go through all failures individually; shared learning of failures helps everyone

Linkerd

18m:20s - Linkerd is built on Finagle
18m:30s - Most people use Finagle for its programming model, but there are operational tools built in, such as service discovery, load balancing, security, retries, timeouts etc
19m:20s - Linkerd separates out logical names from concrete names, such as the user services versus a local developer instantiation of the server, or the production service
20m:00s - Separate environments and canaries can be abstracted by knowing which services to connect to
20m:20s - Service discovery can be done based on central policy or based on local information
20m:40s - There are pluggable modules for Java that allow you to programmatically decide how to route HTTP requests- for example, using a tool called Namerd

Time to Recovery

21m:15s - You can’t eliminate failures completely; how much engineering should be spent on avoiding failure versus optimising the time to recovery?
21m:40s - It’s unlikely you will have a service which never breaks
21m:55s - It’s much more important to be able to fix things quickly when they break
22m:00s - Systems that are easy to debug and that are visible are easier to fix
21m:15s - Having a list of questions that you can answer quickly will give a better time to resolution
22m:40s - Having the ability to find the answer to a question, and to be able to drill further, is important
23m:10s - Logging and metrics are important when a human has to be involved
23m:15s - Optimising the human out of the loop is important for speed
23m:20s - It’s common to let a system fail and recover automatically without needing intervention to prevent against a 3am callout, and resolve the failure at a more convenient time

Information

24m:15s - Things rarely go down hard, and they frequently get slower over time
24m:30s - Spotting trends, such as an ever increasing latency, is key to spotting potential problems for the future
24m:50s - Zipkin is a tracing system that allows multiple services to be traced
25m:00s - In some cases services may have hidden dependencies which aren’t known, so being able to see the service graph is an important piece of information
25m:15s - Zipkin can be invaluable at showing where the requests are going and how they are being processed

Testing and Feature Flags

25m:50s - Can change be managed more intelligently to reduce risk?
26m:00s - Changes can be reduced by making the impact smaller, such as limiting to a subset of users
26m:20s - By nature, change is when problems will occur
26m:50s - Getting to production as soon as possible is important, so that load is being delivered to the system in ways that can’t be anticipated during testing
27m:20s - There’s no substitute for testing with production load
27m:40s - Being incremental in deployment is a good way of testing services
27m:55s - Feature flags are important in being able to incrementally deliver features to a subset of production users
28m:05s - Linkerd has a weighting algorithm that allows a subset of users to be delivered to the new functionality

Game Planning

29m:15s - Game planning can be good; but you need to have a plan before starting
29m:20s - The plan you start with quickly goes out the window in any case
29m:40s - Testing for failures, such as pulling the plug on racks, is important
30m:10s - Soft failures are pernicious in a distributed system

Latency

30m:30s - Linkerd has different load balancing, such as round robin, and weighting rules
30m:50s - It was surprising how much better some of the features were at handling selective failures
31m:10s - Using round robin with a failed service dropped the successful connections to around 90%; using a more intelligent load balancing algorithm brought reliability up to 99.9%
31m:40s - Linkerd can inject latency in the service or routing; having a networking layer outside of the application can allow for such systems to be tweaked
31m:50s - Other solutions, such as using iptables to introduce latency, are also available

Summary

32m:15s - Small components doing one thing well is a key goal to take into the distributed world
32m:30s - Microservices are part of that but are not easy; the goal is to keep the applications as simple as possible
32m:40s - One small team should understand that codebase
32m:50s - The infrastructure should also be separated so that one team can own their own responsibilities

Companies mentioned

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.