Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Podcasts Oliver Gould Discusses Architecting to Avoid and Recover from Failure

Oliver Gould Discusses Architecting to Avoid and Recover from Failure

In this week’s podcast, Robert Blumen talks to Oliver Gould at QCon San Francsico 2016. Gould is the CTO of Buoyant, where he leads open source development efforts. Prior to Buoyant he was a Staff Infrastructure Engineer at Twitter where he was technical lead on Observability, Traffic, Configuration and Co-ordination teams.

Key Takeaways

  • Stratification allows applications to own their logic while libraries take care of the different mechanisms, such as service discovery and load balancing
  • Cascading failures can’t be tested or protected against, so having a fast time to recovery is important
  • Having developers own their services with on-call mechanisms improves the reliability of the service; it’s best to optimise automatic restarts so problems can be addressed during normal working hours
  • Post mortem analysis of failures are important to improve run books or checklists and to share learning between teams
  • Incremental roll out of features with feature flags or weighted routing provides agility while testing with production load, which highlights issues that aren’t noticed during limited developer testing

Show Notes

  • 1m:35s - What is done at the architecture level to prevent failure? 
  • 1m:50s - As systems become more complex we have much larger infrastructure than before
  • 2m:00s - There are many more components than before; instead of having everything on one large computer we now have a warehouse of computers
  • 2m:10s - Things are going to break because there’s more opportunity for failure, in the form of connections between components or components themselves


  • 2m:30s - Stratification- separating concerns- is one way of architecting to avoid failure
  • 2m:35s - Mesos and Aurora at Twitter simplify and separate a bunch of the hardware and operational concerns
  • 2m:50s - Finagle manages communication between systems, and a standard deploy tool that could automate the deployment
  • 2m:55s - The end goal is to move the standardised components out of the application developer’s concern and into standard libraries and tools
  • 3m:15s - Stratification is about separating the layers of the stack, as opposed to isolation which may have other meanings


  • 4m:05s - Each domain has different failure and operating modes, and the layered approach to resiliency means that the layer handles this automatically
  • 4m:30s - Large systems may fail in unexpected ways
  • 4m:35s - Twitter originally had the “Fail Whale” but this has been phased out as the system has become more stable
  • 4m:50s - As Twitter grew, it needed to move quicker, with more engineers and less whale time
  • 5m:10s - Automation and social tools were needed to improve the situation


  • 5m:35s - First deploy over a few hundred thousands of machines where incidents would occur
  • 6m:00s - It’s a large and hard matrix on where problems may occur when deployed to that number of machines
  • 6m:10s - Being able to respond to failures efficiently is as important as avoiding them in the first place
  • 6m:30s - Known unknowns are the things you test for; unknown unknowns are the things you learn about during failures
  • 6m:40s - Capturing information is important during the failure handling process

Cascading failures

  • 7m:30s - The really complicated failures are difficult to think through
  • 7m:35s - Single issues- like a digger going through a fibre line- are simple but it’s rarely that easy
  • 8m:00s - One incident involved a power outage, and a second power outage in a nearby vicinity meant that failovers did not go well
  • 8m:25s - Cascading failures are known but not usually considered all the way down, like for power
  • 8m:35s - Testing for a power outage is possible but the cascade effects may not be


  • 9m:15s - Teamwork is essential; a DevOps model where the service providers are on call was the only way to make the system better
  • 9m:30s - If there isn’t an optimal feedback loop between the developers of the service and the runtime, then it’s difficult to get the changes fixed in place
  • 9m:45s - The more barriers there are to communication, the more difficult things can be to fix

Production Load

  • 10m:00s - Observability system launched on Cassandra and a time series database
  • 10m:25s - The system was tested for a month or two, but when it went into production there were a number of problems that only became apparent because of the traffic
  • 10m:35s - There were traffic spikes that were excess of the capacity of the counter system
  • 11m:00s - It took months of several incidents (rather than one incident) to allow the problems to be fixed across the board

Root Cause Analysis

  • 11m:25s - There are times where a single issue is the cause, but it is rare. In many cases it is the interactions between components
  • 12m:00s - Outages are an opportunity to learn about the unknown unknowns, and turn them into known unknowns for the future
  • 12m:10s - When you’re tackling an outage, often you don’t know what the right questions to ask are- even learning what the right questions to ask for future events is beneficial
  • 12m:15s - As a team, getting better at asking questions reduces the time to recovery


  • 12m:35s - Outages involving Zookeeper and Finagle (which power the service discovery): initially service discovery with zookeeper was used to discover which service to send requests to
  • 13m:00s - When Zookeeper went down, the clients would lose their state and fail themselves, so the client libraries were improved to cache state
  • 13m:15s - Zookeeper went down another time but when it recovered the service state was empty, and didn’t get repopulated, which caused further problems
  • 13m:40s - Finagle now includes a probation which doesn’t eject service configuration immediately if Zookeeper is down, but considers the problem in a different way
  • 13m:55s - An intermediary load balancer is used, which allows clients to transparently connect to services
  • 14m:10s - A series of incidents led to the re-analysis of the problem and overall a better solution

Post Mortems

  • 14m:55s - Tools and processes are needed to understand outages
  • 15m:05s - Being part of a company that doesn’t do post mortems means that it’s unlikely they are learning anything, or aren’t sharing the knowledge
  • 15m:10s - Post mortems are great at socialising the changes
  • 15m:20s - One of the things to get out of a post mortem are the unknown unknowns; what questions should be asked instead when a problem occurs?
  • 15m:35s - There’s always a checklist of things to do, and it should be up-to-date
  • 16m:10s - Processes aren’t perfect; they need to be refined in small ways
  • 16m:40s - There are both low cost and high cost fixes that come out of a post mortem, and need to get added to the road map for an appropriate fix schedule

Learning from Failure

  • 17m:35s - Google has been working at large scale for years with tools like Zookeeper which are hard to get right, and now can be used by anyone
  • 18m:05s - Not everyone needs to go through all failures individually; shared learning of failures helps everyone


  • 18m:20s - Linkerd is built on Finagle
  • 18m:30s - Most people use Finagle for its programming model, but there are operational tools built in, such as service discovery, load balancing, security,  retries, timeouts etc
  • 19m:20s - Linkerd separates out logical names from concrete names, such as the user services versus a local developer instantiation of the server, or the production service
  • 20m:00s - Separate environments and canaries can be abstracted by knowing which services to connect to
  • 20m:20s - Service discovery can be done based on central policy or based on local information
  • 20m:40s - There are pluggable modules for Java that allow you to programmatically decide how to route HTTP requests- for example, using a tool called Namerd 

Time to Recovery

  • 21m:15s - You can’t eliminate failures completely; how much engineering should be spent on avoiding failure versus optimising the time to recovery?
  • 21m:40s - It’s unlikely you will have a service which never breaks
  • 21m:55s - It’s much more important to be able to fix things quickly when they break
  • 22m:00s - Systems that are easy to debug and that are visible are easier to fix
  • 21m:15s - Having a list of questions that you can answer quickly will give a better time to resolution
  • 22m:40s - Having the ability to find the answer to a question, and to be able to drill further, is important
  • 23m:10s - Logging and metrics are important when a human has to be involved
  • 23m:15s - Optimising the human out of the loop is important for speed
  • 23m:20s - It’s common to let a system fail and recover automatically without needing intervention to prevent against a 3am callout, and resolve the failure at a more convenient time


  • 24m:15s - Things rarely go down hard, and they frequently get slower over time
  • 24m:30s - Spotting trends, such as an ever increasing latency, is key to spotting potential problems for the future
  • 24m:50s - Zipkin is a tracing system that allows multiple services to be traced
  • 25m:00s - In some cases services may have hidden dependencies which aren’t known, so being able to see the service graph is an important piece of information
  • 25m:15s - Zipkin can be invaluable at showing where the requests are going and how they are being processed

Testing and Feature Flags

  • 25m:50s - Can change be managed more intelligently to reduce risk?
  • 26m:00s - Changes can be reduced by making the impact smaller, such as limiting to a subset of users
  • 26m:20s - By nature, change is when problems will occur
  • 26m:50s - Getting to production as soon as possible is important, so that load is being delivered to the system in ways that can’t be anticipated during testing
  • 27m:20s - There’s no substitute for testing with production load
  • 27m:40s - Being incremental in deployment is a good way of testing services
  • 27m:55s - Feature flags are important in being able to incrementally deliver features to a subset of production users
  • 28m:05s - Linkerd has a weighting algorithm that allows a subset of users to be delivered to the new functionality

Game Planning

  • 29m:15s - Game planning can be good; but you need to have a plan before starting
  • 29m:20s - The plan you start with quickly goes out the window in any case
  • 29m:40s - Testing for failures, such as pulling the plug on racks, is important
  • 30m:10s - Soft failures are pernicious in a distributed system


  • 30m:30s - Linkerd has different load balancing, such as round robin, and weighting rules
  • 30m:50s - It was surprising how much better some of the features were at handling selective failures
  • 31m:10s - Using round robin with a failed service dropped the successful connections to around 90%; using a more intelligent load balancing algorithm brought reliability up to 99.9%
  • 31m:40s - Linkerd can inject latency in the service or routing; having a networking layer outside of the application can allow for such systems to be tweaked
  • 31m:50s - Other solutions, such as using iptables to introduce latency, are also available


  • 32m:15s - Small components doing one thing well is a key goal to take into the distributed world
  • 32m:30s - Microservices are part of that but are not easy; the goal is to keep the applications as simple as possible
  • 32m:40s - One small team should understand that codebase
  • 32m:50s - The infrastructure should also be separated so that one team can own their own responsibilities

Companies mentioned

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article