InfoQ Homepage Podcasts Haley Tucker on Responding to Failures in Playback Features at Netflix

Haley Tucker on Responding to Failures in Playback Features at Netflix

Dec 09, 2016

Podcast with

Haley Tucker

Thomas Betts

In this week’s podcast, Thomas Betts talks with Haley Tucker, a Senior Software Engineer on the Playback Features team at Netflix. While at QCon San Francisco 2016, Tucker told some production war stories about trying to deliver content to 65 million members.

Key Takeaways

Distributed systems fail regularly, often due to unexpected reasons
Data canaries can identify invalid metadata before it can enter and corrupt the production environment
ChAP, the Chaos Automation Platform, can test failure conditions alongside the success conditions
Fallbacks are an important component of system stability, but the fallbacks must be fast and light to not cause secondary failures
Distributed systems are fundamentally social systems, and require a blameless culture to be successful

Subscribe on:

Show Notes

Fun with distributed systems

1m:24s - Every outage at Netflix follows the philosophy of Leslie Lamport that “a distributed system is one in which the failure of a computer you didn't know exists can render your own computer unusable.”

Weird data in the catalog, solved with data canaries

2m:04s - The Video Metadata Service aggregates several sources into a consistent API consumed by other Netflix services.

2m:43s - Several checks and validations were in place within the video metadata service, but it is impossible to predict every way consumers will be using the data.

3m:29s - The access pattern used by the playback service was different than that used in the checks, and led to unexpected results in production.

3m:58s - Now, the services consuming the data are also responsible for testing and verifying the data before it rolls out to production. The Video Metadata Service can orchestrate the testing process.

4m:22s - This process has been described as data canaries

A vanishing critical service, prevented by implementing a “kill switch”

5m:07s - A second, related incident involved a thin and lightweight logging that service ran out of memory and crashed, due to the entire video metadata blob being loaded into memory.

6m:22s - Isolating the problem was tricky due to different configurations in test and prod.

6m:44s - After setting the test config to match production, the root cause was identified deep within the dependency graph, in a .jar that isn’t actually needed.

6m:59s - Pruning the dependency graph, and removing the .jar completely is still a work-in-progress.

7m:07s - Have worked with video metadata team to implement a kill switch. This allows the logging service to no longer consume the metadata service.

7m:56s - There were 14 pages of .jars in the application, which is a remnant of the service being built as part of a legacy monolith.

8m:25s - The kill switch is not a circuit breaker, but a config value to prevent loading the data.

9m:16s - When this service went down, it created a cascading failure to the proxy tier. That’s where Failure Injection Testing (FIT) comes into play.

10m:16s - FIT tests are manually run, based on a specifically defined scenario, at small or large scale.

Throttling issues, solved by sharding the service

11m:00s - For playback, a critical feature is anything involving the play button, and video playback starting. Other services, such as customer experience improvements, are deemed noncritical.

12m:04s - Very regularly spaced spikes in latency, every 40 minutes. Requests would fail, and all retries would fail. This led to upstream throttling occurring.

13m:15s - Indiscriminate throttling caused both critical and noncritical services.Therefore, the application was sharded into two stacks, to allow different scaling and performance characteristics for critical and noncritical services.

14m:23s - The stack is now two smaller monoliths, which was a relatively simple solution to buy some time for redesigning and re-architecting. The future state will involve individual components being split out into microservices, as appropriate.

Embracing failure

16m:05s - The Netflix culture embraces and accepts failure, with new team members being officially welcomed to the company only after they first break something in production.

16m:28s - This goes hand-in-hand with working to correct the root cause and not fail in the same way twice.

16m:38s - Major outages are met with additional resources and a strong focus on resolving the problem.

17m:40s - Tools such as FIT and the data canary architecture often come about as part of a Hackday project.

18m:15s - The Chaos Automation Platform (ChAP), is the next generation of FIT, and allows automated testing of failure scenarios with every code push.

Monitoring and troubleshooting

21m:06s - Tracking how requests and responses for different devices flow through the Netflix tech stack relies heavily on Elasticsearch. Previously, usage data was stored in Atlas, but the data was very coarse-grained.

21m:51s - Elasticsearch has made it much easier to drill down and find the common ground between failures.

22m:42s - Teams at Netflix are working on traceability, which aids in troubleshooting to identify bottlenecks in the system.

23m:13s - Tucker’s team’s recent fun project is a batch loader pipeline to listen to the video metadata, compute what they need, then cache it using Cassandra.

Companies Mentioned

People Mentioned

Leslie Lamport

Languages and Platforms Mentioned

Amazon S3
Failure Injection Testing (FIT). For more on this listen to former Netflix Chaos Engineer Kolton Andrus from last week’s podcast.
Chaos Automation Platform (ChAP)
Elasticsearch
Atlas
Cassandra

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.