BT

Haley Tucker on Responding to Failures in Playback Features at Netflix

| Podcast with Haley Tucker Follow 0 Followers by Thomas Betts Follow 37 Followers on Dec 09, 2016 | NOTICE: The next QCon is in San Francisco Nov 5 - 9, 2018. Save an extra $100 with INFOQSF18!

In this week’s podcast, Thomas Betts talks with Haley Tucker, a Senior Software Engineer on the Playback Features team at Netflix. While at QCon San Francisco 2016, Tucker told some production war stories about trying to deliver content to 65 million members.

Key Takeaways

  • Distributed systems fail regularly, often due to unexpected reasons
  • Data canaries can identify invalid metadata before it can enter and corrupt the production environment
  • ChAP, the Chaos Automation Platform, can test failure conditions alongside the success conditions
  • Fallbacks are an important component of system stability, but the fallbacks must be fast and light to not cause secondary failures
  • Distributed systems are fundamentally social systems, and require a blameless culture to be successful

Show Notes

Fun with distributed systems

1m:24s - Every outage at Netflix follows the philosophy of Leslie Lamport that “a distributed system is one in which the failure of a computer you didn't know exists can render your own computer unusable.”

Weird data in the catalog, solved with data canaries

2m:04s - The Video Metadata Service aggregates several sources into a consistent API consumed by other Netflix services.

2m:43s - Several checks and validations were in place within the video metadata service, but it is impossible to predict every way consumers will be using the data.

3m:29s - The access pattern used by the playback service was different than that used in the checks, and led to unexpected results in production.

3m:58s - Now, the services consuming the data are also responsible for testing and verifying the data before it rolls out to production. The Video Metadata Service can orchestrate the testing process.

4m:22s - This process has been described as data canaries

A vanishing critical service, prevented by implementing a “kill switch”

5m:07s - A second, related incident involved a thin and lightweight logging that service ran out of memory and crashed, due to the entire video metadata blob being loaded into memory.

6m:22s - Isolating the problem was tricky due to different configurations in test and prod.

6m:44s - After setting the test config to match production, the root cause was identified deep within the dependency graph, in a .jar that isn’t actually needed.

6m:59s - Pruning the dependency graph, and removing the .jar completely is still a work-in-progress.

7m:07s - Have worked with video metadata team to implement a kill switch. This allows the logging service to no longer consume the metadata service.

7m:56s - There were 14 pages of .jars in the application, which is a remnant of the service being built as part of a legacy monolith.

8m:25s - The kill switch is not a circuit breaker, but a config value to prevent loading the data.

9m:16s - When this service went down, it created a cascading failure to the proxy tier. That’s where Failure Injection Testing (FIT) comes into play.

10m:16s - FIT tests are manually run, based on a specifically defined scenario, at small or large scale.

Throttling issues, solved by sharding the service

11m:00s - For playback, a critical feature is anything involving the play button, and video playback starting. Other services, such as customer experience improvements, are deemed noncritical.

12m:04s - Very regularly spaced spikes in latency, every 40 minutes. Requests would fail, and all retries would fail. This led to upstream throttling occurring.

13m:15s - Indiscriminate throttling caused both critical and noncritical services.Therefore, the application was sharded into two stacks, to allow different scaling and performance characteristics for critical and noncritical services.

14m:23s - The stack is now two smaller monoliths, which was a relatively simple solution to buy some time for redesigning and re-architecting. The future state will involve individual components being split out into microservices, as appropriate.

Embracing failure

16m:05s - The Netflix culture embraces and accepts failure, with new team members being officially welcomed to the company only after they first break something in production.

16m:28s - This goes hand-in-hand with working to correct the root cause and not fail in the same way twice.

16m:38s - Major outages are met with additional resources and a strong focus on resolving the problem.

17m:40s - Tools such as FIT and the data canary architecture often come about as part of a Hackday project.

18m:15s - The Chaos Automation Platform (ChAP), is the next generation of FIT, and allows automated testing of failure scenarios with every code push.

Monitoring and troubleshooting

21m:06s - Tracking how requests and responses for different devices flow through the Netflix tech stack relies heavily on Elasticsearch. Previously, usage data was stored in Atlas, but the data was very coarse-grained.

21m:51s - Elasticsearch has made it much easier to drill down and find the common ground between failures.

22m:42s - Teams at Netflix are working on traceability, which aids in troubleshooting to identify bottlenecks in the system.

23m:13s - Tucker’s team’s recent fun project is a batch loader pipeline to listen to the video metadata, compute what they need, then cache it using Cassandra.

Companies Mentioned

People Mentioned

Languages and Platforms Mentioned

About QCon

QCon is a practitioner-driven conference designed for technical team leads, architects, and project managers who influence software innovation in their teams. QCon takes place 7 times per year in London, New York, San Francisco, Sao Paolo, Beijing & Shanghai. QCon San Francisco is at its 12th Edition and will take place Nov 5-9, 2018. 140+ expert practitioner speakers, 1300+ attendees and 18 tracks will cover topics driving the evolution of software development today. Visit to qconsf.com get more details.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS feed, and they are available via SoundCloud and iTunes.  From this page you also have access to our recorded show notes.  They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Sponsored Content

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT