The Netflix API Optimization Story
Over the past month, Netflix has been sharing stories from the API optimization effort that started a little over a year ago. The initial focus of the performance optimizations was to reduce chattiness and the payload size but over the course of the year, as a result of the redesign, API development and operations was distributed and the services layer became asynchronous. The primary redesign feature was redefining the boundary of client and server responsibilities to enable fine grained customizations such as content formatting to address the differences(memory capacity, document delivery and models, user interactions etc) among the various client devices that connect to Netflix.
InfoQ caught up with Ben Christensen, Senior Software Engineer at Netflix to clarify and gather further design details.
InfoQ: Can you share a high level architecture of the Netflix API service from a year ago when the optimization effort was commenced and point out the problem areas?
Before the redesign the Netflix API was a fairly typical RESTful server application. The image below demonstrates how the RESTful endpoints were implemented statically as "Restlet Resources" in a single codebase.
The major problem areas were:
1) Fault tolerance as described in Ben Schmaus's, Manager for Netflix API platform, blog post:
Here are some of the key principles that informed our thinking as we set out to make the API more resilient.2) Poor match with device requirements that impacted performance and rate of innovation as described in Daniel Jacobson's, Director of Engineering for Netflix API platform, blog post:
- A failure in a service dependency should not break the user experience for member.
- The API should automatically take corrective action when one of its service dependencies fail.
- The API should be able to show us what’s happening right now, in addition to what was happening 15-30 minutes ago, yesterday, last week, etc.
Netflix's streaming service is available on more than 800 different device types, almost all of which receive their content from our private APIs. In our experience, we have realized that supporting these myriad device types with an One-size-fits-all API, while successful, is not optimal for the API team, the UI teams or Netflix streaming customers.In short:
- the RESTful API was a lowest common denominator solution
- it slowed innovation on each device since each endpoint needed to be used by all clients and thus feature development was synchronized across all of them
- it also slowed innovation because the API team became a bottleneck for client requests, resulting in prioritization and deferment of requests
- it resulted in chatty network access patterns and inefficient payloads
- complicated feature flags for turning functionality and data types on and off spread through all RESTful endpoints to cater to the differences of each client
InfoQ: What is the final state for the redesign?
The re-design is depicted below:
 Dynamic Endpoints
All new web service endpoints are now dynamically defined at runtime. New endpoints can be developed, tested, canaried and deployed by each client team without coordination (unless they depend on new functionality from the underlying API Service Layer shown at item 5 in which case they would need to wait until after those changes are deployed before pushing their endpoint).
 Endpoint Code Repository and Management
Endpoint code is published to a Cassandra multi-region cluster (globally replicated) via a RESTful Endpoint Management API used by client teams to manage their endpoints.
 Dynamic Polyglot JVM Language Runtime
Any JVM language can be supported so each team can use the language best suited to them.
The Groovy JVM language was chosen as our first supported language. The existence of first-class functions (closures), list/dictionary syntax, performance and debuggability were all aspects of our decision. Moreover, Groovy provides syntax comfortable to a wide range of developers, which helps to reduce the learning curve for the first language on the platform.
[4 & 5] Asynchronous Java API + Functional Reactive Programming Model
Embracing concurrency was a key requirement to achieve performance gains but abstracting away thread-safety and parallel execution implementation details from the client developers was equally important in reducing complexity and speeding up their rate of innovation. Making the Java API fully asynchronous was the first step as it allows the underlying method implementations to control whether something is executed concurrently or not without the client code changing. We chose a functional reactive approach to handling composition and conditional flows of asynchronous callbacks. Our implementation is modeled after Rx Observables.
Functional reactive offers efficient execution and composition by providing a collection of operators capable of filtering, selecting, transforming, combining and composing Observable's. The Observable data type can be thought of as a "push" equivalent to Iterable which is "pull". With an Iterable, the consumer pulls values from the producer and the thread blocks until those values arrive. By contrast with the Observable type, the producer pushes values to the consumer whenever values are available. This approach is more flexible, because values can arrive synchronously or asynchronously.
The Observable type adds two missing semantics to the Gang of Four's Observer pattern, which are available in the Iterable type:
1. The ability for the producer to signal to the consumer that there is no more data available.
2. The ability for the producer to signal to the consumer that an error has occurred.
With these two simple additions, we have unified the Iterable and Observable types. The only difference between them is the direction in which the data flows. This is very important because now any operation we perform on an Iterable, can also be performed on an Observable.
 Hystrix Fault Tolerance
All service calls to backend systems are made via the Hystrix fault tolerance layer (which was recently open sourced, along with its dashboard) that isolates the dynamic endpoints and the API Service Layer from the inevitable failures that occur while executing billions of network calls each day from the API to backend systems. The Hystrix layer is inherently mutlti-threaded due to its use of threads for isolating dependencies and thus is leveraged for concurrent execution of blocking calls to backend systems. These asynchronous requests are then composed together via the functional reactive framework.
We chose to implement a solution that uses a combination of fault tolerance approaches:
Each of these approaches to fault-tolerance has pros and cons but when combined together they provide a comprehensive protective barrier between user requests and underlying dependencies.
- network timeouts and retries
- separate threads on per-dependency thread pools
- semaphores (via a tryAcquire, not a blocking call)
- circuit breakers
 Backend Services and Dependencies
The API Service Layer abstracts away all backend services and dependencies behind facades. As a result, endpoint code accesses “functionality” rather than a “system”. This allows us to change underlying implementations and architecture with no or limited impact on the code that depends on the API. For example, if a backend system is split into 2 different services, or 3 are combined into one, or a remote network call is optimized into an in-memory cache, none of these changes should affect endpoint code and thus the API Service Layer ensures that object models and other such tight-couplings are abstracted and not allowed to “leak” into the endpoint code.
InfoQ: Can you share any details around the choice of design patterns for the Client Adapter layer?
The client adapter layer was implemented as an HTTP request/response loop.
A client team implements an endpoint and receives the following:
- APIServiceGateway instance for accessing functionality
- APIUser representing the currently authenticated user
- APIRequestContext with environmental variables per request such as device type and geo location
Since the endpoint is given the HTTP request/response it can implement whatever style behavior it wishes:
- Document Model + Path Language
- Progressive or document response
It also frees the implementation to use request/response headers, URI templating or request arguments and return data in whatever format suits them best such as JSON, XML, PLIST or even a binary format.
The image below shows the front controller flow of an incoming request to lookup the correct endpoint and execute it.
InfoQ: Why the choice of Cassandra as a code repository for endpoint code over other versioning tools?
First, because we needed global replication (across different AWS regions) and Cassandra offers that. Second, because it is an infrastructure component that Netflix has experience with and supports.
It does not replace standard version control systems (Perforce, Git, etc) for actual development and versioning of code in that sense. Cassandra is the repository for deployed code used at runtime and we have several different environments such as dev, test, integration and prod each with different Cassandra clusters hosting the code deployed to that environment. The revisions of an endpoint stored within Cassandra allow for multi-variate testing, canary testing, gradual rollouts and rapid rollbacks.
InfoQ: Does the dynamic language runtime host the client adapter endpoint code and split the request into granular Java API requests which are now asynchronous calls?
Yes, the dynamic language runtime hosts the client adaptor code. The client adaptor (endpoint) is given an instance of an APIServiceGateway that provides access to asynchronous service calls.
The Netflix API takes advantage of Reactive Extensions (Rx) by making the entire service layer asynchronous (or at least appear so) - all "service" methods return an Observable<T>. Making all return types Observable combined with a functional programming model frees up the service layer implementation to safely use concurrency. It also enables the service layer implementation to:
• conditionally return immediately from a cache
• block instead of using threads if resources are constrained
• use multiple threads
• use non-blocking IO
• migrate an underlying implementation from network based to in-memory cache
This can all happen without ever changing how client code interacts with or composes responses.In short, client code treats all interactions with the API as asynchronous but the implementation chooses if something is blocking or non-blocking.
InfoQ: What is a circuit breaker? How is it used to address fault tolerance?
Ben Schmaus, Manager of Netflix API platform, summarized the circuit breaker pattern in a blogpost:
We’ve restructured the API to enable graceful fallback mechanisms to kick in when a service dependency fails. We decorate calls to service dependencies with code that tracks the result of each call. When we detect that a service is failing too often we stop calling it and serve fallback responses while giving the failing service time to recover. We then periodically let some calls to the service go through and if they succeed then we open traffic for all calls.
If this pattern sounds familiar to you, you're probably thinking of the CircuitBreaker pattern from Michael Nygard’s book "Release It! Design and Deploy Production-Ready Software", which influenced the implementation of our service dependency decorator code. Our implementation goes a little further than the basic CircuitBreaker pattern in that fallbacks can be triggered in a few ways:
1. A request to the remote service times out
2. The thread pool and bounded task queue used to interact with a service dependency are at 100% capacity
3. The client library used to interact with a service dependency throws an exception
These buckets of failures factor into a service's overall error rate and when the error rate exceeds a defined threshold then we "trip" the circuit for that service and immediately serve fallbacks without even attempting to communicate with the remote service.
Each service that’s wrapped by a circuit breaker implements a fallback using one of the following three approaches:
1. Custom fallback - in some cases a service’s client library provides a fallback method we can invoke, or in other cases we can use locally available data on an API server (eg, a cookie or local JVM cache) to generate a fallback response
2. Fail silent - in this case the fallback method simply returns a null value, which is useful if the data provided by the service being invoked is optional for the response that will be sent back to the requesting client
3. Fail fast - used in cases where the data is required or there’s no good fallback and results in a client getting a 5xx response. This can negatively affect the device UX, which is not ideal, but it keeps API servers healthy and allows the system to recover quickly when the failing service becomes available again.
References:  http://techblog.netflix.com/2011/02/redesigning-netflix-api.html
Re: Broken article
For a far more informative article on service performance optimization: cacm.acm.org/magazines/2013/2/160173-the-tail-a...
And by the way all of this that is being poorly reinvented was done in many CORBA implementations a long long time ago. More importantly it was extensible, both client and server endpoints, and not hardwired to one particular implementation.
Re: Broken article
Has this been open sourced on GitHub?
IO state is bound to Business logic
The only way to resolve this is to abstract IO functionality from business logic to a communication layer and then to move IO data to a shared object in a cache.
See the SpringOne presentation on Api Abstraction and API Chaining and (more recently) IO State in API Architecture.
Mike Amundsen May 29, 2015
Ben Linders May 28, 2015