Daniel Jacobson on Ephemeral APIs and Continuous Innovation at Netflix
Following his talk at the recent “I Love APIs” conference, InfoQ had the opportunity to interview Daniel Jacobson about ephemeral APIs, their relationship to experience-based APIs and when to consider them in your organization.
Daniel leads development of critical systems that are the front door of Netflix, servicing 1,000+ different device types and billions of requests per day. He also manages the Netflix playback experience which accounts for approximately one-third of Internet downstream traffic in North America during peak hours.
InfoQ: What is your current role at Netflix and your day-to-day responsibilities?
Daniel Jacobson: I run the edge engineering team which is responsible for handling all traffic for all devices around the world for signup, discovery and playback. On the playback side we are responsible for the functionality that supports the playback experience. The API side is responsible for handling the traffic directly from devices, fetching data from a broad set of mid-tier data services and then we broker the data back. Both teams are critical to success of Netflix because nobody can stream if playback is not available and nobody can stream if the API is not available.
InfoQ: Can you explain what Ephemeral APIs are all about and how different they are from the Experience APIs that you have proposed before?
DJ: Experience APIs are trying to handle an optimized response for a given requesting agent. That's orthogonal to the ephemeral APIs. The experience API is more about the requesting pattern and the payload. Ephemeral API is more about the process of iterating and evolving the experience APIs.
Traditionally, APIs get set up to make it easier for the API provider to support, which results in one-size-fits-all APIs. The problem with that approach is that the API ends up being harder to use for a wide array of consumers. In other words, the optimization in that model is to make things easier for the few but harder for the many. For experience APIs, the goal is to focus on the needs of the individual requesters and optimize the APIs for each of them. It means that you are essentially running a wide array of different APIs. This results in a more challenging environment for the API provider to support because the variability is higher, but it allows the API consumers to develop what is best for them and for the performance of their clients. Ultimately, this should translate into a better customer experience.
Ephemerality is part of our story in how we develop our APIs, but not essential for the experience API model. Ephemeral APIs mean that the endpoints and payloads should be able to be terminated and created with ease and flexibility with the expectation that this can happen at any moment and potentially very frequently. If we can support ephemerality, then we can innovate faster and continuously to support the product needs without being a bottleneck.
To give an example, if we are running an A/B test to evaluate a new feature in our SmartTV experience, the UI team working on that feature can iterate on the client code and the APIs without the API team’s involvement. As they develop the test, they may realize that the data needs change or can be optimized, which would result in them killing the endpoints and create new ones. This can happen dozens of times over the course of the project and without the API team getting involved (as long as all of the data elements already exist in the pipeline).
InfoQ: What is the best way to find the right granularity for experience-based APIs? Is it mostly based on the device capabilities or on team organization?
DJ: I've written a detailed blog post on this topic in the past, which includes the recipe for when experience-based APIs might be a good choice. Basically, it is likely many companies don't need to go this route because it's a scale question.
So, if you have a wide array of different interaction models that are diverging and a close relationship with those who are consuming the APIs, those are good indicators that you might want to optimize for this. The proximity to the consumer of the API is key because you have a tighter feedback loop and more understanding of what their individual needs are.
The difference with generic resource-based APIs is that you don't know who is going to consume the APIs and how they will be consumed. If the consumers are in your organization, and if you understand those nuances, you can create an architecture that is optimized for them all.
Within Netflix, we have created the architecture as a set of Java APIs and all these different device teams can build their own experience-based web APIs that are optimized for their clients. We like to call our system a platform for API development, more than an traditional API.
InfoQ: Do you have a separate API for Netflix mobile app on Android and on iOS?
DJ: In the construct of the platform, we have base Java APIs that are method calls within a JVM. Then, we have an adapter layer that sits on top of that where web APIs can be developed in a device-specific way. So, we have mobile teams developing their corresponding adapters, those are different endpoints, request patterns, payloads and maybe different protocols.
There used to be more overlap between iOS and Android, but now these experiences are indeed different. There are shared functions across all of this so we built a set of tools to allow for the shareability.
InfoQ: Do you rely on an API language to describe Netflix APIs?
DJ: Not at this point. This is something we discuss periodically, but have not pursued yet because of the challenges and costs in maintaining them. Most of the time, if you have language descriptors it means that you are trying to fix things in place, make them consistent for the API consumers. Because our web APIs are ephemeral, the descriptor would also need to be ephemeral, so using one would cost more and not be as helpful.
But another thing is you have many teams building these web APIs with different needs and those teams are iterating on their consumption of the web APIs. This iteration is happening continuously because we are always running A/B tests that require changes to the data being delivered. As the teams iterate, the same person or group is writing and consuming the web API and they are doing the development of both at the same time, which means they already know the nature of the interface, so there is no value.
Most of the discussion for description languages have been at the Java API level, but again, those APIs are changing frequently as well. If we can find a way to describe those APIs consistently at very low cost, we would like to add that to the system, but so far it seems as though the costs of maintenance exceed the benefit.
InfoQ: Do you rely on API tooling to accelerate the development of APIs by device teams?
DJ: We develop a suite of tools to allow people to manage, deploy, and view the health of their API scripts, and to determine which endpoints are active and not. We also have tools to support shareability of code around these scripts and we have tools to inspect the payloads. Also, there are tools that we still need to develop. For example, the difficulty in this world is debuggability and we need to improve in this area.
DJ: The architecture and API for the web site team is different than most devices because they have a separate tier fronting their API calls. For typical devices, they call directly into the web API but for the web site, they call into their own cluster where they handle the traffic directly and then call into our API cluster to get the data. What's happening in their cluster and above it is currently outside our view but they are still writing scripts in our adapter layer.
What's interesting is that we are investigating now if we should apply similar constructs across the breadth of devices or some subsets, and evaluating the cost of doing this more broadly. Some things that we might gain in this approach would be process isolation and an easier path towards debuggability.
InfoQ: What is the place of Groovy and other scripting languages in the Netflix API platform?
DJ: Groovy is the only language in our API environment that people are writing adapter scripts with, but we are looking at other languages. The next one is likely going to be Node.js. Going to another JVM language would be easier, but there hasn’t been enough interest so far. If device teams want to use Scala or other languages, we would need to do more investigation and work to make it happen.
Node.js is not going to run integrated in the JVM so it’s an additional benefit of isolating that into another layer like we’ve done for the main web site.
InfoQ: How were the device teams able to adapt to such changes in their development flows?
DJ: The cultural change to the company was a lot harder than the technology changes. Even with teams willing to go to this route, there were some challenges in getting people to think and operate differently in the new environment. For example, it took some time for them to adapt to writing Groovy and to the functional programming paradigm. But looking back it is definitely a net win.
InfoQ: In your talk, you mentioned an ongoing project to introduce containers at the API adapter layer. Will that effort have impact on the Nicobar open source project?
DJ: As we are investigating containers for the web site layer, we are thinking about how it could be applied to other devices as well. For the container-based model, Nicobar would not be a central player for us. In fact, when we designed Nicobar and the scriptability, it was in part to deploy the scripts in an isolated way. Containers take our original intent to the next level and obviates away the need for Nicobar. That said, our system will continue to support the scripting and Nicobar for years to come, so we expect to continue to develop and evolve Nicobar for a while. As Nicobar evolves, it is likely that such changes will be made in the open source project as well.
DJ: It helps us represent remote data sources as a single domain model through a virtual JSON graph. You code the same way no matter where the data is, whether in memory on the client or over the network on the server. Falcor also handles network communications between devices and servers, and can batch and deduplicate requests to make them more efficient.
Because Falcor is a more efficient data fetching mechanism between devices and servers, it’s going to continue to play a significant role in our platform even as our system evolves into a different architecture.
The main benefits we get out of Falcor are developer efficiency and improved application performance. We get the developer efficiency because the access patterns for the engineers writing the adapters is more consistent. That said, there is a steeper learning curve to use Falcor and it is a more challenging environment to debug.
InfoQ: What are the limitations that you found with AWS Auto Scaling Groups and how does Netflix Scryer help? Will it become open source?
DJ: AWS autoscaling is used widely at Netflix. It's very useful and powerful. Amazon is responding to metrics like load average, determining that it’s time to add new servers when those metrics pass a certain threshold. Meanwhile, it can take 10 to 20 minutes to bring a new set of servers online. A lot of bad things can happen in a manner of minutes, so that adds risk to our availability profile.
That’s what prompted us to develop Scryer. What Scryer does is it looks at the historical data and incorporates a feedback loop of real-time data, evaluates what the needs will be in the near future for capacity, and then it adds servers in advance of that need. What we see is that response times and latencies are much more leveled with Scryer because load averages are not spiking and because the cluster can handle the traffic more effectively.
While we announced it via a blog post a couple of years ago, there is no plan right now to open source it.
InfoQ: Netflix Engineering is well known for its Chaos Monkey service. Can you tell more about other services that are part of your Simian Army?
DJ:There is a suite of monkeys that do different things. Here are some of these services:
- Latency Monkey has various degrees of utility and was designed to inject errors and latencies into a service to see how the failure would cascade. That has since evolved into FIT (Failure Injection Testing).
- Chaos Gorilla is similar to Chaos Monkey but instead of killing individual instances, it is killing AWS availability zones. The idea here is to test high availability across zones by redirecting traffic from a failed zone to a healthy one.
- Conformity Monkey and Security Monkey make sure that builds conform to certain operational and security guidelines and shuts down those that are not confirming.
- Janitor Monkey which will cleanup unhealthy or dead instances.
- Chaos Kong is a recent addition to the army, which simulates and outage in an entire AWS region and pushes traffic to a different region.
InfoQ: Over the years, Netflix has launched many open source projects. What is the best way to know what is available and actively maintained, to take advantage of these contributions?
DJ: As our OSS strategy has evolved, we've released around 60 projects in total across a diverse set of categories including UI, cloud and tools. Some of them are more actively managed than others and we try to partition them in our developer website. Supporting the APIs directly, there are a range of tools including Zuul, Nicobar, Histrix and RxJava.
InfoQ: Should a company new to APIs start with a one-size-fits-all API approach and progressively evolve like Netflix did, or start immediately with finer-grained ephemeral experience APIs?
DJ: If you are brand new to APIs, start with OSFA (one size fits all). There is a question of whether you will ever get to the scale needs that Netflix has. Experience APIs are more of a challenge. I believe that ephemerality should be part of the mindset of each company, regardless.
Going the experience based API route is a function of opportunity and cost. You are adding more overall cost, but the efficiency and the optimization gains might be worth it. If you only have a few devices or very small development team or if you have a wide range of external parties that consume APIs, the cost of operating this more variable environment would likely not be recovered.
You really need to have a tipping point where the development efficiency of the API consumers is hindered by the fact that they are fighting against the rigid API. In other words, if you have different device teams, that have to make inefficient API calls that are different from each other and they have to compensate by doing additional parsing, error handling, etc. then the cost of all of that added energy can potentially be obfuscated by creating an optimized interaction model. This benefit is only worth it if you have enough developers doing these inefficient activities.
InfoQ: In addition to developer efficiency, are there other benefits that you might be looking for with Experience APIs?
DJ: With an optimized set of APIs, you are building a solution to provide a better experience for the customer, such as improved system performance and improved velocity in getting changes into the product.
If you want to have this kind of ephemerality and optimization, you can’t set it up for public APIs. The experience APIs are excellent tactics but are geared towards private APIs because having a close relationship with a small set of developers allows you to have much more latitude in solving the needs of the API consumers.
InfoQ: What excites you the most right now about the API space?
DJ: We are most excited about things like containers, streaming data, HTTP 2.0, websocket and persistence connections, tooling and analytics behind supporting a massive scale API. So we are investigating in those kind of things and experimenting.
Other things are emerging in this space like microservices, continuous integration, continuous deployment, and we are already doing them. At Netflix, we have a distributed architecture with specific functions for each microservice. But successful microservices inevitably grow in scope, potentially causing it to become more of a monolith over time. At that point, it makes sense to start breaking things down again.
InfoQ: Finally, how does continuous deployment relate to ephemeral APIs?
DJ:I often describe my team as being the skinny part of the hourglass that's pushing data back and forth between the two fat parts. In one of the fat parts is all of the API consumers, the UI and devices teams. On the other fat part we have all the distributed server-side microservices. Both of the fat parts are constantly changing (A/B testing, new features, new devices, etc.).
As those change, we need to ensure that data is flowing through the skinny part to support the product and any test that is being performed on the product. We need to change at a faster rate than the rest of the company because we need to handle the changes that many other teams make.
Several years ago we decided the only way to do this was to develop a fully automated deployment pipeline. From a continuous deployment perspective, it was important for us to be able to deploy rapidly, frequently, at low risk and with the high ability to quickly rollback. The goal behind all of that is that we should not be the bottleneck to getting product change to the customer.
Like other things my team does, continuous deployment is a means to an end. And the end is continuous innovation. Having an environment that can rapidly and constantly change to the need of the business and the customer ties back to our ephemerality mindset.