Netflix Zuul Gets a Makeover to a Asynchronous and Non-Blocking Architecture
Netflix announced the re-architecture of Zuul, which is a general purpose API gateway. Zuul  was originally a blocking and synchronous solution. The new effort called Zuul 2 is a non-blocking and asynchronous solution. The major architectural difference between Zuul 2 and Zuul 1 is that Zuul 2 is running on an asynchronous and non-blocking framework, using Netty. Instead of relying on multiple threads to provide increased throughput in Zuul, the Netty framework relies on an event loop and callbacks to do the same for Zuul 2.
InfoQ caught up with Mikey Cohen, engineering manager at Netflix responsible for the effort.
InfoQ: You describe the Zuul 2 effort as a journey. It seems like a long journey- can you provide the motivation, an overview of the journey and why it took so long?
Mikey Cohen: Our primary motivation for Zuul 2 was to build a system that could scale for persistent connections to devices. With more than 83 million members, each with multiple connected devices, you can see how this can add up to a massive scale challenge. With persistent connections, we can enable push and bi-directional communication between devices and cloud-based control systems. With it, we can replace polling requests from devices with push notifications. Polling requests account for a significant portion of requests from devices to our microservices. This will both reduce costs and provide a better user experience. In addition to that there are a lot of interesting features that can be built with these capabilities from more rich user experiences to much better debugging capabilities. Finally, we can support more protocols such as websockets and http/2 at large scale.
Indeed, this was an journey. If we were to have built Zuul 2 in a vacuum, it wouldn't have been a notable excursion. The complexity of a change like this within Netflix is enormous. Our cloud gateway which runs on Zuul, is the front door to Netflix's cloud services. We support over 1000 types of devices. The nuances, assumptions, and constraints that many of these devices have vary a lot. All of these eccentricities need to be supported and normalized in Zuul. For example, some devices care about the ordering of headers, some have limits on the sizes of headers, some do not support chunked encoding and some idiosyncrasies are handled differently in Tomcat vs Netty. Identifying these differences and making them compatible was a large challenge of this project. A lot of these issues can only be found using real traffic. Detecting these problems without impacting customers in our production environment involves walking the fine line between taking enormous risk, and engineering skill to mitigate impact of that risk. A failure here means that some or all of our 83 million members won’t be able to stream Netflix content.
Going deeper into our stack, there are all of the features of our gateway as well as platform infrastructure that we have that need to run in an asynchronous environment. All of Netflix's platform infrastructure as well as Zuul Filters were built with a deep assumption of running in a blocking environment. Thread variables that are assumed to be scoped to a request are ubiquitously used in our supporting libraries and platform code. Blocking I/O is second nature within the platform as well. These concepts do not work in an async and non-blocking architecture. Identifying these areas as well as building compatible solutions that would work with this architecture were time consuming and difficult at times.
Part of this journey was also understanding what we see as the proper role of the gateway at Netflix. Briefly, this role is to principally handle routing, insights, and normalization of requests coming into Netflix. A detour along the way to Zuul 2 was removing a lot of the business logic that we had in Zuul and moving this logic to origin systems. Besides giving us clarity on the role of the gateway, it also simplified the migration of blocking code because we removed a lot of it.
Building our Zuul Filters and filter parsing as asynchronous was another large project along the way. As mentioned in the blog post, we did this before building Zuul 2 so that we could have the same set of Zuul Filters to run in Zuul 1 and Zuul 2 (async code can run in a sync environment, but the other way around does not work). This allowed us to continue to develop gateway features in a single set of filters. There were a few iterations of Zuul Filter interface changes as well as a few iterations on chaining filters together. We ended up using RxJava to chain Zuul Filters together. Migrating the more than 100 Zuul Filters we have proved to be a detail-oriented, time consuming task.
And finally, building the framework to run Zuul 2 went through several iterations. We initially moved from a prototype written in an early version of Netty. From there we moved to using the RxNetty project. At some point in building Zuul 2, we realized that we needed the core Netty functionality that RxNetty was abstracting, so we switched out RxNetty in favor of Netty. By going down these different paths, and correcting and learning as we went, we have built a much stronger product.
InfoQ: As described in your blog, a multi-threaded system works for the most part except when things go wrong, such as backend latencies, retries, etc. Are there other reasons that motivated you to move to the new architecture?
Cohen: We did have a belief that we would see large CPU efficiency improvements as well as resiliency improvements. We certainly see some improvement in efficiency (10-25%) in some of our clusters (not all of them), but were underwhelmed in this area. Our miscalculation seems to be due to the amount of CPU work that we do within our gateway and figuring out the point which CPU work trumps architectural efficiencies of async nio. I suspect that others using the open sourced version of Zuul 2 (when it’s released) will see large efficiency gains mostly because most of the work we do within our Zuul 2 gateway at Netflix is specifically related to the Netflix stack. A lot of people will see numbers like 10-25% performance improvement and think it to be a huge win. When factoring in the time and energy to realize the win, plus the other challenges in operations and debugging that async nio systems introduce, the performance considerations are not enough to justify the work. That said, the other benefits, such as connection management and push notifications, as well as possible resiliency gains, are much more important to us.
We also expected to see some resiliency improvement. As I mentioned in the blog, I do think we will see this materialize, but it is not for free. We are actively looking into things like reducing instantiation and logging of exceptions when errors happen, changing throttling mechanisms, and tweaking of connection reuse and load balancing algorithms in order to realize more resiliency improvements.
InfoQ: Zuul Filters are a critical part of Zuul. You started the re-architecture efforts with filters, correct? Can you describe the re-architecture of filters and provide some recommendation to developers using a similar approach in the re-architecture of a large system?
Cohen: The recognition that changing the business logic portion of the gateway to run asynchronously would be the right course was quite insightful, and in the long run saved us a lot of time. This saved us from having two sets of filters (one for Zuul 1 and the other for Zuul 2) that needed to be kept in sync to maintain parity. This was quite a large upfront investment of about six months or more that was just a prerequisite for building Zuul 2. I could imagine that a lot of engineering teams would see this as putting the cart before the horse, but this proved to be tremendously helpful for many reasons. First, as I mentioned, filter development could now happen in one codebase, guaranteeing business logic parity. Because we knew that business logic was solid, we could eliminate that as a cause for issues, and finally, this gave us a tremendous tool to see how Zuul 1 and Zuul 2 compare from a systems and operational point of view.
InfoQ: Do you have any war stories that would help developers and architects in moving from a synchronous and blocking solution to an asynchronous and non-blocking solution? What is RxNetty and how critical was this package in your efforts?
Cohen: There were many battles in the move to async systems. So many of them blend into each other, but certainly there is a general theme and pattern. It starts off with noting that there is a some sort of leak: ByteBuf, semaphore, file descriptor, etc. These are often infrequent edge cases that happen at scale. Debugging ensues. This could take up to a week to find the cause, usually on the order of days. Solutions can be varied but often involve not propagating events when errors occur. With all of the experience we've gained, we are trying to contribute back some of the tools we've built. For example, we contributed a pluggable resource leak detector to Netty. This allows us to monitor for resource leaks with our monitoring tools. We are also looking into other instrumentations that we hope to contribute like checkpointing that will help resolve issues that follow these patterns quicker.
These battles can’t be all be fought isolation. The test frameworks we built to test this only go so far because so many issues only show up in the real production environment; at high scale and concurrency with actual members and real origin systems. The risk of change is high; if the gateway is broken, nobody can stream. Moving quickly means learning more, but has a higher risk of affecting customers, and being more conservative means incremental learnings, but less risk to members. So there is a delicate balance and struggle in rolling Zuul 2 between wanting to take risks and move quickly and taking a more conservative approach.
InfoQ: You describe that you consciously stayed away from benchmarks. Given that Asynchronous and Non-blocking are generally associated with performance, can you include some details of overall performance improvement efforts?
Cohen: I think the general performance story for us comparing Zuul 1 and Zuul 2 (blocking vs non-blocking) in our gateway is that indeed we got a great improvement in connection scaling, however throughput improvements are hindered by the CPU intensive work we do on our gateway. Most of this work is doing things like metric gathering, logging, analytics, decryption, and compression. As I mentioned, after we open source Zuul 2, I think other implementations that don't do as much work will probably see significant throughput gains.
InfoQ: Can you elaborate on the open source plan for Zuul 2 and roadmap for Zuul in general?
Cohen: We are actively working on open sourcing Zuul 2, targeting before the end of the year. We also plan to include some of the new functionality such as websocket and Http/2 support. From a road map perspective, we would like to start opening some of the filters and routing tools and concepts that we've developed. Most of this work is teasing apart the Netflix specific logic from the generic logic. As we develop websocket and push functionality, we would like to pass these learnings and infrastructure into open source as well.
Zuul 2 documentation still seems to be work in progress. The Zuul wiki provides more information of Zuul in general and how to get started.