Ben Christensen on Resilience at Netflix with Hystrix, Reactive Programming for the JVM with RxJava
Bio Ben Christensen works on the Netflix API Platform team responsible for fault tolerance, performance, and scale so millions of customers can access Netflix. Specializing in Java and through web and server-side dev, Ben gained an interest and skill in building maintainable, performant, high-volume systems. Before Netflix, Ben was at Apple in the iTunes division making iOS apps and media available.
Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.
I’m a software engineer on the Netflix API platform team and what that means is the API team it is responsible basically for the layer that all the Netflix devices talk to as part of the discovery process, that basically entails that it drives all the user interfaces of any device used to stream Netflix content. And the platform team within the overall API or larger team, is focused on Infrastructure, Resiliency, Fault Tolerance, Performance, those type of things, and then our sister teams are more responsible for the business logic and other aspects.
There is a variety of different things that were done in different layers of the stack. The most obvious is that Netflix has a Service Oriented Architecture that takes advantage of AWS, Amazon AWS and we distribute all our infrastructure across multiple zones, so that as zones or different infrastructure elements fail, there is redundancy across everything, and so surviving outages of individual machines or load balancers or other type of things, is a fairly normal thing and that is a problem that is more less solved in our architecture and most aspects of the architecture, most systems are stateless and so they can drop out of service without any impact on behavior. More at the application level, I’m not going to speak much about the data systems because that is not really where I work, that is a whole other different realm of discussion related to that layer, but at the API layer the area where relying upon infrastructure stops really solving problems for you... Instance failure of a machine that's pretty easy to deal with because it’s a fast fail and load balancers just take it out and it just disappears. latency on the other hand is much more challenging thing to deal with, and so the Netflix API by its nature talks to virtually every other backend system within Netflix, and if you don’t pay attention to the relationships between these systems, latency can occur and that latency, lets say that something that normally takes 15 milliseconds, all the sudden spikes to even 2 or 3 seconds, at high velocity that will very quickly back up all the resources the queues and threads and the API tier, and within seconds it can saturate all the resources and the entire thing'll fall over. What that means is that one of many pieces of functionality can cascade across and take down everything.
So what we did is we applied a Bulkhead and Bulkhead Isolation Patterns and Circuit Breakers and Timeouts and a variety of other things but primarily the Bulkheading, so that if latency occurs on any one backend system, we isolate it to a subset of the overall resources within the application, so that we can be resilient to that occurring, stop sending traffic to that service until it recovers, and the rest of the application can continue. And in as many cases as we can, when that service is not available we return fallbacks that allows the functionality to continue or we do basically no responses in that case which turns the functionality off, which we can consider failing silently, and then the third option that we have, is we just fail and blow up and that means that if a backend service doesn’t have a legit fallback, that particular functionality just breaks and the user experience is broken. But the theory is that by isolating one of dozens, the rest of them can keep functioning, so we have had pretty good success with that. The embodiment of that ended up being open-sourced and it’s the Hystrix [Editor's note: https://github.com/Netflix/Hystrix ] Library named after a porcupine due to its protective abilities and that is what we use across the API tier to isolate around a 150 different backend pieces of functionality we work with.
Basically the way we look at it is within the Amazon infrastructure, a cluster of services that is split in three, across three availability zones and we basically plan at any given time that one of them could disappear, and so that means we want each of the other two to be running around two thirds capacity, so that the two of them two by themselves could absorb the traffic from the third. At the same time we also rely upon auto scaling, so if we were to all of a sudden lose a significant portion of our infrastructure, there maybe a short time where we receive more traffic that we can handle and at that point we have throttling limits that kick in so we shed load for any capacity above what we can handle but still be able to successfully serve the traffic up to that limit, while the auto scale kicks in and we scale up. And so the idea there is that even if we were lose more infrastructure that we had overprovisioned for, that it would be only a short period of time while new infrastructure comes online and we could still weather that gracefully.
4. I was glad to see that you recently adopted a new technology in your stack, the Rx Framework or the idea behind the Rx Framework, could you maybe explain first of, what attracted you to Rx or just what it is maybe start of with that?
Rx was developed by Microsoft as part of their .NET languages primarily for C# and it was done by Erik Meijer and his team, and it's functional programming, high order functions but applied to asynchronous sequences of data, and basically what Erik Meijer discovered was that through his research he evolved the idea of the Gang of Four Observer Pattern which was from a time of very imperative programming models and now that we're doing more and more asynchronous primarily because of the concurrency requirements that we have, the idea of an Observer that can be extended to also handle asynchronous callbacks but know when it’s completed and know how to do error handling, and so the dual that was find is an Iterable and the Observable, and the Iterator and the Observer actually can be treated as the same way, and this is done by just adding to new hooks on to the Observer instead of just having the onNext for when a callback occurs, it also has onError and onCompleted hooks, so with those 3 you are able to send n number of values back as callbacks and then signal if an error occurs and terminate that sequence or when it completes successfully with onCompleted. And with just those three methods you are able to model basically how an Iterator would work on something that is pulling data imperatively you can now work on in the same way as if it was pushing it to you asynchronously.
Yes, so that is the basic model. That in and of itself would be interesting but that is not the really where the real powers come from, the real power happens when you take that basic component and then you apply high order functions on top of it. And so this is where you get things like map or select and filter and take and zip and you have the ability to combine and manipulate sequences of data and it’s a very clean abstraction away from the source of the data whether it's coming from the calling thread, a separate thread, an Event Loop, an NIO, callback a non-blocking I/O event. All these different sources of concurrency or parallelism or asynchronicity is abstracted all behind the same simple interface and then you can compose these multiple sequences together in a very logical functional manner that takes mutability out and you stop reasoning about state and mutation and where the sources of it's coming from and instead it can be done in a very declarative logical manner and then once you’ve declared it and you subscribe to it and execute it, the implementation worries about where the data is coming from and how and what resources it’s using. And so that really interested us as part of as we were exploring a new architecture for the Netflix API, we were trying to make it so that our devices could push more of the work to the server instead of client devices making potentially a dozen network calls to render a single user experience using very granular REST-ful APIs, the typical model where each network call does one thing only, instead we want to have more coarse grained but customized to each device, so that the device could define exactly a web service endpoint that delivered all the data they needed for their use case.
The typical approach to this in an imperative Java world, any imperative language, we use Java, is that you are going to have callbacks or futures, both of them we found left things to be desired. Callbacks are pretty easy to use if you are only doing one level of callbacks, but once you start nesting them multiple levels deep and try to combine them together, it’s starts to be an awkward programming model. And it’s also not very elegant in dealing with errors. Futures are very basic and easy to use, again when there is one level of things but they start to become difficult, with Java Futures, to compose and when you are doing something nested often what you end up having, you’ll have a thread with a Future and then when you have that but want to move on to something else, but want to actually receive that back at some point and then do something you actually end up having another thread waiting on a thread and you end up with this nesting of threads which is inefficient and also again awkward and not clear to reason through. There is other approaches such as Guava's ListenableFuture and Akka Scala Futures which are composable and monadic and those type of things, and those are both really excellent solutions, and if we hadn't come across Rx we'd probably be using one of those two approaches. But once we were exposed to Rx, it’s such a pure and clean abstraction that allows not just the asynchronous response of a scalar value but operating over a sequence of asynchronous values, it was very attractive and a very clean approach to it. And so it allows us to then have an API that is treated purely asynchronously and we basically create and observable API and you make all these requests, compose it together and you can go n levels deep with nesting and it all is fairly strict forward to reason through and compose in a declarative manner.
Not so much the actors, the source of the concurrency was never really the big decision for us, we don’t want to pin ourselves to anyone source of concurrency, actors is just another way of asynchronously doing something and typically just means you have a message queue with a thread or thread pool or something under the covers, whatever its implementation is. And so whether be an actual thread pool like an ExecutorService in Java or an actor or a pool of actors or its NIO, we consider that all separate from how we consume the asynchronous responses. And so we didn’t, it’s not so much the actors pattern that was of interest to us, because that can be done without Akka, it was more if we look at Futures and how they work, how do we want to deal with those and the truth is we came across Rx actually first before we really spent a lot of time looking at Akka, and that was just by the pure fact within Netflix, we’ve got a lot of folks from all different backgrounds and disciplines, because we target so many different devices, we’ve got a lot of people and a lot of different programming languages under the belt, and so it allows from a little bit of crosspollination of ideas, and so this crosspollination came from someone who have worked at Microsoft, because folks within the Java Community don’t really know of Rx, and when it was first exposed to me, my first reaction was actually very much like: “Why do we need something new, and if is so great why does it only exist in that community?”. And so we argued over it for several weeks, as to whether or not it was worth pursuing, and as I came to understand the theories behind it and the elegance of it, basically made it so it wasn’t very appealing to continue researching other options because it became so clear that this was an elegant solution, and also because it was not tied to a given source of concurrency which is also nice. It doesn’t say: “If we are going to use this approach, you should use actors or thread pools or whatever”, that is completely left to the implementation, details of an observable element.
Today it’s still a very precise location within the API layer, within Netflix as a whole, and that is at the very top layer of the Netflix API stack at the point where we have an observable asynchronous API, a Java API, a service layer, and on top that we then have the Groovy runtime, and we wanted a runtime other than plain Java because we want the ability to use closures, lambdas, and since Java 8 isn’t quite out yet, we didn’t want to stick with just the anonymous inner classes everywhere, and that is not very elegant when you are doing functional programming. And so we looked at a few of different languages and at the time that we were making the decision, Groovy was a good one, there was enough things for the team to learn, without learning a more esoteric language, and Groovy is a very nice mixture of, you can write just Java with some of these other niceties, and you can kind of choose how far you want to go into it. And so it worked well at that layer and so it’s at that point where the API team and any of our device teams, they create their own custom endpoints and they dynamically deploy them into our production environment, and they consume the Java Service Layer, which is always synchronous and use Rx to compose all of them and generate the responses, whether it's JSON, XML, whatever. It’s typically JSON responses and so that is the layer that it's being used at. Based upon the team's experience there, they grew to like that approach so much that the desire basically grew into: “Why aren't we using this deeper in our stack?”, and so we are now starting to evaluate pushing it below that top layer and into the actual implementation of the Service Layer itself, which is still all just imperative Java. And again because we don’t have Java 8 yet, we are evaluating Clojure and Scala for that part. So Scala is used in quite a few places within Netflix, Clojure is being played with in some, not as heavily though, Groovy could do it as well, for a variety of reasons we are looking at Scala and Clojure for that layer, and so we expect to see Rx to be used in more and more code over time.
8. Obviously Rx has a very functional way of composing things, you can select and map over streams and your developers have been exposed to these sort of concepts, so are they now more open to functional languages or other concepts, or has that been a barrier to adoption?
Most of the teams especially the Java based teams had never done functional programming, I myself had not done any functional programming prior to my time at Netflix, and so that was part of why it took me several weeks of arguing back and forth with this idea, because it wasn’t just a new library approach, it’s a whole mentality shift in how you program going from imperative to functional and declarative. We specifically chose, another one of the reasons why we chose Groovy is because we didn’t force the functional model across the board, people can still kind of mix the two together to some degree, and one of the biggest hurdles to Rx is the initial learning curve, but what we found is that a week or two of effort from an engineer and they would come up to speed with it, and thus far everyone who has gone over that hurdle, once they are through it, they all prefer the approach at this layer, doesn’t mean it's right for every aspect of what we do, but for the layer at which we’ve applied it, where you are basically composing sequences of data all coming from different places, it just naturally makes sense and actually makes it easy to deal with, and so the reaction is positive and people have now started to think functionally in that way, which lends itself to now them wanting to push it deeper into our codebase where we have similar patterns of code. As for whether they are now more open to more functionally oriented languages, I would say: “Yes” in the sense that because they are already doing functional, they are doing it in Groovy, they kind of already are, and so the choice of language isn’t so much what they think of but it is making so the Clojure has now become something that isn’t as foreign an idea to consider now. Whereas a year ago Clojure was just way too big of a change to try to make at that point. Now it’s a legitimate option for us, the biggest thing that we are looking at right now between Clojure and Scala is just for the size of team and codebase that we have, it's the age-old debate between Dynamic and Static Typing, and I don’t have enough experience with Dynamic Typing in a large codebase to have a valid opinion there, but that is definitely the discussion. Everyone reacts to Lisp syntax the first time with a little bit, similar to functional the first time, it's just like, this is foreign to me, but in my experience it doesn’t take very many days of working with it when it just starts to be code, and so that is definitely open and people are interested in these languages that are better suited to the programming approach now.
Kind of, I don’t know that it was necessarily Rx itself rather than our use case, it was a use case that drew us to find functional, because remember none of us had done functional programming and we had a problem that we realized wasn’t been elegantly solved, we found a tool for and then we all learned functional programming so we could use the more elegant approach to it and basically simplify it, make it so we can reason through it better and you abstract away the thread safety concerns at a layer where it didn’t make sense for us to be worrying about that consistently, put that stuff in a place where it's taken care of. I love dealing with concurrency type things, but there comes a point where it's just like, this is not the layer where we should worrying about that stuff. But you are right to some degree in the sense that it’s been a gateway into us thinking more about where else does this make sense to be applied to, that doesn’t mean necessarily that all of our infrastructure code that I’m going to go and say: “We should use functional everywhere here”. In some places standard Java is just the right solution and works great but in a lot of places, especially where you are just pulling together a lot of data, transforming it and composing it into a response, it's just a natural fit for Rx and functional.
10. It’s interesting to hear because usually you hear that kind of thing from functional advocates but to hear people discover these concepts and say: “Well that is the right way to go and imperative is much harder here”, it’s an interesting discovery I think?
Yes, actually it is interesting to look back a year and a half ago and like I said I’d never done functional but as I was exposed to it and learned enough to actually implement a library in Rx Java, yes it definitely does have its place and it’s been interesting to watch a full Java team, all with years and years of ingrained thinking in the imperative model, all very quickly come up to speed with it and they are now capable of flipping back and forth between the two, comfortably and applying it where it makes sense.
No, not really, I would say that it’s been a very engineering focused not computer science focused approach. Even in my presentation here that I gave on it, I mentioned the word Monads at the beginning because it made sense in one aspect, but I promised I wouldn’t go into that stuff. I’ve actually found that discussing all of that theory which I still don’t consider myself very comfortable discussing, because I came in from a very applied engineering approach of “I have a problem to solve, this works well with it”. We found that just wasn’t helpful at all in the discussions of “Why would we do this”? It’s very interesting academically but it doesn’t actually help me get the job done all that much, so no, we don’t discuss Monads very often.
We open-sourced it about a month ago, a month before QCon London, and what we did basically is we realized that what we had was working well for us, and at the start it was part of the API layer and other teams were starting to ask about it, so we knew that we needed to extract that into its own library anyway which wasn’t that hard, but as we started to do that we were also getting just through the interaction with other people externally that they were learning what it means to be functionally reactive, they were interested in the library and part of the Netflix' approaches is to open-source anything that makes sense, and so we took a little bit extra time to clean it up and polish it and we also spent a little more time to make it conform as closely as possible to the Rx .NET APIs with recognizing that it's in Java so certain things are different. So we took that time and open-sourced it and we released the version 0.5 basically to recognize the fact that is not fully feature complete yet. It’s totally production-worthy for what is there but it’s a subset of the overall Rx and then on the GitHub project itself right out of the gate we had about a hundred issues filed that basically just goes to the entire... we audited against the MSDN docs for the Rx project with all the operators and functionality that we didn’t yet have, so I’d say that we're maybe about a third feature complete, give or take, and now we are just working through adding operators that we need and also inviting contributions externally. We’ve already had some folks contributing quite a bit to the project to help us flush it out and our plan is to continuing doing so in at the point where we hit feature complete we'll call it 1.0 and we'll finally hit that milestone. We don’t have any specific timeline that were attempting to hit because the interesting thing about it has been that even with just the top dozen operators, you can get a phenomenal amount done with it and we’ve been using it in production for a year only needing that top section of them. Part of that too is that we are not using it for client side, for GUI development there is a whole other set of operators that you would need that we never use because we are using it server side.
It’s irrelevant for the implementation of Rx Java itself, there are a few little things I think that we would do differently if Java 8 was out and we felt like we were comfortable pinning to that as the minimum requirement. But even once Java 8 is out, I have no intention of doing that because there is a lot of value in working from Java 6 onward because that is going to be around for a while still, if for no other reason that we continue to support Android which is more less Java 6 specs and so we’ve successfully played with using it on Android, I know some others with the open-source project have, based upon feedback I have received. However the Rx Java Library with pre-released versions of Java 8 that I’ve played with, it works without any changes whatsoever because the nice thing about Java 8 is that the lambda support basically just uses any interface that has only a single method on it, it just treats it as a functional interface and so it doesn’t even need the language adapters that we use for other languages, and so thus far with the pre-releases it looks like Java 8 will be able to use it without any issue and it’s very nice because you’ve just got Java but in a functional manner.
I don’t know that I have a favorite but I will say that map and mapMany seem to be the commonly used, without those basically nothing wherever get done, I don’t get attached to them.
Werner: It’s always good not to get attached to your abstract concepts. Thank you very much!