Bio Adrian Cockcroft is the director of architecture for the Cloud Systems team at Netflix. He is focused on availability, resilience, performance, and measurement of the Netflix cloud platform, and is the author of several books while a Distinguished Engineer at Sun Microsystems: Sun Performance and Tuning; Resource Management; and Capacity Planning for Web Services.
Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.
1. Hi, I’m Michael Floyd here with InfoQ and I’m here with Adrian Cockcroft. We’re here at QCon San Francisco 2012 and Adrian is speaking about architecture for the cloud. So first, Adrian, can you introduce yourself to our audience?
My name is Adrian Cockcroft, I’ve been at Netflix for about five years and I am currently the architect for the cloud systems team basically and before that I spent some time at EBay and then it was Sun Microsystems for a long time working on performance related things mostly. Ok, so that’s my background.
I am the overall architect, so the way Netflix does architecture, we implement patterns we hope our developers will follow. So you can do it by having a big stick and an architecture review board forcing everyone to do something a certain way, we do it by creating patterns and tooling that encourage people to have an easy life to follow the architecture. So it’s much more of an emergent system, the architecture emerges from everything that we do, but we are building out lots of tooling that gives us highly available patterns. So the talk I am going to give is about the patterns we use to give us highly available services running on a cloud infrastructure which is built out of commodity components that come and go, talk about the kind of outages we get and the way we deal with it and the way we replicate our data in services.
3. We’ll get to that in just a second. But first I kind of want to set the stage. As I mentioned you’re here talking about these patterns that you’ve just described. So, can you just give us an idea of the scope of the problem that Netflix is trying to solve? I’m sorry, for instance API requests, how many API requests would you serve in a day?
So, the numbers we’ve shared, we’re doing a few billion API requests a day, I think we’ve published the numbers recently, we do more than a billion hours a month of streaming, there were some numbers published this morning that basically put Netflix at 33% of the peak time internet bandwidth streams to homes in the US. So we’re consuming the lion’s share of the actual bandwidth that people use in practice streaming videos to them. So that generates a lot of traffic, it’s terabytes of traffic, many terabytes and the requests that are supporting that come from currently about 30 million customers, of our streaming product on a global basis. So that’s large sum of customers. You can calculate they’re watching an hour or two a night each; so we have quite a lot of people at any one time watching. So lots of traffic, we run something in the order of 10,000 machines, that’s the sort of scale we’re at for handling that traffic.
Yes, it’s all running on Amazon, it’s AWS server services, we have a fine grain model where basically every engineer or group of engineers develops just one little service with its own REST interface. They all talk to different backends that they have their data in. And we’ve got maybe 300 of those services, I don’t actually keep track of them that well because people keep inventing new ones. So you hit our website and it flows through into this big large number of very discrete services. Each of those services is scaled horizontally so we have at least three of anything so that we can lose one and still have two left, and then we have up to 500 or 1000 of our biggest services that some of the big front ends get to that kind of scale and the total number of machines we have there is of the order of, it varies between five and ten thousand. It order scales so at peak traffic it’s bigger then when it shrinks and it depends what you are counting how the numbers work out.
Sure. If you care about high availability then you’ve got to start replicating. And there’s two basic ways of replicating. One way is to have a master and then slaves. The problem with master- slave architecture is if you can’t reach the master or it goes down, the slave isn’t quite sure whether it’s ok to take an update or serve something. The other alternative is a quorum based architecture where we basically take two out of three have to be working, so everything we do is triple replicated, we’ve got three copies of every service, three copies of all the data in the back end and when you write, you write three copies and as long as two of the copies are still there you can run everything. So it’s triple redundant system. That’s the basic pattern. We use the availability zone as our basic, sort of unit of deployment. I think of it as a data center, so all our data is written three times, but not to three machines, those machines are in three different data centers, so they are a mile or so apart, millisecond or mile or something like that, so that we can lose an entire data center, an entire zone, and everything keeps running. That’s the basis of the architecture, a lot of details about how you distribute traffic across that, how you rout traffic within them but that’s the basic architecture.
Continuous deployment is really more about the agility and the productivity of the developers. You can either gather all your code into a sort of train model and then every two weeks or whatever you say “ok, here’s a new train”, you give it to QA and then they test it for a week and then you launch it and maybe they can figure out what’s wrong with it in time before the launch. And as the team gets bigger and bigger that train model stops working. It works for a small team, it’s the way we used to work five years ago. The way we work now is completely decoupled. So any developer that owns a service can rev that service as often as they want and whenever they want.
Generally, we try not to change things on Fridays because we like to have quiet weekends, other than sort of general guidelines like that, if you’ve got a brand new thing you’ve just deployed, you might deploy ten times in a day just because you are iterating to work through issues that you are finding and you are running a small amount of traffic into it. If it’s a service that’s mature and it’s been sitting around for a long time it may not be changed for months. Because we’ve got these fine grained services everything is on REST interface and it does one single function, piece of business logic, you can change those as fast as you want. You still have to manage the dependencies and the interfaces versioning and things like that but eventually we figured out how to solve those problems too.
How the availability side works? So, we do something a little different to most people which is, since we’re running in the cloud, if I have 500 machines running a service, when I want to change it, I don’t update those 500 machines. I create 500 new machines running the new code. First of all I create one and see if it works, smoke test it, we call it the canary pattern which is fairly well known, you have this thing and if it dies you keep putting it up until you get a good one. But once you decided you want to deploy it in full you basically tell the Amazon auto scaler “please, make 500 of these machine” and the machine image, which contains everything, it’s like the root disk sort of clone, say make 500 of them, that takes eight minutes to deploy 500 machines on Amazon.
It’s a number I’ve measured recently and then it takes a while for our code to start up, so in a matter of minutes you’ve got 500 new machines running alongside your old machines. You switch the traffic over to it, watch it carefully, see if everything worked fine and then you typically leave the old machines running for a few hours, typically through peak traffic because sometimes systems look ok when you first launch them and then maybe find a memory leak a few hours later or they fall over under high traffic or something like that. We run side by side and we actually automatically clear out the old machines eight hours later or something like that. So if there’s an issue with anything, we can in seconds switch all the traffic back to the old machines which are warmed up, running, they’re just sitting there ready to take all that traffic. So it means that we can rapidly push new code with a very low impact if we push something that is broken, we can tell immediately if something is broken and switch it back. So it’s about not trying to sort of legislate against everything that could go wrong, it’s about being very good at detecting and responding extremely quickly when something isn’t quite right.
Yes. We build our own PAS specifically to make our developers more productive. When we were building it, it was just our platform and the things we needed to build. As people started talking more about PAS as a thing, we looked at what we built and we decided it’s a specific kind of PAS, it’s a large scale global PAS design for a large development team. A lot of the PAS out there, like Heroku or Cloud Foundry are really designed for a small team and a small number of developers, you can run it on a VM on your laptop. Our system doesn’t make sense until you’ve got several hundred machines running on Amazon, that would be the smallest deployment that would make sense for us, but it supports scale to a thousand or tens of thousands of machines, which is the problem we were trying to solve.
So I think what we built is an enterprise scale PAS and as we were working through it, we started open sourcing various pieces of it. Our foundation data store is Cassandra which is already an Apache project, we didn’t like the Java client library that’s called Hector, who is Cassandra’s brother I think in Greek mythology, my Greek mythology keeps getting mixed up, so we rewrote our own client library which is called Astyanax, which is Hector’s son, and we basically at that point, we have a client library, we should just share it, we started sharing more and more of our code and the code we used for managing Cassandra. So, those were the first pieces. We are now in the process of releasing basically the whole platform. All of our platform infrastructure code is either things we consume externally, that everyone has already got access to or it’s things we’ve put out there and things that we are in process of putting out there. We launched a service yesterday called Edda, which is something to do with Norse gods stories or something like that, e-d-d-a. It’s an interesting data store in that it collects the complete state of your cloud infrastructure, so you can go to Amazon and say “give me all my instances” with one API call it gives you a list of instances, then you can iterate over those instances saying “describe, describe, describe, describe”, you get back a JSON object.
What this does is it continuously does that on a one minute update and stores the complete history, a versioned history of all those objects. So what you have is for all of the ten or more different entities that we track on AWS, we have a versioned history going back as far as we want of the state of everything. So I can say exactly what state was everything in three days ago when we had this problem and I can basically interrogate all the instances that are no longer there or all the structure, I can see the entire structure, everything. And we actually have some tools which clean up things, we sort of do garbage collection on instances, Amazon entities, and things like auto scale groups. We have a Janitor Monkey which talks to this data base and figures out what is old and not being used, in order to figure out what’s old, it has to have a historical view of the world. So that’s a new service, we just put it up on AWS. It’s lots of JSON objects, it’s written in Scala which is the first for us, most of our code up to this point it’s been written in Java, so it’s the first thing we’ve done in Scala and has a MongoDB back end because that JSON object manipulation is just the ideal use case for that tool. Most of our high scale customer facing applications are running on Cassandra because that is the highly available, highly scalable side of things we use Cassandra. So that’s an example of the kind of thing we’re releasing over time.
9. Making the move to Scala is kind of a big deal. What were the preparatory steps that you took, were there governance things involved, how did this get pushed through to Netflix basically, was there somebody championing Scala?
We don’t have governance. We have an engineer that decided to do something in Scala. Basically there is a series of patterns that are well supported by tooling. If you run something that runs in a JVM, Scala is similar enough that you run it in a JVM, if you got a language that runs in the JVM that can call Java, Java files basically sit on top of our platforms, that’s a relatively small divergence from a platform. And we have a few people that are playing around with Scala, we built some other internal tools with it. So it’s one of those emergent things, we don’t stop anybody from trying anything new, if you want to write something in Clojure or anything, you’re welcome to try. The problem is that you then end up supporting the platform and the tooling yourself and most engineers will figure out eventually that they don’t want to support all the platform and tooling required to support a radically new language.
For example, I think the Go language is really interesting. We have really no way to support it in a very JVM based environment, so that is going to sit on side until it becomes a significant enough tool that enough developers are interested in that we decide to figure out how perhaps to increment and add it in, but that right now I personally think it’s an interesting language. We have other people think that Scala is interesting. There’s no big rule, there’s no big process, there are just well beaten paths with tooling and support and then there are other things that you can do on the side. The other main language we support is Python, mostly for our ops automation infrastructure. I don’t think we’ve released any open source Python projects yet, but we will be doing some. But even that, the back end platform interfaces that are PAS, basically, implements they are not particularly slow moving. They are moving forward with functionality really fast. So the challenge then is you’ve got two language bases and Python is playing catch up and trying to keep up with all the Java based languages, Java based platform pieces. So it’s actually quite difficult to support more than one very distinct language in a platform, in particular if it’s evolving really rapidly. The point of everything is to evolve things rapidly.
So far, I think probably a happy engineer.
Michael: So, there was no language features or anything?
I haven’t actually had that discussion. We have a few people have written things in Scala and they seem happy to do it. We’re trying to figure what is it good for, I mean obviously every language has its own thing it’s particularly good at. So there are some people that have figured out this is a good thing to write in Scala. Then, most of our developers, 99% of our code is probably in Java. A little bit of it in Python. So it’s just, at this point I regard it as experimental but anything that will run in the JVM is basically going to fit into the architecture relatively easily.
Yes, so github.netflix.com is where we have all our code, we have, I don’t know, I’ve lost count, we have ten or fifteen projects up there now, we have at least ten more that are in flight, we had a big push to try and get everything done for the big Amazon event that is happening at the end of this month, AWS re: Invent, which is the week after Thanksgiving in Vegas. It’s a huge show, we have fifteen, I think, Netflix presentations there all together. Reed Hastings is giving a key note, so it’s going to be a massive show for Amazon and Netflix is very involved in this. So what we are trying to do is get as much of our platform out as possible before that show and then at the show gather anybody that’s interested to sort of have a little “are you interested in the Netflix platform?” user group meeting.
We are still feeling our way, we don’t know how that’s going to work out. We know we have quite a few people, the nice thing about Github is you can see how many people have taken copies of your code and are playing around with it, we’ve got contributions from people. But the interesting thing about having open sourcing as a strategy is that it really helps clean code up. So if you take a developer, this Edda thing was written ages ago and the guy that originally wrote it, I think he left or is working on something else, we hired a new guy and we said “well, we need to open source this thing, can you figure out how to clean it up for open source?” and he looked at it and said “neah, I’m going to rewrite it from scratch” and that’s why it’s got rewritten in Scala. It’s a version 2 of this thing, the version 1 was something that was thrown together fairly rapidly, iterated over some period of time, so we took that to be the prototype, did a nice clean version of that and then put that out into the open source project.
So if you’re a manager trying to get your engineers to clean up and document their code probably the best thing you can do is “hey, would you like to open source that?”, because then there’s a lot of peer pressure “my code is going to be in public, anyone can read it”, you know that sort of Github is sort of your future resume if you’re a developer, you want your name associated with good projects and good quality code. So that’s acted as an extremely beneficial sort of side effect, and then internally we found some of the things that we’ve open sourced we have other teams using them for other projects that are not part of the core architecture of what we are building for supporting the customer-facing streaming product. We have a team that’s working on movie encoding, it’s a completely separate system, pulls in all the main movies form the studios and encodes them. They are now using a project that we open sourced whereas they could not use the internal version of it because it was too tied in to dependencies into our architecture. So by open sourcing it as separate projects, you end up componentizing the platform in a way that makes it more consumable benefits internally, as well as externally.
Well, the user interface is the one most people see. You can talk abstractly about the platform but when you actually see a GUI with an enormous number of options and features on it, you actually realize that there is something there. And that product is called Asgard. Asgard is the home of the gods, I think, control the clouds, it’s got a big hammer.
Yes. So we have that group that’s building those tools that seem to have gone on a Norse mythology naming scheme, we have several names, every engineer gets to pick a project name, so Edda is related to that, it’s the stories that describe what the gods did or something like that, so the story is about the cloud. So that’s how they came up with that name. And then there’s an Odin, which is a workflow manager system for something. But Asgard is probably the best place to start if you want to see what this looks like, it’s a replacement for the Amazon console, so it’s a Groovy Grails app, very sophisticated, it’s been developed over a long period of time, with some talented engineers, a lot of engineering efforts go into it. It’s what everyone at Netflix uses to deploy code, to see what state everything’s in, to configure things.
So that’s the core front end, we have a configuration services, we have a dependency injection system, based on Guice, the Google Guice project which is called Governator. We have some things based off of ZooKeeper, Curator and Exhibitor. We have things for managing Cassandra, Priam is Cassandra’s dad, so that’s what we used for managing Cassandra. I have to have a slide to keep track of all these things, I have too many, too many projects. But there’s more stuff, about once a week we put out new projects, we always do a Tech Blog post, techblog.netflix.com is where we put posts up that describe what we’re doing in this space. And the other things that are on Tech Blog, we talk about not only open source projects but technical things, like the personalization algorithms were published on Tech Blog, and if there’s been an outage, significant outage, then we talk about what we learned from the outage and responses and things like that. So, I encourage people to look at techblog.netflix.com and if you are looking for a whole lot of Java code, at Github.
Michael: Well, Adrian, thank you for coming by today. We enjoyed having you.
Thanks very much.