Bio Adrian Cockcroft has had a long career working at the leading edge of technology. He's currently working for VC firm Battery, advising the firm and its portfolio companies about technology issues. Before joining Battery, Adrian helped lead Netflix’s migration to a large scale, highly available public-cloud architecture and the open sourcing of the cloud-native NetflixOSS platform.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
Hi there. So, I’m generally well known for the time I spent at Netflix for the last 7 years. I recently left and joined Battery Ventures, a VC firm. Before that I was known mostly for the time I spent at Sun Microsystems working on Solaris performance, and things like that. And other than that I’m originally a physics graduate rather than a CS graduate so I’m a bit more on the Systems Engineering side than the typical Computer Science sort of side of things.
The key thing that Netflix optimized for was speed, and as a fairly small company operating in a world where there are a lot of very large competitors, really the only thing you can do is outrun them and be very, very Agile. So Agility for Netflix was a way of surviving and winning all these different races they were trying to win in the building personalized video delivery for people and streaming, and all of the different things. So to get speed everything basically is subservient to the need to be able to make decisions, and build things, faster than anyone else could - any of the competitors could - and the competitors are nice big companies so it is relatively easy to go faster than them. So as Netflix got bigger it gets a little harder to be fast, but what we figured out was a way to decouple everything so that we could still stay very Agile. That was what I really learned. That leads to a sort of freedom of responsibility for developers and taking out all of the handoffs, and organizing in a sort of DevOps way of doing things. It wasn’t that we saw the DevOps Manifesto and decided to do it: we looked at that and said “Oh we're kind of doing that already”. It was an outcome that we had already arrived at.
3. So did you find some resistance from developers to DevOps - I mean the idea of having to take care of your own deployment and all of that kind of thing - was there resistance to that when you started?
Yes certainly, there were some people that liked the way it was previously. I joined Netflix in ‘07 and the architecture then was a monolithic development; a two week Agile sort of train model sprint if you like. And every two weeks the code would be given to QA for a few days and then Operations would try to make it work, and eventually ... every two weeks we would do that again; go through that cycle. And that worked fine for small teams and that is the way most people should start off. I mean if you've got a hand full of people who are building a monolith, you don’t know what you are doing, you are trying to find your business model, and so it’s the ability to just keep rapidly throwing code at the customer base is really important. Once you figure out how... Once you’ve got a large number of customers, and assuming that you are building some Web-based, SasS-based kind of service, you start to get a bigger team, you start to need more availability.
And as Netflix did its transition from a DVD business, where the site was only accessed mostly once a week by customers to set up what DVDs they were going to get for the rest of the week, to a streaming business where it has to be up all the time otherwise streaming stops; and you have millions of people concurrently streaming, and they have a total now of about 48 million customers globally using the product. So the availability requirements suddenly shot up, the size of the team shot up, and the need to innovate was also increasing, and those things sort of collided, and the monolith can’t take it. So there is a point at which you have too many developers trying to change the movie object or whatever, the customer object. Everyone wants to change that object for some reason and they can’t all change it at once; so you get these collisions in the system. So that lead to a move to split everything up, which ended up with what’s now called “A Microservices Architecture” but at the time it was just, "We have to just separate these concerns" or bounded contexts if you like, so that different teams could update their own things independently.
That’s how it all basically ended up. Started talking about it publically about... 4 years ago was the transition; 2010 was the main transition year where we started going out publically and talking about what we were doing and got mostly baffled responses like “Why would you do that, it’s crazy,” and just kept going and eventually people decided it was a good idea, but it was too big a jump; too big a gap. So in 2012 – 2013 lots and lots of open source projects. 44 projects later people are going, “Ok, well I can download the code and that helps bridge the gap" but it is still difficult that there are quite a few people now running architectures that are really very similar. There is a reasonably large group, probably 10 -20 large web companies, who have a roughly similar architecture using a number of different technologies.
I think the Conventional SOA world was sort of built around the whole sort of WS* WSDL/SOAP kind of stuff and that sort of ground to a halt under its own, sort of weight, at some point. So moving on from that what we are talking about here is much more like programming the Internet. So if you build a typical mobile app now, that's reasonably sophisticated, you might use Facebook for sign in and it may let you send stuff to Twitter, and maybe there is a Google Map interface, and it’s probably got some sort of crash analytics thing, and it's got some add interfaces. So your little mobile app on your Android or your iPhone is talking to 10-20 different web services and you are programming in that environment; and you are building your little piece - all these other things are around you. So that’s basically, each of these things is a single function that does something; you are decoupled from them.
You don’t go and have a deep discussion with the Google Maps team just to use their Maps API: it's a reasonably stable API, you are isolated, it's sort of versioned, occasionally it changes and you may want to do things. So basically you build your own service, you build a bounded context around the thing that your team, your 2 or 3 engineers, are building and you build a service or a group of services that interface with all the other things that your company is doing, as if they were separate companies. It’s a different bounded context. So you talk to them but you are not tightly coupled. So the way that the Microservices stuff is different is I think the lack of coupling, and that you are trying to build a single verb kind of model where you’ve got... the bounded context idea from Domain Driven Design is quite a good way of thinking about it. If you make the bounds too small then you’ve got too much coupling; there is not enough information within the context and you end up with very high rates of coupling outside.
If you make it too big you’ve got too many concepts sort of juggled, sort of a bag of things. So it’s quite hard sometimes to figure out what the right chunk size is, but it turns out that it naturally comes down from what I call the Inverse Application of - is it Conways law?... what’s the organization...? Basically the law that says the communication structure of an organization will be reflected in the code. So you set up the ... you decide what structure you want your code to have and you make the organization look like that; and an engineer or a team will create the Microservice. So if you look at the Netflix architecture and you look at all the services - if I look at it - I can look at the name of the machine and think, "I know who wrote, I know who wrote that, I know who wrote that," And if it breaks the system automatically calls a person when a service breaks; or it calls a team. or there is a pager-duty escalation if you like.
So going back to your question, "Was there resistance to doing this?" Some engineers weren’t happy with the idea of being on call, and the time it would take, and the responsibility; and it turns out there are a couple of things which are maybe counterintuitive. One is that the systems don’t break as often if you own them in production. And the analogy for this is like if every car instead of having an airbag had a 6 inch steel spike pointing at the driver, there would actually be fewer accidents on the road because everybody would be driving along very, very slowly and carefully, and they would never bump into anything; because there is peril built into the system and the peril is very closely looped back to your actions. So the fact that if you build a system where you can do anything and there is no consequence to your actions, you are creating externalities: you just externalize all of the risk and all the problems.
You know, "Some guy in Ops will fix my code and make it work and by that time I’m working on a totally different project; I don’t care". So you want to tighten that loop so that done for a developer isn’t, "I gave it to QA", or even "I gave it to Ops", or "It’s in production", which is what most people say done is - "It’s in production". In this model done is, "It’s no longer running in production; I've removed it". If there’s any code of mine running in production I’m not done, and if it breaks I get called. So it turns out that given that feedback loop people right really, really resilient code; and they figure out how to make it scale. And the ones that don’t generally get woken up enough times that they leave or you push them out gradually because they haven’t figured it out. So you find out who’s able to deal with that and it’s not that hard. Another way of saying it is it’s easier to teach developers to operate their code than is to teach the operators to develop code maybe, or to look after the developers' code; it’s actually a simpler problem.
You actually end up having fewer meetings with Ops, and you end up spending less time on meetings on managing your code in production, which is also counterintuitive. It’s one of these problems that if you want to give some work to somebody else you have to explain them in detail how to do that job and that can take longer than actually doing the job; it's that case. And once you get everything to a fairly mature state, that’s the general case; so there’s not really any push back now to owning stuff in production. It also encourages pair programming because you don’t want to be the only person that knows how the service works, because then you're always going to be on pager duty and you are always going to be called. So you want to code review it with somebody else, typically another member of your team or several other people, and then you take turns being on call; so you can go on vacation, sleep at night, and things like that.
Yes. There are two levels of the API. So if you have a REST service, most of these services are just... not everything because you have MySQL or Cassandra or something like Thrift, but most things are an HTTP call to a service to do something. At one level there is that protocol on the wire, it’s sort of a contract if you like or a stable API, but the interface that you end up using is actually one level above that; there is a client library that talks to that service, and there is actually usually two layers of those libraries if you do it right, and it’s problematic if you don’t split it into two. So think of having a client library which does nothing but talk to one service: it knows how to serialize and deserialize requests to that service and it knows how to handle errors from that service.
And that provides maybe, if it’s a Java Library a POJO that is just constrained to that, you know, Plain Old Java Object, there is no higher... you are not tied into the Type System of the entire infrastructure. And the idea of that is it should be a very single function thing and that interface library should be built by the people who built the service. So you build a new service, you define the protocol, you build a client library for that protocol for the different languages that you want to support - a JAR will support all the JVM languages, but you might need a Python Library or you might need whatever, Go or something, Ruby; whatever you might need. And you publish those libraries in Artifactory or whatever. And so if somebody wants to consume your service they go to the repository and they pull out that library and build against it and call against it, and that’s the most basic way to call a service.
Then what you find is that there are these layers of platform. Let's say you're are a part of a service within the platform and then there are several different teams of people consuming your service. Each of them might want to consume it in different ways, so each team builds another layer, another library, which effectively talks to multiple underlying libraries but gives them their object model. So the object model is optimized for this use case and there might be a few, a group of services, that use your service in a certain way and they share a higher level library that talks to your low level library; but that library is built by the team that consumes it, with their object model, so the binding is upwards; and then there is this interface between the two which doesn’t change very often, and which you can patch things up in, because all the code is written to this higher level library and the protocol is written to the lower level library; and you can fudge things: you can route traffic, you can basically say, “This is only implemented in the old version of the service" and you can route it to the old library; and you can have side by side old and new and things like that.
So it gets a little more... it's maybe more complicated to think about, because the user behavior depends on the routing of traffic to all these different microservices. But once you figure out the power of being able to route traffic, to control behavior, and go to old and do versioning and things like that, it’s quite a powerful technique.
You can do, if you do it naively. If you send a MB of XML over a link it will take a long time; so don’t do that; you learn not to do that. So the serialization matters. Most of the latency is not on the wire because the calling between machines within the Data Center is sub millisecond; that’s not the problem. The problem is serializing and deserializing the request. One of the things we did fairly early on was benchmark a bunch of different things. And I had somewhere a little graph where we had to truncate the graph for how long it would take to do something in XML - half a second or something - and then if you did it in JSON - I forget the numbers but if was fifty milliseconds or something - this was to encode a fairly large object, And then in sort of Google protobufs [Protocol Buffers], and Thrift and Avro it was getting better and better; down into a much more reasonable time.
So if you are doing something complicated between services that’s fairly high speed, then use one of those. And there is a mixture of all three of those in use at Netflix - Thrift, protobufs and Avro. Avro is the one we picked as the main recommendation because it turns out it has a good compression in it, so the objects are very small; and these objects tend to get stored in memcached as well - you tend to build memcached tiers between layers in the system and stuff gets collected in there; and you want to store it in a very compact way because if you use half the space you get twice as much stuff in your memcached. And is certainly compared to storing raw JSON kind of things in memcache; that would use probably 5-10 times the space. So it's worth using reasonably well optimized protocols.
Yes; well there are two challenges with monitoring: one is that microservices are a challenge to visualize. I have a diagram in my deck which I call the Death Star Diagram, which is you take all your microservices, you render the icons in a circle, the icons then all overlap each other, then you draw the connections, all the calls, between them; and you get lines going from everywhere to everywhere and you just get this round blob of undecipherable mess. That’s if you do it really without any hierarchy. And I've found diagrams like Gilt and Groupon and Twitter and Netflix, all have a diagram that looks very similar; it’s just a blob in a circle with stuff between them. The way that you really need to visualize this is with a lot of hierarchy in this; you need to group related services into a sub thing and then sort of go out, so that you can see at the high level what’s going on, and drill in and see more and more detail; so again you’ve got layers of bounded contexts.
The trouble is if you need to... trying to understand the whole system is a problem; the system should be too complicated; at this point the system is too complicated to understand, so you should only need to know about the parts you interface to and how the bit you built works. So microservices on their own cause problems, just because you’ve got lots of layers of calling and you need to track requests across them. So the Netflix libraries, Ribbon is the client library and Carry On is the server side, they tag every request, every HTTP request. They throw a header field that’s got a GUID in it and that GUID tags the transaction and that flows down; and when you do the next call you send it on down. But if you fork and make an additional one there is a sub transaction in there too.
So you can actually track the parent transaction, and the child, and you build the tree of these GUIDS; so an incoming request hits the website and it explodes into this tree of requests that probably hits 50 to 100 different services, to gather everything that it needs to just render the homepage for a single web request from a browser; 50 to 100 services quite commonly; some of them are memcaches, some of them are calculating things, whatever. But that’s maybe half a second or a second worth of processing time; it’s pretty heavy homepage, there's lots and lots of stuff on it. So given that, you’ve got to track all of those and render them; display them somehow. And tooling for that is kind of primitive, that’s kind of an advanced topic at this point.
The other thing that breaks it is Cloud and being dynamic; so machines have a lifetime of one or two days, average; and you’ll never see that IP address again, or actually you may see that IP address again immediately, but it’s a different machine at that point. So you can’t key off of IP address or MAC address as a machine identifier as you could in the Data Center; so just that reaping dead machines kind of problems: a code push might create 500 or 1000 machines in a few minutes, and if your system doesn’t alter table every time it adds a new machine in your back-end of your monitoring system, it tends to fall over; those kinds of problems.
A lot of monitoring tools assume that adding new machines is an infrequent operation, and it isn’t in this model, ... if you build a really dynamic system. So a lot of the..., And then the other problem is just scale: Netflix ended up with 10, 20, 30,000 machines, all running at once and you are trying to monitor and make sense of that, and the tools generally die in horrible ways. If you can find Roy Rapoport from Netflix - he's at the conference - if you interview him, he did a presentation on what the monitoring system has to do. It’s collecting billions of metrics and it’s processing huge numbers of data; it’s a monstrous system just for keeping track of everything.
8. Great. Do you have any specific recommendations, given the highly distributed nature of a Microservice System? Do you have any specific recommendations for things like circuit breakers to avoid cascading failures and that sort of thing?
Yes, that’s a big problem. So if you build something that breaks, and it ripples out, and everything is synchronous calls, you can take out an entire network of systems. So you have to contain the failure somewhere. So there are Circuit Breakers, Bulk Heads, those kinds of patterns. What happened about... After the first few naive attempts to build this thing using futures and threads and whatever, I distributed Michael Nygard's book “Release It” to a bunch of people that read it, and they went, “Oh, this is cool” and that’s where they got the patterns from. And that was kind of ... As the sort of a overall architect of this stuff, that was kind of some of the stuff I was doing - sort of looking for good ideas from some people and spreading the ideas to other people: and some of them came from books, some of them came internally.
But you act as this sort of cross switch for spotting the good ideas and trying to stamp out the bad ideas and those kinds of things; and that was kind of my role. I invented very little of what Netflix did, but I carefully selected and curated that stuff into a thing that made sense that you could explain to people; so when we on-boarded people or when I went and talked about it publically, I could gather together a sort of coherent view of what was going on. So going back to the Circuit Breakers, there is a couple of very useful software packages that have come out of this: there is one called Hystrix which was developed by the API Team; and this came out of the observation that every time something broke it would appear as though ... the API would be the thing that looked like it failed, because its dependencies had broken; and the central API that is supporting TV-based devices is this concentration point and it has 30, or 50 or 100 dependencies coming into it, and when one of them broke, it would break.
So they'd get the call and they’d have to rummage around in their logs and say, “No, it’s actually this one that broke. Our code is fine but one of our dependencies has broken”. And after a while of being on every single call for anything that ever went wrong, they, being developers, developed a defence mechanism, which is that they built this circuit breaker layer into the API Server which wrapped every dependency in a circuit breaker that basically said, “If this thing breaks, I’m going to stop calling it and I’m going to flag it and I’m going to point at it and I’m going to say 'that thing is broken'”. And then when the people that manage the alert system, the site reliability engineers, see an error they go and they see the circuit breaker and they say, "Don’t call the API team because the circuit breaker is pointing at dependency number 3"; so they call whoever owns dependency number 3. So that was one of those things that you get the incentive right and they build the system that automatically says "That's what's wrong". So that system then got ... it was embedded in the architecture of the API server, but eventually the other teams at Netflix wanted to use it; and so this is another interesting sort of case, that the platform team owns the stuff that is shared across all the other teams, but the API team is one of the teams that keeps building new stuff; they are very advanced in what they do; and they talk to the platform team, but they build new bits of platform that then get adopted by the platform team, and the platform team kind of typically adopts these.
They reserve the right to call your baby ugly and not adopt something, because they don’t like it, and that's happened a few times, but mostly particularly from the API team... Ben Christensen who - there’s a whole pile of InfoQ stuff on, go see Ben Christensen's talks. The stuff he was coming up was really good, so in order to adopt it into the platform it went through a cleanup and a rewrite and a modularization; and that was also what was needed to make it open source. So you tend to get this thing where something gets developed as sort of version one and the version two is the cleaned-up, modularized, platformized version and that also was the open source version. So Hystrix is a sort of version 2 cleanup of the original idea that went through a cycle of development. There’s several examples where that kind of thing happened at Netflix; it’s a good approach for doing it.
So, Hystrix is there; the other key thing is that when you’ve got these networks of systems you need to figure out what is dependent on what, and the only way you know, other than if you have lots of outages you find out, but you have to induce errors in order to make sure that it is working right. So Chaos Monkey is well known, and that’s where you kill a machine to prove that it replaces itself and everything keeps working; and typically there are arrays of machines, arrays of instances, so 50 or 100 or 500 machines, so killing one of them makes no difference; make sure you don’t store any state on it, that’s the main thing, or the state is replicated somewhere. The other thing..., but that's actually not a particularly hard way of killing something... There is a version of Chaos Monkey that was invented as part of the Cloud prize that I ran last year. It tortures a machine in 14 different ways; it was like the barrel of Chaos Monkeys. It will do things like getting into the network stack and stopping all DNS requests from going in and out, and it will cause DNS to fail. So what happens if DNS fails in your infrastructure? Who knows? Or you find out one day in a very nasty way. So you can simulate DNS failing or a disk filling up, or the CPU getting busy or un-mounting file systems; it basically tortures the machine in lots of different ways, so that’s on a particular instance.
The other thing that’s interesting is Latency Monkey, where the client and server side - the libraries I mentioned before called Ribbon and Carry On - they have entry points where you can insert... you can get into it from the outside. through the status page, and basically tell it to start misbehaving. So you can say, “I want you to whenever you get a request return a 503 error" or something, or a 9739 blah - make up a number. You can return any arbitrary thing if you wanted to confuse the system, or just return fine but wait 5 seconds first or wait some random time. So you can basically inject errors and latency into a perfectly working system to see what happens upstream; to make sure the errors are contained by the next level up.
And that works well for checking everything in production, but when you are running in test, you typically want to do it at the other end: you’ve got a new service and you want to see if your service is resilient to the failure of everything else. So you actually want to do it in the client side; you want to say, “I know that there is a really good service out there that is really working, but I’m going to pretend it's stopped working”. So you can inject errors in the client or the server side, from the Latency Monkey environment. Once you do that regularly you really find out what... you really do end up building a resilient system. Because it takes all these things and says, "Yup, that failed but it was contained and rest of the site still works". There is a whole load more Monkeys for taking out whole zones and regions and things, but it’s not so much of a software architecture; that’s more like operational disaster recovery testing.
Charles: That’s very interesting. Thank you very much indeed.
Adrian: Thank you.
It would be great to have his spreadsheet comparing the different ways of calling and tracking applications. That XML time compared to the other ways of sharing information between application calls.
I loved the insight about DevOps and support calls.
Interview Notes and topics to research
Micro Services Architecture - Different than SOA.
Inter-Service communication very important to understand and know the timings.
XML - 500 ms; JSON - 50 ms; Google Protocol Buffers ; Thrift; AVRO;
What is AVRO? Is the main process used by NetFlix for communication. Very fast.
Use of MemCache to reduce time delays for frequently required data.
Roy Rapoport talks on Netflix Monitoring.
Circuit Breakers in code - Prevent false alerts about down systems by pointing to the proper service that failed.
Michael Nygard's book ' Release It' has concepts about these points.
Hystix - now released
Ben Christensen - InfoQ has talked about the concepts to have stable code.
Latency Monkey - introduce errors into the system on demand.
- Library for Ribbon & Carry On
- Like Chaos Monkey that causes breaks to see how Prod Recovers.