Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Microservices Retrospective – What We Learned (and Didn’t Learn) from Netflix

Microservices Retrospective – What We Learned (and Didn’t Learn) from Netflix



Adrian Cockcroft does a retrospective on microservices, what they set out to do at Netflix, how it worked out, and how things have subsequently permeated across the industry.


Adrian Cockcroft has had a long career working at the leading edge of technology. He joined Amazon as their VP of Cloud Architecture Strategy in 2016. He was previously a Technology Fellow at Battery Ventures and he helped lead Netflix’s migration to a large scale, highly available public-cloud architecture and the open sourcing of the cloud-native NetflixOSS platform.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Cockcroft: I'm Adrian Cockcroft. I'm going to talk to you about microservices retrospective: what we learned and what we didn't learn from Netflix. I was at Netflix from 2007 to the end of 2013. We're going to look a bit at that, and some of the early slide decks that I ran through at the time. It's a retrospective. I don't really know that much about retrospectives, but a good friend of mine does. I read some of Aino's book, and figured that there's a whole lot of these agile rituals being mentioned in this book, along with retrospectives. It turns out, Netflix was extremely agile, but was not extreme, and was not agile. We did extreme and agile with a lowercase e and a lowercase a, we did not have the rituals of a full extreme, or full agile. I don't remember anyone being a scrum master of all of those kinds of things. We're going to talk a fair amount about the Netflix culture. The Netflix culture is nicely documented in this book, "Powerful: Building a Culture of Freedom and Responsibility," by Patty McCord, who ran the HR processes and talent. Basically, she was the CTO for Netflix, which was the Chief Talent Officer. Amazing woman, you can see some of her talks. I figured that I should adopt some of the terminology anyway. I've got some story points. I'm going to talk about some Netflix culture. Pick up some of the slide decks from those days. Go over some of the things that were mentioned, and then comment on them. What we did. What we didn't do. What seemed to work. What got left out along the way. I'll talk a bit about why don't microservices work for some people. Then a little bit at the end, just talking about systems thinking and innovation.

Netflix Culture - Seven Aspects of Netflix Culture

Netflix culture, these seven points. The first point is that the culture really matters. They place a high value on the fact that values are what they value. Then, high performance. This is a high-performance culture. It's not trying to build a family, they're trying to build an Olympic winning team, or a league winning team. Pick your favorite sport, how do you build the best team ever for that sport? You go find the best players, and you stack everything in their direction. Those players are the best in the world, so you get out of their way. You do what they tell you to do. There's a bit of coaching, but fundamentally, they have the freedom, but they're also responsible for being the best in the world. What that means is, as a manager, you're giving them context, not control, setting everything up to be successful. Then the teams are highly aligned. We have a single goal to win whatever we're trying to win. Launch in a market or take over something. The teams are loosely coupled, they were individually working on these things, and they have clear APIs where they touch.

If you're building the best team in the world, with the best people in the world, you have to pay top of market. This is something Netflix does. It's always been one of the smaller companies in the Bay Area, and it is very savvy, so it's pretty much the highest paying engineering company in the Bay Area. Then, promotions and development. Netflix management don't have to worry too much about promotions. It's up to you to develop yourself. Part of the adult freedom responsibility thing is figure out how to develop yourself. There was no official mentoring program when we were there. You're supposed to be at the top of your game already. You already know how to maintain that. You get to run your own exercise regime effectively. By having very wide salary bands and very wide career talent levels, they call it grade levels effectively, you have to have very many promotions. You can move up financially, and you're just a senior engineer. When I was there, there were senior engineers, managers, directors, and VPs. Those were the only titles you could have. We didn't have any junior engineers, and there wasn't any grade scales in between. The managers didn't have to worry about it. There's a whole lot of things that are just unusual here. This is part of the system that lets Netflix do things really quickly and effectively.

Netflix Systems Thinking

Netflix is a systems thinking organization. Lots of interesting feedback loops, lots of subtleties in the way they put things together. Optimize for agility. Minimize processes. This ability to evolve extremely rapidly. Unafraid and happy to be the first to do something, pioneers. Netflix was the first company ever to use NGINX. The guy who built NGINX had to found a company so that we could pay him for support. One of the first customers for JFrog's Artifactory. One of the first customers for AppDynamics. One of the first big enterprise customers for AWS and probably some other companies that I've forgotten along the way. By being a pioneer customer, for an interesting startup, you get to really leverage them, they will do all kinds of things for you. You get a great discount, because they love the feedback. It works out. There's a system here, but you have to be good at doing it.

Then, you have to be comfortable with ambiguity, that's a very complex system. There's a lot going on. Things aren't done one way. There's a lot of creativity, but you have to have self-discipline and combine the two together. Then, there's this interesting point that flexibility is more important than efficiency. If you crunch something down and make it super-efficient, it becomes inflexible. You don't want to waste things. You don't want to squeeze them to be super-efficient, either, because you're taking away the flexibility that lets you innovate and try things. If everyone is so crazy busy all the time, because you're running a super-efficient system and everyone's super busy, they don't have time to learn new things. They don't have time to do their best work basically. Then the final principle, steer pain to where it can be helpful. One of the things that we did, certainly, while I was there, was we put engineers on-call, developers on-call. If you were writing code, and you ship the code that day, you were on-call to fix it if it broke that night. That caused developers to be much more careful about the code they deployed. They learned a lot of good practices.

Netflix in the Cloud

Here's the first deck. This was from November 2010. It's the deck that I presented at QCon San Francisco. I did do a little meetup before that, where I tried out some of this content. This was really the first time that we presented Netflix to the world. There's a whole bunch of things here. Undifferentiated heavy lifting, wanted to stop doing that. We went through what we were doing. These goals would work for just about any transition that you might want to make, to the cloud or to some new architecture. You want to be faster, more scalable, more available, and more productive.

Old Data Center vs. New Cloud Arch

In the old data center, we had a central SQL database, so it was Oracle. What we wanted to do is move to a distributed key-value, NoSQL, denormalized our system so that everyone could have their own backend. We weren't trying to do queries and all kinds of things that wouldn't scale. In the old system, we had a sticky in-memory session. A customer would connect, and that thread would have the session cookie route that customer to that thread again for the next part of the request. Cause all kinds of issues. It's not very scalable. What we did was we stored all of that session information outside the instances in Memcached, so that we could basically have a stateless system. Every time you came in, you could talk to a totally different frontend. It would look in Memcached, find the session information and respond. We just built it to be that way. In the data center, we had pretty chatty protocols. The machines were all sitting next to each other. There's no reason not to. In the cloud, we actually set up everything so that it was a long way away. We were working in the East Coast data center, U.S.-East, and Netflix's engineers were on the West Coast, so everything was long latency anyway. That actually helped people build latency tolerant protocols, because as they were testing things, they had 100 millisecond latency to deal with.

We had very tangled service interfaces, and we moved that to be layered. We instrumented code in the data centers, but we instrumented service patterns, so you built a certain new service, and you would just write your code, build the templates. All of the monitoring and stuff would already be there once you got it running, because we had that in the service pattern. We had very fat, complex objects. I'll talk a bit more about what that looks like. We wanted to move to lightweight serializable objects. Typically, the work of an engineer or developer in the data center was you built a JAR file, and you checked it in, and a bit later somebody in QA would check out all the latest JAR files and try to make a build on the monolith and see if they could get it to work. We do that every two weeks. In the new architecture, you take that JAR file, you wrap a service around it. You give it an API on the front. You wrap it into a machine image and you fire it up. You deploy a whole bunch of them. It's up to you to do your own QA, to develop it, to get it out there, and see if it works, and deploy it whenever you feel like it. That's what I mean by components or services.

Tangled Service Interfaces

Looking at these tangled interfaces, the data center, we had SQL queries and business logic, really deep dependencies for sharing these patterns where you call an object to get to something else through it, so trampolining through objects, all kinds of bad stuff. Then, when you call the data provider, we had sideways dependency. If you tried to call, it was a library that you'd use to do something, you'd pull that library into your code, and you'd end up where the whole universe would come in. There's this sense that you call a dependency, you add one more dependency to your repo, and you'd go over the event horizon of a black hole. Eventually, the entire repo would turn up in your build. We were trying to cut that dependency.

Untangled Service Interfaces

We untangled them. We built an interface. We sometimes use Spring runtime binding, and a lot of the Netflix patterns ended up in Spring Cloud a bit later on. The service interface is the service. This is something that I think people don't quite get. I'm doing a movie service, and the way that I expose that movie service, is I build a movie service JAR file, and I put it in Artifactory. Somebody else wants to use the movie service and they pull that JAR and it's got an API, and they call that API. They don't actually know or mostly don't care what's going on behind that. There may be two or three actual microservices behind that library. The only thing you actually need to know is the first time you deploy it, you have to go figure out the security connections, because otherwise you can't connect to services. You have to go to modify the security groups to add you as a customer of those services. From a code point of view, the implementation is hidden, and you can mess around locally, remotely. You can iterate. You can evolve the underlying calls on the wire independently of the API that you're giving to the customer that's trying to use this thing. I think this is something that got lost along the way. I don't think this is a lesson that wasn't learned. The wire protocol shouldn't be the interface between microservices. There's a good example of this. If you ever use AWS, most of the time, you use an AWS SDK for the language you're doing. That's exactly the AWS SDK is AWS. The fact that there's API calls underneath and stuff on the wire, it's hidden from you.

If you do that as one big library, a couple pulls in everything. You pull in the AWS SDK, all of a sudden, you've got a big pile of stuff. What we actually wanted to do was split that. The first thing was a basic layer, usually built by the people that built the underlying service behind it. We called the service, access library. It's like a device driver. It does serialization, error handling. It minimizes its sideways dependencies. Logically, it's like a SCSI driver talking to a disk. You never talk directly to a disk, you use the SCSI driver to talk to a disk or NVMe, whatever the latest SSD interfaces are. You don't actually talk directly to that very often. Most of the time, you use a file system, and file systems are weird and complicated. There's lots of different ones, but they sit on top of disk drivers that are used by all of them. What you've got is a disk driver, a service access library, and then you have an extended library above that.

Service Interaction Pattern (Sample Swimlane Diagram)

Seeing this is a neat pattern. Here's a relatively complex way of doing something like that. You read this top to bottom and left to right. An application calls the movie library, say, "I have a movie. I want to know something about this movie, because I'm going to present it. I want to get back the boxshot, the synopsis, or a bunch of information about it that I'm going to present and build into a response to the web page so that I can render that movie." That ESL looks in a local cache, doesn't find it. Then it calls the driver for a cache, which goes to a cache service and says, no, we haven't got it in the cache. Then it calls the actual movie service. This is transparently and within this ESL, it calls the movie driver, which goes across and calls the movie service, which returns something. Then saves that for next time, and actually pokes it into the cache from behind so that next time somebody looks for it in the cache, they'll find it. There's a working set of common movies which are currently popular, which the movie site is trying to render right now. There's just a way to maintain that. Then you put it in the local cache and return the result, and the application is often done. This is one way of building something showing the different layering that can be done. Notice the cache access layer is used by the movie service and by the presentation service there, but they have different things above it. You've just got the minimum things you need.

Boundary Interfaces

One of the things this let us do was move forward, when external dependencies didn't exist. The first time we tried to move to cloud, there was no identity system. We hadn't figured out how to do identity yet, so we built a fake driver that just had in it the actual developers that were writing code, so about 20 people that were working on this at the time. We built an identity service, which would only return those 20 people. We were the only 20 people that were going to log into the cloud and try and do anything. Then we were able to build all the other layers above it. You can unblock and decouple things by basically stubbing these services and then putting them in later. This let us build this much more interesting interface on top early.

One Object That Does Everything, & An Interface for Each Component

In the data center, we had the movie object and the customer object, and they did everything, and you use one or the other, or both to do anything. They would be horribly complex. They'd been developed, they got super tangled. A lot of the problems we had with our 100-person development team was just about anything you did, touched those objects, and then you'd be trying to resolve all the different people that had touched the movie objects in this biweekly build we were putting together, really problematic. We wanted to do something different. What we did was basically say we're going to have video and visitor, so your code will not work if you ship your code that talks to movies and customers in the data center. You turn up in the cloud environment, no, sorry, we have totally different object model. The video is just an idea. It just contains a number, it's just a large integer. The visitor is just an integer. If you want to serialize 100 movies and send it to somebody, it's a list of 100 integers. It is the type of video, it's an array of 100 videos. When you get to the other end, you say, ok, I have a video, but I want to present this. I'm going to take that video and just say, as a PresentationVideo, and the system would basically fill in all of that facet, all of the information. You'd need a presentation layer, boxshot, and synopsis, and whatever. Then the MerchableVideo, it's got a whole bunch of factors, which had to do with the personalization algorithm, needs a whole bunch of information that the presentation doesn't need. You end up with multiple different views of the video. When you're passing the system around, you're just using this integer. It was fully type safe and managed and built out in Java. A lot of work into that. I think that got lost along the way as well. I'm not sure if the Netflix code uses it still. It was a pretty neat idea of making it clean. The response to this talk in 2010 was a mixture of incomprehension and confusion. Most people thought we were crazy. Somebody actually said, you're going to be back in your data centers when this fails and blows up in your face.

Replacing Data Center Oracle with Global Apache Cassandra on AWS

Less than a year later, I did another talk. This was at the Cassandra Summit, "Replacing Data Center Oracle with Cassandra on AWS." This is one of my favorite slides, all the things we didn't do. We didn't wait. We didn't get stuck with the wrong config. We didn't file tickets. We didn't wait to ask permission. We didn't run out of space and power. We didn't do capacity planning. We didn't have meetings with IT, and all these things that other people were waiting for. We just didn't do because everything was API driven. Your developer just was doing it self-service. The problem we had was data centers. We couldn't build them fast enough for what we were doing. We were transitioning from Netflix shipping DVDs model, which had a fairly small footprint, to a streaming model, which was an exponentially growing footprint. This is what it looked like. This is a nice exponential growth of just about one year. The beginning of January 2010, we said we need to be out of the data center before the end of this year. We're not going to build a new data center, so we've got to do it. Because we predicted we would have blown up, Netflix had massive outages at the end of the year if we hadn't done the move to cloud. This was API call rate, 37x. This is what happens when customers instead of shipping a DVD and poking at the API once or twice a week, they're streaming continuously. Then the proportion of customers that were streaming was rapidly increasing.

We got high availability with Cassandra by using availability zones, and they're actually 10 kilometers to 100 kilometers apart. I've got three copies of my data that are separated by a good number of miles, and kilometers. It all works well. Then we also used the same model to go around the world. We had this nice model, which is basically saying, instead of being stuck in one data center, we've just unblocked ourselves with this NoSQL, eventually consistent Cassandra system that we can be anywhere in the world. This is the Australian map of the world, which puts Hawaii and Australia in the center, because what you really need to know is, how far are you away from Hawaii so you can go there on vacation.

The other thing we were doing was Chaos Monkey. This was partly to make sure systems were resilient. The real thing we were doing was it was a design control, because we were autoscaling everything, you want to be able to scale down. You can always scale up by just adding instances. If you're really careful when you're releasing a new service, you can fail. We didn't want to do that. We wanted to be able to scale down super aggressively. We wanted to also be able to roll out new versions of something and switch traffic to it and turn off the old one, and just have the system just retrial and keep going and not notice. Your customers wouldn't notice me. Maybe an individual request would be half a second slower because it was retrying something. That's not going to be a big deal. If you really want to autoscale down, you should be running a Chaos Monkey, because that's really just proving that you can. The response to this was that Netflix was a unicorn, and it might work for us. I admitted it was working for us, but it wasn't relevant to anybody else. No one else is going to do anything like this, you just are kind of weird thing.

Cloud Native Applications: What Changed?

About 18 months later, beginning of 2013, I did another talk. This time we'd launched our open source program. This is one of the talks from one of the meetups where we launched the open source program and a whole bunch of other stuff. This is also where I started using the words cloud native and microservices, one of the early uses for that. Cloud native applications, what changed? February 2013. 2006, EC2 launched. What people did with it was they forklifted old code to the cloud, because it was faster than asking somebody in IT for an instance. What we developed is these new anti-fragile patterns, microservices, chaos engines, and composing highly available systems from ephemeral components. Then we used open source platforms, so these techniques spread rapidly. This is a totally new set of development patterns and tools. Enterprises were just baffled by it. The startups and the scaled cloud native companies, the Uber, Lyft, Airbnb, and Netflix, those kinds of companies, we all were doing this. The enterprises weren't at this point. Capital One was probably one of the first enterprises to really dive in and try and figure this out. Nike was starting to try and figure it out. A few people were, but it was pretty slow in terms of enterprise adoption.

Typically, the people use Cloud, they just stuck their app servers in the cloud and stuck a data center somewhere else. This was the way you were supposed to use cloud because we didn't trust putting data in the cloud. The way we did it with Cassandra, was to spread everything over three zones. If we lost an availability zone, everything just kept working. We tested this and proved that it just kept working. That was a cloud native architecture. This is a way of running a system, which you never saw in the data center. Master copies of the data are cloud resident, it's dynamically provisioned. It's ephemeral. For me, cloud native meant that the application was built to only run in the cloud and optimize for that way. I think the CNCF co-opted the term, so cloud native currently is our Cloud Native Computing Foundation. It means Kubernetes or something nowadays. I think that's much more operations. That isn't the way you built your application, it's more about how you're operating it.

Looking at availability, one of the things with availability is not, does it stay up, but I need a machine. A few minutes later, I've got it in the cloud. In the data center you might spend months waiting for it. The first thing about availability is I got it. Then I can have it running in all different places so it's available. I can have machines in Brazil or South Africa. I think AWS is about to launch in New Zealand and Israel and wherever, lots of different places around the world. That's another point of availability. I am available in lots of different places. Those places can be far enough apart that if something breaks in one of them, we can use another one. That was an interesting point. Then, from a development point of view, we had these nice anti-fragile development patterns, and the Hystrix circuit breakers, pretty influential, successful project. This is the whole diagram of how it works. Basically, you call in to the frontend, and it's sitting there building this nice isolation layer that has backend services and has a bunch of observability. You can see what's happening. You can monitor it. You can actually see, this backend service has stopped working, and we've stopped trying to call it, and you should do something about it. In the meantime, we've got a fallback, which just returns some static content, or that part of the response, part of a website, web page, doesn't appear for a while, because that backend is not behaving itself. Maybe the history service has gone down, so your rental history isn't appearing for now. You can do everything else. You just can't see that one part of the system.

We had this nice cartoon we came up with, the strange world of the future, stranded without video, no way to fill their empty hours, victims of the cloud of broken streams. One of the things with Netflix was, yes, it's not life-critical. If you're using it as a digital babysitter, people expect their TV sets to just work. They're supposed to just work. They're not supposed to be, only work if some random thing on the other side of the world happens to be up or up right now. We wanted to make it feel like it just worked. You should just be watching a TV show and you should not be thinking about the fact that it's rebuffering, and somewhere on the other side of the world something's happening. You should just enjoy it.

We talked a bit about configuration management. Data center configuration management databases, woeful. They're an example of a never consistent database. Anything in a CMDB, you can guarantee that the actual state of the data center is different to what's currently in the CMDB. That's about the only statement you can make. Cloud is a model driven architecture. If you query the cloud, you're actually finding what is really there. It's dependably complete. You can actually do things properly now that you could not do in a data center. This is a really significant thing. We built something called Edda, which is the stories of the gods. We have a lot of Norse god naming schemes going on at that time. We didn't just put all of the cloud information that we scraped out of AWS in there, we put all the metadata about our services: when it came up, what it was doing, what version it was, and when it went away, again, instance-by-instance, was in there. Also, stuff about the request flow going back and forth between things that we were collecting with AppDynamics, we put that in there. The monkeys could look at this, but fundamentally, you had the ability to query the entire history of your infrastructure. You could say, something broke, and here's an IP address and a log file. Which instance was that? I can go and find every instance that's ever had that IP address. I can go and find out, which one was it? When was it here? All those kinds of things. Or changes to security groups. If you're doing an audit log for security purposes, you can go see all these kinds of things. This is super powerful. I'm not sure if people really build this into their architectures nowadays. We did all kinds of things sitting on top of this. This is a lesson, I'm not sure if it really got learned, but it's super powerful.

There's another thing I did, we spent about an hour running some benchmark configurations. We were trying to figure out, so we've got Cassandra running at 48 instances. If we got a limit, what happens? We just kept doubling it up. After about an hour, we gave up and said, we hit a million requests a second, more than a million, that's fine. That's plenty. It's way more than we're going to need. It looks like it's linear. Then we shut it down again. This benchmark config only existed for about an hour. This is something that I find really interesting, because this is a frictionless infrastructure experiment, you can go and try stuff out. I keep running into people that their cloud environment is so locked down, that they can't just go and try something. That's a sad thing. This benchmark was used by DataStax. It was advertised all over the internet. At the time it just kept popping up when I was browsing around web pages because they were targeting me for it.

Netflix Open Source Program

Netflix open source. This is interesting because open source has really become a common way to build a technology brand, influence the industry. Netflix wasn't the first company to do this, but we did it very intentionally, and I think quite successfully. It was a project I kicked off, got management support for, drove. We had contests. We had logos. We had meetups. We had some of the best attended meetups. Hundreds of people would turn up at Netflix to hear about the open source program. It worked extremely well for us. I think other people then copied that. Capital One have a very good open source program, lots of other people did. This is an interesting approach. I then moved to Amazon later on, and redid some of the things I'd learned from running this program as well. Open source, why are we doing it? As the platform evolved, it went from bleeding edge innovation to common patterns. Then, what we really wanted to do was go to shared patterns, because if you get out ahead of the market, and then the market does something different, you've got to keep moving back to the mainstream. Netflix did end up off the mainstream a little bit, I think, in a few cases. This was a serious attempt to try and keep Netflix aligned with the rest of the industry by sharing what we were doing with the rest of the industry. Instead of running our wagon train across the prairie, and just trying to dodge between different hillocks, you lay down a railway line. This is why California came into being, there was a railway line from Chicago to Los Angeles, and they grew fruit in Los Angeles and shipped it back to Chicago. Lots of people in Chicago, say, it's warm there, and I can go work on a fruit farm. That was what happened.

These are the goals for the Netflix open source program. Try to establish what we're doing as best practice and standards. Hire, retain, and engage top engineers. That worked pretty well. Build a technology brand, which has worked stupendously well. It's why I'm doing this talk today. Benefit from a shared ecosystem, get contributions from outside that helped Netflix. There are some pieces of ZooKeeper that Netflix built, a thing called curator that is now a top-level Apache project. Netflix originally built it. The engineer eventually left Netflix, and has carried on working on it. Netflix just gets to use this. They don't have to go build it anymore. Doesn't have to build this code, because everyone else decided it was good, and just maintains it for them.

What we were trying to do is add more use cases and more features and fix a few things. The thing that really didn't work out here that we learned quite a lot from was it was too hard to deploy. There was a key internal component, which was the machine image. What we were running internally was a really hacked up version of CentOS. It was so hacked up, we didn't want to share it. We were trying to move to a standardized version of Ubuntu, at the same time I was trying to get everything out. Because we didn't have the final version, and we didn't have engineering resources internally to go and just build a quick version that we could export, this whole thing was much too hard to deploy, for much too long. Eventually, this got fixed after about a year, but it was too difficult. One of the things you need to do is if you're going to build something and open source it, think about the deployment, think about supporting it, and you really need a little bit of engineering resources to get good adoption going. I just love this acronym at the end, meantime, between idea and making stuff happen. You want to have low MTBIAMSH.

Docker Containers, and Microservices Patterns

I left Netflix in January 2014, joined Battery Ventures. One of the reasons I joined them was because Netflix had been using so much startup stuff, I got to know all the VCs, so I got to go hang out with the VCs. What actually happened in 2014 was Docker Containers started to take off as a pattern, which actually implemented a lot of the ideas that Netflix had built. Talking about a time when if you say, how much do people care about microservices? This deck I just showed you was using the word microservices back at a time when nobody else was. Then, it just became more popular over time. At the end of my time at Battery, just before I joined AWS, I did a mega deck with over 300 slides, which is a daylong workshop of everything I knew about microservice all carefully arranged into different chunks. This is excerpts from that deck. As I mentioned earlier, they thought we were crazy in '09. They said it wouldn't work in 2010. We were unicorns in 2011. 2012, they said, ok, we would like to do it. We can't figure out how to get there. In 2013 we'd shared enough OSS code that quite a few people were actually trying to use it and trying to copy us. I think that was pretty influential.

Lessons Learned at Netflix, and the Innovation Loop

What did I learn at Netflix? Speed wins. Take friction out of product development. High trust, low process, no hand-offs between teams, APIs between teams. The culture, undifferentiated heavy lifting. Simple patterns with lots of tooling. Then, like I said, for benchmarking, Self-Service Cloud makes impossible things instant, like a million write per second benchmark that was done in an hour. This is the innovation loop that I was pushing at the time. You observe, orient, decide, act. You see the competitors do something or a customer pain point, you do some analysis. Your innovation moves to big data. You plan a response. You just do it. You share your plans. You don't ask for permission. You share your plan so that other people can comment on them, so you get feedback on your plans but you don't ask permission. You decide what you want to do. You build it out as incremental features. You do an automatic deploy, launch an A/B test. Then you measure the customers to see if they liked it, and you keep going around that loop. We go around that loop maybe once a week, or once every few weeks for a new feature or a new idea.

Non-Destructive Production Updates

One of the things here is non-destructive production updates. You can always launch new code, a new version of the code in production safely, and you can test it in production safely. People can't figure out how we did that, so I'm going to explain it. It's immutable code. Old code is still in service, you launch a new one alongside it as a new service group. The production traffic is talking to the old service group until you tell it to route some of the traffic to the new one. That means I can launch code, but I don't actually have to send production customer traffic to there. I can actually do overrides at a very detailed level, so that I can basically create a test cell, or an environment so that the only people in that test cell, get the new experience through the new thing that I just deployed. The only first people in that test cell are developers and QA people. We just got to use Netflix, and we're getting this new experience, and we're seeing if it works. When we finally decide it works well enough, then we add a cohort of users to it, make sure that works, and eventually add everything else. That's why you don't have to wait for anyone else. You don't need permission. You're testing in production, you're just putting it out. It's completely safe to do that. This is really critical. At Netflix, we were accused of ruining the cloud, because we weren't using Chef and Puppet to update our code in place by starting these new instances. We had a little fun with that. Ultimately, this is the pattern that Docker and containers went with. This idea of version-aware routing between services is another idea that seemed to get lost along the way. I know there's lots of microservices into libraries for calling between services that don't have this version-aware routing built into it.


One of the things we were doing was cutting the cost and the size and the risk of change, and that sped everything up. Like I just said, I can push stuff to production, test in production completely safely. I can go faster. Chaos engineering, I'm being sure that this is going to work. I can go faster because of chaos engineering. This is the definition I had then of microservices, loosely coupled, service-oriented architecture with bounded context. If it isn't loosely coupled, then you can't independently deploy. If you don't have bounded context, then you have to know too much about what's around you. Part of the productivity here is I only need to understand this service and what it does, and I can successfully contribute code to the entire system. If you're in a monolith, you need to understand pretty much the whole monolith to safely do anything in that monolith.

If you're finding that you're overcoupled, it's because you didn't create those bounded contexts. We used inverted Conway's Law. We set up the architecture we wanted by creating groups that were that shape. We typically try to have a microservice do one thing, one verb, so that you could say, this service does 100-watts hits per second, or 1000-watts hits per second. It wasn't doing a bunch of different things. That makes it much easier to do capacity planning and testing. It's like, I've got a new version of the thing does more watts hits per second than the old one. That's it, much easier to test. Mostly one developer producing the services. There'd be another developer buddied with them, code reviewing, and then either of them could support in terms of being on-call. In order to get off-call, you have to have somebody else that understands your code. You'd code review with some other people in your team, so that you could go off-call and they could support whatever you were currently doing. Every microservice was a build. We deployed in a container, which was initially Tomcat in a machine image, or it was Python or whatever in a machine image. Docker came along, and that's a fine pattern. Docker copied this model. It's all cattle, not pets. The data layer is replicated ephemeral instances. The Cassandra instances were also ephemeral, but there was enough of them that you never lost any data. There's always at least three copies of all the data. In fact, when we went multi-region, there were nine copies of the data. If you lost a machine, it would rebuild itself from all the others. A whole bunch of books. I want to point out, "Irresistible APIs." Kirsten Hunter was one of the API developers at Netflix, and I worked with her and I wrote the foreword for that book. Think about writing APIs that are nice to use. A bunch of other interesting books here.

What's Missing? (Timeouts and Retries)

What's missing? A whole bunch of advanced topics here. I'm going to talk about timeouts and retries. This is something else that got lost along the way. First of all, we're running on TCP/IP, most of the time. TCP is connection oriented. The first time you try to make a call to someone, it sets up a connection, and then it sends the request. If it's a secure connection, it does extra handshake first. Most of the time, at the higher level, we're just trying to call another service so this gets lost in translation somewhere. Be really careful when thinking about connection timeout, which is one hop, it's relatively short, versus request timeout, which is end-to-end through multiple hops. These are usually set up incorrectly. They're usually global defaults. A lot of the time systems collapse with retry storms, just because the timeouts aren't set up properly. Most people set timeouts that are far too long, and have far too many retries. I like to cut this back a lot. What this ends up with is services doing work that can never be used, and retry storms.

I'm going to try and illustrate this with some clunky diagrams that I built. Here's a system that's working. It makes a call. It gets a response. If anything breaks, the service in the middle, because it's 2 retries and 2-second timeouts, the edge service is not going to respond for 6 seconds. The thing in the middle is going to be managing 9 outbound requests. I've got 9 threads running there instead of one. That's going to run out of memory. It's going to run out of threads. That service in the middle, which was perfectly fine, because there's a downstream service, it's going to keel over, and the edge service is going to keel over. You end up with this cascading problem that one failed service deep in the system causes the entire system to do work amplification, and to collapse. The only thing you need to do to get this to happen, is have the same timeout everywhere and have more than one retry. You have many retries. Even if you partially fail, and you fail the first time and you respond the second time, you're still causing extra work in that service in the middle, which is never actually going to be used because it's retrying, but the edge has already given up and called you again.

How do you fix this? You want a timeout budget that cascades. You want a large timeout at the edge, and you want to telescope. You want all of the timeouts that fit to fit inside that, like the pieces of a telescope that fit in, all the way to however deep it goes. If you have a fairly simple architecture, and you know what the sequence is, you can do it statically, but mostly you need some dynamic budget or deadline you pass with the request. Like, ok, I'm going to give you a 10-second timeout from the edge, that means 10 seconds in the future. Here's my give-up time. I'm going to give up at this point in the future. Then somewhere deep in the system you get, I'm never going to be able to respond in the time, we're past the deadline, let's just chuck everything away. That kind of approach. The other thing is, don't ask the instance the same thing, retry in a different connection. Here's a better way of doing it. My edge service is 3 seconds. I've got a service in the middle of 1 second. There's one retry. If I got a failed service, it'll try it twice, it's going to respond in 2 seconds. That's before the edge times out. You've got a nice, effectively a fast-fail system, and it's behaving itself.

Why did I try the same service? Just because that service is probably doing a garbage collector or something, and it's just stopped talking to everybody for 10 seconds, or a minute, or something. There's a whole bunch of other ones. What I want to do is call on a different connection that goes to a different service. My first request will fail, but my second one will tend to succeed. We built this in at Netflix. Again, I don't know why everybody doesn't do this. This sounds like it's so obvious, and it's so resilient. I think this is another lesson that wasn't learned somewhere along the way. I still see lots of frameworks, lots of environments, having retry storms. We fixed it at Netflix more than a decade ago, and I wrote these slides in 2015 or so. You can't get with the program. There's a good book on systems thinking by Jamshid Gharajedaghi. "We see the world as increasingly more complex and chaotic, because we use inadequate concepts to explain it. When you understand something, you no longer see it as chaotic or complex." Best example of this is just driving a car through a city. If you've never seen a car, you've never seen a road, you have no idea what's going on, it's chaotic. It's complex. It'll freak you out completely. Pretty much any functioning adult is able to do this, so it's not seen as chaotic or complex.

Microservices Is Many Tactics

We tried microservices, and it didn't work. There's all kinds of people complaining that maybe we should stop doing microservices because it's not working for us. Why isn't it working for you? Maybe you aren't a unicorn, maybe you're just a donkey with a carrot tied to its forehead. You're a wannabe unicorn. Microservices is one of many tactics that reinforce each other, and that forms a system that is fast, resilient, scalable. If it isn't just microservice, it's many tactics. Let's just look at a few of these. You're too slow. You're probably using chattier, inefficient protocols or your system boundaries aren't right. If everything needs to happen in milliseconds, you build one big service that does this. It's how ad servers work, they have to respond super quick. Yes, a monolith, because you're responding in milliseconds, and you have to. Or, you're doing high frequency trading, sure, build that monolithically, and be really careful how you build it. If you're responding to building a web page for somebody on the other side of the world, you don't need to put everything in one instance. You're more interested there in getting the fast development practices and the ability to innovate more quickly.

If you're not resilient, you probably haven't fixed retry storms. Or maybe your developers are writing unreliable code, and you should put them on-call. Or you're not using chaos engineering, you're not leveraging AZs properly. You're not scalable, maybe you didn't denormalize your data model. You've got some SQL query things sitting in the backend that keeps keeling over. Maybe you haven't tuned the code for vertical scaling, and your code just is not running efficiently on the instances at all. That leads into overall efficiency, like cost efficiency, in particular. You should be autoscaling so deeply, that the average utilization across all of your instances is like 40% to 50%, average. What I see is single digit averages in almost all the cloud environments I see. The difference between 10% and 40% is 4x. That means you're spending four times as much as you should. Forget all of the buying reservations and savings plans and things, just get your utilization to double and your cloud bill will halve. You get there by autoscaling super deeply, turning things off, maybe using serverless so you're 100% utilized for the serverless functions. There's lots of things you can do. This is, I think, the biggest cost lever that I don't see people using nowadays. If you left out most of the system, it's not surprising. It's not going to work for you.

Systems Thinking at Netflix and Beyond

Systems thinking at Netflix and beyond. This is a talk I gave at a French conference that was held in the Louvre, at The Carrousel du Louvre. A whole bunch of things talking about from a systems thinking approach, what is Netflix doing? It is driving higher talent density, by a culture optimized for fully formed adults. Netflix had no graduate hires, and no interns. You're building an Olympic team. You wouldn't have high school athletes on an Olympic team unless they were absolutely operating at that Olympic level. You got to go through the leagues and get better, if you want to play for the top league and the top team in the top league. That's the attitude. There was yes, definitely explicitly skimming off the top talent that they could offer for everybody. You spend 5, 10 years at Google or Facebook or something, and then come work at Netflix. It's not a family. This became a pretty famous quote, "We can't copy Netflix, because it has all these superstar engineers, those Olympic athletes, we don't have those people." The guy that told me this, I looked at his badge, and we just hired somebody from them. I said, "We hide them from you, but we got out of their way." We got vastly more out of the engineers that we hired than the people that previously employed them. You can say, is it a 10x engineer, or whatever? A lot of the 10x engineer is getting out of the way of somebody that is just being systematically slowed down and made unproductive in their previous place. Whereas you get to Netflix, Ben Christensen said this. He started doing something, we're just egging him on. It's like, we just pushed him out, grew wings and just flew. This is the guy that built Hystrix. He actually felt like that, instead of being pushed back, he was being pulled forward. That was probably the most distilled quote I've seen of that.

It's a little complicated here. Purposeful systems representing the system's view of development. We assume plurality in all three dimensions: function, structure, and process. What he was really saying here, there's multiple ways to do things. There's multiple ways to look at functions, multiple structures, multiple processes. If you lock everything down to one function, one structure, one process, you're not actually representing a useful system that is fit for purpose. There's another way of looking at this, which is W. Ross Ashby's Law of Requisite Variety, because if you're trying to manage something, the thing that's managing it has to have at least as much complexity as the thing it's trying to manage. One of the problems here is people tend to simplify, and they do efficiency drives, everything looks too complicated. You get a new CIO in, and they try and simplify everything, and they kill off everything, and they make it so it doesn't work anymore. Because the complexity of the business was being managed by complexity. The ability to evolve means that things are ambiguous and unstable. It's chaos. It's complexity. If you can manage it, then you can go super-fast.

How does that work? This is another quote from the Netflix culture deck, "If you want to build a ship, don't drum up the people to gather wood, divide the work and give orders. Instead, teach them to yearn from the vast and endless sea." The point here is that, go, we want to build TV for the world. We're building the first global TV station. Get inspired by that, go and figure out how to do that, rather than detailed, step-by-step things. Then, this is like the fourth version of Conway's Law from 1968. This talk had excerpts from all these decks. A bunch of different things, you can go find them. They're either on SlideShare or on my GitHub site. I'm going to take a PDF of this entire deck, a final version and put that on GitHub as well. You can also find me giving a lot of these talks, videos of this on YouTube, and I try to build some links there.


I started off with the retrospectives book by Aino. One of the antipatterns that she highlights in that book is the prime directive ignorance. You ignore the prime directive, because you didn't protect unprepared organizations from the dangers of introducing advanced technology, knowledge, and values before they were ready. Maybe at Netflix, we came up with stuff and we shared it with people and a few people took it home and tried to implement it before their organizations were really ready for it. I think we had a bit of that happening, as well as all the cool companies that actually did pick it up and take it to new heights, and we all learn from each other. Just a little worry about the dangers of coming home from a conference and saying, I saw a new thing and let's go do it.


See more presentations with transcripts


Recorded at:

Jul 28, 2023

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p