Interview: Adrian Cockcroft on High Availability, Best Practices, and Lessons Learned in the Cloud
Netflix is a widely referenced case study for how to effectively operate a cloud application at scale. While their hyper-resilient approach may not be necessary at most organizations – and the jury is out on that assumption – Netflix has advanced the conversation about what it means to build modern systems. In this interview, InfoQ spoke with Adrian Cockcroft who is the Cloud Architect for the Netflix platform.
InfoQ: What does “high availability 101” look like for a new Netflix engineer? How do they learn best practices and what are the main areas of focus?
Cockcroft: We run an internal "boot camp" every few months for new engineers. The most recent version is a mixture of presentations about how everything works and some hands on work making sure that everyone knows how to build code that runs in the cloud. We use a version of the NetflixOSS RSS Reader as a demo application.
InfoQ: Are there “traditional” web development techniques or patterns that you often ask engineers to “forget” when working with cloud-scale distributed systems?
Cockcroft: Sticky session based programming doesn't work well so we make everything request scoped, and any cross request information must be stored in memcached using our EVcache mechanism (which replicates the data across zones).
InfoQ: You and others at Netflix have spoken at length about expecting failures in distributed systems. How specifically do you recommend that architects build out circuit breakers and employ other techniques for preventing cascading failures in systems?
Cockcroft: The problem with dependencies between services is that it rapidly gets complicated to keep track of them, and it's important to multi-thread calls to different dependencies, which gets tricky when managing nested calls and responses. Our solution to this is based on the functional reactive pattern that we've implemented using RxJava, with a backend circuit breaker pattern wrapped around each dependency using Hystrix. To test that everything works properly under stress, we use Latency Monkey to inject failures and high latency into dependent service calls. This makes sure we have the timeouts and circuit breakers calibrated properly, and uncovers any "unsafe" dependencies that are being called directly, since those can still cause cascading failures.
InfoQ: Netflix OSS projects cover a wide range of services including application deployment, billing, and more. Which of these projects do you consider MOST indispensable to your team at Netflix, and why?
Cockcroft: One of our most powerful mechanisms and somewhat overlooked NetflixOSS projects is the Zuul gateway service. This acts as an intelligent routing layer which we can use for many purposes, handling authentication, geographic and content aware routing, scatter/gather of underlying services into a consistent external API etc. It's dynamically programmable, and can be reconfigured in seconds. In order to route traffic to our Zuul gateways we need to be able to manage a large number of DNS endpoints with ensemble operations. We've built the Denominator library to abstract away multiple DNS vendor interfaces to provide the same high levels of functionality. We have found many bugs and architectural problems in the commonly used DNS vendor specific APIs, so as a side effect we have been helping fix DNS management in general.
InfoQ: Frameworks often provide a useful abstractions on top of complex technology. However, are there cases where an abstraction shields developers from truly understanding something more complex, but useful?
Cockcroft: Garbage collection lets developers forget about how much memory they are using and consuming. While it helps code get written quickly, the sheer volume of garbage being generated and number of times data is copied from one memory location to another is not usually well understood. While there are some tools to help (we open sourced our JVM GCviz tool) it's a common blind spot. The tuning parameters for setting up heaps and garbage collection options are confusing and are often set poorly.
InfoQ: Netflix is a big user of Cassandra, but is there any aspect of the public-facing Netflix system that uses a relational database? How do you think that modern applications should decide between NoSQL and relational databases?
Cockcroft: The old Netflix DVD shipping service still runs on the old code base on top of a few large Oracle databases. The streaming service has all its customer request facing services running on Cassandra, but we do use MySQL for some of our internal tools and non-customer-facing systems such as the processes that we use to ingest metadata about new content. If you want to scale and be highly available use NoSQL. If you are doing rapid continuous delivery of functionality, you will eventually want to denormalize your data model and give each team its own data store so they can iterate their data models independently. At that point most of the value of a unified relational schema is gone anyway.
InfoQ: Can you give us an example of something at Netflix that didn’t work because it was TOO sophisticated and you eventually went with a simpler approach?
Cockcroft: There have been cases where teams decided that they wanted to maintain strong consistency, so they invented complex schemes that they think will also keep their services available, but this tends to end up with a lot of downtime, and eventually a much simpler and more highly available model takes over. There is less consistency guarantee with the replacement, and perhaps we had to build a data checking process to fix things up after the event if anything goes wrong. A lot of Netflix outages around two years ago were due to an attempt to keep a datacenter system consistent with the cloud, and cutting the tie to the datacenter so that Cassandra in the cloud became the master copy made a big difference.
InfoQ: How about something you built at Netflix that failed because it was TOO simple and you eventually realized that a more sophisticated solution was required?
Cockcroft: Some groups use Linux load-average as a metric to tell if their instances are overloaded. They then want to use this as an input to autoscaling. I don't like this because load-average is time decay weighted, so it's slow to respond, and it's non-linear, so it tends to make autoscaler rules over-react. As a simple rule total (user+system) CPU utilization is a much better metric, but it can still react too slowly. We're experimenting with more sophisticated algorithms that have a lot more inputs, and hope to have a Netflix Tech Blog post on this issue fairly soon (keep watching http://techblog.netflix.com for technology discussion and open source project announcements).
InfoQ: How you recommend that developers (at Netflix and other places) set up appropriate sandboxes to test their solutions at scale? Do you use the same production-level deployment tools to push to developer environments? Should each developer get their own?
Cockcroft: Our build system delivers into a test AWS account that contains a complete running set of Netflix services. We automatically refresh test databases from production backups every weekend (overwrite the old test data). We have multiple "stacks" or tagged versions of specific services that are being worked on, and ways to direct traffic by tags is built into Asgard, our deployment tool. There's a complete integration stack that is intended to be close to production availability but reflect the next version of all the services. Each developer has their own tagged stack of things they are working on, that others will ignore by default, but they share the common versions. We re-bake AMIs from test account to push to production account with a few environment variable changes. There is no tooling support to build an AMI directly for the production account without launching it in test first.
InfoQ: Given the size and breadth of the Netflix cloud deployment, how (and when!) do you handle tuning and selection of the ideal AWS instance size for a given service? Do you run basic performance profiling on a service to see if it's memory-bound, I/O bound or CPU bound, and then choose the right type of instance? At what stage of the service's lifecycle are these assessments made?
Cockcroft: Instance type is chosen primarily based on memory need. We're gradually transitioning where possible from the m2 family of instances to the m3 family, which have a more modern CPU base (Intel E5 Sandybridge) that runs Java code better. We then run enough instances to get the CPU we need. The only instances that are IO intensive are Cassandra, and we use the hi1.4xlarge for most of them. We've built a tool to measure how efficiently we use instances, and it points out the cases where a team is running more instances than they need.
About the Interviewee
Adrian Cockcroft is the director of architecture for the Cloud Systems team at Netflix. He is focused on availability, resilience, performance, and measurement of the Netflix cloud platform, and is the author of several books while a Distinguished Engineer at Sun Microsystems: Sun Performance and Tuning; Resource Management; and Capacity Planning for Web Services.