Netflix: Dystopia as a Service
"As engineers we strive for perfection; to obtain perfect code running on perfect hardware, all perfectly operated." This is the opening from Adrian Cockroft, Cloud Architect at Netflix in his recent Keynote Address at Collaboration Summit 2013. Cockroft continues that the Utopia of perfection takes too long, that time to market trumps quality and so we compromise. But rather than lamenting the loss of "static, better, cheaper" perfection, Cockroft describes how the Netflix "Cloud Native" architecture embraces the Dystopia of "broken and inefficient" to deliver "sooner" and "dynamic". Cockroft says that "the new engineering challenge is not to construct perfection but to construct highly agile and highly available services from ephemeral and often broken components."
Hosted almost entirely on Amazon Web Services (AWS) infrastructure, the Netflix architecture utilises "antifragile patterns" that were inspired by Nassim Taleb's 2012 book "Antifragile: Things That Gain From Disorder" and Michael Nygard's book "Release It!." These patterns include stateless auto-scaled microservices, reactive APIs, circuit breakers, bulkheads and graceful degradation. Antifragility includes the curious concept of Chaos Monkeys; services that randomly terminate virtual machines in production to test and prove the availability and reliability of the whole system. Cockroft describes how Netflix recently deployed their "Chaos Gorilla" that terminated an entire zone out of their three AWS zones, yielding no perceptible outage for Netflix customers.
Cockroft describes the architecture as "Cloud Native". "Master copies of data are cloud resident, everything is dynamically provisioned, all services are ephemeral." says Cockroft. The largest services are autoscaled with an average instance lifetime of 36 hours. At the physical level the Netflix infrastructure is composed of thousands of AWS instances supporting auto-scaled micro-services. Some micro-services provide stateless middleware functions while others provide database storage via a quorum of Cassandra NoSQL database instances.
The final part of Cockroft's address holds ideas and opportunities for the Linux Foundation, sponsors of the Collaboration Summit. What keeps Cockroft "awake at night" are the problems associated with a monoculture. That replicating "the best" as patterns and reducing interaction complexity leads to an epidemic single point of failure. Cockroft cites the example of the leap-second bug that caused major problems for online services last year. Cockroft suggests that the only way to address these "pattern failures" is automated diversity management and to understand the trade-off between efficiency and fragility.
According to a recent report by networking equipment company Sandvine, Netflix is responsible for one third of all the US-based downstream traffic on the internet. Netflix is very open about their architecture and progressive in releasing much of their code as open source software in the form of NetflixOSS. Adrian Cockroft's slides from this talk are posted on slideshare
Stephanie Davis (nee Stewart) Dec 21, 2014