How Netflix Deploys Code

Netflix, the popular movie streaming site, deploys a hundred times per day, without the use of Chef or Puppet, without a quality assurance department and without release engineers. To do this, Netflix built an advanced in-house PaaS (Platform as a Service) that allows each team to deploy their own part of the infrastructure whenever they want, however many times they require. During QCon New York 2013, Jeremy Edberg gave a talk about the infrastructure Netflix built to support this rapid pace of iteration on top of Amazon's AWS.

Netflix uses a service-oriented architecture to implement their API, which handles most of the site's requests (2 billion requests per day). Behind the scenes, the API is separated into many services, where each service is managed by a team, allowing teams to work relatively autonomously and decide themselves when and how often they want to deploy new software.

Netflix is heavily invested in DevOps. Developers build, deploy and operate their own server clusters and are accountable when things go wrong. In case of failure, a session is organized where the root cause of the issue is investigated, and ways are discussed to prevent similar issues in the future -- similar to the five whys.

Deployment at Netflix is completely automated. When a service needs to be deployed, the developer first pushes the code to a source code repository. The code push is picked up by Jenkins, which subsequently performs a build producing an application package. Then, a fresh VM image (AMI) is produced based on a base image (containing a Linux distribution) and software that all Netflix servers run, including a JVM and Tomcat, possibly further customized by the team. On top of this base install, the application package is installed. From this, an AMI is produced and registered with the system.

To deploy the VM images to its infrastructure, Netflix built Asgard. Via the Asgard web interface, VM images can be instantiated to create new EC2 clusters. Every cluster consists of at least 3 EC2 instances for redundancy, spread over multiple availability zones. When deploying a new version, the cluster running the previous version is kept running while the new version is instantiated. When the new version is booted and has registered itself with the Netflix services registry called Eureka, the load balancer flips a switch directing all traffic to the new cluster. The new cluster is monitored carefully and kept running overnight. If everything runs OK, the old cluster is destroyed. If something goes wrong, the load balancer is switched back to the old cluster.

Failure happens continuously in the Netflix infrastructure. Software needs to be able to deal with failing hardware, failing network connectivity and many other types of failure. Even if failure doesn't occur naturally, it is induced forcefully using The Simian Army. The Simian Army consists of a number of (software) "monkeys" that randomly introduce failure. For instance, the Chaos Monkey randomly brings servers down and the Latency Monkey randomly introduces latency in the network. Ensuring that failure happens constantly makes it impossible for the team to ignore the problem and creates a culture that has failure resilience as a top priority.

Many parts of the Netflix infrastructure are open source already and available on Github. It is Netflix' goal to eventually release all of its infrastructure for other companies to benefit from.

Topics

Pitfalls of Unified Memory Models in GPUs

Evolving Trainline Architecture for Scale, Reliability and Productivity

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the AWS topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Microsoft Introduces Drasi: Open-Source System for Real-Time Event Processing and Automation

How Cell-Based Architecture Enhances Modern Distributed Systems

Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems

Orchestrating a Path to Success - a Conversation with Bernd Ruecker

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Challenges and Lessons Porting Code from C to Rust

Copilot Now Available in OneDrive: AI-Powered Features for Streamlined Document Management

Ephemeral IDs: Cloudflare's Latest Tool for Fraud Detection

Evolving Trainline Architecture for Scale, Reliability and Productivity

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

No EC2 or Kubernetes Allowed: Insights from Building Serverless-Only Architecture at PostNL

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

How a Sustainable Mindset in Software Engineering Can Increase Team Performance and Prevent Burnout

The Ongoing Challenges of DevSecOps Transformation and Improving Developer Experience

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Google Cloud Adds Scalable Vector Search to Memorystore for Valkey & Redis Cluster

Podman Desktop 1.13 Launches with Hyper-V Support and Additional Enhancements

Uber Completes Major MySQL Fleet Upgrade, Boosting Performance and Security

QCon San Francisco

QCon London

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?