How Etsy Deploys More Than 50 Times a Day
Daniel Schauenberg described at the last QCon London how Etsy, renowned for its DevOps and Continuous Delivery practices, does 50 deploys/day. A fully automated deployment pipeline, thorough application monitoring and IRC-based collaboration are all important to achieve this rate of change while keeping risk to a minimum.
Etsy's development approach revolves around making many small, continuous changes. A direct consequence is the need to do many deployments a day. In the words of Daniel Schauenberg, at any given time every Etsy developer needs to know the answer to the question "how comfortable am I with deploying a change right now?". To be comfortable at all times, Etsy adopted a range of tools and practices: mandatory IRC-based communication; developer virtual machines; continuous integration; one-click deployments; thorough application and system monitoring; no blame post-mortems and on-call policies for both dev and ops teams.
Every developer has its own KVM (Kernel-based Virtual Machine), configured by Chef. The same cookbooks used in production are also used on the developers' virtual machines, which means that each developer has its own full Etsy stack. Anyone can provision a virtual machine through Virtual Madness, a web application that automates the whole process.
On the continuous integration front, Daniel explained how Try is central to their process. Try is a tool that allows a developer to test his changes in Jenkins, the CI tool used at Etsy, without having to commit to trunk. Try helps to keep the trunk clean and thus deployable, while at the same time allowing the developers to test their changes quickly and reliably. The CI cluster must be powerful enough to support 150 engineers, and more than 14000 tests suites runs per day. LXC, Linux containers, parallelize the workload. They also provide the isolation needed to keep the executors from colliding with one another.
The deployment pipeline passes through the princess, or staging, environment before going into production. Princess is, for all intents and purposes the production environment, but only Etsy's employees have access to it. The Deployinator is the deployment tool made and used by Etsy that offers one-click deployments.
Config flags, also known as feature flags, are an integral part of the deployment process. Through its feature API, Etsy is able to do A/B testing, completely enable or disable a feature or variants of a given feature.
Monitoring is key to the way Etsy's team builds the confidence to do Continuous Delivery. Developers do their own feature monitoring and everyone has access to all the graphs through dashboards. Etsy has a policy where, by default, everything that can be graphed is graphed. Over time, the number of metrics has increased steadily so Etsy has built Kale, to help detect anomaly patterns. All logs are available through Supergrep, a web based log streamer that increases the logs' signal-to-noise ratio.
IRC is the main communication tool throughout Etsy and is key to the collaboration culture of Etsy. There are lots of different chat rooms, each with a specific purpose. For instance, there is a #warroom where only outage related conversations are allowed. The room is used to coordinate the investigation, discuss counter measures and resolution monitoring. #warroom, as with other chat rooms, is one place where new engineers are encouraged to lurk around, as they are considered to be good places to learn.
After each outage, or near outage, everybody is invited to a post-mortem. Post-mortems are such a significant cultural event that even finance and support can attend if they want to. Post-mortems are meant to be a learning opportunity and so they are blameless. All the information related to a post-mortem is recorded in Morgue: dates; severity; IRC logs; graphs; remediation actions. Morgue is another tool built by Etsy for the specific purpose of post-mortem record keeping.
There are on-call policies for operations, developers, payments and support. Developers are usually on-call one week every four weeks, on a rotation basis. The policy aims to keep everyone aware of the day-to-day issues that face the site so that they can be taken into account when developing new features or improving existing processes.
Etsy has about 60 million monthly visits and 1.5 billion page views per month.
Scalability (downwards, not upwards)
As I also have a background on working with small teams and small companies: what do you think is the minimum team size required to be able to completely take advantage of such an approach?
Randy Shoup Jul 03, 2015