Heroku's engineering team wrote about their journey from manual deployments to automated continuous deployments for Heroku Runtime, their managed environment for applications. They achieved this using Heroku primitives and a custom deployer tool.
The Heroku Runtime team builds and operates both single (Private Space/PS) and multi-tenant (Common Runtime/CR) environments. This includes container orchestration, routing and logging. Previously, the team followed a deployment routine consisting of multiple manual steps - including signoff from the primary on-call engineer, allowing sufficient buffer for monitoring post-deployment, and monitoring multiple dashboards. This also had the overhead of waiting for things to stabilize before proceeding to other regions and services. It worked as the team was smaller then, and services and regions were limited to 2 prod environments in the US and the EU. With an increase in the number of team members and a long term project to re-architect the CR into multiple services, the team had to engage in an automation exercise and build an automated deployer tool.
InfoQ reached out to Bernerd Schaefer, principal member of the technical staff at Heroku, to learn more about the challenges they faced and the details of the solution.
The previous processes in place were dependent on team bandwidth and careful manual planning of the expected impact. Direwolf - a test platform - reported the status across regions. The growth of the team to 30+ members made this process cumbersome. Combined with the challenge of managing an architecture revamp for CR which would split their monolithic Ruby app for CR into multiple services, the team decided to push for complete automation. The app was running in two production environments and the manual steps led to higher coordination costs.
The team's solution was to use existing Heroku primitives and a custom tool called cedar-service-deployer. Each service became part of a Pipeline, and sharded services were deployed across multiple staging and prod environments as part of a long-term project. The cedar-service-deployer tools - written in Go - scans pipelines for differences between stages. If it finds any, it runs checks to see if it can promote the code to the next stage. These checks include release age, sufficient time for integration tests, ongoing incidents, alerts that are firing, promoting only from the master branch, etc. Adding new checks requires a code change, says Schaefer, as the list is fixed. At the same time, he explains that teams can configure their own alerts:
Teams are able to configure which things are checked for individual services, particularly which alerts to monitor to determine the health of a service. For example, a service might have one alert checking that the service is up, one checking that its HTTP success rate is over 99%, and teams adding those services to the deployer would configure those alerts in a JSON file for the deployer service to monitor during releases.
Monitoring and alerting form an important part of the deployment, as they can indicate possible issues. Heroku uses Librato for collecting metrics and alerting. There are some other systems too for monitoring, but so far all of the services that are controlled by the deployer use Librato, says Schaefer.
Schaefer further elaborates on their philosophy of monitoring:
One of the things we've been pushing forward is baking monitoring into our standard service toolkit, so that every service has useful instrumentation by default. As services go into production and mature, they'll probably need some custom instrumentation, of course. But the goal is that service developers can focus on and be experts in the features their service offers -- without also needing to be experts in metrics collection, or tracing, or whatever else we want to know about how systems are operating.
Although in most cases the deployer can automatically decide whether to push or not, there is provision for manual override. Schaefer explains:
The system always allows operators to use the existing manual tooling to push out releases. We might do that during an incident to get a critical hotfix patch rolled out to the impacted environment. It's rare to push out other changes during an incident, since we try to minimize the changes to production while one is open, and folks are rarely in such a rush to get things out, but that capability is there if it's needed.
The stateless nature of the deployer means it works by trying to promote between any two stages in the pipeline, and is not tied to a single "release" version. This enables multiple commit points to be present concurrently at different stages of the deployment pipeline.