Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News How Checkly Achieves Zero Downtime Deployments with Terraform

How Checkly Achieves Zero Downtime Deployments with Terraform

Checkly, a monitoring tool that validates the correctness of API endpoints and browser click flows, shared their experience of using Terraform to perform zero downtime deployments for their Docker based infrastructure on AWS.

Checkly uses "workers" to run jobs submitted by users. Each worker runs in a Docker container, five of which run in an EC2 instance. Checkly's challenge was to deploy to AWS without affecting the user experience, concurrently support multiple versions of their code, and independently upgrade the worker code. Terraform's modules, rolling updates and custom remote executor code were used to achieve this.

Checkly uses the Puppeteer framework to automate browser actions. Puppeteer is a headless Node API for the Chrome browser. Each Checkly worker is a Node process which can accept parameters and run its tests without needing to save any state, making it easy to horizontally scale in response to request traffic. User requests are pushed into an AWS SQS queue from a cron job, from which the workers pick them up, and push the results into another queue. A failed job will not invoke the SQS API to delete the message and will be retried. Deploying a new version into AWS is through a Docker based lifecycle followed by using Terraform primitives to do a rolling update. The code passes through three environments - dev, test and production. InfoQ got in touch with Tim Nolet, founder at Checkly, to find out more:

The unit tested code is built into a Docker container, with the build, tag and push Docker commands in the package.json as scripts. We push the container (tagged with a version and with "test") to our private Docker repository and then cycle the test EC2 instances which pull the latest "test" container using the Terraform "taint" command.

The "taint" command in Terraform forces a resource - EC2 instances in this case - to be destroyed and recreated. Checkly’s team lets the test instance run for a couple of days. If all goes well, the Docker image is re-tagged with "latest" and the "taint" is repeated for all production EC2 instances, which completes the rolling update. One of Checkly’s goals is to allow for multiple versions of the app to co-exist, which can require additional handling in either the code or in the data stores and message queues. For example, if the JSON format used in the SQS messages changes, both formats have to be handled for a short period of time while the old infra goes down and the new one comes up. Nolet elaborates on their approach:

As we are quite young, we have not had huge changes yet in the overall data transfers objects or messaging schemes. But I would always solve that in the code. The queuing bus, the storage and all other middleware are just not the right place for it. So if that means a bunch of extra "if" statements or case switches to handle two message type, so be it. We use Postgres as our main datastore, so the JSON fields are very welcome to handle small tweaks to the data model without too much hassle.

Terraform offers primitives like create_before_destroy and the remote executor that are utilized by Checkly. The create_before_destroy flag is available to all Terraform managed resources and is used to ensure that a replacement resource is created before the old one is removed. When Terraform invokes the underlying AWS provisioner, the remote-exec command keeps checking if the Node process is running in the container and returns once it is, signalling Terraform that the resource is ready. It uses a simple grep command to achieve this. Checkly's Terraform code is organized into modules, with one module per AWS region.

Terraform code can be tested by testing toolkits like Terratest which can validate infrastructure managed by Terraform. However, Checkly does not use any test frameworks for this, depending instead on the fact that "the test and the production environments are identical, and any major issues will be caught in the former", says Nolet.

Checkly’s base Docker image is Ubuntu-based with all the packages necessary to run Puppeteer and headless Chrome, which adds some extra libraries and fonts. The Docker container runs a PM2 process which launches a Node process. This part of the Docker strategy is stable and errors which might lead to a deployment rollback are usually in the actual product code, according to Nolet. Checkly uses both AWS CloudWatch and AppOptics for monitoring. CloudWatch alerts on AWS queue sizes, delays as well as basic instance health. AppOptics is more application specific, checking metrics like the number of runs in a given region in the last 10 minutes, or the run times in a given region. Checkly's status dashboard is publicly available.

Rate this Article