InfoQ Homepage Articles Deploying Microservices to AWS at Gilt: Introducing ION-Roller

Deploying Microservices to AWS at Gilt: Introducing ION-Roller

May 08, 2015 14 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Over a period of seven years, gilt.com has grown from a Ruby-on-Rails start-up, to a major e-commerce player built on a Scala microservices architecture. Gilt's flash-sale business model is predicated on driving a 'pulse' load of customer traffic to compete for limited quantities of luxury product; adopting microservices has provided a blend of scalability, performance, and reliability for our services. It has also enabled autonomy, empowerment, and flexibility for our development teams. Teams are free to use their choice of language, framework, database and build-system to create services for core site functionality, mobile applications, personalisation algorithms, real-time data feeds and notifications.

With a Cambrian explosion of software services comes the question of how to deploy and run the artifacts we create. Gilt began its life running software on 'bare metal', over first one and then two traditional data centres. Teams would request that hardware be provisioned, and then deploy using a blend of custom scripts and Capistrano. It worked, but was prone to difficulty. Initial experimentation with Hypervisor technologies proved to dampen performance during our daily pulse load; many services ended up being co-located on the same physical box, leading to performance anomalies and, at times, outages due to the lack of resource isolation; one rogue deployment of a service consuming 'multiple-threads-per-request' during pulse traffic could take out the entire site.

We adopted a number of techniques and created tooling to alleviate these problems using performance testing in production, and continuous deployment. However, our research into ways to allocate and auto-provision services onto the estate of hardware in our data-centers proved time-consuming, and, ultimately too difficult.

This lead us to consider moving our micro-service infrastructure wholesale to the cloud, in particular Amazon Web Services (AWS), and honing in on the concept of Immutable Deployment, a concept inspired from our experience with immutable variables in functional programming. We adapted our data-centre based tooling, 'ION-Cannon', to a new soon-to-be-open-sourced tool, 'ION-Roller'. Our motivations for ION-Roller were to:

Allow our teams to declaratively specify how their services should be deployed in AWS, e.g. min/max nodes available, instance size, etc.
Provide a pipeline for deploying services and applications as immutable Docker images to production environments, with phased roll-out and movement of traffic from old to new versions of the service.
Support rapid rollback in the event of a rogue release, potentially to an already 'hot' set of of instances running the previous version.

In this article we’ll talk about the core concepts used in ION-Roller, the technologies and approaches involved, and provide overview of its use.

The ION-Roller Cloud-based Kingdom of Microservices

We envision the world of microservices as HTTP endpoints built on top of independent immutable environments, communicating via REST APIs.

HTTP Endpoints

Gilt uses HTTP as the transport wrapper for communication between systems. The HTTP endpoints are represented as a hostname/port combination to the user (with a discovery layer on top of this). Our idea is to use this endpoint as an organizational concept in the evolution of software and configuration over time.

While the concept is simple, many useful goals can be accomplished. Through the use of appropriate proxies and/or networking, we can offer such features as:

gradual rollout of new software versions
safe rollback to previous versions of software/configuration
detection of anomalies through error rate and latency monitoring
understanding of what software serves an endpoint
information about changes in software/configuration over time

Configuration over code for describing environments

Unrestrained flexibility in deployments may seem appealing when the scale of an operation is extremely small, but it becomes extremely difficult to manage, monitor and understand as the scale increases.

Declarative configuration is a useful tool for describing and managing deployments in a structured fashion. As configuration, it is more limited than code as to what is possible, but makes it possible to analyse and report on what has been requested.

Risk Management in Software Deployments

Releasing software involves dealing with multiple tradeoffs: flexibility/simplicity, and velocity/safety.

Most production environments are moving towards deployment processes which allow shorter amounts of time between feature/bug completion, and getting those changes into the hands of the users. This implies a much higher rate of change in the production environment. Given that every release involves risk, there are downsides to this policy. To offset this, teams need to adopt release processes which mitigate the risks in individual releases, and make a release a routine, safe operation.

There are multiple aspects to risk: number of issues, severity of issues, number/percentage of users affected, and failure recovery time.

The release process does not affect the number of issues, or severity of issues as much as the in-development process, though frequent releases usually make it possible to understand the reasons for an issue more quickly (due to less changes happening per release).

What we can help with are the numbers of users affected, and time to fix. Testing in production, partial rollouts and canary releases fit here. Testing the new release without any production traffic gives minimal (or zero) impact, but finds the least issues. As many issues only appear when given normal traffic, a canary (a single instance running the new software) is a valuable tool when seeing if a new release is suitable for use. At this point, the impact to users may still be small, depending on by how much your service is sharded. Then, continuing the rollout slowly keeps the idea of trying to minimize the impact of the bug to users by minimizing the affected percentage.

Failure recovery time can be a major advantage of the immutable software approach, as the old software is still running “hot”. If an issue is identified, this can allow minimal risk and time in going back to the older version. We will discuss immutable deployments further later.

Desired versus Live State of Running Software

We want to continually monitor the difference between what is supposed to be running behind an endpoint, and what is actually running; if there is a difference, we should (over time) move the live state of the system towards the desired state. We continually perform a number of tests against the environments (running software, load balancer settings, DNS, etc.), and may update any of these if they do not match what is required.

For example, if we know that traffic should be served from a specific version of our software, and we discover that no servers running this version are available (for any reason, including that somebody accidentally removed them), we should request that it be deployed.

Immutable Deploys

The idea of “immutable environments” is to deploy software with a defined, understandable configuration that does not change over time. The software and configuration are both “baked” into the release, and you change the software not by updating the data in place and restarting, but by setting up a second copy of the software with the new details.

This idea provides a more predictable environment, and also interesting possibilities around rolling back to older versions of your software in case a new release proves to be “bad”, without requiring a full restart. (Even if the old software is available, sometimes it is costly and time-consuming to start it up fully, if it has to load caches, etc.)

In a smaller-scale environment there may be only a small number of running servers, the detection of issues with the new release may only be discovered after the rollout has been completed. With traditional non-immutable releases, all the old versions of the software would already have been terminated.

There is one disadvantage to the immutable deploy approach, as currently implemented in our service. As each environment is set up as-needed, the setup time is dependent on such factors as the time to spin up new EC2 instances, and to download the appropriate Docker image. In comparison, other deployment tools which require an existing pool of machines, do not need machines to be set up before the software can be installed. So we have traded lower latency for rollbacks, against slightly higher latency for the initial software rollout.

Docker Images

For a microservices environment, it can be a useful concept to abstract the choice of tooling and language choice from the deployment system. This allows individual teams to have more autonomy when building software. While there are a number of options which make this possible — from OS-specific packaging platforms like RPM, to cloud provider-specific like Packer — we decided to use Docker to fill this purpose. Docker provides a lot of flexibility when choosing deployment environment and strategy, and fits our needs nicely.

Reinventing the wheel?

When we looked at the current options available for easily managing immutable web-server based infrastructure, the options seemed limited. Most software is currently optimised for updating software in-place, and not for the kind of updates we were interested in. We were also interested in managing the complete lifecycle for web server traffic, and no option we found seemed optimized for this task.

Another consideration was whether there might be more value in requiring developers to learn and understand an existing deployment toolset, or could they actually be more productive when given a friction-free, simplified tool that could fulfil the request to deploy a particular version of software to production. We believe developers should be able to release software easily without knowing the intricacies of the underlying mechanisms (while keeping the possibility of customization, when required for advanced use cases).

Why not Puppet/Chef/etc.

Many tools take a machine-centric view of the world, and configuration is done on a per-machine basis. Instead, we take a broader view of system state (e.g. we want four copies of this software serving traffic), and do not really care much about individual machines. We leverage higher-level concepts provided by AWS, such as replacement of machines which continually fail health checks, and addition of extra capacity dynamically when extra traffic comes to the HTTP endpoint.

Why not CodeDeploy

CodeDeploy is an Amazon-provided software rollout management system; however it does not meet our needs in supporting immutable infrastructure. Although it is very easy to put together scripts which release your software using CodeDeploy, you do need to set up your environment in advance. Also, CodeDeploy doesn’t provide out-of-the box support for deploying Docker images.

Why not Plain-old Elastic Beanstalk

Elastic Beanstalk offers the ability to create environments, which allow the running of Docker images, and sets up a number of supporting systems, such as EC2 instances, a load balancer, an auto-scaling group (change the number of servers based on load). It also allows access to log files, and a certain amount of management of these systems.

However, it has extremely limited support for the concept of “immutable deploys”, where multiple releases of a piece of software can provide traffic to the same user-visible endpoint, with traffic being moved gradually over time. The only support it has is for the ability to “swap CNAMEs”, which is a very coarse method for moving traffic — all the traffic moves to the new environment at once. Also, it has issues with reliability due to the nature of DNS; lookup results can be cached for a period of time after the DNS has changed, and also bad clients can ignore DNS TTL values, causing traffic to be sent to the old environment long after it should have moved.

Moreover, Elastic Beanstalk has no higher-level structure for understanding what is running in production behind an endpoint. The question “what is running, and what is its configuration” is not easy to answer, and requires some sort of system on top of it.

We decided to leverage Elastic Beanstalk as a useful method for deploying Docker software, but layer appropriate management and control layers on top of this to provide a complete workflow for our users.

Introducing ION-Roller

ION-Roller is a service (API, web app and CLI tool) that leverages Amazon’s Elastic Beanstalk and underlying CloudFormation framework capabilities to deploy Docker images to EC2 instances.

Getting started with ION-Roller - the little things you need to do

All you need to start deploying software is:

AWS account with permissions granted to ION-Roller to spin up instances and access resources on your behalf (if your organization runs software across multiple AWS accounts, ION-Roller can offer a single view of deployments across all those accounts, given sufficient permissions).
Docker image in a Docker registry (hub.docker.com or private Docker registry of your choice)
Deployment specification for your software, including HTTP endpoint, number and type of EC2 instances, runtime arguments for your Docker image, environment variables, security settings etc.; these are supplied as part of your service configuration, and posted to ION-Roller through its REST API.

For details on the installation of ION-Roller into your AWS account and full documentation, stay tuned for open sourcing of https://github.com/gilt/ionroller.

Deploying Software with ION-Roller

You can use the supplied command line tool to easily trigger your deployments:

ionroller release <SERVICE_NAME> <VERSION>

The tool provides ongoing feedback on the progress of the release:

[INFO] NewRollout(ReleaseVersion(0.0.17))
[INFO] Deployment started.
[INFO] Added environment: e-k3bybwxy2f
[INFO] createEnvironment is starting.
[INFO] Using elasticbeanstalk-us-east-1-830967614603 as Amazon S3 storage bucket for environment data.
[INFO] Waiting for environment to become healthy.
[INFO] Created security group named: sg-682b430c
[INFO] Created load balancer named: awseb-e-k-AWSEBLoa-A4GOD7JFELTF
 …

Alternatively, you can trigger a deployment programmatically, since ION-Roller supplies a REST API for full control of your configurations and releases.

Under the hood, ION-Roller triggers an Elastic Beanstalk deployment process with all its goodness — including the creation of a load balancer, security and autoscaling groups, setting up CloudWatch monitoring and pulling the specified Docker image from a Docker registry.

Traffic Redirection

Once it detects a successful deployment, ION-Roller safely moves traffic from the old version to the new version of your service over time.

The traffic redirection happens by altering the set of EC2 instances which provide responses to HTTP requests through a load balancer. As the rollout proceeds, instances from the new deployment are added to the load balancer configuration, and instances from the old deployment are removed. The time taken for this to happen is configurable.

Gradual traffic redirection allows you to monitor the latest release, detect failures quickly and rollback when necessary.

Software Rollback

As the old environment is still available during the rollout and the software is still running, we can safely rollback to the older version of the software at any point. Unused old instances are removed after a configurable period of time. Due to this delay, we can still rollback for a period of time after the rollout has completed.

Our goal is to continuously monitor the health of the endpoint, and automatically revert to the old version if issues are detected; we will use Amazon’s CloudWatch alarms to signal that a rollback should be performed.

Performing a manual rollback is as easy as running:

ionroller release <SERVICE_NAME> <PREVIOUS_VERSION>

If the old instance is still available, it only takes a few seconds to move the traffic back the previous version. This assumes, of course, that you have not done anything to make the old software break (like updating datastore schemas, etc.). This is important to bear in mind if you depend on the ability to roll back your software, no matter what the deployment system is.

Canary releases and testing in production

ION-Roller supports the concept of canary releases via configuration of the traffic migration process. After a new version is deployed to the initial set of instances, the process stops, allowing for release testing against production traffic. Rollout will continue after a configurable period of time.

Testing in Production

For use cases where you wish to test (or demo) your new software without sending production traffic to it, we want ION-Roller to allow a separate HTTP endpoint to be set up; this would then be available to handle requests before the production endpoint is updated.

Keeping an eye on things - ION-Trail

ION-Roller sees environment/software behind an endpoint with all its changes over time. As it monitors or makes changes in the environment, it records events related to environment lifecycle or deployment activities. This can be used for auditing, monitoring, and reporting purposes. ION-Trail is a supporting service that provides an event feed for all recorded deployment activities.

Conclusion - Decentralizing DevOps

We hope this article has shared a little of how we think about deploying a microservices architecture to AWS. ION-Roller lets us decentralize our DevOps organisation, and makes deployment declarative in the hands of the engineer. ION-Roller gives us phased rollout, hot rollback, and supports features like 'canary' and 'testing in production'. To learn more, watch out on the Gilt Tech blog (http://tech.gilt.com) for the upcoming announcement of public release of ION-Roller as open-source.

About the Authors

Natalia Bartol worked on improving developer tools as Eclipse Support Engineer at IBM and Eclipse Developer/Team Lead at Zend Technologies. As a software engineer at Gilt she focuses on deployments of microservices and increasing developer productivity. Natalia holds a MSc in Software Engineering from Poznan University of Technology.

Gary Coady is a senior software engineer with Gilt Groupe, where he automates deployments, writes build tools, and evangelises Scala. He is also a former Google Site Reliability Engineer, where he spent significant time in the trenches, managing and troubleshooting operational issues at scale. Gary holds a BA (Mod) Computer Science from Trinity College, Dublin.

Adrian Trenaman works as SVP Engineering at gilt.com in Dublin, Ireland. He holds a Ph.D, Computer Science from the National University of Ireland, Maynooth, a Diploma in Business Development from the Irish Management Institute, and a BA (Mod. Hons) Computer Science from Trinity College, Dublin.

InfoQ Software Architects' Newsletter