Scaling Microservices at Gilt with Scala, Docker and AWS
At Craft Conference 2015, Adrian Trenaman discussed the evolution of the Gilt.com architecture from a monolithic Ruby on Rails application to a cloud-based microservice 'lots of small applications' platform utilising Scala, Docker and AWS. Trenaman shared both technical and organisational lessons learnt from the past eight years, as Gilt has grown from a startup to a $1B company.
Trenaman, VP of engineering at Gilt.com, began the talk by introducing the core business and corresponding technology deployed within Gilt Groupe. Gilt.com is an online shopping website based in the United States, which specialises in flash-sales of luxury brands and lifestyle goods. The nature of a flash-sale means that traffic to the website spikes massively fifteen minutes before the time the sale starts, and then rapidly reduces over the next two hours before returning to a low baseline. The result of this traffic pattern means that the cost of application failure depends greatly on the time of day a problem occurs.
Our customers are like a herd of bison that basically stampede the site every day at 12pm. It's our own self-imposed denial of service attack, every day...
The Gilt.com website was originally built in 2007 as a monolithic Ruby on Rails application with a PostgreSQL database. As traffic increased a memcached caching layer was added, and certain business functionality conducted within the site was moved to a series of batch processed jobs. Over the next 4 years the increasing traffic began stressing the original architecture, and due to the monolithic nature of the application any crash caused a complete failure of the website and supporting business applications.
In 2011 the Java programming language and Java Virtual Machine (JVM) were introduced into the application stack, and services based around business functionality were beginning to be extracted from the original monolith. Trenaman stated that the reliance on the original single database was not removed during this time, as there was always something else with a higher return on investment to work on. However, many of the small services maintained a local read-only copy of data from the primary database, and a 'cart' service was created with its own Voldemort-based data store.
Trenaman described the architecture at Gilt during 2011 as consisting of 'large, loosely-typed JSON/HTTP services', with data being exchanged across service boundaries as a course-grained key/value map. As the company was innovating at an incredible pace, the development team also unintentionally created a new Java-based monolith in the 'Swift' view service, which became a bottleneck for innovation. The result of the architecture resulted in a codebase in which 'some parts people cared about, and some they did not'.
Trenaman discussed how the Gilt technical leadership decided in 2011 to re-arrange teams around strategic initiatives (the so-called inverse conway manoeuvre), with the primary goal of making it fast and easy to get code into production. Although there was no explicit architect role, a microservice-based 'lots of small applications (LOSA)' architecture emerged, primarily driven by Gilt's engineering culture and values. Goals and key performance indicators (KPIs) were set for each team working on an initiative, and many initiatives were started (resulting in the creation of ~156 microservices by 2015).
The growth of the number of microservices accelerated when Scala, running on the JVM, was introduced into the technical stack. Trenaman discussed that an average service at Gilt consists of 2000 lines of code and 5 source files, and is run on 3 instances in production. During 2011 and 2015, Gilt also decided to 'lift and shift' the legacy application stack to the Amazon Web Services (AWS) cloud, and also began deploying new microservices to this platform. Trenaman noted that the vast majority of the services running at Gilt are currently executing on AWS EC2 t2.micro instances, which contain relatively little compute power, but do offer 'burstable performance'.
Trenaman stated that Gilt are very positive about the microservice architecture, as it has given their organisation the following benefits:
- Lessens dependencies between teams - resulting in faster code to production
- Allows lots of initiatives to run in parallel
- Supports multiple technologies/languages/frameworks
- Enables graceful degradation of service
- Promotes ease of innovation through 'disposable code' - it is easy to fail and move on
Trenaman was also keen to state that there have been a series of challenges with implementing the microservice-based LOSA architecture:
- Maintaining multiple staging environments across multiple teams and services is hard - Gilt believe that testing in production is the best solution, for example, using 'dark canaries'
- Defining ownership of services is difficult - Gilt have chosen for teams and departments to own and maintain their services
- Deployment should be automated - Gilt are building tooling using Docker and AWS (some of which will be open sourced soon)
- Lightweight APIs must be defined - Gilt have standardised on REST-style APIs, and are developing 'apidoc', which they are labelling as 'an AVRO for REST'
- Staying compliant while giving engineers full autonomy in production is challenging - Gilt have developed 'really smart alerting' within their 'continuous audit vault enterprise (CAVE)' application
- Managing the I/O explosion requires effort - some inter-service calls may be redundant, and this is still a concern for the Gilt technical team. For example, loops are not currently automatically detected.
- Reporting over multiple service databases is difficult - Gilt are working on using real-time event queues to feed events into a data lake. This is currently implemented using Amazon's Kinesis and S3 services.
Additional information on Adrian Trenaman's talk 'Scaling Microservices at Gilt' can be found on the CraftConf website. Further details of many of the Gilt technologies mentioned above can be found on the Gilt Tech blog.
Additional InfoQ article on Gilt's deployment tools