At QCon New York, Niko Kurtti presented "Forced Evolution: Shopify's Journey to Kubernetes", and described the Shopify engineering team's journey to building their own PaaS with Kubernetes as the foundation. Key takeaways for other teams looking to build their own PaaS and associated developer workflow included: target hitting 80% of deployment and operational use cases; create patterns and hide the underlying platform complexity; educate and get people excited about the project; and be conscious of vendor lock-in.
Kurtti, production engineer at Shopify, began the talk by describing that Shopify is a rapidly growing Canadian e-commerce company that offers a proprietary e-commerce platform for online stores and retail point-of-sale systems. Shopify currently has 3000+ employees, and the company processed $26 billion is transactions in 2017. The underlying e-commerce software platform sees 80k+ requests per second during peak demand.
At the start of 2016 the engineering team was "running services everywhere", including within their own data centers (using Chef and Docker), on AWS (using Chef) and Heroku. Developers liked the developer experience of Heroku, and Kurtti commented that this platform actually scales quite well, with "simple UI sliders" to increase the number of instances and associated CPU and RAM. Although the platform team had defined service tiers and appropriate Service Level Objectives (SLOs) based on criticality to the business, there were many processes that were not scalable, and accordingly these presented challenges as the company grew.
Kurtti continued by stating that manual or artisanal processes clearly did not scale well, and neither did slow processes that make people wait. Challenges were encountered with "rusty knobs that don't work when needed" within the platform and deployment operations, and also processes that did not work first time or reliably. Accordingly, the Shopify team recognised that they needed to increase their focus on tested infrastructure, and automation that works as expected, everytime. Also critical to the ability to scale was giving developers the ability to safely self-serve in a consistent manner across the infrastructure/platform, and providing comprehensive training to enable them to become experts in the systems they operate. Alongside these new initiatives the organisation had also decided to embrace cloud computing, and were keen to promote migration to their chosen cloud vendor, Google Cloud Platform (GCP).
The Shopify engineering team recognised that they were effectively building an internal Platform-as-a-Service (PaaS), and so decided that three principles were key to its success: providing a "paved road" -- operating a platform that would by default meet a high percentage of use cases within Shopify, but also allow customisation if required (the Netflix engineering team discussed a similar concept at last year's QCon New York); complexity should be hidden -- there are advantages to knowing about the underlying platform, but many developers do not want to be exposed to all of the details of the internals; and self-service is a priority -- developers should not be bottlenecked by waiting for centralised operations or platform teams.
After analysis and experimentation the Shopify team chose to build their PaaS on top of the Kubernetes container schedulers and orchestrator. Kubernetes had the best traction of the open source projects within this space, it was platform agnostic, it could be extended via the APIs exposed, and it was also offered as a service in GCP -- Google Container Engine (GKE) -- which allowed the team to focus on the value-adding components they could provide on top on this "strong foundation".
Kurtti stated that the four building blocks of running an application on the Shopify PaaS were: how to specify an application's runtime; how to build an application; how to deploy an application; and how to set up dependencies. Accordingly, the engineering team created the "Services DB" and "Groundcontrol" tool. The services database provided an interactive web UI for developers that included a catalogue of existing applications and a mechanism for the automated generation of associated Kubernetes manifests, alongside build and continuous integration configuration. The Groundcontrol system was a Golang-based application that was deployed on the Kubernetes clusters which created namespaces and encryption keys and managed service accounts. An example of the web UI is shown below, which demonstrates the flow from a developer initialising a project to it automatically being deployed within a test namespace on a Kubernetes cluster.
Additional tooling was created, including PIPA, an agent that builds Docker images, and Buildkite, which acts as a coordinator for PIPA. The tools combine to provide a "Herokuish" workflow by default, and can also be used to specify bespoke Dockerfiles or create custom pipelines. Kubernetes-deploy was also created (and released as open source), which is a "command line tool that helps you ship changes to a Kubernetes namespace and understand the result". The tool is pluggable and provides a simple pass/fail result on deploys. It also configures ConfigMaps and Secrets, and protects Kubernetes namespaces. All of these tools integrated with Shopify's open source "Shipit" deployment tool that is used extensively internally within the company.
The platform team has invested heavily in the creation of "cloudbuddies", which are effectively custom extensions on Kubernetes in much of the same style as CoreOS's Operator pattern. Cloudbuddies extend the Kubernetes API and manage processes such as creating DNS records, configuring cluster/user quotas, and setting security rules. The cloudbuddies have been highly influential on the success of the new platform, and Kurtti discussed how extending Kubernetes has been generally a good experience: the Kubernetes APIs are well-documented ("if not super stable") and the Golang client libraries are high-quality; current concepts (like Deployments, Endpoints) can extended as well as custom entities (using Custom Resource Definitions); distributed systems primitives are provided; and as the extensions are written in pure Golang they can be unit tested and run and deployed as normal applications.
The platform team does not currently expose the Kubernetes control plane to developers, and instead of using tooling like kubectl they deploy and operate applications via the provided web UI. This UI provides the majority of functionality that kubectl does, and in the future kubectl itself may be exposed to "power users" within the development team. Extensive documentation has been created, which focused on "how to drive the car" rather than "how to build the car" -- meaning that the developer experience and operation has taken priority over explaining the technical foundations of the platform. Kurtti also praised the efforts of his Shopify teammate Jenna Black, who provides oncall cloud help during the working day, and stated that working alongside people who have specialised expertise in (and value the importance of) the support function is extremely beneficial.
The talk was concluded with a discussion of the new platform's "report card". The development team has extensively praised the platform, in particular focusing on the ease and small amount of time required to get an application running on the cloud platform. However, challenges still remain. The platform team is currently focusing on providing insight into how everything works for the developers, and also addressing common development issues such as scaling and debugging. The platform Site Reliability Team (SRE) itself has had to embrace giving up control of the underlying infrastructure (in regards to the use of the fully-managed GKE), and providing a single platform to meet all common use cases has been challenging.
For engineers looking to build their own, Kurtti and the Shopify team provided several key takeaways: target hitting 80% of use cases; create patterns and hide complexity; educate and get people excited about the project; and be conscious of vendor lock-in.
Additional details about the talk can be found on the QCon NY website, and the slides (PDF) can be found via the schedule page. The videos for the majority of the QCon NY talks will be made available via InfoQ over the coming months.