BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Q&A with Stuart Davidson on Scaling Continuous Delivery at Skyscanner

Q&A with Stuart Davidson on Scaling Continuous Delivery at Skyscanner

Stuart Davidson, an engineering manager at Skyscanner, spoke at QConLondon 2018 on the journey which his organisation has been on to get from a reactive operations model to providing teams with an empowering developer experience. Davidson told the story of how, with support and a lofty-goal from CTO Bryan Dove, Skyscanner began a journey to provisioning the capacity to continuously deliver application ten thousand times a day, across an organisation which already had 600 technologists worldwide and was continuing to grow. This took the form of an iterative path with small increments and cultural change, ultimately resulting in a containerised platform owned by empowered teams.

Davidson, who heads up Skyscanner's Development Mechanics and Deployment & Orchestration teams, shared that he was originally part of a small group fully occupied with reactively delivering minor increments, such as specific CI enhancements in response to requests from development teams. Recognised the strategic benefit that a robust deployment pipeline would bring, the company's leadership set a goal to create a platform capable of deploying 10,000 changes a day. Davidson described this as an awakening, as they realised that the platform they were designing would never have scaled to this capacity. 

Davidson explained that his team discovered they were "a strategic enabler" for the business by recognising they were, in fact, a "strategic roadblock". The team gated provisioning of the CI pipeline "which all products had to go through." Davidson calls this the "Jenkins Paradox," which he described as the contradiction between repeatable builds, infrastructure and robustness, at the same time as encouraging exploration and innovation with new and unfamiliar tools.

Their solution to these challenges was to move their CI infrastructure to Drone, a container native tool, which managed layering of build and CI environments provided by development squads. This gave the squads greater ownership of their build, test and runtime infrastructure. He explains that making teams responsible for their own repeatable pipelines proved popular with the squads and resulted in upskilling:

The adoption was insane because the engineers saw the benefits as well. These semi-autonomous squads could build the pipelines they wanted. They loved it...and we'd inadvertently trained every squad in containers. There was at least one person in every squad who knew how to manage their Dockerfile, as they needed it to control their build environment. So we thought, let's take it to production.

Skyscanner started their production journey by using AWS's ECS container service as it was the "cheapest, easiest, most accessible container scheduler." While they are now migrating to managed Kubernetes, Davidson doubted that choosing Kubernetes at the outset would have been as successful, given that they were already an AWS shop. He points out that technical problems can be deferred and that teams should make "solutions as simple as possible if taking an iterative step." Speaking on how they iteratively incremented, he said:

Don't invest too much. It's a hypothesis. Don't get into a position where you are investing six months worth of effort. Try and find something that's quick and easy. Learn from that and then try and find what's important to you in the next step.

Davidson shared that experimenting with tools should be balanced against the cultural shifts required to achieve CD, saying:

If you want to try some of these tools, look for one that is robust and will run, and you won't have to worry about the operations of it. Because you will have way, way more problems. You'll have the cultural shift.

Skyscanner's deployment solution evolved rapidly to one where teams were able to do blue/green deploys with integrated monitoring and observability. Davidson told the audience that they were able to "take an idea and put it into production, and have it monitored and alerted in 30 minutes." He pointed out that the steps they took to get there were "just small iterations on the idea that we already had."

Davidson also spoke about the safety given to the teams through performing small deploys where risk was reduced even though they had concerns about their testing:

Every change going into GitLab was being deployed continuously in a blue/green fashion. This was scary, as our testing was good, but maybe not that good. With continuous deployment we did find, if you look at the risk equation, the change which was being deployed was very, very, very small...So if there was a problem it was easy to find where the problem stemmed from.

Davidson spoke of how they added a canary-like pause to their blue/green deploys, using StackOverflow's open source Bosun monitoring tool to execute queries against a time series database with both system and business metrics to assess health of a release. Davidson explained that squads could define programmatic pauses of the rollout at specific percentages to analyse for acceptance patterns and if this failed, the deployments were automatically rolled back. Davidson said:

We can also query (OpenTSDB) to see if our sales of flights have gone down. In fact, we would prefer to rollback if that suddenly takes a plummet. It could be that something good came on the television, people stopped looking at Skyscanner and we rolled-back accidentally, but that's OK. We want to be safe with this sort of thing.

At the time of presenting, Davidson showed that in the previous month Skyscanner had deployed 456 distinct services a total of 3,733 times across multiple regions. He described his teams' goal as being one of empowering squads to be able to focus on delivering value to the traveller:

We aim to be a force multiplier. For every engineer who works in Skyscanner we try to enable them to do more in a day. We try and do as much of the heavy lifting as we can, so an engineer can get their source into production as quickly and reliably as possible. They can then focus on the product and features we give to the traveller.

InfoQ spoke with Davidson to learn more about this journey to scaled CD.

InfoQ: What form did management support take after the initial challenge to be able to scale to 10K deploys a day?

Stuart Davidson: Bryan lived up to his word and mentioned [our deployment tool] Slingshot as often as he could - we even got to present it to the board.

My manager (Ryan Crawford) and my Engineering Lead (Paul Gillespie) did a ton of work going round the engineering leadership and getting feedback, challenging perceptions and gathering a group of influencers that would help sell it to the rest of the company.

But it wasn't just top-down, we had tremendous support from an ambitious Tribe Engineering Lead called Dave Garcia who could see the benefit of what we were doing and was adamant his Direct Booking team became early adopters of the product. They were always giving us feedback and helping us shape what we worked on next.

We also ended up with a really busy Slack channel where engineers would help each other out with questions and trade examples of how to work with the system - it became almost viral as squads adopted the system for new pieces of work then porting their existing systems. The positive attitude and tenacity of engineers across the business meant it was a real success.

InfoQ: How has Skyscanner's product development and business agility responded to being able to rapidly deliver small changes?

Davidson: It's interesting actually, there's been small changes certainly and a real focus towards data-driven experimentation in the business. Those small changes, however, have given us confidence to make much larger changes in how Skyscanner operates.

We've quickly spun up new verticals in specific regions to see what sort of uptake we'd have. We've created content pages or pages focused on specific events such as the Champions League or the cricket - these no longer take 6 months of planning and requisitioning of hardware.

A first draft of an idea can genuinely be in production within an afternoon then teams can quickly and safely iterate on what they've created.

InfoQ: One of your key lessons was to realise that empowering teams with capabilities, involved empowering them to help themselves. How did this realisation come about in practice?

Davidson: This was actually easier than it sounds within Skyscanner as our engineering organisation promotes the autonomy of squads - it really removed blockers for inquisitive early adopters to give it a try and provide us with feedback. This does have a side effect of making some squads difficult to move at the end, but by then the user experience of migration is normally well documented or potentially even automated.

For example, we did do quite a bit of work at the start on a tutorial that explained some concepts then went from start-to-finish leaving the engineer with a fully working production service at the end. That experience tended to get people on-board really quickly and it had a bit of a "wow" factor that got people engaged and talking to others in the business.

Slides and a video recording of Davidson's talk will be made available on InfoQ over the coming months.

Rate this Article

Adoption
Style

BT