BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News DOES London: ITV Autoscaling for Love Island

DOES London: ITV Autoscaling for Love Island

Leia em Português

Bookmarks

Tom Clark from ITV, a UK-based commercial producer and broadcaster, gave his fourth talk at DevOps Enterprise Summit London, titled 'Better, Faster, Cheaper, Happier', and built on the evolutionary story of the common infrastructure platform for which he is accountable.

ITV produces ten thousand hours of content per annum, generates three billion pounds in revenue, and employees six-thousand staff, of which three hundred are in the technology teams. The common platform's genesis was in 2015 when the organisation was tackling a modernisation programme. Clark said:

A load of the things that made ITV, ITV, our lifeblood, were stuck, they were calcified: airtime sales was impossible and dangerous to change, changes to content sales took eight weeks if you were lucky, and talent payments was running on a virtualised ICL mainframe running Cobol and availability of both parts and skills were dwindling.

The modernisation programme drove a number of transitions: from project to product, waterfall to agile, on premise infrastructure to cloud, ETL to API, manual to automated, bespoke to standard, monolith to microservice, shared to common, proprietary to open source, slow to fast, big to small, CAPEX to OPEX, Dev & Ops to DevOps, centralised to devolved, command and control to trust and measure, MTBF to MTTR, outsourced to insourced, and infrequent to frequent. Clark explained that in creating lots of new product teams who would be forming and storming, they were concerned that they would have lots of decisions to make:

We worried that it would be the wild west and worried they would unnecessarily reinvent the wheel, and reinvent boring bits of plumbing that they shouldn't really have to think about: logging, monitoring, CI and CD. If everything was to be unique and artisanal and bespoke, then the complex, multiple team interactions would add massive overhead to the organisation.

The idea behind the common platform was to take care of the plumbing so the teams didn't have to; infrastructure, logging, metrics, audit and security compliance were to be handled separately so that the teams have more time to be delivering business value for the organisation. Clark explained that the people on the common platform team have to be smart (high IQ) to keep up with the rate at which technology changes, and also kind (high EQ); in his view, platform engineers sit at the intersection of these two qualities.

There are two types of engineers in the team. The first type, the core platform engineers, curate and develop the standard patterns for the other teams in tools such as Puppet and Terraform, incubate new hires, and act as "professional rubber ducks" providing "second opinion-as-a-service". They can be parachuted in if there is an emergency resource constraint in another area, and provide research and development, making sure the platform is 'evergreen'.

The field platform engineers are embedded in the product development teams along with a product owner, delivery manager, the developers and platform engineers. They have an operations responsibility coaching the teams; a lot of the developers hadn't seen production before and were nervous of this. They provide "force multiplication", and are responsible for quality influence in terms of non-functional tests.

There were several problems with the initial common platform team. Primarily, there were lonely platform engineers out on their own with the product development teams. By selecting best of breed products and putting them together with ITV glue, it became a bespoke platform with a steep learning curve, although it was initially intended to be self-service. The developers found it hard to do what they wanted to do and the platform engineers were too kind and too helpful, and the dynamic changed from "I can do that" to "The platform engineer can do that," resulting in a situation Clark described as Platform Engineer as a Service (PEaaS): force multiplication and quality influence suffered. There was poor developer experience with the fifteen minute CI/CD cycle times considered too long, even though this was an improvement from the initial eight weeks. Having twenty product teams but only four core engineers meant that research and development suffered. Clark said:

ITV like to think of themselves as fast followers in technology adoption, but it takes constant effort to stay still. If you stop, if you take your foot off the gas, you slip backwards and we did. We were running yesterday's technology. We learned to do something when you can afford to, not when you can't afford not to; if you are reacting, it's probably almost too late. We realised that the common platform was what we needed but not what we wanted, because ultimately the platform we wanted was no platform at all.

Along with this realisation came another: that nobody was happy -- not the developers, nor the platform engineers. And so an intervention was staged. The common platform team went out with a roadshow visiting all their sites with three elements: a classic StoStaKee retrospective, a Net Promoter Score (NPS) "Would you recommend this platform to a friend?" question, and requests for features. People asked for more transparency and visibility, more empowerment and self-service (which was ironic, since this had been the original intention). The NPS was 3.1 which was neither recommend nor not recommend. People also wanted faster deployments and autoscaling. At this time they were running on Amazon EC2 instances with Puppet deployed via an AMI, and it would take fifteen minutes to initialise, so they had to pre-warm VMs in order to handle spikes. Accordingly, autoscaling was not a possibility.

One team in particular had a strong interest in autoscaling and managing traffic spikes. ITV's reality show, Love Island, is third only in popularity on the channel to Football World Cup games, with audience figures of over three million viewers for the opening episode this year. Nearly a million people watch it on Simulcast (watching on devices live, along with what's being broadcast on air). Unlike with football, when the audience tends to tune in over a thirty-to-sixty minute period before the game kicks off, the Love Island audience demographic behaviour means that the system sees twenty times the load in ten minutes as the programme begins. Clark explained:

This was the new normal for us at ITV. We wanted Love Island to be business as usual. We didn't want to have to do any pre-warming or pre-planning. We wanted to be able to handle this load whenever it happened. And so our partners online set an objective and key result (OKR): one infinitely autoscaling service running in production by March (this was in January). So I aligned the Common Platform Version 2 (CPV2) Minimum Viable Product (MVP) to this OKR.

Clark's team created a roadmap backlog (it already existed, it just hadn't been shared), a Slack channel, a contributors group who set the vision: "Provide a brilliant hosting and development platform." Since time was of the essence, the delivery date was in three months, the team focused on evolution, not revolution, and upgrade, not replatform. They considered the Minimum Viable Change (MVC) in order to limit the risk. They decided their biggest opportunity for winning was to update runtime/scheduler from EC2 instances to containers. Since ITV's cloud platform is AWS, they identified two options, Fargate and Elastic Kubernetes Services (EKS) and developed a weighted scorecard and EKS was identified as a leader. Using a phased approach, they added teams in increments. Clark said:

We also followed a new rule: "Optimise for the common case" and didn't try to accommodate every edge and corner case. We aimed for convention over configuration because every time you add new configuration parameters to a file that incurs new cognitive load on anyone that has to look at that file.

The team came up with some developer personas and discovered that most developers didn't care whether it was Kubernetes or not, they just wanted outcomes; to "get stuff done". Clark refers to these as "80% were easy mode". The other 20% did care about Helm charts and YAML, and were given a different developer experience. They used the Greek word for simple: 'Aplo' to describe the easy mode where all the YAML happens in the background.

The team succeeded in delivering their MVP (the autoscaled service running in production) by the end of March, and were rewarded with mission patches (stickers) to stick to their laptops. Clark had the mission patch put on a celebratory cake ("baked motivation"). There were a number of benefits ITV saw as a result of this effort: developers can self-serve more, cycle time is ten times faster at ninety seconds, there are fewer failed deploys and it's thirty percent more efficient to run which is viewed as additional injected capacity. And they can autoscale in milliseconds. A principal developer, Dave Smith, sent an email to the development team saying this about CPV2:

The performance is better, it's cheaper to run, the config is nicer, the deployment times are delightful and the scaling is sublime!

An hour before the broadcast of the premier of Love Island was due to begin, on the Slack channel a stream of 500 errors was reported. A change had been made which recategorised a 400 error to a 500 error so there was no real user impact but it was causing unwanted noise in a channel that they wanted clear at that time. The team discussed it in Slack and agreed to make change and roll forward, minutes before their biggest event of the year. The fix was successfully deployed and complete within thirty minutes of being spotted. Clark said:

It was a trial by fire for the platform, but really it's an example of the engineering maturity we have in our teams now when that was the obvious thing to do. It wasn't scary, there was no change freeze. It's very boring now and actually people aren't watching the stats on Slack anymore as it is BAU.

Rate this Article

Adoption
Style

BT