BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles How the Financial Times Approaches Engineering Enablement

How the Financial Times Approaches Engineering Enablement

Key Takeaways

  • Continuous delivery and decoupled architectures such as microservices can massively speed up delivery of value, with code going live in minutes rather than months
  • The key is for product-focused teams to be as autonomous as possible; they should not normally have to wait on other teams for something to happen, whether that is a decision or task execution
  • You still need teams focused on infrastructure, tooling and platforms, but they need to see themselves as enablers of other teams: the focus is on anything that makes things safer, simpler and quicker
  • Wherever possible, common engineering capabilities should be self-service, automated, and well-documented
  • Product-focused teams should be able to rely on those capabilities; they need to be maintained for the long term, and kept secure and compliant

 

With the move to the cloud, the widespread adoption of continuous delivery and the rise of microservices, it is possible for software development teams to move a lot faster than they could five years ago. But this can only happen if these teams are able to move at their own speed, without having to wait for someone to sign off on a release or process a ticket to spin up a server.

Companies still need teams working on infrastructure, tooling and platforms, but the way they work has to change so that they do not become a bottleneck. These teams need to see their true function as being about enabling product teams to deliver business value. Investment in this area pays off as it speeds up many other teams and allows product-team engineers to focus on solving business problems that provide value to the organisation.

I spent four years as a tech director at the Financial Times, leading teams focused on engineering enablement. In this article I’ll talk about how our teams are set up and the things we have found important in really enabling other teams.

The Engineering Enablement group

The Product & Technology teams at the Financial Times are aligned into several groups. Each group has a clear area of focus. The Engineering Enablement group is made up of all the teams that build tools to support engineers at the FT.

This includes lots of teams who provide a layer of tooling and documentation around key vendor functionality, whether that’s for AWS or for tools like Fastly; this also involves working closely with key vendors and having a deep knowledge of what they offer and how to make the best use of it.

Then there are teams that have expertise in particular areas, for example, cyber security or web component design.

There are also teams that do operational and platform management - we expect engineering teams to be the best placed to fix a big problem with their systems, but these teams will do triage, help with incident management, patch our supported operating systems for things like EC2 instances, and make recommendations on changing instance type or downsizing, as an example.

And finally, we have a team which works on engineering insights: creating a graph of information about all our systems, teams, hosts, source repositories etc, and providing our engineers with insight into areas where they need to improve documentation/add more scanning etc.

Principles

We have defined a set of principles to help give us confidence that we are building engineering capabilities that are valuable to our customers. They naturally group around particular concerns.

The most important in my view are:

Can people use this without costly co-ordination?

In general, engineers should be able to solve their own problems.

That means if someone wants to, for example, set up a DNS entry, they should first of all easily be able to find out what tooling is available to help with that. That tooling should be discoverable.

Then, the common things an engineer needs to do should be documented. There will always be people who need to do something complicated, and our teams have slack channels and people on support for those channels - but for 80% of what people need to do, documentation should be enough. Recently, we consolidated our documentation for engineers into a single "Tech Hub". If you go and search for DNS there, you should find what you need.

This screenshot shows the current high level Tech Topics on Tech Hub and the sub topics for DNS specifically.

And we want things to be self service. There should be nothing stopping an engineer from using the capability immediately; no waiting on a ticket or a PR to be signed off.

A recent change made by the team that owns DNS was to add a bot to our github repository for DNS changes, so that straightforward changes could be automatically approved.

You have to balance risk here of course - there will still be areas where we need to have some controls in place; for example, where getting something wrong would expose the FT to cost, risk, or security issues. Which leads onto another group of principles.

Do we guide people on doing the right thing?

Capabilities should be safe to use; the default configuration should be secure, and we should protect engineers from making easy-to-avoid mistakes.

They should also be secure and compliant - we will make sure these shared capabilities are kept up-to-date and comply with our own policies.

And finally, we need to consider what will make people feel safe adopting the capability.

Is there a risk for teams using this?

The capability needs to be owned and supported, minimising the impact of upgrades, migrations and maintenance, with a commitment of multiple years.

We should be able to provide transparent usage and cost insights, so teams understand the cost impacts of their design and architecture choices. We have found that sharing this information makes it a lot more likely that teams will take action to keep their costs down.

Autonomy of teams

Teams at the FT have a lot of autonomy, within certain boundaries. The boundaries generally are where you want to make a change that has an impact outside your team, for example, when you want to introduce a new tool but something is already available, or where we get a lot of benefit as a department from having a single approach.  

For example, if you want to use a different datastore from AWS, that’s not going to be a problem, but if you want to bring in a new cloud provider, you’d have to make a case for the additional cost and complexity. Similarly, if you want to start shipping logs somewhere different, that has an impact on people’s ability to look at all the logs for one event in a single location, which can be important during an incident.

Sometimes, teams need something for which there isn’t a current solution, and then they can generally try something out. For a completely new vendor, teams need to go through a multi-step procurement process - but teams can go through a shorter process while they are doing evaluation, provided they aren’t planning to do something risky like send PII data to the vendor.

Teams do use their autonomy. They make decisions about their own architecture, their own libraries and frameworks. About five years ago, as the FT adopted microservices, moved to public cloud and began to have a "you build it, you run it" culture, there was a big increase in the number of different ways teams did things at the FT.

Since then, there has been a bit more consolidation, as teams realise they have patterns they can copy which will save them time and effort. For example, once one team was successfully using CircleCI to build services, it was easy for other teams also to adopt this. After a while, the central engineering enablement teams took over the relationship with CircleCI. This kind of process for finding a new tool can be very powerful - you already know that people want to adopt it!

Moving the boundary

Sometimes, we do need to move the boundary on what teams can do.

A few years ago we introduced a fairly lightweight process for technology proposals. Anything that impacts more than one group or introduces a significant change should get written up in a document with a template format that covers the need, the impact, the cost, and the alternatives. One documented alternative should always be "do nothing" so we understand what would happen then. As an example, a few years ago we needed to move to a new DNS provider as our existing provider, Dyn, was going end of life. In this case, "do nothing" wasn’t feasible  - and having that stated in the doc was still useful.

These documents are circulated for comment, and then brought to a meeting called the Tech Governance Group to seek endorsement. This meeting is open to all, and we encourage people to attend to listen and learn. Once it went virtual, we found we often get 40 to 50 people at these sessions.

If you are going to contribute, you should have read the proposal beforehand and given feedback. The aim is for the meeting to be more about the final details and communication; all the work on consensus takes place before. For the DNS proposal, that meant the DNS team evaluated several alternatives, focusing on how to reduce the impact of the change on other teams. They were also able to document how the new approach would allow teams to move to a much more streamlined process for making DNS changes, via infrastructure as code in a github repository. They spoke to every development group ahead of the meeting so by the time the review happened, they had a commitment on approach and timeline.

I don’t think a proposal is yet automatically brought to the Tech Governance Group for every significant change, but a lot does get discussed this way, and you can go back a year later and understand why we took a particular decision, because the documents are all linked into a github repository.

The challenge of measuring our impact on engineering productivity

I’ve always found it hard to measure engineering productivity by looking at things like tickets completed. For example, when we measured velocity from sprint to sprint, I never found any clear trends! And it’s too easy to measure work done rather than value provided; building the wrong thing quickly isn’t good.

The DORA or Accelerate metrics are in my view a very good initial measure for companies. If you are a low or medium performer on these, then building CI/CD pipelines and optimising for releasing small changes quickly will give you a massive benefit.

The FT made a lot of changes over the last five to ten years - moving to the cloud, adopting DevOps, moving to a microservice architecture - that impacted our score on these DORA metrics. Important things that changed during that period included the time to spin up a new server, and the time between writing code and it going live. Both of these went from months to minutes.

What that means is that you can test the value of the work you are doing quickly. You can experiment. That means we don’t spend months building something and then discover our customers hate it or that it doesn’t have the impact we expected. An example from a few years ago - we wanted to increase engagement with the FT’s film reviews, meaning we wanted people to read more of them or spend more time reading them. We could very quickly add the review score to the page that listed all the reviews - and just as quickly, remove that when we found it resulted in fewer reads of the actual review.

Once you are high performing on these metrics - which the FT has been for years now - you have to work out other things to measure to see the impact of engineering enablement. This is a bit of a challenge for the teams now: working out the new metrics that best show we are having an impact. We do use qualitative ones, like developer surveys and interviews, and we have been trying to measure system health in various ways too.

Key lessons I’ve learned

Firstly, the importance of communication.

The more you can talk to people about what you are doing and why, the better. That means explaining why you are putting any constraints in place: whether it’s because of cost, complexity or risk.

Secondly, the value of providing teams with insight.

You want to focus on making information visible to teams, and nudging them to do things. We found that once we could show teams where they needed to improve their operational runbooks, they were very likely to go and do that. We created a dashboard with a service operability score, so teams could see where to focus their efforts.

We all have a lot of demands on our time. If your email or dashboard has a link to exactly where I need to go, and explains exactly what I need to do and why, I will be much more likely to do it.

But the final and I think most important thing to focus on iis decoupling - remove the need for people to wait. Be responsive, be helpful. But also, invest in decoupling by automating and documenting, and you will also spend less time on boring tasks!

My advice to others considering how to structure their organisation

I think every organisation has to provide at least some centralised engineering services. The question is how far you should take it.

For me, that depends on where you are starting from. So, for the Financial Times, it would be a lot of work to move everyone onto one standardised golden path. A good plan, given where the FT is, would be to build several golden paths, for the main technologies that are used - for example, one for deploying node apps to Heroku, another for serverless that uses AWS lambdas and standard AWS resources like DynamoDB, S3 and SQS. Many of the steps within these golden paths would be common across the paths: using the same source control, the same DNS provider, and the same CDN, for instance.

For an organisation that has a more standard set of technologies in place, I would look at where the pain points are: ask engineers what is painful, what holds them up. I will bet that a central team can create a bigger impact than if those engineers were working on feature teams instead.

But the most important thing is to make sure the teams providing these centralised services understand that their role is to enable other teams: to be responsive, to make things self service and automated, to produce really good documentation, and to talk to other engineers so that the things they build have the most positive impact for those engineers and for the organisation.

About the Author

Rate this Article

Adoption
Style

BT