InfoQ Homepage Presentations Enabling Engineering Productivity at the Financial Times

Enabling Engineering Productivity at the Financial Times

Bookmarks

View Presentation

Speed:

Download

39:04

Summary

Sarah Wells discusses how they ended up moving fast with over 30,000 releases in a year from a development team of around 250.

Bio

Sarah Wells has been a developer for 20 years, leading delivery teams across consultancy, financial services, and media. Building the FT's content and metadata publishing platform using a microservices-based architecture led her to develop a deep interest in operability, observability, and DevOps, and at the beginning of 2018 she took over responsibility for Operations and Reliability at the FT.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Wells: Today's track is all about optimizing your organization for speed. I want to spend a bit of time talking about that, because the key element of speed that I think matters the most is how long it takes between starting work on a change and having it go live in production. When I first joined the FT, we released code for our publishing platform website once a month, because we couldn't do it without a significant downtime. What that meant was that lots of changes all went live at the same time. If something went wrong, the first challenge is working out which change had failed. Also, it could be up to six weeks, since you'd done the work and you no longer had the context in your head. We do a lot more than 12 changes a year now. This shows the changes in production that we've made over the last month at the Financial Times. There are a lot of them, around 2500, or more than 100 a day. Each change is small, and teams release them at will. That makes it easier to understand the change to measure the impact and to roll it back if something has gone wrong.

I think the main benefit of making lots of small changes and having them go live quickly is that there isn't as huge an investment in trying something out. Linda Rising says that too many organizations say experiment when they mean try. It isn't an experiment, if you can't easily measure the impact or roll it back. When we were only doing 12 releases a year, so many changes went live at once, you couldn't really work out which change had the impact on any of your metrics. In any case, it was very hard to roll that feature back out when there were at least a month of other changes made on top of it. I think the ability to experiment is crucial to being able to deliver true value to your organization.

High Performing Technology Organizations

Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim looks at what links high performing technology organizations. What that means is software development organizations that positively impact the performance of your organization measured in terms of profitability, market share, and productivity. They identify several key metrics that correlate with high performance. A short lead time from committing code to going live. Frequent releases, which means each one is small. A low change failure rate, which I think is linked to doing those small changes, because you can easily see what that change is meant to be doing. A quick fix if something goes wrong, which I think is linked to having the ability to go from code commit to live in a quick and automated fashion, because that's what you do to fix things too.

What Accelerate also says is that teams that manage to do this are largely autonomous. The architectures are loosely coupled. Teams don't have to collaborate with other teams to get things done. For example, you can release your change when it's ready, you don't have to wait for a scheduled release. You can spin up servers without having to open a ticket for another team. We often talk about these teams as cross-functional teams because they have all the skills that you need within them: design, UX, engineers. We often expect the engineers to be full stack. Full stack has to stop somewhere. We can't expect engineers to know everything down to the kernel. Most engineers in our product teams are T shaped, meaning they have depth in one part of the stack and can work in adjacent parts. However, a five person team can't have expertise in all the things that matter, things like accessibility, observability, security, without impacting their ability to deliver value to the business. They shouldn't all have to solve the same problems, to have to build their own tools and abstractions, to spin up servers and deploy code, to sort out their own observability tooling and write things to manage costs and compliance.

The Engineering Enablement Team

This is where the engineering enablement team comes in at the Financial Times. Our aim is to standardize, simplify, advise, to support the FT's product development team so that they can deliver value quickly, securely, and scalably. We provide common platforms and services, and often that means having a relationship with a vendor. Maybe we put the thinnest possible layer of tooling in place to make it easy to use a capability. That could be something like a log forwarder that makes it easy for people to ship logs to the log aggregation system, or tooling around cloud provider accounts. We also provide expertise and support in areas like observability, cybersecurity, and incident management. We provide insight into what exists. We maintain a central system registry, which allows us to know what systems exist and who owns them. We also can show how well systems comply with expectations around things like cost, documentation, and quality.

In the Team Topologies book, the kinds of teams we have in engineering enablement describes a platform and enabling teams. Team Topologies has a lot of great stuff to say about how you structure an organization for flow. At The Financial Times most of our teams are stream aligned. They're aligned to a single valuable stream of work that might be building and maintaining the FT's payroll, for example. Platform teams focus on enabling other teams to deliver work with substantial autonomy. They provide internal services that reduce the cognitive load on those stream aligned teams that are working on business problems. Enabling teams are specialists in particular domains. They have the time to research, to dig deep into a domain, to get ahead of where things are going, and they share their expertise.

This gives an overview of some of the things the engineering enablement team look after. If we do our job, stream aligned teams shouldn't have to develop in-depth knowledge within their team about each of these. They should be able to do what's needed without having to coordinate with anyone else. We adopted microservices and DevOps about five years ago at the Financial Times. We've learned a lot about what works and what doesn't when you're building complicated distributed systems and aiming to maximize the flow for your engineers.

A Cautionary Tale

I want to talk about two things that I think have helped us. First is the idea of building guardrails, not fences. We don't want to block the path for people. We want to guide them in the right direction. We want to add value, so we should look at paving the roads that people want to take. Before I talk about those in detail, I want to tell a story. Five years ago, I was a tech lead for the FT's content publishing platform, where a journalist would publish a piece of content. It would go into the platform and be made available through APIs that could be used by our website, our apps, and by close third parties. We were building a new system using microservices and we were using FT platform, which was the FT's first cloud platform. FT platform was within the latest build of microservices architecture in the first place, because of the impact it had on time to provision a server.

When I joined the FT in 2011, you had to order hardware in for a new project unless you were very lucky. It took six months to get the production server for the first project I worked on. You can see it in this graph. By 2014, you could spin up your own server in minutes. This is a huge improvement. It's a fraction of a percent of the time. You needed sudo access to deploy code for the first time to a newly provisioned server, and sudo access was only given to integration engineers. As a developer on a client team, I could spin up thousands of servers, but I couldn't put code on any of them unless I got someone to help me. I understand completely why things were built this way. We were new to working in a DevOps way, and the infrastructure team didn't quite trust our developers with sudo access. In my view, this was way too cautious. You have to balance risk against reward. I think the risk and the cost, in this case is spinning up the machines, not deploying code to them.

Autonomous teams can choose something else. This is a great blog post from one of my colleagues, Matt Chadburn. It's six years old, it still holds true. If you are really empowering your teams, if they are truly autonomous, and if you're focused on speed, then you have to allow those teams to choose something else if it gets them there quicker. As an internal team, you should be able to build something that's better tailored to their needs, because you aren't trying to build something that works for everybody. If you can, you should think about why you're even trying to build something.

FT platform was good, but it didn't get ahead of the changes happening in the organization. Apart from the sudo issue, which got fixed, FT platform was built expecting one application per server. With the adoption of microservices at the FT, we had tiny applications, even the smallest VMs had a lot of capacity left over. Teams looked for other solutions. One team opted to deploy to Heroku. Using a path that takes a lot of complexity away, they still had to build some tooling themselves. The team I was on at the time adopted containers. Being able to deploy multiple applications onto the same server meant we saved around 40% on hosting costs. That's not the whole story around cost, of course, we took on additional complexity in maintaining our own container platform. There's always a temptation to focus on concrete spend with third parties over the opportunity costs of having your team working on a platform rather than on new features. That is really a different story. The thing to take away is that internal teams are service providers now. That means that if you are an internal team, you need to offer good documentation, support, excellent communication. You need to maintain backwards compatibility and to be very careful about how often you ask people to upgrade stuff and do work. If you are building a platform, it is a good thing to focus on removing the places where people have to wait for you.

Guardrails Not Fences

Let's start by talking about guardrails, not fences. Our guardrails cover the things we expect a team to consider, so that we build things in the right way. Also, so that we build the right things. They're there to ensure the things we build are safe, secure, and can be operated. They start from a checklist, which is loosely in the order you need to tackle things if you're building a new product or feature. When you dig into the details shown here, there are links out to principles, policy, and standards. I have to be honest, very few engineers at the FT will have read these, and they need some work to bring them up to date. That's because it's easy to write guardrails, it's harder to maintain them and to get people to care about them. Actually, that's fine, because people shouldn't have to read the guardrails. If they are building a system at FT, the tools they use should help them to comply with the guardrails. They don't need to know why something is the way it is. It's available if they want to indulge their curiosity.

Adding to Biz Ops

For example, Biz Ops is our internal system registry. We want to know about new systems and about who owns them. In our guardrails, we say you should create at least an initial record for a system. You will be nudged to do that, in any case, because when you create that initial record, you define a unique system code. You'll find it really hard to get a service to production without a system code. For example, resources in AWS should be tagged with it. That's a pretty strong nudge to developers.

Runbooks

Another approach is to make it easy for people to see how they're doing. How are you doing compared to these guardrails? We want our critical systems, the ones that need to work 24/7 to have good runbooks, so that anyone who's trying to fix the stuff when something goes wrong has the information that they need. We score them. We have a set of fields where we know we need information, and we combine that all together to get a score of how good the runbook is for this service. This has worked really well for us to help improve our runbooks particularly where we didn't have any information for some systems. We gamified it a bit, so groups and teams would work to get above their colleagues in this table. These are just a few examples. In general, you should always be looking to automate your guardrails and to guide people to where they can improve in their coverage.

Evolving the Guardrails

You should expect your guardrails to change as your tech stack and the way that you build systems changes, and your focus changes too. Four years ago, we were just finishing Greenfield builds in three big areas of our estate, the website, the content publishing platform, and our subscriptions and payroll code. They were all using microservices, and we were adopting a DevOps approach. You build it, you run it. Lots of our work focused on working out who owns a service and getting enough documentation to be able to fix things. Now that's no longer focused for us in the same way. In fact, the scoring of runbooks is not really working the same way for us now, because the problem we have isn't that runbooks are missing content, is that the content that's there is not up to date. That's a lot harder to identify programmatically.

As you change your guardrails, you need to think about how you share those changes. They're one of the things that gets brought to our Tech Governance Group. This is how we talk about changes that have a wide impact. We meet as a group. The meeting is open to every engineer. In practice, we tend to have a core group of people who are engaged in discussing a proposal, and many more who are there to listen and learn. This is where going remote made a difference. We now quite often have 40 to 50 people on those calls. Proposals are sent around well in advance, and we ask people to read and comment on the document before the meeting, if they're going to say anything. The meeting is largely there to share the information and to formally endorse the proposal. All the work to get a consensus is generally done in advance. The meeting is very good for explaining to people what's changed.

Pave the Road

The second thing I want to talk about is paving the road. You should expect some diversity in your software estate. If you don't have that, you probably have people forced to use something that isn't a good match for their needs. For some organizations, there's likely a quite standard set of tech, which teams can use. Even when there's a lot of standardization, you can expect to have different databases, libraries that were just things that suit different needs. The FT has a very diverse software estate for historical reasons. We gave teams a lot of freedom to choose. Paving the road is about making it easy to do the things that engineers need to do a lot of the time for the stuff that's pretty standard.

There's this concept of a desire path. For example, in a park, you often see a path develop because it reflects where people actually want to go. Ideally, you should look out for these desire paths, and then you can pave them. At the Financial Times, one thing that happens quite often is a team or a group start using something and they like it, other teams adopt it. If we as a central engineering enablement team can take that, make it available to others. We're choosing something that's already been proven, it's been tested. People like it. Sometimes it's as simple as taking over the relationship with the vendor. Other times we do more. For example, Fastly is our current CDN provider, came into the FT when a key team started to use it. We had another CDN provider, but that meant assessing, do we want two, particularly two providers who are there for different parts of our estate? We're not managing to get extra resilience, it's just extra costs and complexity. Our central team did the migration. They built up expertise in Fastly to support the needs of all of our teams.

The Golden Path

Internally, we don't talk about paving the road, we talk about the golden path, an opinionated and supported way of doing things. Opinionated means we make recommendations. Sometimes there will be options to choose between, sometimes it will be a bit stronger than that. Then if you go off the path, you need to make the same guarantees as that recommended approach. For example, that means you have to keep things patched, you need to support the amount of hours they need to be documented. Supported means someone owns the capability and will maintain and improve it. We should be clear on how long it will be around.

If we move to something else, we will do whatever we can to make that migration simple and painless. I'm talking about migrations specifically because they happen a lot. If you don't handle them effectively, it can destroy the relationship you have with other teams. Because we have microservices, and expect teams to manage large parts of their infrastructure, migration is a real challenge. There are more things that might need to be upgraded. For example, when you had a single relational database, migrations were a pain, but they happened fairly irregularly. You probably have several different types of database, graph, document store, key value, and you have lots of different instances, each migration hits in lots of places, and they're happening more often. Also, when you have lots of microservices, you tend to have to do changes multiple times. If you need to upgrade a library that's used widely in your code base, you may have to create PRs and do releases for hundreds of services. That's time consuming without some level of automation, or even with some level of automation.

The Principles for Building the Golden Path

Recently, we've been documenting our principles for building parts of the golden path. This is to help us decide how we approach solving problems. For me, each principle answers a question that we should be asking ourselves.

Should We Provide This Capability?

The first question, and it is important, is, should we provide this capability? There are many things we could do, what makes this thing worth doing? We should look to buy things where we can rather than build from scratch. That can be a struggle, because engineers like to solve difficult problems, and because it's sometimes easier to get an internal team to build something than it is to get the cost approved. You should buy stuff first and build if the thing doesn't exist, and if it gives you enough value. If you are going to build a capability, make sure it has obvious value for engineers. It needs to be more attractive than other homegrown or external offerings. Make sure of that by talking to your customers to find out their needs. The real proof that you're building something someone wants is if you can get them agreeing to use it as soon as you have got it ready. If you can't get a volunteer to work with you on this, maybe this is not something people really need.

Can People Rely On It?

The second question is, can people rely on this? That's really around people believing this capability will be around for the long term, and won't suddenly sting them with a massive increase in costs. I expect anyone who's worked in tech for a while has been hit by something unexpected going end-of-life, or changing the charging model. A company has been acquired, sunsetting their product. That applies internally too. If you're building tools for other teams, you do not want to get reputation for pulling the plug. If you do have to do that for some reason you need to provide an alternative. Tell people about the change well in advance, and help them move. We want things to be owned and supported. We also want to understand how they're used and get insights into the cost. At the moment, that's mostly important for the engineering enablement group, because we don't have an internal market. We don't charge costs for our platform back to individual teams. We do want to understand two things, is this thing being used? Are people actually using it? Are they using it in a way that's costing us an unexpected amount of money? Often, teams don't realize this themselves. If you can show them that, they can make changes.

This is a good example of what it means to support a capability. We used to manage DNS via DYN, and they were acquired by Oracle in 2016. In June 2019, Oracle announced that DYN will be shut down the following May. This was a proposal brought to our Tech Governance Group in October with a plan for migration. We found out in June, by October, we had a plan and we were giving teams seven months' notice. We took the opportunity as part of this change, to move to an infrastructure as code approach. This meant the migration actually made things better for our customers, they understand opening a pull request on a repository. It's easier than the previous approach. Finally, as much as possible, migration work was done for the teams.

Can People Use This Without Costly Coordination?

The next question is, will teams be able to get on with things, or will they be stuck waiting for us to do something for them? We want teams to be able to find out a capability exists and to be able to find the documentation for the most common use cases. Sometimes teams want to do something complicated. It's fine to cover the 80% of pretty standard stuff automatically and work with them to solve the other things. Finally, we want there to be nothing stopping them from going ahead. No need for an email from their manager, or a ticket on the Jira board. Again, there will always be some things that need a little process around them, but the aim should be to keep that as lightweight as possible and be clear on why. We want things to be discoverable, documented, and self-service.

For discovery and documentation, we've recently consolidated documentation from a lot of different places into one single place, our Tech Hub. We want developers to come here first when they have a question about how to do something and search for it. It's where the guardrails live. It's where the information about our technical capabilities lives. The level of self-service you can offer is going to vary, but there's always a case for stepping back and seeing what more you can do. The Edo team that supports DNS at the FT had already introduced infrastructure as code, DNS configuration is held in the GitHub repository, and change is made for our pull request. When the team recently ran a survey to see what the pain points were, they got feedback about people having to wait for those PRs to be approved. They had to think about this. They did some analysis. They looked at the last 500 changes. They found that many changes looked to be pretty low risk to approve. You've added a set of lines, you've deleted a set of lines, or you've modified lines, and there were no comments made by anyone as part of approving the pull requests. Not controversial.

They developed a set of rules to run over any PR to speed things up whenever possible. Some of the rules are, you automatically approve simple stuff, something where there's only an addition. There are certain changes that you send to the cybersecurity team for them to approve. You can look for common mistakes. One example would be poorly chosen time to live options. You can look for modifications that are really benign. You can find someone who's closer to the business need of a change so that they can review rather than it being done by a separate team. I love this because it's focused on what our customers were asking for. It has a good and granular appreciation of the risk of automation. Some things we are still being careful about.

Will People Get Stuck?

The next question is, how likely is it that people will get stuck and have to come and ask questions? We expect that when people are doing something unique, they'll get stuck. If it's something engineers do all the time, they should be able to follow the documentation and get it done. We want to make things easy to use, and we should aim the documentation for our newest developers, junior developers, and new starters. Wherever possible, we should be consistent with what already exists. That means there's no surprise when people are trying to do something. This is the documentation of how to set up those DNS records. It's clearly divided up into sections. We want to invest more in how to write good developer documentation. We've done some training and tech writing. We're looking to come up with some templates for structuring our tutorials, our reference docs. Some companies have a consistent approach to building new services. The FT doesn't, and we have a lot of variety. As a result, we want people to be able to use our capabilities in ways we didn't expect. We want to have a Unix style philosophy. Unix is built to have simple, short, clear, modular, extendable code. You can compose things together, pipe the results of one command into another. It's immensely powerful. We want to supply APIs and other ways for people to build on top of our tools. One way we're approaching composable capabilities is our serverless blueprints. We use AWS. Lots of teams use the serverless framework to build things like Lambda. Our blueprints have common patterns like reading an item from a DynamoDB, and you can compose those together.

Does It Guide People To Do The Right Thing?

The final question I want us to answer is, does what we build help people to do the right thing? Default should be sensible. We should protect engineers from making mistakes. We should also protect them from accidentally spending a lot of money. If you use our capabilities, you should be confident they'll be kept up to date and vulnerabilities will be patched. Finally, they should be reliable. Suitable is the interesting word here. It depends what relies on this thing. It means we need to understand our users. There's room for nuance. For example, our Biz Ops system registry is built on a graph of data. Some of the data in this graph is our runbook information, which means we need it to be highly available. Ordinarily, we'd expect highly available to mean we need it to be multi-region, but we don't want to have to support a multi-region graph database. What we do is we extract the runbook information periodically and store it in highly available S3 in multiple regions.

We discovered there's a flaw here too, and one we found via a production incident. We have Fastly in front of the S3 buckets, which implements authentication via single sign-on. When we had a problem with single sign-on, we couldn't access Biz Ops, we couldn't access the runbooks, so now we also have a ZIP file of the same information in the Google Drive. That level of backup is something you learn from experience, and having central teams that maintain capabilities means we all benefit from that experience.

Resources

Here's the full list. Your list might be different, but it's good to think about what principles you do want to follow. Personally, I really enjoy working in this platform in enabling team area. You have the chance to become experts in particular things, and the pleasure of building stuff for customers who are right there to get feedback from.

I want to finish by sharing a couple of really good reads on golden paths, how to approach them and how to focus on the bits that matter. It's also worth being aware that people are starting to build platforms for some of these things. Spotify have a thing called Backstage. If something like that fits with your stack and the way you work, it'll save you time and effort over building your own platform and enabling team.

Questions and Answers

Shoup: The very first question was about how you measure the Accelerate metrics.

Wells: We can generally tell how many changes we're doing, you saw the graph that I had in the talk. Actually, I'm pretty sure we have some parts of our estate where people aren't calling our change API. We have a very simple API that you can call from continuous integration pipelines, but it won't be everything. We're probably undercounting. Once you get to it there's 100 changes a day, and they're not happening at the weekend, I'm pretty happy. The time from cutting code to it going live, which I prefer as a measure to ones that look over the whole lead time of a story, because there are always other reasons why that might get held up. Because it feels it should be quite small generally, but it's a big challenge for companies to get to the point where you're able to commit code and have it go live quickly. I don't know that I could tell you what that measure is for the FT. I would expect for almost every team it's minutes to hours. I know that our website, they basically do continuous deployment, so once it's merged to master, it goes live. For some of our other teams that have more state, I think that's probably the bit where it's dangerous, they do a bit more before they actually press the button. It's generally pretty quick.

For change fail rate, if I was calculating it, when we did 12 releases a year there would be one or two where you'd be waiting around and it would go wrong when you try and roll it back. That's like about a 16%, 20% failure rate, something like that. When I was leading the content platform at the FT, I worked out that we were probably at 1% or 2%. It's very hard to say because generally speaking, some of the things that fail you don't even know about, because the developer merges it, it goes live. They look at it and they go, no, and then they fix it. Unless you put in place a way for people to say, this particular change didn't work and then I wrote something else, you don't necessarily know. You know it's generally fairly low for that. I think also for fixing stuff is similar, we're normally going to release stuff pretty quickly.

What is the change failure versus experiment failure when you roll it back?

Change failure will be, we release some code, and suddenly the website has a list that's all wonky. We think, no, and we roll that back. An experiment failure would be, we're releasing code behind a feature flag. We turn the feature on, we see whether it has an impact on the metric that we care about. We realize it doesn't. In fact, it makes engagement worse, and we turn off the flag and remove the code.

The FT's got everything, honestly. You name it, we'll have more than one of it. I think it's probably too far. Once you're in that position, how do you work your way back? We're just trying to standardize on obvious things that people like and pick a few things that work really well. I do really admire companies like Spotify, or Monzo, or Skyscanner where there's a lot of commonality. You can join the company and you can literally just go, I'm building a Go app, and it's got all the stuff that you need. If you do that in the right way so that developers find it genuinely helpful, then I think it's amazing. It's really hard to get from there from when you're not there.

In terms of trying to work out what we would do at the FT, we would say if it's significant we'd like you to bring it to the TGG. Then the question is, what does significant mean? It will be, are you introducing a new programming language? Are you deciding you want to use an alternative graph database? We already have a graph database that we use in loads of places at the FT, you'd have to have a good reason to say we're going to introduce the additional complexity of a different one. You could win that discussion, but you need to explain why the current one doesn't do it. There's some element of, ok, you want to use a different hosting platform. We use Heroku. We use AWS. We have a small amount of Google stuff. You want to add something else in, you need to explain why. Actually, the central team aren't going to support that necessarily, so you're going to have to patch it and maintain it and be responsible for being called at 2:00 in the morning, if it breaks. Some of it is about, yes, you can have that but it's your problem. Some of it is we expect you to meet the guardrails for whatever it is that you're using.

Shoup: Various questions about executive support. If I remember correctly, you founded this team, or you were asked to help found it or lead it. Maybe you can say a lot about how this team was formed and level of support ongoing.

Wells: We obviously had platform and infrastructure teams in the past before I took over. I think what we did was about five years ago at the FT where FT platform was being built. There was still very much a sense of there is the operations and infrastructure team who are building this platform. Then there were developers, and they're not really talking about what's needed, which is how you could get something that was really good with a couple of just things that just weren't quite right. The platform was excellent, but it did things like it abstracted away from AWS, which is annoying if you want to add autoscaling and the platform doesn't support it. That kind of thing. We'd adopted DevOps within development teams, and that we totally understood that we were going to deploy our own applications and we were going to fix things if it went wrong.

What I did about three or four years ago at the FT was I moved into a role that was director for operations reliability. Now I was in charge of the operations team and things like monitoring, and just operational tooling in general. Then recently this year took over other teams that do platforms. You start doing things, and it demonstrates the impact you can have on teams. You do need to have executive support, which generally means they have to believe what you're saying, they have to have a vision that this feels like the right thing. We want our teams to be empowered. It's a bit scary. One way that it happens sometimes is you find a team that just has the ability to try these things out, probably because they're under the radar in some way. Maybe you're working on a system that's not critical, so you could try it out. Then everyone goes, that team's really good at moving fast or whatever.

We had a third party who built our mobile app, many years ago. We would always get this thing where our editorial stakeholders would say, how come they can release things in a week? It would be like, they don't follow any of the safety and quality processes that we're following. Then you prove that actually, you have the conversation with editorial to say, would you rather that we can get things out in a week and sometimes we'll release something and you'll have to tell us, this is wrong, and we roll it back? Then they do. It's essential for your technical leadership to back you. Something Cait O'Riordan, who was the former CIO for the FT, did was she actually got us to talk to the board about things. When I was their principal engineer, I went to a board meeting and did a five-minute talk about moving to the cloud, and how servers had changed from being pets to being cattle, which just really amused them, but it's sending the message, this is what we do.

Shoup: How do people at the FT do quality testing, and performance testing? Because a lot of questions about, does your team do that? How does the FT do it?

Wells: In a lot of different ways. I'd say tools for code quality, or coverage of stuff like that. I think they're on the list of things we would want to do. We're dying to develop this paved road, this golden path. We probably want to be able to advise on some of those. It's just lower down the list than some other things in terms of where I think it impacts people. Maybe it's a bit more maturity in each of the individual teams, because every development team has a view about what testing framework they're going to use. We have got some QAs for things like our mobile apps, but in general, we expect quality to be part of what every engineer does. If you're releasing hundreds of times a day, you have to be really careful that you don't have some manual QA process that just slows you down. We were doing a lot when we had people doing that testing. You have to look at how much is it catching versus how much it's slowing you down. I think personally a lot of things it's around testing in production, monitoring as tests. Can you have something running that's constantly checking that it works? We would probably have an opinion, but I'd expect to get a lot more pushback on that from individual teams.

See more presentations with transcripts

Recorded at:

Apr 07, 2022

Sarah Wells

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?