Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Mature Microservices and How to Operate Them

Mature Microservices and How to Operate Them



Sarah Wells discusses some of the challenges for building stable, resilient services and ultimately what worked at the Financial Times.


Sarah Wells has been a developer for 15 years, leading delivery teams across consultancy, financial services and media. Over the last few years she has developed a deep interest in operability, observability and DevOps, and this has recently led to her taking over responsibility for Operations and Reliability at the Financial Times.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Wells: I'm going to talk today about microservice architectures, about why we've adopted them at the "Financial Times," the impact it's had on the way we work, and the things we've learned about how to operate them that make it, that are basically essential to building it if you're going to do it successfully. And we've got five years experience of building microservices. I think I worked on the first system at the FT that was built using microservices, which was the content platform, the delivery and publishing platform, and I was the principal engineer on that. But we now use it pretty much everywhere.

Things That Can Go Wrong When You're Operating Microservices

I want to start by just talking about some of the things that can go wrong when you're operating microservices. In December last year, we got reports that a lot of people were visiting the FT and seeing the sorry page, a 404. And we thought, "Oh, this isn't good." So we started looking into it. And very quickly we realized that the problem was with one of the redirects that we'd set up. And we have a lot of redirects on And that's because we have a particular format for our pages. So you have pages that are streams of articles.

We have stream pages that basically have articles all on a particular topic, and they have a particular UUID for the topic, but this isn't very friendly. It's not easy to remember this. So you need to have something that's a bit more memorable which is we normally set up a vanity that is much more easy to remember. All this is managed by a URL management tool which is a microservice that we wrote. But the problem was that we had set up a redirect to a page that didn't exist.

And what was worse was the page that it came from had many redirects going to it as well. And what that meant was people were being funneled into our 404 page. So we started thinking, "Oh, we need to fix it." And we started looking into how we would fix the data via our URL management tool, but it turns out that we actually couldn't work out how to do that. We tried a few things, we got errors. It just wasn't working. And then we realized, "Oh, we want to maybe do a restore from backup," and found out that, we had some very experienced developers working on this, none of them knew how to do a restore from backup on this tool because no one had touched it in two years. We weren't even sure how we did the backup. And the interesting thing that then came up is once you're not very sure about things, you become very hesitant. So someone suggested an approach which turned out to be the thing that fixed it, but it took us 20 minutes to get to the point of deciding that we were confident that this was the right way to approach it.

So we did get this fixed, but we really realized that when you have a microservice, it's quite possible that you haven't touched it for several years. No one helping us with this incident had any idea how it worked. And when that happens, you're really relying on two things. Does the monitoring tell you there's a problem and does the documentation give you what you need? Monitoring told us, basically, that we had a problem, but our documentation didn't include anything about restoring from backup.

The interesting thing is with microservices, you quite easily have a polyglot architecture. You can use lots of different data stores, different programming languages, and that's great. But what it means is when you need to work out exactly how this data still gets backed up, you may have a problem. The interesting thing about this particular issue is we were able to identify very quickly which microservice was involved. That's not my experience. Normally when things go wrong, even working out where the problem is, is a big part of solving the problem.

So another issue we had, we got told that editors were publishing updates to the lists on our homepage. This is a very important part of the "Financial Times" site because it's how people find the news, it's how we break new stories, and they weren't getting updated. And between the editor clicking a button to say, "I'm publishing the list," and it appearing on the website, there's probably 10 or more microservices. So our first problem is, where is the problem? What's gone wrong here? Luckily, we had some monitoring. We were able to see, this is monitoring an API endpoint. We were able to see that the problem was with one of our content API's. So at least we knew which team to do it, because there were three different teams involved in this. You can see that we were starting to get increased latency errors. So we started looking at this. And this happened at 5:45. You're just thinking about going home and then suddenly everything is wrong and you have to stay. And we quickly realized the problem was that our graph database, Neo4j, was using a lot of CPU, and we were running our microservices in containers, on a cluster, and we didn't have a lot of limits on CPU or memory so it was basically starving everything off resources. It wasn't just the List API that was broken.

We worked quite hard and we managed to find a mitigation for this, but we still didn't really understand what the problem was. It took 24 hours for us to work out what had gone wrong. And this graph, and that's not easy to read, but it's basically showing that there were a couple of 503s, 504s, maybe 10 in the space of 15 minutes. What we'd managed to do was an update to the graph that had caused certain queries to be incredibly inefficient and brought down our entire cluster. And the thing about a graph database is, it's not a schema update, is just loading data. So even though we knew that we'd made some changes, we actually didn't know what the impact was that that would have. Microservice architectures are more complicated to operate and maintain than a monolith. So why bother doing them? I mean, if they're more difficult, why not just stick with the monolith? And the answer for that is, business reasons. And actually, all of your technical decisions really should come down to business reasons. And to do that, I need to tell you a little bit about the business of "Financial Times."

The Business of "Financial Times"

This is the picture I took of our paper being printed at our print site in Bow. But nowadays, we don't describe ourselves as a newspaper. We describe ourselves as one of the world's leading business and financial news organizations, and that's because most people don't read the FT on print, they read it online, on the phone, on a tablet, on a computer. We have a paywall, and that's where we make most of our money. If you know anything about print advertising or digital advertising, you'll realize this is a good thing because both of them are tanking. So news is in an interesting place right now. Local news organizations are really struggling and any news organization that doesn't have a paywall is starting to think, "How do I actually make money?" What's clear is that in the near future it's going to be really important for us to be able to experiment and try things out and see what's going to work for us.

And I saw a really interesting talk from Linda Rising last year that said, "It isn't an experiment if there's no hypothesis and if you can't fail." What she said is that experiment for most organizations actually just means try. They're going to try something out. And this is because if an organization invests quite a bit of money in a product or feature, it is really unlikely that they're going to decide that it wasn't worth doing and get rid of it. And the only way you get a culture of experimentation is if you basically can do experiments quickly and cheaply, because then you have a chance that someone will say, "This is what constitutes success," and you'll be able to say, "Oh, we didn't meet it, we're not going to roll it out."

A/B testing is built into and we do hundreds of experiments a year. And we start by saying, "What are we measuring? And what would be the criteria for its success?" And there are plenty of things that don't go live because we've decided that they didn't give us enough benefit. We have this A/B testing built in, but you still need to release code to do any kind of tests. To show the blurb on the right and not on the left, you need to have some conditional statement in your code. So we need to be able to release code. We need to be able to do it very quickly. And we do, we release code to thousands of times a year. And this means that we do have a culture of experimentation. But releasing changes that frequently doesn't just happen. You have to do a lot of work and it had a massive impact on the way we work and culture of our company. Microservices has been a thing that has enabled it for us, a combination of continuous delivery, microservices and DevOps. Together, they've been the foundation for being able to release things quickly. What they mean, though, is that the team that builds the system has to operate it too.

If it took the developers who built the List API 24 hours to work out what went wrong, there's no chance that some other team is going to be able to do that more quickly. So you have to operate it. But we've now been doing microservices for five years, so teams are starting to get smaller, they're moving on to new projects. The people who currently work on the website or on the content platform aren't the same people who made a lot of these decisions. What do you do when teams move on? How do you make sure that systems don't become unmaintained, unloved and risky? Because your next legacy system is going to be made of microservices not a monolith. And the one thing we know about every bit of software is eventually it becomes legacy.
I'm going to talk about three things. I'm going to talk about optimizing for speed, why we've done it, how we've done it, then the things that we've found essential to build in to operate microservices, and finally some of the ways we're attempting to make sure that when people move on we still have systems that work.

Optimizing for Speed

Optimizing for speed. I've just said it's really essential for us because it's part of the ability to let us experiment. And this is what we've believed for quite a long time. I was really happy when this book came out, "Accelerate" Nicole Forsgren, Jez Humble, Gene Kim. If you haven't read it you really ought to, I expect it's going to get mentioned in a lot of talks at QCon this year. And it is about how you build high-performing software development organizations. And high-performing software development organizations have an impact on organization performance. So effectively they can increase the market share, the productivity, the profitability of an organization. And there is evidence for this. They've done surveys, done a lot of research. What they found is that there are four measures that are strongly correlated together and correlated with high-performing teams, the ones who have the most impact on the business.

The first one is about delivery lead time. How long does it take you to get from committing code to getting it live? And often when we talk about cycle time, we talk about from idea to live, but there's a lot of variability in that. Actually, what the "Accelerate" authors found is that commit to live gives you a very good accurate measure of performance. And generally, high-performers do it in less than an hour. That is going through the whole build and deployment cycle. So you can't be waiting for someone to sign it off in a staging environment if you're going to do it in an hour.

Second measure, deployment frequency. And really, this is a proxy for batch size because if you're deploying things frequently, you're doing things in small batches. So the hypothesis is if it’s small, it's going to be easy to understand. On Demand. You're not waiting to release once a week, once a month. You're just doing it when it's ready.

The third one is time to restore service. Is all of this moving fast actually just leaving you something that's really unstable? How quickly can you restore service if something goes wrong? Generally, again, less than one hour. And actually, that's linked because often you're restoring service by putting things through the same pipeline as when you release new features.

And finally, change fail rate. Are you basically compromising on quality to move this quickly? Well, high-performing organizations do changes with a failure rate of 0% to 15%. By contrast, low and medium-performance organizations have up to a 45% failure rate, which is pretty shocking actually to release that often and have to actually do a patch immediately afterwards. So what this tells me is that high-performing organizations release changes frequently, they're small changes, they fix things quickly, and they have a low change fail rate.

Continuous Delivery Is the Foundation

Continuous delivery is the foundation of this. The thing I take from the "Continuous Delivery" book, Jez Humble, David Farley, is if it hurts, do it more frequently. Basically, the whole point of continuous delivery is releases used to be really painful. And as an example, our old build and deployment process was very manual. We could only release maximum of 12 times a year because we had to freeze the website and we couldn't publish new stories while we were doing a release, so we had to negotiate with the editorial department to say, "Can you stop publishing news for a bit?" So we do it on a Saturday once a month. And it was a manual process. This is an extract from an Excel spreadsheet where it was all documented. This just shows 6 lines out of 54. They were never correct because you clearly aren't going to have 54 lines in a spreadsheet without some kind of typo, and they often went wrong. This was really painful for us, so we totally went, "Oh, yes. This is painful. We should fix it." and we adopted continuous delivery.
Also, you can't experiment when you're only doing 12 releases a year, because by the time you get feedback, you do some changes, you put it out, it's six weeks before you even find out whether it's at all worth doing. And even then it's difficult because maybe you've put so many changes together, you can't tell the impact of one thing or another.

So we went to continuous delivery, and the first thing we did, automated build and release pipeline. It needs to be basic stored as code, it needs to be version controlled. You need to be able to recreate your pipeline from scratch, if you have to. And effectively, your aim is to make it so incredibly easy to release code that anyone can do it at any time. In our previous process, people were scared to do releases. Now, no one at the FT is scared to do a release. It doesn't mean you don't evaluate the risk. We don't necessarily release code at 5:00 on Friday. We probably won't do a lot of releases on a day where there's a Brexit vote happening because we really want to be aware of our risk, but you know that the release is going to work and you know that you can fail back to the previous version easily if it goes wrong, and that is not something we used to have.

The second thing is you can't stop to do manual regression testing if you want to move fast. It just takes too long. You can't regression test against your entire system, so you need automated testing built into the pipeline. And when you do do manual testing, you need to target it so that it's focused on the change you've just made. So keep that scope really local to that change. It should be fairly independent.

And finally, continuous integration. And I think this is the thing that people sometimes miss with continuous deployment. You need to be putting changes out regularly. It's good to have an automated pipeline, but if you're still only releasing once a week, you're not benefiting from it. And continuous integration means that your master branch is always releasable and you release from it. Now, we use GitHub so it kind of pushes you towards branching and pull requests, but we don't let them live for very long. A branch is there for less than a day and then it is merged. And I think if you aren't releasing multiple times a day, you need to work out what's stopping you and you need to decide whether it's worth adopting a microservice architecture, because it's going to cost you, so you need to make sure you get what you need.

Often, the architecture is the thing that's stopping you. For us, the 12 releases a year were because we couldn't do zero downtime deployments. We had a SQL database that had schema changes, it could take quite a long time to do it. So from the beginning our new systems were built for zero downtime deployments. That was mostly done through sequential deployment. So we'd have multiple instances and we will deploy to each of them in turn. We also tended to move to schema-less databases. So in microservices you can have a lot of different data stores, most of ours are now schema-less: document stores, graph databases. And that tends to mean that you don't have to stop and do some big upgrade. And if we do, we can fail over to a region, upgrade the other region and fail back.

The great benefit of zero downtime releases is that in-hours releases becomes the normal thing that you do. And when something goes wrong, anyone that you need to help you is going to be there. If you want to move fast, you need to be able to test and deploy your changes independently. You don't want to have to rely on anyone else. So what this means is you can't have to wait for another team to do some code changes and you can't queue up for an integration testing environment. You need to be able to control your own release cycle otherwise you get slowed down. And what that means, and this is borne out by the research in "Accelerate" is that your systems and your teams needs to be loosely coupled. And "Accelerate," they sort of talk about architecture, and they say, "It actually doesn't really matter what your architectures are as long as they are loosely coupled." So for us, microservices are loosely coupled. You can keep your monolith loosely coupled. It's much harder work because it's much easier to get things dependent on each other without you realizing.

"Process Theater"

It isn't enough to change your architecture and to build a continuous delivery pipeline. You need to look at your other processes as well. When I joined the FT, we had a change approval board, and you would go to the board on a Tuesday to release to test on a Thursday, for example. And we also created change requests. We'd fill out a form to say, "We're going to do this change." And you'd have to get someone senior to say, "Yep, I have approved these changes." I mean, clearly, that's not actually valid because you're putting four weeks' worth of work out, no one is going to have reviewed every single part of that. So it's really about process theater. It's theater because it's pretending to be making things less risky, but really, it's not. It's not having that impact.

And you can remove this stuff. So change approval boards don't reduce the risk of failure. There's research in "Accelerate" on this which says that they actually have no impact on the change failure rate, but what they do is slow down everything else. Your change failure rate is similar, but you deploy less frequently, it takes you longer to fix stuff and generally you tend to do bigger releases. So it's not a good thing and we don't have change approval boards anymore. Our theory is, if you're doing small changes, the best person to make the decision about whether this change should go, is the person who just made that development change. And if you're filling out a form for every change and you're doing 10 releases a day, that just takes too long. It can double the time for you to do a release. So we've changed this process as well. We rely on the peer review, the pull request, as the approval, and we just log that we've made a change. So we call an API to say, "We released this system. This git commit, this person." And that's all we need, the ability to go, "What has changed?" when something goes wrong.

Speed and Benefits

How fast are we moving at the FT? Well, this is from the content delivery platform, so this is just one group within the FT. In 2017 we did 2,500 releases and it works out about 10 releases per working day. And just for fun I created an equivalent scale graph for the monolith. And then it does have data on it. We're releasing about 250 times as often which is a massive difference. But what about the failure rate? Are we seeing a difference in the change failure rate? Well, our changes are small, which means they're easy to understand. They're independent, so the blast radius is greatly reduced. When something does go wrong it tends to be only affecting a small part of our website, a small part of our API's. And they're easy to reverse because we've got this automated pipeline.

When we did 12 releases a year, one or two of them would fail and when they failed they were incredibly painful. It's about a 16% failure rate. I'm pretty confident in 2017 we had less than 25 releases fail. So that's less than a 1% failure rate. And the kinds of failures we got were much less significant because when you fail with one of those big bang releases, you've lost any chance of trying any of the new functionality. You have to roll the whole thing back.

The Cost of Operating Microservices

So that was about speed and the benefits, but there is a cost and operating microservices is harder. They are an efficient device for transforming business problems into distributed transaction problems. This is actually quite an old tweet now, but I think it's just totally true. Everything's over network traffic now and things are unreliable. And all your transactions being distributed means that things can partially fail or partially succeed. You're hoping for eventual consistency, but quite often you end up with just inconsistency that you have to fix up somehow. But luckily there are patterns and approaches that can help.

I think DevOps is a good thing to do regardless of your architecture because you want your team to all have the same goal. You want all your team to be focused on the value you can give your customers. If you have separate development and operations, you've got one team that wants to keep things stable and one team that wants to get changes out there. So DevOps is great, but you absolutely have to do it for microservices. The example I have with the lists earlier, that team had to be the one that did the investigation. And you can't hand things off to another team if they're changing multiple times a day. This is that whole thing where you're basically depending on - as soon as you are coupled to another team, you slow down.

Decisions about Tools and Technology

High-performing teams get to make their own decisions about tools and technology. And there's two things behind that. The first thing is, actually, empowering your teams makes people happy. That's not a bad thing. But the second thing about it is it speeds you up again. You can't be waiting for the result of an architecture review board to decide what queue it is that you're going to use. You need to make that decision. But the effect of that is that teams will make different decisions and it actually makes it extremely hard for any central team to support everything. So you have to have those teams supporting the decisions that they've made, paying the price. If you choose a flaky queue, you're the one that's going to have to operate it.

You can mitigate this a bit by making things someone else's problem. So you shouldn't spend time installing and operating Kafka if you could get away with using the queue that your cloud provider offers, similarly with databases. Why install and run a database if you can just get someone else to do that for you? And ideally, you choose something where backups and everything else is done for you by someone else. So lots of the FT runs on Heroku because that makes things very easy. Where we run on AWS, we want to move away from installing our own software onto EC2 instances. We want to use the things that we're offered from AWS. So we want to use Kinesis, we want to use Aurora, all the things that just make it someone else's problem.

Simon Woodley is talking later in this conference. I really recommend seeing him. He's an IT strategist. One of the things he talks about is how successful technologies go through this cycle. It starts off when it's something for experts, and then after a while everyone is doing it, but they have to build their own solutions. Finally someone will convert it into a product and eventually it becomes ubiquitous and it's made into a commodity. And an example of this is electricity. Nowadays you wouldn't build your own power station. Another example is compute. You should just be using AWS or equivalents rather than installing your own data center. So the next step from that is to think, "Well, actually buy rather than build." If commodity is available, use it, unless it's critical to your business. The only time where you want to be doing stuff yourself is where it is the differentiator for you and the customization that you can do is critical for your business.

The Level of Risk

Another thing about operating microservices, and moving fast generally, is you need to work out what level of risk you're comfortable with. And it's going to be different for different organizations. The FT are not a hospital or a power station. We've had this conversation within the FT. No one dies if we make a mistake when we release something to our website. And this is not to say that we're cavalier about it. Now, clearly, we really care about security, we care about personal data, all our business critical functionality, we think carefully before we do a release, but we are not as worried about releasing something that maybe moves things around on the website. As long as we can fix it quickly, identify and fix it quickly because we value releasing often so we can experiment. That's the decision that we're making.

Grey Failure

With microservices, distributed architectures generally, you have to accept that you're generally be in a state of a "grey failure." Something's not working. The only thing that matters is, is it having an impact on the business functionality? So if you've got a VM that's down and pods are being moved around but you've got multiple instances, that's probably okay. Charity [Majors] is great. She works at Honeycomb, she's got so much to say about observability. I really like that she says, "You have to embrace failure and hard lean into resiliency." So things are going to fail, build resilience in, so that your system will recover without you having to do anything.

Retry on failure is pretty essential for this because you might be calling an instance that's being patched or is moving to another VM, but you need to be careful about it because if you just retry you can have a thundering herd problem where service is just getting back up to its feet and you send thousands of requests to it and you knock it over again. So back off and retry. An exponential backoff is quite good. You wait maybe half a second, then a second, then two seconds. But this can quickly add up to quite a lot of time, so it's quite a good idea to set some kind of time budget. Set a header saying, "If you as a service get this request and it's already two seconds since I sent it, just stop because there's no point." But you need everything to handle that. But it's a good idea, the time budget.

This is something developers find quite difficult because I think we naturally want to fix things. We want to work out what went wrong and then fix it. Mitigation is normally quicker. So this is probably something that will be familiar to a lot of people. The discussions where you say, "Well, this thing is broken." Well, there was a release that just went out. Yes, but the release can't possibly have caused it. And after maybe 20 minutes, 30 minutes of debate, you roll back the release, and it fixes the problem. So basically roll it back. See if it fixes the problem. Work out why later. If you've got two regions, failover. That's basic. And then work out, then indulge in your excitement at investigating the problem.


You need to make sure you know when something's wrong. It is extremely easy with microservice architecture to get inundated with alerts because you just start monitoring for all your systems. But what you want to know is what's actually broken? What's the impact on your customers? So concentrate on the business capabilities. For us, that's something like, are we able to publish content right now? We do something that we call synthetic monitoring. So it's monitoring, it's happening in production all the time and it's synthetic because it's not a real event. What we do is we want to publish an article every minute to check that our published queue is working. And actually, we take an old article that doesn't normally change and we just publish it. So it looks a bit like this. We have a service, and it sends in a publish event using exactly the same mechanism as all our content management systems do. The box of the bottom is our system, it's a bit simplified. And then we just read the same API's that our website reads and check that the update made it through. And we monitor it the way we monitor anything else. It's got a health check endpoint, and basically it's healthy as long as publishing is working.
The cool thing about this is actually doesn't matter what changes in the box below because you're only poking at one end and reading at the other end. It's very resilient to changes in your architecture, and also there are no data fixtures required. We're running this in production on real data, and because, for the content platform, we don't have any personal data, we can just copy all of the data to staging and run it there as well. And this is also useful because it lets us know whether publishing is working, even if no one is actually currently publishing real content. And news publishing can have peaks and troughs. Before we had something like this, if we had an alert in place to say, "Has anything not been published for a couple of hours?", the chances are that alert is going to fire on Christmas day because no one publishes anything on Christmas Day. You really don't want alerts that are most likely to bother you on a bank holiday.

Synthetic monitoring is great but it's not enough because you also need to make sure that you know whether real things are working in production. Our editorial team is very inventive. They've had to work around the limitations of systems for so long that they will try something out and see if it manages to get published successfully. So that means that we need to check what they're doing when they publish, but it also means we need to decide, "What do we mean by publish being successful?" We have two regions. And when you publish an article, it goes into MongoDB, Neo4J, ElasticSearch, and S3. So actually, a successful publish has to get to eight different destinations. And that's complicated to do with normal monitoring, so we wrote a service to do that monitoring. It listens to notifications of publish events and it just checks everything where that event should be and it waits for two minutes. And if it doesn't happen in that time, effectively it says, "I'm unhealthy. You need to republish this bit of content." And because publishing an article is idempotent, we can just republish it. And then basically, we end up with eventual consistency, even if sometimes we have to publish it a few times.


You need to build observability into your system. And the reason you need to do this is because the way that you debug microservices is completely different from how you do in monolith. So I remember working on the monolith, something would go wrong, you'd work out how to replicate it, you'd attach a debugger and you'd step through. And your logs were probably pretty shocking because you never really had to rely on them. With microservices, you probably can't run your entire stack locally. You need to be reliant on the stuff that you've got there in production to work out what was going on.

Observability is, can you infer what was going on in the system by looking at its external outputs? So generally, with our systems that's logs, metrics, monitoring. And you want those to be good enough that you can basically work out what went wrong.

Log Aggregation

Log aggregation is important because the events that you've got in a microservices architecture are all over the place. They're in different services, they're on different VMs. You need to aggregate them all together somehow. You might need to do sampling, you might need to get rid of some of them. You can do things like keep all the error logs, but sample the successful ones. What we do is we remove all the logs that are just from our health checks from the monitoring. But you still need the ability to find the logs for a particular event because there'll be lots of logs, whatever you're doing. And we do that using a transaction iD. And we wrote this ourselves because the service meshes and tracing services weren't around when we were doing it. So we expect there to be a header on a request that has a unique identifier. And every service has to output that in logs and pass it on to any other services. And if it doesn't get a transaction iD, it will generate one. And we have a library for doing this. So it's pretty easy to add it to any new service, and it's essential. When you're trying to debug something, you find the transaction ID, you can find all the logs.


Metrics are measuring something over a period of time. It's really easy when you start with microservices to measure a lot of metrics and then get completely overwhelmed by how much you've got. I would say that you need to look at the metrics for services closest to your user because any problem that's happening on a read between the user and the database is going to be shown in that service that's closest to them. And keep it simple. You want to know how many requests you're getting, how long it's taking for those requests to respond and what the error rate is.

With any system, you're always doing migrations and upgrades. So it might be that, as we did, you built your own cluster orchestration, and now Kubernetes is ready and you want to move to it. It might be that you need to just upgrade a database version. It could be that you actually made a decision and you've realized it wasn't the right decision. So you're always doing these changes. But we have 150 microservices in the content platform. You don't want to have to release 150 services to do an upgrade. You want to try and make it so that you can have some centralized way of doing it. So for example, with deployment pipelines, we have templates. So we can easily create new deployment pipelines, but also we can easily update them. So if our security team says, "Oh, we want to add some security scanning into all your deployment pipelines," we add it in one place and all of the pipelines get updated.

Service Mesh

We don't use a service mesh because they just didn't exist when we started with this. I would totally use that if I was building microservices now, because the service mesh can take on lots of things like back off and retry, routing, load balancing, log aggregation, all the things that we've built into our systems with individual libraries. The problem for us is if we have to update a library, we have to release a lot of services. A service mesh can do that for you. And it means that your microservices focus on your business functionality, not all these other things that need to be there just to have a network of systems.

Any code base has bits that don't change that much. The difference between a monolith and microservices is that can mean that there's a service that you haven't actually released for years. We've got some microservices that we probably haven't released in two years. And what's interesting with that is, if you haven't done it, are you sure you could when you needed to do it? You don't want to find that out at the point where there's a security vulnerability that you need to patch urgently. And we've had a couple recently. We had to upgrade node for all our node services, we had to upgrade Kubernetes. You really don't want to be working out that nothing is working at that point. And we have one team at the FT that builds all of their services overnight. They don't deploy them, they just build them. So at least they know that the build is not working and they can do small fixes along the way.

What Happens When People Move On?

The final thing I want to talk about is what happens when people move on, when the team gets smaller, when people move on to new challenges and the people you've got maybe weren't involved in a lot of the decisions. I think it's critical to make sure that every system is owned. And I think the system needs to be owned by a team. It's not enough to say, "Oh, yes, Luke knows about this system," because what happens if Luke leaves the FT or he's busy working on something else where he can't make the time to fix it? You need to know that if there is a security vulnerability, this team is on the hook to fix it. This is a difficult discussion that we struggle with, with our product people, which is, if you won't invest enough to keep the system running, you should shut it down. Some of the people at the FT in product would like to believe that systems just run with no effort for years, but it isn't true. Particularly with microservices, you need to maintain a level of knowledge of the system, the ability to fix it.

Operational documentation is important, but when you've got thousands of systems, it's really hard to keep it up to date. We started with a searchable  runbook library. So literally just template, fill out a runbook. And it was interesting. As we moved to microservices we started to struggle with it a bit. And Liz [Fong-Jones] is talking in a couple days' time, I'm really looking forward to her talk. She basically says, "Don't over-invest in runbooks." I'm going to slightly disagree and say, "Work out what's important in your runbook." I agree that the troubleshooting steps are probably a record of past failures and they're not going to help you this time because everything that you do is probably different. Every failure is probably different. But what you do want to know is, who owns this system? Where is the code? How do I look at the logs? How do I find the metrics? All the things that are going to help you. How do I do a restore? That stuff is going to be useful. It's going to useful for me if I look at it six months after I last worked on it.

One thing we did get right at this point was system codes. Everything has a unique system code and we use it in our logs, we use it in alerts, we tag our AWS resources with it. But we needed to represent all this information as a graph with microservices in particular. When I moved from the content platform team, we had to do 150 updates of individual runbooks to stop me being the tech lead. You don't want to do that. It's a graph. It's like there is a relationship with a team with a tech lead. So we have a graph, we have people, teams, systems - and this is the start. We want to represent all kinds of system information into the single store, and then something like a runbook view is just one view of our systems. And then if I move team, all you have to do is say the tech lead is basically someone else and that's one update. It's still a challenge because there's a lot of information here and no one likes updating documentation, so it helps if you can give people something in return.

And one thing we're working on at the moment is a new monitoring dashboard system in the FT that we're building and we're calling it Heimdall, because Heimdall is the God that can see everything everywhere. Apparently his hearing it so good that he can hear wool growing on sheep. What we've done with this is we look at that graph of system data, and provided you've created everything correctly, you just get a dashboard for free. This is a new thing at the FT. You used to have to maintain your own dashboards before. And we found that as we started putting information on to this, so for example, we started showing whether you said that this system was decommissioned, we found people went, "Oh, actually, it's not decommissioned," and they'd fix up the data. So this is about basically giving something to people that will encourage them to make sure the data is correct.

You need to practice stuff. You do not want to be trying something at 2:00 in the morning when you just been woken up and discover that the documentation is wrong or you don't have access rights to do this thing. So practice is important.

It's back to that Continuous Delivery thing. Continuous Delivery means that releasing code is not scary, you do it all the time. You want all your operational stuff to be really similar, something that is totally not scary and you just know how to do it.

Failovers, database restores. These are the basics, they're the starters. Basically, can/do you know how to fail over to another region? Do you know how to restore the database and have you practiced it? We practice failing over weekly and different people do it every time. So we know that we're testing those steps, we know that the documentation is correct.

Chaos engineering is, obviously, cool. The engineers love the sound of it. The business are absolutely terrified of the thought. And I know some people are changing the wording because of that, but it's not about unleashing chaos. And there's a whole load of talks about this that I know are coming up and they will be really interesting. It's not about unleashing chaos. You need to, first of all, understand what your steady state is. What does your system actually look like when it's working successfully? What kind of noise do we expect? Is it alert-free? And then you look at what you can change. So what can I do? And you want to minimize your blast radius. You do not want to be breaking things in production deliberately. It does not make you popular.

And then you basically think, what are you expecting to see happen? Are you expecting alerts to fire? Do you think that latency will go up? Will there be an increased rate of errors? And then you run the experiment, and you see if you were right. And what we found is that often we were wrong. In particular, what we normally find is no alert has fired even though we took down one region and we would have expected an alert to fire. And this is useful.


Wrapping up - building and operating microservices is hard work. It's much harder than the monolith, but I think it is worth it. You have to prepare for legacy by maintaining knowledge of the services that are live. You have to invest in it. It's basically that you have to garden, you have to get rid of the weeds and keep things clear. And you need to plan now. If you're building microservices, think about the case in three or four years' time where you've got a skeleton team trying to maintain 150 microservices. That could be extremely difficult. Basically, plan for that, build things in. And remember that it's about the business value, so keep an eye on whether you're still able to move fast. If you're stopping moving fast, you're losing the benefits, but you're still paying the cost.


See more presentations with transcripts


Recorded at:

Mar 28, 2019