Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Panel: the Correct Number of Microservices for a System Is 489

Panel: the Correct Number of Microservices for a System Is 489



New research from the University of Sledgham-on-the-World has revealed that the correct number of microservices for any software system is 489. Given that, why have so many organizations used a different number of microservices? The panelists discuss the architecture of their various systems, what trade-offs they have made in the design of their systems, and how their system has evolved over time.


Suhail Patel is a Backend Engineer at Monzo focused on working on the core Platform. Jason Maude works at Starling Bank as one of their lead engineers and host of the Starling podcast. Nicky Wrightson works on the data platform at Skyscanner. Sarah Wells is the Technical Director for Operations and Reliability at the Financial Times.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Maude: This is a QCon exclusive here, where we can reveal some exciting, new research. Research from the University of Slechem on the world, has revealed that the correct number of microservices that you should be running is 489. This number was calculated, "By taking the total distinct code concepts that can be written in melted cheese on the number of pizzas a two-pizza team eats during an average Sprint, and dividing it by the mean average of your desired unit test coverage and the coefficient of the number you first thought of." That obviously arrives at 489. I'll let you work out the math for yourself. This would be of great and propitious import to the entire software engineering industry, if I hadn't just made it up, which I did. That's completely fake. Chucking that out of the window.

We can now start to ask the question, why 489? Why did we come up with this concept of stating a number of microservices that should be run in production? This came out from a tweet from Monzo about the number of microservices they run in production and what their software architecture looked like. They got a load of tweets saying, "That's completely rubbish. You're running far too many microservices in production. How can you possibly know what's going on?" That raises the question, what do you mean? What is too many microservices? How do you know that you're running too many microservices or too few?

What we wanted to do with this session was explore that concept from a number of different perspectives. I'm going to slowly invite up on stage, three guests from very different organizations who have taken very different approaches to the number of microservices in their systems landscape. How they've designed that landscape or how that landscape has evolved over time, whether by design or not. I'm going to call them up on stage. I'm going to quiz them about what they have in production. Then at the end, I'm going to turn the floor over to you to quiz them for me. I would like to invite the first one of my guests up, who is Sarah Wells, the Technical Director for Operations and Reliability at the "Financial Times."

The Systems Landscape at the Financial Times

Could you describe for us what your systems landscape is like at the "Financial Times?"

Wells: Pretty much anything you could have probably exists somewhere at the FT. We have teams that load apps to Heroku. We have teams that are building Lambdas probably in Node, maybe in Java. We have teams running containers. We're gradually moving towards EKS. We've probably also got people on ECS, Beanstalk. We have a whole ton of stuff. We've still got systems that sit on a VM. Probably, we still got some Java apps running on Tomcat. We opened up people's choices maybe four or five years ago. Rather than saying this is the message queue you should use, or this is the database, we basically said, if you can make a case for why this is a good thing for you, you can do it. We have a lot of different databases. I think we have three different ticket tracking systems. Basically, it's a bit free-for-all.

Maude: You really have taken the concept of developer autonomy quite far there.

Wells: We have. What's interesting is sometimes you find people do see something and go, this seems really good. You do get people coalescing around particular technologies, where they've seen another team use it and they think, "That would work for me." I don't think we've got anyone using a graph database that isn't Neo4j. We've got a lot of people using CircleCI. You do find that people move towards the things that are clearly easy to use and provide value.

Maude: In reference to our talk title, how many microservices would you say are at the FT?

Wells: It's quite difficult to say. I think once you have a lot of microservices, you probably don't know how many. I can say how many systems we have, because we track them in a business operations store, but which of those are microservices, I don't know. What I would say is it's been pretty much the standard architectural pattern. Some of our bigger groups, for example, building the website, our content platform and APIs, have around 150. Then we have other teams that maybe have 10. They're doing smaller chunks of work. Obviously, if people are writing Lambdas, they could have quite a lot of individual Lambdas.

Mitigating Teams Going Off the Rails in a Diverse Microservices Landscape

Maude: One thing that someone might say if they came along and saw this vast landscape with many different approaches taken across the landscape as a whole, is, how can we be sure that this is resilient? Because then you've got to measure the risk of each individual team and whether they can run their system in production effectively. How do you manage to mitigate the risk of one individual team going off the rails, as it were, and doing something crazy?

Wells: You couldn't do the approach we've got unless it really was: you build it, you run it. We provide standards and guardrails for people to say, if this is a critical system, we expect you to be in two regions. If it's maybe less critical, you've got to be in two availability zones. We've got security standards. We have other standards that say, here's how you should represent the health of your system. If you're going to empower your teams, you have to actually trust them as well. The best thing you can do is trust them but make sure you can pick up where they've got it wrong. A simple example would be, we have information in our business operations store that says, does this system handle personal information? If it handles personal information, you've said yes it does. You have an S3 bucket, which is tagged with that system code and it's open, we're going to basically flag that to you. Did you mean to make that bucket open? Providing that checking is a sensible approach.

Maude: It's interesting how you would decide when you would impose those rules across the landscape, and when you would just say, actually we'll let people go free in this particular instance.

Wells: It is about risk. We probably don't always get it right. We definitely have this categorization of this is a brand critical, or this is a business critical system and it has a higher standard required. For example, we will phone people up out of hours for that. Actually, as soon as you have a system where you could be woken up, out of hours, you take that much more seriously in terms of resilience. You will set yourself up so that you don't get phoned up when something could be automated.

Maude: With great power comes great responsibility.

Wells: Yes.

Approaches to How Big a Microservice Should Be

Maude: Having lots of different approaches to microservices means that you might see different sizes of microservices, almost. I suppose that the question of how big a microservice should be, is one that has been constantly with us since microservices came along as an idea. What approaches have you seen to that question?

Wells: All the things people say about how big a microservice should be, don't make sense. If it's a thing owned by a team, then it's also not something that fits in someone's head. We have maybe 5 teams that between them have 150 microservices. I think what we've found is useful is be prepared to recognize where you've got it wrong. If you have got some microservices that then you find that there are four of them that you always end up releasing at the same time, then they're not separate. Can you combine them? We've particularly done that around services that we're writing into a graph database. Did we have one service that did reading and writing of a particular type of concept or did we separate those out? Because when we read it it's a different thing we're trying to achieve. We've moved between that. Recognizing that there's some pain here, therefore, we probably ought to change our approach, is quite effective.

Maude: That's a great metric. The metric of, are you deploying this at once?

Wells: It's the, have you actually got microservices or do you have a distributed monolith? Unless you can genuinely independently deploy something, then it isn't a microservice. The other metric I think is useful to think about this is, are you actually managing to release things all the time? If you have got a system made up of microservices and you're not releasing multiple times a day, I would wonder whether you're really truly getting the benefit for it. Because there's a lot of cost for having a microservice architecture. The website was 12 releases a year, 5 years ago, and across the content platform and the frontend, it's probably 4000 releases. Recently, we've added a change API, and I think we're doing about 40,000 changes a year.

Maude: There might be costs in terms of complexity, and in terms of making sure your code is split up and so on. The benefit is this rapid release cycle that you can't do with a monolith.

Wells: I don't know how long people have been coding, but do you remember Git Merges? I just don't remember the last time I did a Git Merge, because basically you don't have a long-lived branch, and you're doing small changes. It's not the pain that it used to be. When things go wrong, you're not trying to work out which of the many changes that you bundled together was the cause.

Maude: That brings lots of benefits in terms of risk and resilience in itself, that ability to do that.

Responses to Monzo's Microservices Tweet

I'd now like to invite up our second speaker, who is Suhail Patel. He is a Backend Engineer at Monzo. A lot of the inspiration for this talk was the tweet that Monzo sent out about the number of microservices in their architecture. What responses did you get from that tweet?

Patel: It was really quite interesting because a lot of people have this sticker shock reaction. They say, 1500 microservices? Then they look at, how many engineers do you have? We had about 150 engineers at the time. People made an interpolation, "That means 10 engineers per service." We wanted to run with that as a joke. You join Monzo. You get your 10 services. You treat them like pets. You give each of them a name. You care for them. Then they grow up. That's not really the reality. If you look at Monzo as an internal company. There's a lot of different domains that we are involved in. For example, we've built an in-house chat system. We've got a lot of business tooling. We've got an entire financial crime department. We've integrated with nearly 20 payment networks, and lots more ongoing. All of these are really complex pieces of software and architecture that we've had to build in-house. If these were spurn up into different companies, we would have tens of companies for each of the things that we have built. What we've been able to do is build an architecture where we have some shared common abstractions but lots of different and separate entities and teams built around them. That coalesces into one big 1500 microservice number, but without context, it just is a sticker shock number.

Maude: How do you know where the boundaries should lie between these abstractions? How do you know if you've split something down into two smaller pieces?

Patel: I think the release analogy that Sarah had was absolutely correct. If you're releasing these things in tandem, or if you've got some coordinator service, or if they're sharing data across microservice boundaries, and storing these pieces of data in lots of different places, then you might have not got it quite correct. Ultimately, the way we see microservices is it's a bounded context. If the thing has its own data entities, its own data abstractions, it fits within its isolated environment, can be scaled independently, deployed independently, I think that's the sweet spot of having a microservice.

Maude: When you got the new engineer coming on, and they're not getting given their 10 microservices individually, but they're asked to do a certain task and so on. How do they know what's there already? How do they know if there is already something within the 1500 microservices that already does what they should do?

Patel: One of the things that's not commonly talked about is the amount of tooling that is almost required when you get to any amount of microservice scale. When you get to the hundreds of microservices, you need a set of common tooling that engineers are familiar with. Right from the get-go, engineers join with a backend engineering 101 to understand what abstractions are provided in the architecture, and where they essentially insert their code, where the business logic is. Every single microservice at Monzo follows the same common set of patterns, even to the same file structure so that you can jump into any microservice and understand and jump into the code, whether you're on-call, whether you're a veteran engineer, or whether you're a completely new engineer.

Good Documentation Culture

Aside from that, having good documentation and README's is also really important. That's more erring into the cultural side of things, having good knowledge management. Sharing that context. Having a forum for questions of whether a microservice exists, whether it's the right domain, whether it's the right scope to add this additional piece of functionality or to break it up further? What is the history behind it? It's all really important. Those are more cultural issues. If you're talking about services themselves, having documented API boundaries. We use a lot of proto style API boundaries so that it's well documented. All the fields are there. They're marked as required or optional. That stuff really does help.

Maude: What's the rate of change we're talking about here? How many microservices are you adding a month?

Patel: I think it gets to about 100 per month, at the absolute peak. The thing is, a lot of people are thinking of this as a velocity. Are we moving faster because we're adding 100 microservices this month, or are we moving slower because we're adding only 30 this month? I don't think that's the right way of looking at it. When we have a new abstraction being built, if we're going to embark on a completely new project to add a new payment network, it's going to be pretty common that we're going to add 10 to 20 microservices to deal with all of the abstraction layers that are involved in all of the bounded contexts that we need to get involved. For example, you open the MasterCard spec and it's thousands of pages of documentation. You want to break those up into small individual components. I don't think the number of microservices is a function of time. I think it's an evolution of when you're adding more complexity to your system, how do you break those down into individual chunks that are manageable and able to be retained in some group of engineer's heads?

Documentation Complexity As More Things Are Added

Maude: Of course, as you add more things, this knowledge management of making sure that you've got all of your landscape documented, and you can pass that knowledge on, that becomes a difficult challenge in some ways.

Patel: Absolutely. I don't think it's a solved problem. If anyone's looking at Monzo for the solution, unfortunately, I don't think we've even quite found the perfect solution. What we can do is we can make incremental steps. We've made a lot of advances in our tooling, being able to figure out where a microservice is, what region it's deployed on, and all of that stuff. Deploying changes is really easy. There is a code owner for every service, at least you can go to members of that team. At least you know, what is your starting point to go find out information? If a service is just sitting there in production and everyone's afraid to touch it, then we've got a problem. I think tooling and some amount of documentation helps. For example, we run in a single repository. We have all of our microservices in a single repository. One of the benefits that brings is that you can do essentially a grep across the entire repository and see what microservices are there that fit your patterns. For example, if you do a grep for MasterCard, all of the MasterCard services and many more services as well will pop up because they have MasterCard named as variables, or in comments, and stuff like that, or even in the README. These are all viable starting points. Whether you're going to find perfect documentation to get to your endpoint, without speaking to another human being, I don't think that's a problem that anyone has quite cracked.

I think this applies even if you have a monolith. If you're completely on the other end of the spectrum, and you have a monolithic, one single binary that you deploy, one single application even running on a single server. If you have lots of classes, functions, or even variables, or one big function, all of these things are complexity. All of these things you're going to have to speak to other humans and gain context about, why was this designed in this way? What were you taking into consideration? What was going through people's minds? That stuff is not really explicitly documented at every step.

Does Keeping Everything in One Repo Lead To Inappropriate Coupling Between Microservices?

Maude: We still haven't managed to get rid of humans yet. Keeping everything in one repo, does that lead to any anti-patterns developing in terms of coupling, or coupling that is inappropriate between microservices?

Patel: Yes. One of the things that really comes up is, at what point should something graduate to a library? If an abstraction is used in one place over here in an isolated microservice, and then a similar abstraction is built for another microservice. At that point, you come to a logical conclusion, this should be a library so we can share some code. For us, for something to graduate into a library means that we are adding another layer of support, and analytics, and metrics, and expertise around it. One of the themes I've seen going around at Monzo, and one that I really agree with is, you don't really want something to graduate into a library until you've written it a few times, so that you can understand what is the best interface to provide to engineers to make the most effective use going forward. Have you captured all the different problem space? If you're writing something once, you're like, "I'm going to graduate this immediately to a library." You might solve your specific problem, but then you might get the interface wrong for other people. Then changing that is going to be a pain. It's something that we've experienced as well and something that we've had to do. Being in a single repository, we can do checks to make sure that things can continue to build and using a static typing language like Go. That stuff helps. It's still pain. We have to now educate engineers to make sure they're now doing it in this new way and not the old way. Where's the new documentation? That migration path is just painful.

Taking Into Account Latency That Appears From Calls across the Network

Maude: One thing that does differ between microservices and a monolith is, in a monolith if a class is calling another class, or a function is calling another function, then the time you've got to take into account of that is the processing time with the various classes. Whereas, if a microservice is calling another microservice, there is additional time taken for that call to go across the network. With 1500 microservices where a simple, make a payment, or I'm trying to pay for something on my card at a supermarket, that's going to have to go across many services. How do you take into account the latency that appears not from the processing, but from the call to call?

Patel: I'm going to invert it ever so slightly, and say that if you ask most engineers who work on a monolith style system, and it's something that I've worked with in the past as well. In a monolith style system, there's a lot of things performance-wise or latency-wise that engineers don't really take into account. For example, what if your program has a GC pause right in the middle? That's going to ruin your latency budget. What if your disk gets slow? If you're running on Amazon, you're running on EBS, your EBS Burst Balance depletes, your disk access is going to get a lot slower than what you usually expect. If anything, by surfacing the fact that you're going over a network means that there's some amount of resiliency and reliability that you need to take into account. I do agree that your reliability might go from near enough from 100% to 99.9999%. We don't try and hide that fact. We tell engineers that crossing an availability zone boundary is probably going to add 1 millisecond hop latency to your application, but you can't rely on that. That's not a guarantee that anyone gives. We don't control the network, Amazon controls the network. Even they don't get it 100% perfect. If anything, don't try and hide the abstraction that you're going over the network. In your application, make it explicit that this is now a network call, and to take into account, do you have a real hard time real-time deadline? Ideally, in the 99th percentile, you want to try and get it within this amount of time budget. Occasionally, it might go over. That's the reality with most applications. If you have a GC pause, you can't give 100% guarantee. The same thing applies, whenever you're touching anything that goes outside of CPU or memory context, even memory can get slow, RAM disks fail all the time, what happens? Computers are fickle.

The Architecture at Skyscanner and the Unique Challenges They Face

Maude: Much like humans. I want to introduce my third panelist, who is Nicky Wrightson, the Principal Engineer at Skyscanner and the track host of the microservices track. Could you describe the architecture of Skyscanner and some of the unique challenges that Skyscanner faces?

Wrightson: I can really only talk to the data platform as that's my area. We deal with 2 million messages per second entering. Every single thing about our world is scale. Every paradigm that you've learned before and brings you into your next job, breaks. You have to innovate at that scale.

Maude: You're facing this fairly unique constraint of processing that many messages per second. You have to take a different approach. What are some of the rules that you often hear or principles you hear about your microservice architecture that you have to go, that's just not going to work with this many messages.

Wrightson: It's less about the microservice architecture and the standard typical way of several hops via a queue. We want to flatten that. We make that as short as possible because, of course, you get that many messages. When I talk about scaling, I'm just going to do one example is that we have an ECS cluster and this is just for writing to our route stack, that we had to ask AWS to add more nodes to because we had 4000 containers running in this. That's just for our logs. If you don't horizontally scale and keep that vertical movement short, you just end up with backlogs and you just can't process that amount of data.

Maude: What you're dealing with there is you're dealing with a lot of problems where you don't actually see each individual problem, you're seeing volumes of problems, almost. The probability of a problem happening.

Wrightson: Exactly that. I was previously at the FT where we could have really great observability in all of our microservices, but it doesn't mean anything at that scale because one message really doesn't mean much. You're definitely looking for trends. You're looking for anomaly detection, thus it's all about the metrics at this point.

Maude: With that monitoring, you end up being another step removed from the underlying systems that you're running.

Wrightson: Definitely. You can be opinionated, but you can't delve into it and reason about what you're actually transporting from A to B. The monitoring has to evolve too. We traditionally check if something enters the system, and then it comes out of the system just by comparing an RDS, anything like this. You can't do it at that scale. It just breaks. We're using probabilistic data structures to give us an indication of when there might be an issue, which was quite a funkier solution.

How Probabilistic Data Monitoring Works

Maude: Could you give us an example of that? How would probabilistic data monitoring work?

Wrightson: I'm not a huge expert on this actual part of it. Herman was definitely the person. Basically, it's just saying that given a set, we've got the probability that we're always going to get a definite answer of when it is there. You can compare and contrast if you get the same answer from both sides. It's more of like set logic than actual comparisons.

Investigating Production Issues and Debugging

Maude: I imagine that debugging problems and investigating production issues must be a whole different ballgame in that circumstance.

Wrightson: It is. The trouble is that we end up breaking things that you don't find on the internet. We regularly break AWS. AWS use us. They're quite proud of the fact they use us for testing.

Maude: You are essentially a Chaos Monkey for AWS.

Wrightson: Exactly that. Yes.

Maude: That's a fantastic side business to run, I suppose.

Wrightson: Yes. It does make life interesting.

How the Issues Are Fixed To Lower the Error Rate

Maude: How do you approach it then trying to fix the problem? Are you just throwing stuff at the wall to see what lowers your error rate?

Wrightson: We can definitely reason about things. We're in the massive process of trying to separate concerns of our system. As we isolate at least the different types of data, we can isolate the problem a little bit better. Our metrics are good as well. We can often see where traffic is stopping or slowing down. We've got quite advanced monitoring of latency through the system, which is definitely the indicator that something's gone a bit wrong.

Maude: In some ways I imagine you can abandon stack traces at this point.

Wrightson: Yes. You don't look at logs.

Maude: When you got that many log clusters then looking through the logs is just pointless.

Wrightson: It does become pointless.

How to Properly Scale Up and Down Automatically To Control Latency

Maude: When you're trying to control this system, and you're sitting one step back and having this system run before you, how do you scale up and scale down automatically? Or, not even manually do that, but automatically create the system to scale up and scale down in order to control the latency without overshooting, overprovisioning or provisioning in such a way that another system ends up getting too much traffic and slowed down?

Wrightson: We've got a couple of different autoscaling methods. The one that I'm most familiar with is this Go application that routes all of our traffic. This is the most critical application in our world. That goes down, Skyscanner is in trouble. We scale that via CPU. There's an interesting reason why we scale it via CPU, and it's not a good one. We haven't actually put a limit on how many processes it can spawn. We found out the hard way that if we didn't do CPU, it would just out of memory. Because it wouldn't scale. It would just go, one, and then bang. We do overprovision. Sometimes we do it manually when we see things are going a little bit wonky. It will do some autoscaling, but there's certainly been times where it doesn't. For example, in Japan, there was a talk show. They showed Skyscanner on their phone. This sent an unbelievable amount of traffic immediately to the site. We're running on ECS. It wasn't scaling quick enough. Then we were failing over to another region, which of course it was knocking out the next region, and so on.

Dealing with Cascading Problems

Maude: I imagine these cascading problems must be a worry that you have to deal with.

Wrightson: Yes. We've now deployed that to 4 regions, we're hoping to go to 10.

Maude: Worldwide deployment, yes. From Starling we can empathize a little bit with being unexpectedly displayed on a talk show and then having your system suddenly overloaded, although not by the same extent, I imagine.

Breaking Apart a Monolith and the Permissions and Authentication between Services

Participant 1: We have a monolith in our platform. I'm interested in looking at how we break it apart. One of the things that I am curious about is how you think about the permissions and authentication between those services. Going off of Sam Newman's talk, talking about exposing an API in the monolith that a microservice could then consume from. How do you think about authentication? Do you just assume that because it's internal, anything goes, or do you use something else?

Maude: Traffic between your different services, how do you deal with authentication, or do you deal with authentication? Do you just go, we've received a request from an internal service. It must be good.

Patel: Assuming that everything internal is safe, initially, that might be ok. I think the moment you get some scale, I don't think that's an assumption that can be made universally. It only takes, not even something malicious, but something accidental to knock another part of the system out. It's something that's pretty common. One microservice that shouldn't be talking to another microservice, and maybe it's got a stale DNS entry, and it's just flooding it with a bunch of traffic, and it's not provisioned correctly, accidental failure happens all the time. For authentication specifically, we've embarked on a bunch of projects. We've got a centralized authentication service, which is also responsible for internal authentication. Just part of our zero trust policy and defense-in-depth, being a bank, these sorts of things are really important, and also regulated. It's mandated but also really good practice.

Also, we've embarked on projects on doing network level isolation, where we say, a service that is being deployed on our shared multi-tenant architecture, physically is blocked off using firewall rules to talk to a service that it shouldn't be talking to. That means if you do get into one particular entry point, like one of our API entry points, if there's a bug or a security vulnerability that hasn't been quite patched yet, maybe a zero-day vulnerability, you won't have the ability to talk to a critical part of the system like our transaction service or our ledger, unauthenticated but even via the firewall. Your packet, literally, will not be routed to that. Even in the accidental, and in the exploitation case, it just won't get there.

Wells: We distinguish between the microservices that make up a system, and the relationships between those microservices and those that are part of another system operated by another team. What we've found quite useful is to have an API gateway that sits between that. There's a couple of reasons for that. The first thing is we require you to have an API key. We've recently made sure that we can tie that to a team rather than a person, and that's linked to a system code. We know what system is making that call. Then you can add throttling. This is a good idea, because you as an owner of an API are responsible for making sure that someone else doesn't take your API down. When we were early on doing some microservices, we had a developer working on our website who managed to make 26,000 calls in parallel to one of our APIs and took down one of our clusters. I treated that as our fault, because we didn't have the throttle set correctly. It was an unscheduled resilience test, and it failed over to the other region, and we were fine. It's our fault. That combination of you can control at the gateway, and you know who it is, is really useful.

Maude: I would also suggest that if anyone does introduce a bug into production, use the phrase, unscheduled resilience test to describe how it goes.

Wrightson: It's a little bit of a mixed bag, actually, because, of course, the data platform needs to be quite open. Because you've got apps, so we've got to have all of that data sent to us. The only way you can do that is through public open APIs. Once it's in our world, we're actually doing more service to service communication rather than API calls. We're hitting an API, but it's then talking immediately to Kafka. We do different styles of security there and authentication on that side of things. We've recently actually had Monzo's blog post sent around about their zero trust network. Part of the work we're doing on our Kubernetes world is definitely going down that route and trying to leverage Istio as much as possible for them.

How to Design a (Bank) Database to Cater For Microservices When You Can't Get Away With Eventual Consistency and Caches

Participant 2: When you have a lot of microservices like you said, 1500 microservices, how do you design the database to cater for those microservices especially with a bank when you can't get away with eventual consistency where you can cache and stuff like that? I just wanted to know how it works.

Patel: This is pretty specific to the Monzo case. One of the key tenants of our microservices architecture is that you share data by communicating. You don't want different microservices accessing the same particular underlying data in your database. Especially, like reads is a bit troublesome. When you have mutations going on from different microservices, that's where you're really in the problem land. When you've got things touching the same underlying data, they might not have the same guarantee. You might not have the same locking primitives, or you might not have the same safety. You might introduce race conditions. You might not have the same deployment strategy, or a bug can completely ruin your data and then you need to resort to backups, which is never ever ideal. You might even have silent data corruption because you've got multiple things touching the same piece of underlying data.

Regarding the eventual consistency point and caches. I think that really depends on the design and the primitives that you have around writing data. For example, one thing that we are pretty vocal about is we use Cassandra. We use Cassandra for all of our microservices. All of our microservices that need to store data are storing it in a Cassandra cluster. That's something that we support. We've added a lot of abstractions and tooling on top to make sure that we have the consistency guarantees that we need for our data. Of course, we want to make sure that we have locking primitives and stuff like that, but that goes beyond just storage of data. If you're coordinating storage of data, as well as calling a third party, for example, we integrate with lots of different payment networks, all of those might need to be integrated within the same transactionality. It might not be enough just for your database to provide.

Maude: Nicky, availability of database connections is a concern when you're dealing with 2 million requests a second.

Wrightson: Yes. We have a Kafka cluster of 77 nodes.

Maude: That's quite a few.

Wrightson: Yes. It's pretty difficult to reason about and exceptionally hard to roll out any changes. Really difficult. It took us two-and-a-half months to change the AMI on every single one of those. That's our biggest pain point. Otherwise, we're completely in that segregation mode as well. We don't have two different services owning the same data. That's the critical point.

How to Get Around Ownership and Updating Services

Participant 3: I have a question with regards to ownership as well as keeping things up to date. You mentioned code owners, that kind of stuff. Obviously, as you get a lot of services, maybe it's not reasonable for a single person to own them. Obviously, holidays can get in the way, how do you work around that?

Wells: I'm Tech Director for Operations and Reliability, I care deeply about knowing who owns systems. We've got a policy that everything should be owned by a team. You can't have something owned by person, because they're working on something else. They might be on holiday. It's got to be a team, and everything has to be owned. It's not easy to make that happen when you've got thousands of services. You have to try and make sure that there's something that says you are the team that own this and expect people to maintain good documentation to do testing, backup, restoring. We've got a central system where we track information about all our systems and we actually score it. We score the operational information. We've gamified it. If you're a group or a team, you can look and see how you're doing compared to other teams in terms of having filled in the critical information. That works enormously well. There are some very competitive developers at the FT, and you get comments in Slack of, "We've just overtaken you again." I think my team's OKR is we will be top at all points, because we wrote it.

Running the System in Production

Maude: When you've got something owned by a team, what do they have to take care of? Obviously, the running of the system in production, the documentation, is there anything else that they have to be able to guarantee?

Wells: If they're using a platform that isn't centrally supported, they need to make sure that things are patched. If you've chosen to deploy it on something different, you've got to make sure that you follow our patching policies, like high severity issues are patched quickly. Things like when we had to upgrade all of the code that use Node because there was a security vulnerability, and that was across huge numbers of teams. They are responsible for doing that. If it's a system where we need support overnight, they're responsible for being able to do that. You still have some systems where, honestly, you know that no one actively knows this, but we do know who we'd phone. Everyone has got some engineers who can find out what things are happening operationally, and you just rely on those.

Hiring for an Obscure Language

Maude: How about hiring? If you've picked a really obscure language for your system, do you have to be able to guarantee that you can hire people who could maintain it?

Wells: Probably, language is the thing we are least diverse in. That partly is because when you have a team who go, "I really like this testing framework in Scala." Six months later, you find that no one on that team is actively interested in doing that, because actually, they're all Java developers. We've got fewer languages, and that's not so bad. Generally, I'd say that it's a mistake to hire for the technologies you're using now anyway, because you don't know what you're going to be using. If we'd hired in 2013 for the technologies we were using there, it's almost no overlap. We were doing Java on Tomcat with Apaches in front of it, running on VMs, and now with Node and Go on containers.

Maude: Hire people not technologies.

How to Evolve Microservices Standards in Response to Emerging Issues

Participant 4: It was mentioned that with such a large number of microservices that standards are important. How do you go about evolving those standards or introducing new standards, maybe in response to some security issue or when you've got so much estate using those standards at any time?

Wrightson: There's several aspects to this. We heavily invested in our tooling, which means that you can cookie cut a microservice that works with all of our health checks, all the logging, and all the metrics, deploy it to production probably within about five minutes. That's all it would take to actually create a useful service in production. When you have such an easy way of producing new services, through consistency, you can roll out changes through that. A lot of our tooling comes into play. We do have a group that is sometimes effective, sometimes not, at trying to document our principles. Generally, engineers don't like reading Confluence pages. They just like doing stuff. If you can bake it into the doing, then you're in a lot better place.

Patel: For engineers themselves, evolving standards, we also have the same cookie cutter service generator approach. You can deploy something very quickly. If you want to evolve that, for example, you want to change the structure that engineers will go towards, and propose a migration plan, I think getting others on board and showcasing why this is a better approach. If you're always permanently stuck in one static way, then you won't evolve and essentially you will fall behind. The layers of complexity will catch up to you. Or, you'll have a bunch of people who go off the beaten path and just go do their own thing. Then now you've ended up with a completely unmaintainable mess because people are frustrated with the current controls and the current restrictions. There is an element of satisfying both sides, making sure that we have structure and harmony. On the other side, making sure that people can propose change, and see that forward, and explain why this is a better approach. Why this might be more flexible going forward. Then once people have committed, going all in and committing. For example, if there was a migration plan needed, then getting all the people on board, even everyone who was not entirely convinced, getting them on board, and doing a migration and following the migration plan. Making sure that everyone is happy on the other side, so you don't have a halfway house lingering around for nine months while you've got some old services and some news services, and no one knows which is which, and what should be done now.

Wells: There's a really interesting blog post that I think Slack recently published about how you can change the engineering tooling, the things that you have. They said, anyone's free to go off and investigate something because they think it sounds interesting. At some point, you decide, this is good, at which point you have to convince everybody else. You can make that easy. You can build tools. You can do communication. You should make it as easy as possible but you're always going to have a long tail. What you have to do is say, if we're committing to it, everyone is going to have to move because otherwise you end up three years in with most things on GitHub, but something's in Bitbucket. Just constant, every time that you hit a problem, it's like, I've got to do it three times. I totally agree. There are two things, you really need to make it so that you're enforced to follow the standard, or at least encouraged strongly. The other thing that's quite useful is visualization, if you can show people whether they are complying with the standard. I think that's quite good.


See more presentations with transcripts


Recorded at:

Jul 31, 2020