Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Microservices from the Trenches: Lessons, Benefits, Challenges, and Mistakes

Microservices from the Trenches: Lessons, Benefits, Challenges, and Mistakes

Key Takeaways

  • You can solve some of the tradeoffs that come with moving to microservices without actually moving to microservices.
  • Working with microservices might help you to speed up delivery, allowing the teams to work quite independently.
  • On the other hand, the operational overhead that came with microservices might be too great. So tooling is a crucial aspect when it comes to microservices.
  • Team boundaries is a key phrase for microservices. With Conway's law, the service boundaries and team boundaries must be aligned to support fast delivery and response to problems.

Nicky Wrightson, principal engineer at Skyscanner, hosted a panel at QCon London with participants who have moved from the monolith to microservices and in some cases back again. They have strong opinions on monorepos, on operating distributed systems, and on the best way to structure an organization to make a success of this architecture.

The panelists were Luke Blaney (principal engineer at the Financial Times), Matt Heath (back-end engineer at Monzo), Alex Noonan (software engineer at Segment), and Manuel Pais (IT organizational consultant and trainer).

Nicky Wrightson: How long have you been working with microservices and what's the one thing you wish you’d known before starting your journey in microservices?

Matt Heath: I've been working on microservices for about seven years now. The one thing I wish I'd known? How things can spiral, not out of control, but rapidly get more complex as your business develops.

Alex Noonan: I've been working with microservices for about three years now. I think the thing that I'd wish I'd known in the beginning was that you can solve some of the tradeoffs that come with moving to microservices without actually moving to microservices. Sometimes, people get a little bit blind. They see that they don't have good fault isolation — and microservices could give them that. But you can address that without making the switch to microservices. I wish we'd known that it was possible to address a lot of the issues we were seeing without going to microservices.

Manuel Pais: Of the people here, I'm probably the one who has the least direct experience with microservices. As a consultant, especially in continuous delivery, I've felt a lot of the pain from clients who went for microservices too early, trying to optimize for something that was not really the problem they had and not thinking about problems in terms of their engineering practices or dependencies between teams. They just thought it’s all about the architecture when there were other problems. As Sam Newman says, first identify what problem you're trying to solve and then seek out the solution, which might be microservices or not.

Luke Blaney: I think the one thing I wish I'd known is how important tooling is when it comes to microservices. We already knew to some extent that for things like orchestration, you needed some level of tooling. Stuff that is useful without microservices becomes essential when you do have microservices.

Wrightson: Luke, can you see real benefit to your organization or your customers from this microservice approach at the Financial Times?

Blaney: We can definitely see real benefit to our organization and our customers in terms of speeding up delivery and, like Alex was talking about, in isolating things. It's really beneficial for us that way. It's brought us a long way forward. It does have its pitfalls, but I think on the whole it has been beneficial.

Wrightson: Matt, how’s Monzo benefited from using microservices?

Heath: Monzo's benefited, I'd say, in quite a similar way. By using microservices, our teams can work quite independently. Monzo is a bank. It's quite easy to say that banks are complicated, but we have lots of different systems. Using microservices means that we can work on those very independently. It's helped us deal with the complexity.

Wrightson: Alex, your company has done a bit of a U-turn and gone back to a more monolithic approach. What was the key benefit of that U-turn?

Noonan: The driving force behind that U-turn to a more monolithic approach, for us, was two things. One, the operational overhead that came with microservices was too great and we didn't have good tooling around testing and deploying and maintaining these microservices. The other main driving factor was that we finally came to understand the root issue in our system that we were trying to address with microservices, and it hadn't actually addressed that. Once we finally had this understanding of the core issue, we were able to address it, and moving back to a monolith was what helped us address it more easily.

Wrightson: Who's seen the architectural diagram for the microservices at Monzo on Twitter? We're all quite aware of the pure scale of the microservices. There is a lot going on there. Are you seeing cracks at this monumental scale? Is it the right approach?

Heath: It's quite hard to represent that many moving parts on a single diagram. I don't think that'd be any different if our organization had 50 applications. We have more than 1,600. That means that there's a lot of moving parts, but it doesn't mean that everything is quite small. 

The main problems we have are making sure people have the right information. People don't have to understand how everything works. They can just know how their cluster of things work. Tooling is the main thing. We've had to build out a lot of fairly complex tooling. Those are the things we've had to focus our efforts on.

Wrightson: Manuel, you look at microservices in organizations from a team angle. Surely, there are benefits that microservices can provide for team organization? 

Pais: Yes and no. I think it's all about the right moment at which to think about using microservices to fix the problem you have. Often, it's premature to try to optimize something with microservices but because there’s a lot of talk all around about microservices, we think that it's going to fix our problems. Organizations often don't realize that to actually benefit from microservices, they need to do a lot more change. It's not just the technical architecture. If you go one step back, you need to think about Conway's law and how the team structures are mapped with the services that you have. They should be aligned to a large extent — not one to one, but they should be aligned. Otherwise, you're fighting that.

Also, you have to think about decoupling teams, decoupling environments (if you have static environments for testing), and decoupling the release process. You need to consider all these other aspects to take full advantage of microservices. Once you have done that, then microservices are really good at making sure you're aligning the teams and the services with the business areas or the streams of value in your business. That can be very powerful.

If you want to promote a faster cadence of change, microservices can help you  with a separation at the data-store level and at the deployability level. It's a journey that you need to go through until it actually brings the benefits that you expect. It's often done too early, when we should be thinking about areas that we might be overlooking — for example, Conway's law. Also, cognitive load is something we talk about in Team Topologies: in almost every organization, some teams feel they're overloaded. If you ask them to own their services and be able to build, test, operate, support, and understand them enough to do that, they probably will have too much on their plates. We need to think first about the right size of software for teams, and the next question is whether microservices will help or not.

Wrightson: Coming back to Conway's law, how do you recommend changes in that organization? Business functionality evolves, and so does the team structure. How do you marry that with the technology and the ownership?

Pais: What we see happening, which is probably not what we want, is sometimes decisions are being made, especially around the team structures in the organization, by HR or senior management. And because of the mirroring effect of Conway's law, they are deciding on our software architecture, to some extent. We don't want that. What we're saying is we actually need to bring those two worlds together. We need to make technology and design decisions together with the decisions on team structures. Otherwise, one is restricting what the other is able to do.

Wrightson: Matt, you must have come across a very similar thing because Monzo's growth has been stratospheric. You must have learned to change your organization, but at the same time you don't want to check out all tech.

Heath: Monzo's growth has been stratospheric for quite a while. Our company's probably changed enough to become a new company every six to 12 months for the last five years. That means that we have some areas that are a bit more stable, but other areas that are quite rapidly in flux. I think that's one of the reasons that such a granular set of services works for us. All of our services are very similar. We have a box that you put the code in and you get lots of nice properties. Netflix call that “the paved road”. That means that because the differences are not big, it's quite easy for people to pick up a new application. When we form teams or when we refocus areas, we can divide the boundaries a lot more granularly. Otherwise, we would have essentially a whole ton of payment systems and we wouldn't be able to move people between them because they’d have specific expertise in each. Yes, we can change that quite a lot.

Wrightson:  How do you keep continuity of knowledge when people are moving around?

Heath: That's something where you have two problems. The real question is what you're documenting. In the case of Monzo, we have debit cards. We used to use an outsourced card provider. We brought that in house. We've built a MasterCard processor. There's quite a lot of domain knowledge there. That's one of 10, 20 different equivalent scale projects.

As a result, we have a load of people who have read 10,000 or 15,000 pages of MasterCard documentation to understand how those systems work and have deep domain knowledge. That's not something you can easily transfer. I think we have a mixture of people who want to spend time an area and become domain experts there. That works really well for them. But having services with each service being quite consistent in how it works means people can pick up how our system models a different domain quite quickly if they move and build on the domain knowledge there.

Wrightson:  Alex, as you've moved more back towards a monolithic approach, what is your biggest operational challenge?

Noonan: I think the biggest operational challenge that we deal with now is that we're running about 2,000 to 4,000 workers at any point. This means in-memory caching is not very efficient across these thousands of workers.

We've introduced Redis to help us with this. Redis has now become a point that we have to consider every time we need to scale. We've had issues in the past where Redis becomes unavailable and the workers are crashing. That has added an additional point of scaling for us. It's something that's been consistently coming up that we have to deal with. We've done different things like moving to Redis Cluster and sharding it, but it's still something in the back of our minds that we wish wasn't there. It's not enough for us to move back to microservices just to get the benefit of caching. It's a tradeoff that we were willing and are comfortable to take.

Wrightson: Luke, at the FT on the opposite end of the scale, what do you see? Is a widely adopted microservices architecture your biggest operational challenge?

Blaney: I find it’s often the hardest where there are team boundaries. If there's an incident, it's tracking it across multiple teams. It's understanding where the fault lies. Teams are very good at understanding their own little pocket of things.

We have a publishing pipeline. It goes from the tech team that works with their editorial staff. Then there's a publishing platform team. Then there's a team that are working on the front-end website for the users. Each of these teams is very mature in their thinking, but whenever it goes from one to the next bit, things get missed. It's hard to trace exact problems.

Wrightson: Manuel, do you have any advice for this?

Pais: Yes. I think that there's a key phrase there: team boundaries. With Conway's law, we want the service boundaries and team boundaries to be aligned. Then we can look at how teams interact when there are problems and we need to diagnose what's going on.

We should determine if we have an adequate  platform. I get a feeling that Monzo has a pretty good platform team to allow that scale and that repeatability between different services. Having a good platform team provides you the tools and the mechanisms to do that fast diagnosing on the tooling side.

You can do things like swarming, where people from different teams work together on a rotation, so they are part of a swarm to diagnose  problems across microservices. You can then address that kind of problem more quickly than in a hierarchical support system in which someone identified a problem the customer has, and then we have to try to sort out who is responsible. Having that swarm can be very helpful for fast response.

Wrightson: Matt, does Monzo address it in a similar way because it must be very hard to diagnose where one capability is being affected by a single service?

Heath: Sometimes, but a lot of the time, it's fairly straightforward. We have lots of different services and we have very strong consistency between the ways they work. That means we have standardized dashboards for every single service. We can use tools like tracing to track that backwards.

What we usually see is some particular business metric isn't working: say, cards are declining, payments of a particular type, or a particular API. In that case, it's relatively straightforward to pin that through to the particular root cause. Each of those is owned by a different team. Until recently, we had one group on call, on rotation. We have a specialist on call, which is being pushed out into teams. At some point, we'll probably switch that around. The initial on-call team will be the owning team, which can then escalate it to a global team. 

The boundaries between those teams are definitely the hardest things. But because everything is so similar — in the same language, literally the same service structure — many of the people who are on call can look at a particular thing and work out why, while we're waiting for other people to get online and go through that problem.

Wrightson: Alex, does going back towards a monolith implicitly bring benefits because you're not going to have a monolith made up of five languages?

Noonan: We implicitly get the benefits of going back to a monolith. Our monolith is in one language and that was one of the benefits that naturally came out of moving to a monolith. Luckily, when we did move into microservices for a bit, we kept everything in one language. We were good about defining certain standards for how these microservices were going to work because we knew the complexity that was coming with it. A benefit of moving back to a monolith was getting that stuff for free, and forcing us all to use some of the same things.

Audience question: Monzo has more than 1,500 microservices. How do you define the bounded context when you build so many?

Heath: Look at it as if we had a relatively simple product. it would be very difficult to take an image website, for example, and slice and dice that into 1,500 pieces. That would be pure insanity — don't do that. But then you add and integrate 10 or 15 different payment networks, and start operating in multiple countries, and have various product features that are quite different. I'm not sure how many different partners we have, for a variety of different things. We launched a credit scoring product a week or so ago, which means adding a whole set of services that interact with particular third parties. That means allowing the user to give consent, and tracking that. Those things are very isolated from how your card works.

The bounded context in those things is pretty much clear. For credit scoring, we have something that needs to interact with a third party. There's probably some bounded thing there, their API. We also have our API to the app. By the time you divide that up, you probably have five or six different services. They all have their own tightly defined responsibility. They all have their own API. We use HTTP internally. 

One of the things that we're considering is that there are a lot of APIs for someone to think about. Some of those aren't really used outside of a team, maybe more of a private API versus a public API. We essentially have lots of different teams who are providing a service, almost like an S3 API or some other API,  to a number of other teams and those things up to a lot of services.

Audience question: Speaking of microservices and team boundaries, how do you suggest adopting ownership of services and changing of teams? We have a growing organization and team responsibilities change. Suddenly, we have a bunch of microservices from the organizational structure of six months ago that become tech debt. How would you cope with this?

Pais: I think one of the keys is to have stable teams. The other key is that all the services need to be owned by a team. With those two things, the organization can evolve and you can have people changing teams. But you don't create a team just for a new microservice. Once you have stable teams and you have ownership, the alignment between team structures and services is much clearer.

Also, you can try to promote that most of the teams might be working on some new service but also retain ownership of an older one that doesn't change much. Perhaps with microservices it's not so clear yet because it's newer, but for older types of architecture, if you have a team responsible for an older service and also responsible for a new service then you bring in good tooling, good telemetry, centralized logging, or whatever might be needed, into the old service even if it's not changing that often. You keep parity and you evolve both at the same time.

Wrightson: Luke, do you have anything to add, because this is definitely an area that you've suffered with for good and bad?

Blaney: In the FT, it's quite hard because different teams use completely different programming languages, different tech stacks — if there's something that could be done differently, someone will be doing it differently. It can be really hard to move things from one team to another. 

Although, one benefit of microservices is it becomes easier to rewrite small bits of your stack. If it is moving from one team to another that is radically different, who are not comfortable with and lacking skill in the programming language, it's a smaller undertaking for them to change one or two microservices than needing to replace the entire monolith. 

It's always about making that judgment call: is it worth the rewrite if a team feels much more ownership of things? I've seen many cases where a perfectly good system is sitting there and a team just doesn't feel comfortable with it. There's not really a good technical reason to rewrite the system but for the sake of team ownership it pays dividends in the long term to have them build their own replacement. Then everyone feels comfortable with it.

Audience question: I've seen some scenarios where teams create multiple services but did not, for example, structure the code within the monolithic code base to make it more modular. When should you decide to take the step to microservices instead of structuring modular code within the same unit of deployment?

Wrightson: Matt, what's your opinion on this one?

Heath: I think it depends. I'm not going to be as extreme as you might think I'd be. There needs to be a good reason to pull things out of an application and put them into another application, whether that is because you've added functionality and now it's too complicated or because it's potentially sharing responsibilities of two teams. 

We talked about features a second ago. Is that feature being maintained and is it in this code base? What is its life cycle? There are certain points where it's useful to pull those out if you already have an architecture that supports that and you already have lots of tooling in place. In our case, we do. We have a service and maybe we've added something that's made it a bit more complex, and there may be a point that we refactor that to pull that out. Normally, we'll make it backwards compatible. We'll move that service out and proxy it through to this one, and then, at some point, update the clients to use this. I think it only makes sense if you have that tooling already. Otherwise, you'd be much better tidying up the modules, getting consistency within your code base so that it's simpler and easier to understand within an application.

Wrightson: Alex, you're the natural opposite to this one.

Noonan: I think that we did break apart our repos at one point. It didn't turn out to be as much of an advantage as we thought it was going to be. Our primary motivation for breaking it out was that failing tests were impacting different things. You'd go in to change one thing and then tests were breaking somewhere else, and now you have to spend time fixing those tests. Breaking them out into separate repos only made that worse because now you go in and touch something that hasn't been touched in six months. Those tests are completely broken because you aren't forced to spend time fixing that. One of the only cases where I've seen stuff successfully broken out into separate repos and not services is when we have a chunk of code that is shared among multiple services and we want to make it a shared library. Other than that, I think we've found even when we've moved to microservices, we still prefer stuff in a mono repo.

Audience question: You said earlier that you were facing some issues so you took a microservices approach. Later on, you realized that this was not working and you came back to a monolithic one. What problem did you think could be solved by microservices? What did you solve using the monolithic approach?

Noonan: The original problems that we were facing with our monolithic setup were with a queue that the monolithic worker consumed. You could only consume what was first in line in the queue. If there were issues consuming from this queue, it would back up, and that would impact everything in our monolithic worker. We wanted to be able to separate them from one another. We didn't want one piece of the worker able to cause issues with the rest of the stuff that was in it. 

Microservices provide the benefit of environmental isolation. We thought if we break out into separate queues and separate workers that isolate them nicely from one another, they won’t impact each other. What we learned was that it did isolate them, but a worker having an issue impacted all customers using it. It isolated the issue and decreased the blast radius but it didn't solve the root problem with our queuing system, which was really what was crippling us with its first-in, first-out way of consuming.

We solved that by getting  rid of that queuing system and building something internally to isolate them better from one another. I could see how when we decided to move to microservices, we thought we’d get great environmental isolation out of the box to solve our problem. We didn't understand that it wasn't actually everything in the monolith that was the issue, so we made that decision that potentially wasn't the best decision, but you never know. Once we finally understood and realized what the problem was, we were able to fix it. A monolith worked better in that situation.

Wrightson: If you had one bit of advice for the audience, either on adoption or maturing their microservices architecture, what would it be?

Blaney: It's about keeping track of what you've got. It can easily run away from you. You start with a couple of services and you think it’s easy, but you have to link data, monitor all these things, keep track of everything. Do that from the get-go. Don't wait until you've got 100 microservices, and you wonder which was the first one you built. Start at the start. Make sure you're documenting these things as you go because once you've got a lot of them, it's really easy to lose one in the mix.

Pais: First, start by assessing your team's cognitive load. This is a common problem. Hopefully, we have end-to-end ownership in the teams, but whatever their responsibilities are, make sure the size of the software and other responsibilities they have match their cognitive capacity. There is a specific definition of what cognitive load is, but you can look it up. Essentially, ask the teams — because there's a psychological aspect here, too. If people don't understand their part of the system well enough, whether it's microservices or something else, they feel anxious. You need to address the case when people on call really hope nothing happens in some part of the software because, otherwise, they’re going to be in trouble.

The second thing is Conway's law. Print it out and put it up in your office so everyone remembers that we can't design software in isolation from thinking about the team structures.

The third thing is that microservices are really beneficial if they're aligned with the business stream of value. This happens all the time; go on Twitter and you'll see people complaining that they have microservices, but for any type of feature or business change, they need coordination between three, four, or five different microservices teams. That's not what you wanted to achieve with independent deployability. Also, independent business streams are the goal. It's not just the technical side by itself.

Noonan: When you're considering to move to microservices, make sure you're actually addressing the problems that you're having. For example, we thought we were going to get this great environmental isolation but that didn't actually fix the problem. Make sure you're addressing the root problems with microservices. From there, make sure you fully understand the tradeoffs that are going to come with it. I think what really bit us was the operational overhead when we moved to microservices. We only had 10 engineers at the time, and you almost need a dedicated DevOps team to help build tooling to maintain this infrastructure. We didn't have that. That's what ended up biting us and why we moved back. We just didn't have the resources to build out all this tooling and automation that come with microservices. Make sure that you're extremely familiar with the tradeoffs, you have a good story around each of them, and you're comfortable with those tradeoffs when you make the switch.

Heath: It might sound weird, but keep things simple. The simpler and more consistent your services are, the more consistent your monitoring tools, your tools, and your development processes will be. It will be easier for people to pick stuff up and move around. The amount of code in one of those units is smaller, so it's easier to understand. That's changed my mental model from what's happening in the box to how the boxes tie together. Keeping those things simple pushes a lot of the complexity out, essentially into the platform. You need lots of tooling and potentially investment in teams to work on that tooling. Most of that tooling exists a lot better than it did five years ago and it will continue to improve.

About the Authors

Luke Blaney has worked for the Financial Times since 2012 as a developer and then platform architect. Now a principal engineer on their Reliability Engineering team, tasked with improving operational resilience and reducing duplication of tech effort across the company.

Matt Heath is an engineer at Monzo, a new kind of digital bank, where he works on Monzo's microservice platform and payment services. Having previously worked as the technical lead of Hailo's global platform, Heath has an unhealthy obsession for scaling fault-tolerant, high-volume, distributed systems, and in his spare time organises the Go London User Group.

Alex Noonan is a back-end engineer who spends most of her time building reliable, scalable systems. She's been working at Segment for the past four years, focused distributed systems and scaling the core data pipeline.

Manuel Pais is an independent IT organizational consultant and trainer, focused on team interactions, delivery practices, and accelerating flow. Manuel is co-author of the book Team Topologies: Organizing Business and Technology Teams for Fast Flow (IT Revolution Press, 2019). 

Rate this Article