InfoQ Homepage Presentations Panel: Microservices - Are They Still Worth It?

Panel: Microservices - Are They Still Worth It?

View Presentation

Speed:

Download

35:40

Summary

The panelists have moved from the monolith to microservices and in some cases back again. They have strong opinions on monorepos, on operating distributed systems and on the best way to structure an organization to make a success of this architecture.

Bio

Luke Blaney is a Principal Engineer at the Financial Times. Alexandra Noonan is a Software Engineer at Segment. Manuel Pais is an independent IT organizational consultant and trainer, focused on team interactions, delivery practices, and accelerating flow. Matt Heath is an engineer at Monzo.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Moderator: First, I'll get the panelists to say who they are, what they do. What's the one thing they wish they had known before starting their journey in microservices? Matt, we'll start with you.

Introduction and Lessons Learned before Microservices

Heath: My name is Matt Heath. I work at Monzo. I work within our platform team as a back-end engineer. I've been working on microservices for about seven years now. The one thing I wish I'd known? How things can spiral, not out of control, but how things can rapidly get more complex as your business develops.

Noonan: I'm Alex. I'm a Software Engineer for Segment. I've been working with microservices for about three years now. I think the thing that I'd wish I'd known in the beginning was that you can solve some of the trade-offs that come with moving to microservices without actually moving to microservices. Sometimes people get a little bit blinded and they see that they're having issues where they don't have this good fault isolation and microservice gives you that. You can actually address that without making the switch to microservices. I wish we'd known that it was possible to address a lot of the issues we were seeing without going to microservices.

Pais: I'm Manuel. I'm a consultant and I'm co-author of the book, "Team Topologies." From the people here, I'm probably the one who has least direct experience with microservices. As a consultant, especially in continuous delivery, what I've felt is a lot of the pain from the clients that went for microservices too early, trying to optimize for something that was not really the problem they had and not thinking about other problems in terms of their engineering practices or dependencies between teams. Just thinking it's all around the architecture when there were other problems. As Sam Newman was saying this morning, identify first what problem you're trying to solve and then seek out the solution, which might be microservices or not.

Blaney: I'm Luke. I'm a Principal Engineer at the "Financial Times." I think the one thing I wish I'd known is how important tooling is when it comes to microservices. We already knew to some extent things like orchestration and stuff, you needed some level of tooling. Stuff that is useful without microservices becomes essential when you do have microservices.

Moderator: Luke, as we're talking, this panel is supposed to be, Microservices - Are they still worth it? Can you see real benefit to your organization or your customers from this microservice approach at the FT?

Blaney: We can see real benefit to our organization and our customers from this microservice approach, definitely, in terms of speeding up delivery. Like Alex was talking about, isolating things. It's really beneficial for that way. It's brought us a long way forward. It does have its pitfalls and stuff, but I think on the whole it has been beneficial.

Moderator: Matt, how about Monzo? How's Monzo benefited from using microservices?

Heath: Monzo's benefitted from using microservices, I'd say, in quite a similar way. By using microservices, our teams can work quite independently. For context, Monzo is a bank. It's quite easy to say that banks are complicated, but we have lots of different systems. That means that we can work on those very independently. It's helped us deal with the complexity.

Moderator: Alex, your company has done a bit of a U turn, and gone back to a more monolithic approach. What do you think the key benefit for that U turn was?

Noonan: The driving force behind that U turn to a more monolithic approach, for us, was two things. One, the operational overhead that came with microservices was too great and we didn't have good tooling around testing and deploying and maintaining these microservices. Then one of the other driving factors was we finally understood what the root issue was in our system that we were trying to address with microservices, and it didn't actually address that. Once we finally had this understanding of what the core issue was, we were actually able to address it, and moving back to a monolith was actually what helped us address it more easily.

Moderator: Who's seen the architectural diagram for the microservices at Monzo on Twitter? We're all quite aware of the pure scale of the microservices. There is a lot going on there. Are you seeing cracks at this monumental scale? Is it the right approach, though?

Heath: Are we seeing the cracks at this monumental scale of microservices? It's quite easy when you look at a diagram. With that many moving parts, it's quite hard to represent it on a single diagram. I don't think that'd be any different if our organization had 50 applications. We have 1600-and-something. That means that there's a lot of moving parts, but it doesn't mean that everything's quite small. The main problems we have are making sure people have the right information. People don't have to understand how everything works. They can just know how that cluster of things work. Tooling is the main thing. We've had to build out a lot of fairly complex tooling. Those are the things we've had to focus our efforts on.

Moderator: Manuel, you look at microservices in organizations from a more team angle. Surely, there are a lot of beneficial things from team organization that this technology can provide.

Pais: Are there beneficial things from team organization that microservices can provide? Yes and No. I think it's all about when is the right time to think about it, microservices to fix the problem you have. Often, it's premature trying to optimize something because microservices is all-round, a lot of talk about it. We think that's going to fix our problems. Organizations often don't realize, to actually benefit from microservice, you need to do a lot more changes. It's not just the technical architecture. It's if you go one step back, you need to think about Conway's Law, and how the team structures are mapped with the services that you have. They should be aligned to a large extent, not one-to-one, but they should be aligned. Otherwise, you're fighting against that.

Then also, have you thought about things like decoupling teams but also decoupling environments if you have static environments for testing, decoupling the release process. All these other aspects you need to consider to actually take full benefit of microservices is not always thought out. Once you have that done, then I think microservices are really good at making sure we're aligning the teams and the services with the business areas, or the streams of value in our business. That can be very powerful.

Promoting Faster Change

Then if you want to promote faster cadence of change, then microservices can help you on that with separation at multiple levels, the data store level, and at the releasability, the deployability level. It's a journey that you need to go through until it's actually going to bring the benefits that you expect. It's often done too early. We should be thinking then at other areas that we might be overlooking, for example, Conway's Law. Also, cognitive load is something we talk about in "Team Topologies," which is almost every team, or in almost every organization, some teams feel they're overloaded. If you ask them to own the services and be able to build, test, operate, support, and understand enough to do that, they probably have too much on their plate. We need to think about the right size of software for teams, whether that's microservices or not, is the next question.

Moderator: Coming back to Conway's Law. How do you recommend changes in that organization? Business functionality evolves, and so does the team structure. How do you marry that with the technology and the ownership alone?

Pais: How do we marry business evolution and team structure with the technology and the ownership? What we see happening, which is probably not what we want, is sometimes decisions are being made, especially around the team structures in the organization, by HR people or senior management. To some extent, because of the mirroring effect of Conway's Law, they are deciding on our software architecture to some extent. We don't want that. What we're saying is we actually need to bring those two worlds together. We need to make technology and design decisions together with the team structure decisions. Otherwise, one is restricting what the other is able to do.

Moderator: Matt, you must have come across a very similar thing because Monzo's growth has been stratospheric for quite a while. You must have learned to change your organization, but you don't want to check out all tech at the same time.

Heath: Monzo's growth has been stratospheric for quite a while. Our company's probably changed pretty much to a new company every 6 to 12 months for the last 5 years. That means that we have some areas that are a bit more stable, but other areas that are quite rapid influx. I think that's one of the reasons that such a granular set of services works for us. All of our services are very similar. We have a box that you put the code in and you get lots of nice properties. Netflix call that The Paved Road. That means that because the differences are not big, so it's quite easy for people to pick up a new application. When we form teams, or when we refocus areas, we can divide the boundaries a lot more granularly. Otherwise, we would have essentially, a whole ton of payment systems and we wouldn't be able to move people. People wouldn't be able to move off if they wanted to go and work on a different area because they have specific expertise there. Yes, we can change that quite a lot.

Continuity of Knowledge

Moderator: How do you keep continuity of knowledge when people are moving around?

Heath: How do we keep continuity of knowledge when people are moving around? That's something where, honestly, you have two problems. I think documentation. The real question is what you're documenting. In the case of Monzo, we have debit cards. We used to use an outsourced card provider. We brought that in-house. We've built a MasterCard processor. There's quite a lot of domain knowledge there. That's one of 10, 20 different equivalent scale projects.

You have both a load of people who have gone and read 10,000 or 15,000 pages of MasterCard documentation to understand how those systems work and have deep domain knowledge. That's not something you can easily transfer. I think we have a mixture of people who want to spend an area and become domain experts there. That works really well for them. Then in other areas, by having the services, each service being quite consistent in how it works, it means people can pick up how our system that models that quite quickly. Then build on the domain knowledge there.

Moderator: Alex, as you've moved more back towards a monolithic approach, what now is your biggest operational challenge.

Noonan: As we've moved back towards a monolithic approach, I think the biggest operational challenge that we deal with now is, at any point in time in our monolithic worker, we're running about 2000 to 4000 workers. This means in-memory caching is not very efficient across these thousands of workers.

Something that we've introduced to help us with this is Redis. Redis has now become a point that we have to consider every time we need to scale. We've had issues in the past where the Redis becomes unavailable and the workers are crashing. That has now added an additional point of scaling for us. It's something that's been consistently coming up that we have to deal with. We've done different things like moving to Redis Cluster, sharding it, but it's still something in the back of our minds that we wish wasn't there. It's not enough for us to move back to microservices just to get the benefit of caching. It's a trade-off that we were willing and are comfortable to take.

Microservices Architecture

Moderator: Luke, on the opposite end of the scale at the FT, what would you see? Is a widely adopted microservices architecture your biggest operational challenge?

Blaney: Is a widely adopted microservices architecture our biggest operational challenge? I think it's where there are team boundaries, I find is often the hardest. If there's an incident or stuff, it's tracking it across multiple teams. It's understanding where the fault lies. Teams are very good at understanding their own little pocket of things.

We have a publishing pipeline. It goes from the tech team that works with their editorial staff. Then there's a publishing platform team. Then there's a team that are working on the front-end website for the users. Each of these team's very mature in their thinking, but whenever it goes from one to the next bit, things get missed. It's hard to trace exact problems.

Moderator: Manuel, do you have any advice for this?

Pais: Do I have any advice for this? Yes. On one hand, I think that there's a key term there, team boundaries. With Conway's Law, we want the service boundaries, team boundaries to be aligned. Then, how do they interact when there are problems and we need to diagnose what's going on?

A couple things. One is, if we have good platform teams. I get a feeling that at Monzo, you have a pretty good platform team to allow that scale and that repeatability in a way, between different services. Having a good platform team that provides you the tools and the mechanisms to do that fast diagnosing on the tool side, things like swarming. Different teams work on different services too, or people from those different teams to be on a rotation, where this person is going to be part of this swarm to identify, to diagnose this problem, where is it exactly coming from? Or, possibly there are multiple reasons. Then address that more quickly than this hierarchical support from, someone identified a problem the customer has, and then now we have to try to sort out who is responsible. Having that swarm can be helpful.

Moderator: Matt, does Monzo address it in a similar way because it must be very hard to diagnose where one capability is actually being affected by a single service?

Heath: Does Monzo address it in a similar way where one capability is actually being affected by a single service? Sometimes. A lot of the time, it's fairly straightforward. I think the way that we think about that, we have lots of different services and we have very strong consistency between the ways they work. That means we have standardized dashboards for every single service. We can use tools like tracing to track that back.

The on-call Team

What we usually see is some particular business metrics, say, cards are declining, or payments of a particular type aren't working, or a particular API doesn't work. In that case, it's relatively straightforward to pin that through into the particular root cause. Each of those are then owned by different teams. Up until relatively recently, we had one group for on-call, who had a rotation. We have a specialist on-call. That's being pushed out into teams. At some point, we'll probably switch that around. The initial on-call team will be the owning team, which can then escalate it to a global team. I think, yes. The boundaries between those teams are definitely the hardest things. It's because everything is so similar in the same language, literally, the same service structure. Many of the people who are on-call can look at a particular thing and work out why, while we're waiting for other people to get online and go through that problem.

Moderator: Alex, you must implicitly get the benefits of that going back towards a monolith because you're not going to have a monolith made up of five languages?

Noonan: We implicitly get the benefits of going back to a monolith. Our monolith is in one language. That was one of the benefits that just came naturally of moving to a monolith. Luckily, when we did move into microservices for a bit, we still kept everything in one language. We were good about defining certain standards of how these microservices were going to work because we knew the complexity that was coming with it. It was a benefit of moving back to a monolith was getting that stuff for free, and forcing us all to use some of the same things.

Participant 1: As Monzo has more than 1500 microservices, so how do you basically define the bounded context when you build the so many microservices?

Heath: How do we define bounded text when we build so many microservices? I think if you look at it as if we had a relatively simple product, it would be very difficult to take an image website, for example, and slice and dice that into 1500 pieces. That would be pure insanity. Don't do that. If you then add 10 or 15 different payment networks, and you start operating in multiple countries, and you have various product features that are quite different, we integrate with. I'm honestly not sure how many different partners, for a variety of different things. We launched a credit scoring product a week or so ago, which means that's a whole set of services that interact with a couple of particular third parties. That allows the user to give consent, we can track that. Those things are very isolated from how your card works.

Bounded Context

The bounded context in those things is pretty much, take a problem. For credit scoring, we clearly have something that needs to interact with a third party. There's probably some bounded thing there, their API. We also have our API to the app. By the time you divide that up, you probably got five or six different services. They all have a very tightly defined responsibility. They all have their own API. We use HTTP internally. One of the things that we're thinking about is, that's a lot of APIs for someone to think about. Some of those aren't really used outside of teams. Maybe they're more of a private API versus a public API. That's something we're thinking a bit about. We essentially have lots of different teams who are providing a service, almost like an S3 API, or some other API to a number of other teams. By the time you add those things up, there's a lot of services.

Participant 2: Speaking of microservices and team boundaries, how do you suggest adopting ownership of services, and changing of teams? You have a growing organization, team responsibilities change, and then suddenly we have a bunch of microservices from the organizational structure 6 months ago, and they become tech debt. How would you cope with all this stuff then?

Adopting Service Ownership and Team Changes

Pais: I think one of the keys is to have stable teams. The other key is that all the services need to be owned by a team. Once you have those two things, the organization can evolve. You can have people changing teams. You don't recreate teams or create a team just for this new microservice, and then, sometime, there's the team that's doing another microservice. Once you have stable teams and you have ownership, it's much clear the alignment between team structures and services.

Also, you can try to promote that all the teams, or most of the teams might be working on some new service but also retain ownership on an older one that doesn't change much. Perhaps with microservices it's not so clear yet because it's more new. For older type of architecture, then if you have a older service and you're also responsible for the new service, then you bring in the good tooling, the good telemetry, centralized logging, whatever it might be, into the old service even if it's not changing that often. You keep parity on and you evolve both at the same time. To me, that's useful.

Moderator: Luke, do you have anything to add, because this is definitely an area that you've suffered with, good and bad?

Team Ownership

Blaney: In the FT, it's quite hard because different teams use completely different programming languages, different tech stacks. If there's something that could be different, someone will be doing it different. It can be really hard to move things from one team to another. Although, one benefit I do think of microservices is it becomes easier to rewrite small bits of your stack. If it is moving from one team to another that is radically different, and they're like, "That's a new programming language we're not comfortable with. We don't have the skills." It's a lot smaller undertaking to go, "That's one or two microservices that need to change with this org structure." It's a smaller undertaking for them to go, "We're going to rewrite two microservices than we need to replace this entire monolith." It's always about making that judgment call, of, is this worth the rewrite if a team feels much more ownership of things? I've seen many a cases where I've seen a perfectly good system sitting there and a team just doesn't feel comfortable with it. There's not really a good technical reason to rewrite it. For the sake of team ownership, and that stuff, it actually pays dividends in the long term to just be like, "We're going to replace it. We'll build a new one." Then everyone feels comfortable with it.

Participant 3: I've seen some scenarios where within teams there, multiple services are created. These teams did not consider, for example, to structure the code within the same code base, to make it more modular. I'd like to hear from the panel their opinion about that. When should you decide to take the step to microservices instead of just structure your code to make more modular within the same unit of deployment?

Moderator: Matt, what's your opinion on this one?

Steps to Microservices

Heath: I think it depends. I'm not going to be as extreme as you might think I'd be. There needs to be a good reason to pull things out of an application into another application. Whether that is because you've added functionality, and now it's too complicated, or it's sharing responsibilities that now potentially two teams are involved. We talked about features a second ago. Is that feature being maintained and is it in this code base? What's the life cycle of that thing? There are certain points where it's useful to pull those out if you already have an architecture that supports that, and you already have lots of tooling in place. Then in our case, we do. We have a service and maybe we've added something that's made a bit more complex, and there may be a point that we refactor that to pull that out. Normally, we'll make it backwards compatible. We'll move the service out, proxy it through to this one. Then at some point, update the clients to use this. I think it only makes sense if you have that tooling already. Otherwise, you'd be much better tidying up the modules, getting consistency within your code base so that it's simpler and easier to understand within an application.

Moderator: Alex, you're the natural opposite to this one.

Breaking out Repos

Noonan: I think that we did actually break apart our repos at one point. It didn't turn out to be as much of an advantage as we thought it was going to be. Our primary motivation for breaking it out was, failing tests were impacting different things. You'd go into change one thing and then tests were breaking somewhere else, and now you have to spend time fixing those tests. Breaking them out into separate repos only made that worse because now I go in and touch something that hasn't been touched in six months. Those tests are completely broken because we aren't forced to spend time fixing that. I think one of the only cases that I've really seen to break stuff out into separate repos and not services is maybe if you have a chunk of code that is shared between multiple services, and you want to make it a shared library. Other than that, I think we've found even when we've moved to microservices, we still prefer stuff in a mono repo.

Participant 4: Basically, you said earlier that you were facing some issues so you took a microservices approach. Then later on, you realized that this is not working, and you again come back to monolithic one. What problem did you think could be solved by microservices? Then later on, you realized that this is not a good thing and you should move to the monolithic one?

What were the problems for which you think that microservices is a better approach? Then, after developing it using a microservices approach, you then again changed back to the monolithic one. What were those problems and how you solved them using the monolithic approach?

Problems Solved by Monolithic Architecture

Noonan: The original problems that we were facing were with our monolithic setup, and we were using a queue that the monolithic worker consumed from. Something that we were facing was, if there were issues consuming from this queue, it would back up, and that would impact everything in our monolithic worker. We wanted to be able to separate them from one another. We didn't want one piece of the worker able to cause issues with the rest of the stuff that was in the worker. It was really the queue that we were using, you could only consume what was first-in-line. What microservices provide, though, one of the benefits of them is environmental isolation. We thought, "If we break out into separate queues and separate workers that isolate them nicely from one another, so now they're not impacting each other." What we learned with that over time, actually, was that it did isolate them from one another, but now if this worker was having an issue, all customers using that were impacted. Even if it was only one customer was the cause. It isolated the issue and decreased the blast radius but it didn't solve the root problem with our queuing system, was really what was crippling us because it was this first-in, first-out way of consuming.

The Queuing System

How we solved it was we actually got rid of all of that queuing system and we built something internally to isolate them better from one another. That's how. I could see how in the time when we decided to move to microservices, we were like, "You'll get great environmental isolation out-of-the-box. It solves our problem." Since we didn't have a good understanding that it wasn't actually everything in a monolith that was the issue and it was the queue, we made that decision potentially. Financially, it wasn't the best decision but you never know. I think then once we finally had that understanding and this moment of realizing what the problem actually was, and then we were able to fix it. A monolith worked better in that situation. That was the main reason for this.

Moderator: If you had one bit of advice for the audience, either on adoption or maturing their microservices architecture, what would it be?

Blaney: One bit of advice for the audience on adoption or maturing their microservices architecture? I think a lot of it's about keeping track of what you've got. It can easily run away from you. You start with a couple and you're like, "These are easy," but having to link data, monitoring all these things, keeping track of everything. Do it from the get-go. Don't wait until you've got 100 microservices, and you then go, "What was the first one we built?" Start at the start. Make sure you're documenting these things as you go along. Because once you've got a lot of them, it's really easy to lose one in the mix.

Pais: One bit of advice for the audience on adoption or maturing their microservices architecture? First, start assessing your team's cognitive load. This is a common problem. Hopefully, we have end-to-end ownership in the teams, but whatever their responsibilities is, make sure the size of the software and other responsibilities they have matches their cognitive capacity. There is a specific definition of what cognitive load is, but you can look it up. Essentially, ask the teams, because there's a psychological aspect here, too, if people don't understand their part of the system well enough, whether it's microservices or something else. If they feel anxious, if I have to be on-call, and I really hope nothing happens in this part with software, otherwise, I'm going to be in trouble. Then you need to address that.

The second thing is, remember Conway's Law. I would suggest, print it out, put it up in your office, just so everyone remembers that we can't design software in isolation from thinking about the team structures.

Align Microservices with Business Value Stream

The third thing is, microservices are really beneficial if they're aligned with business stream of value. This happens all the time, just go on Twitter and you'll see people complaining, "I have microservices, but for any type of feature or business change, I need to have coordination between three, four, or five different microservices teams." That's not what you wanted to achieve in the first place, this independent deployability. It's also, independent business streams is the goal. It's not just the technical side by itself.

Noonan: One bit of advice for the audience on adoption or maturing their microservices architecture? When you're considering to move to microservices, that make sure they're actually addressing the problems that you're having. For example, us, we thought we were going to get this great environmental isolation but didn't actually fix that problem. I would say, make sure you're addressing the root problems with microservices. Then from there, make sure you fully understand the trade-offs that are going to come with it. I think what we got really bit by was the operational overhead when we moved to microservices. We only had 10 engineers at the time, and you really almost need a dedicated DevOps team to help build tooling to maintain this infrastructure. We didn't have that. That's what ended up biting us in the end, and why we moved back. Because we just didn't have resources to build out all this tooling and automation that come with microservices. I would say, just make sure you're extremely familiar with the trade-offs and you have a good story around each of them, and that you're comfortable with those trade-offs when you make the switch.

Heath: One bit of advice for the audience on adoption or maturing their microservices architecture? It might sound weird, but keeping things simple. The simpler and more consistent your services are, the more consistent your monitoring tools, all of your tools, your development processes. The easier it is for people to pick stuff up and move around. The amount of code in one of those units is smaller, so it's easier to understand. For us that's changed my mental model from what's happening in the box to how the boxes tie together. Keeping those things simple pushes a lot of the complexity out into essentially the platform. You need lots of tooling and potentially investments in teams to work on that tooling. Although, I don't know if we'd started in five years, most of that exists a lot better than it did five years ago and will continue to.

See more presentations with transcripts

Recorded at:

Apr 27, 2020

InfoQ Software Architects' Newsletter