Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Seven Ways to Fail at Microservices

Seven Ways to Fail at Microservices



Holly Cummins discusses a number of anti-patterns in building microservices: The murky goal, Microservices envy, Cloud native spaghetti, The enterprise hairball, The someday automation, and others.


Holly Cummins is a Senior Technical Staff Member at IBM and a Java Champion. Holly started her career as a performance engineer, making the J9 JVM go faster. She then led delivery for WebSphere Liberty. As a consultant in the IBM Garage, she worked on understanding climate risks, counting fish, helping a blind athlete run ultra-marathons in the desert solo, and using AI to invent stories.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Cummins: I'm Holly Cummins. I want to talk about some of the ways microservices can go wrong. I work for IBM, and these are all based on my experiences as a consultant with the IBM Garage. These are all problems that I see in the field over and over again. The first problem that I see is sometimes that we don't even know what the problem is. We feel we should be doing microservices, but we haven't really spent enough time trying to define, why are we doing microservices? What problem are we trying to solve? What's hurting us now? What's going to be better after we've done microservices? I think this is quite a natural human tendency, especially for us as techies. We want to jump to the solution. We want to play with the new shiny. Sometimes the work to figure out what problem are we trying to solve, it's actually much less fun than solutioning. I think, with containers, because containers are almost a magical technology. They're so lightweight. They're so portable. They make so many things so much better. We say "just because I've got these containers, it would be a waste of the container capability to run my application in just one container. I should run it in as many containers as I can". Actually, not having enough containers is not a valid problem.

CV Driven Development

Another problem that we see is CV driven development. We look at our CV and there's a big blank spot where it should say microservices, and we say, I can fix this by rearchitecting my company's stack. They're like, that's the solution. That's not really a problem. You may say actually, "No, that's just too cynical, Holly. Surely, no one would actually make architectural decisions based on their CV?" Turns out the evidence is that they would.

Red Hat recently did a survey and they looked at the main drivers for container based development. The number one driver was career progression, which is basically CV driven development. I think the main thing that really drives microservices adoption at the moment is that they're almost a new orthodoxy. Even if we're not taking part in the great resignation. Even if we're not looking for a new job. We look around and everyone else is doing microservices. We say, if they're doing microservices, what's wrong with me if I'm not doing microservices. It becomes this fashion thing. Really, a lot of us have succumbed to microservices envy, which is, we feel we should be doing them.

Microservices Are Not the Goal

One of our consultants has a rule that when he talks to a client, if they just say Netflix, which does happen a lot, he thinks, I think we're in trouble here. I don't think you're doing this for the right reason. If they have a conversation that's a bit deeper, and they talk about things like coupling and cohesion. Then he thinks, ok, we're in the right space. Ultimately, we can't start our conversation by saying I need to move to a microservices architecture because microservices, they are a goal, but they shouldn't be a goal. Microservices are the means to achieve that goal of business agility, or whatever it is we're trying to achieve. They're not even the only means, they're a means. One of the challenges that we see is that microservices on their own often help us get to what we're actually trying to do.

Distributed Monolith

I love this quote, "Do you have microservices, or do you have a monolith spread over hundreds of Git repos?" Which, unfortunately, is what we often see. At that point, what we have is a distributed monolith. A distributed monolith is a really bad thing. It's hard to reason about. It's actually really prone to errors, because with a conventional monolith where it's all running in a single process, you get things like compile time checking when you're developing. Because your IDE can say, no, that method name has changed, and because you're always in the same process, you get guaranteed function execution. You don't have to worry about discovery, and then whether the thing that you're trying to call actually exists. When we take those away, but leave the coupling, what we end up with is cloud native spaghetti.

Distributed is not Equivalent to Decoupled

I was called in to a project, and it was a troubled project. Pretty much when I landed, the first thing they said to me was, every time we change one microservice, another one breaks. If you've been paying any attention to the dream of microservices, the promise of microservices, that exactly the opposite is supposed to happen. They're supposed to be independent of each other. They're supposed to be decoupled. Decoupling doesn't happen for free, if you distribute your system. Just because they both start with D, they're not the same thing. It is very possible to have a highly distributed system with all of the pain that comes from being distributed, while still being completely entangled and coupled. This was what had happened in this case. When I looked into the code, I was just doing an explorer, and I kept seeing the same code over again in each repo. I thought, this is strange. The object model, in this case, was pretty elaborate. There was about 20 classes, and some of them had 70 fields. It was this really complex schema. Each microservice had to basically duplicate that schema in its code, because they didn't want a coupling to a common object library, which makes sense, instead, they copy and paste the code. Of course, that doesn't eliminate the coupling. If a field name changes, it still breaks everybody. You just don't get that warning.

This comes back to some of the ideas about domain driven design that we really need to be thinking about when we're doing microservices. Because the goal or the ideal is that you have a whole bunch of microservices, and each microservice maps really neatly to a domain. This allows the interfaces to your microservices to be really small. What happens in the bad case, and I think what happens if we're not really careful, is that we fragment quite small, perhaps along technical boundaries, rather than domain boundaries. Then we end up in that spaghetti case. The errors that we see from this can be quite serious and quite difficult to catch as well.

The Mars Climate Explorer

This is the Mars Climate Explorer. Way before the most recent very successful Mars mission, NASA did a mission where it was just going to go and it was just going to orbit around Mars. Sadly, it did not orbit around Mars, instead it crashed into Mars, and that was the end of the Mars Climate Explorer. When they did the postmortem, which really was a postmortem, in this case, at least for the ship, what they found was that there was two control systems. Most of the time, the control was done by a control system on the Explorer itself. Then every now and then, there'd be course corrections and supervision from a control system on Earth. We have a very distributed system, it couldn't have been more distributed, part of it was in space, and what seems to be quite modular. Of course, the domain is actually quite similar between these two systems. It turns out the root of the problem was they hadn't been quite clear enough in their communication about what the interface looked like. It was passing numbers back and forth. The part in space used metrics units, the part on Earth used imperial units, and so disaster occurred. I think we can safely say in this case, the system was very distributed, and being distributed did not help.

Consumer Driven Contract Testing

There is a workaround here. There is a thing that we can do that we should all be doing, which is, use consumer driven contract testing. When we have this system where the IDE isn't helping us out, we need to contract test our integrations. This is really important because integration testing, we're going to do some of the time. The whole point of microservices is that we shouldn't be having to do expensive integration tests with the whole system all of the time, or else we're not decoupled. Mocks on their own have a problem. The problem with mocks is that we have a conversation at the beginning about what our interface looks like. We come to an agreement. Then we go away, and we try and write a mock that looks like our understanding of what they said their code looked like. In the ideal case, we get it right. The problem is we bake our assumptions into the mock, because we write the mock, and we're maybe not the best person to know what the other code looks like, because it's not our code. We're certainly not the best person to know what the other code does.

In the happy case, we get it right. Our tests all pass. Everything is good. Unfortunately, that's not always what happens. Sometimes, their actual implementation ends up being different than what we understood, either because they changed their mind, or because we made an assumption that was incorrect. In this case, the tests will still pass. When we actually try it out, it's going to fail. The solution is to not rely just on mocks that we write that are never validated or never even looked at by the other side. Instead, what we want to do is we want to have a consumer driven contract test as an intermediate. We have our code. We have their code, and we have a contract test. The beauty of a contract test, why it's different from a mock is both sides interact with the contract test. For the consumer, the contract test acts as a really handy mock. It saves us time because we don't have to be trying to write mocks. For the provider, the contract test acts as a really handy functional test. It's a deeper test than just something like a Swagger validation for the syntax. It will actually be checking the semantics as well. If I parse Bob in on this interface, I expect to get Fred back. It means that it saves the providing team time writing functional tests.

The good thing is that if everything works, our tests pass. They're cheap and light to run. Their tests pass. They're cheap and light to run. Then things work. If something goes wrong, in this case, their implementation has gone pear shaped. Our tests are still going to pass, but they get an alert of the problem because their tests are failing. That's good, because if we'd actually deployed things like this, it would have failed. Similarly, if we make a change to how we call things, we get an alert really early on that, the interface doesn't look like that, you need to change something. There's a few different contract testing systems out there. Spring Contract is one which works really well if you're in the spring ecosystem. If you're a bit more polyglot, then I really like Pact. It's got bindings for almost every language that you might be using. It's a great thing to get in.

The Enterprise Hairball

Of course, even if we sort that out, even if at our business logic microservice layer, we're pretty safe and we're pretty decoupled. There's probably quite a lot of other things in our system that we maybe haven't thought about when we made our really clean microservices architecture. Sometimes I think we get really excited about the business logic and we forget the front and the back, and then all the glue. This is of course especially likely in an enterprise system. One of our architects calls this the enterprise hairball. We all know, a hairball, it's not good, and you need to deal with it. What happens is, we start out and we have a monolithic business layer. We say, I can do better than this. I'm going to get the dream of decoupling. I'm going to divide it up. Everything's working quite well. We forgot the frontend layer. There's that. Then of course, there's the database layer. Most often, those are still pretty monolithic.

Then, of course, particularly in an enterprise context, we're going to have an integration layer. We're going to have messaging or something like that, that's pulling this quite complex system together. This is where we start to see problems. What most people have worked out is that every change to our business logic is going to need a change to the frontend as well. I think we're getting much better than we used to do at doing modular micro-frontends. That's getting there. Of course, as well, the integration layer often is still really monolithic. The integration team will usually be having demands on them from all sides. I heard a colleague, and they described the integration team as a panicked sandwich. They're squeezed between all these things, and they have so many demands on their time. They have this monolithic thing, so that they have to really schedule their changes really carefully, which blocks everybody else. Of course, it's well known. We see the same thing in the database as well. Every time we want to make a change, we need to make a change to the database if we kept our database monolithic.

This can cause a lot of frustration, particularly for the integration team. They can seem really unresponsive and slow, even though they're working really hard. We need to be sorting out the coupling at those layers as well. We really need to be slicing the integration layer, going to a more modular integration. There's really good patterns out there for that now. Going to a more modular database, where we have these vertical silos rather than having horizontal silos, because otherwise our microservices just aren't going to achieve what we want. The main thing that we won't be achieving is we will not be achieving continuous deployment. It will be impossible because we have so many dependencies elsewhere in the hairball.

Drags That Hinder Releases

How many of you recognize this? That you work really hard, you've created something amazing. Then its value, which is just sitting on the shelf, it can't be released, even though you have a microservices architecture, because there's a release board. Then all the other microservices need to be released at the same time, even though independent deployability was the whole point of microservices. Then there's a release checklist. Then there's various other things that because we're so scared of releasing, and the business is really scared of releasing because it's been burned in the past, so then there's this release checklist. Then there's incantations, and then there was a deadline. Then we still are doing this death march, even though we should be doing continuous deployment. Then we have to sacrifice a goat because we've been burned in the past when we didn't sacrifice a goat. Then someone somewhere is tracking a spreadsheet with all the dependencies. Of course, the moon has to be in the right phase. This wasn't what we signed up for when we did microservices. We see these drags that prevent the releases. Sometimes that actually gets enforced at a technical level as well as at a process level. What we sometimes see is we see pipelines that have been carefully crafted so that every microservice has to go through the same release pipeline to enforce that we don't have an independent deployability.

Test Automation

That wasn't the point. That wasn't why we were trying to do this. Usually, the reason that we're so scared of releasing, the reason why releasing is this horrifying prospect is because there's a ton of manual work around it. In particular, the tests aren't automated. When I visit a client and I hear our tests aren't automated, what I actually hear is, we have no idea if our code works at the moment. It might work. It worked last time we did manual QA, we hope it still works. Really, you've got to have that automation of quality, especially, if you're going to be the spaghetti architecture, which is, frankly, quite hard to resolve. It's easy to say, don't be spaghetti. It's hard in practice. If you're going to be spaghetti, you've at least got to be tested spaghetti, so that you have that confidence.

Then we see as well, that even if there's automated tests, to do the actual release, there's a whole bunch of extra stuff, which is manual. Then if we have concerns about compliance, there's a whole bunch of manual compliance work. Of course, especially if we're in a regulated industry, if we care about compliance, anything we care about, we should automate it. Then, of course, we see that even once we actually get it out the door, to try and figure out if things are going wrong, tends to be a fairly manual process as well. With all these manual processes, and with all these slowdowns, what that really means is that even though we're deploying to the cloud, we're not getting the promise of the cloud. We're using the cloud as though it isn't a cloud. Fundamentally, with the cloud, things that we used to do, that used to be a good idea, that used to keep us safer, are actually hurting us. Old style governance in the cloud, doesn't work. It doesn't achieve the business outcomes that we were hoping for. It loses a lot of the business benefits of the cloud.

The Release Cycle

A colleague of mine spoke to a large bank. This was in East Asia, and they were a legacy bank. Their lunch was getting eaten by all of the fintechs and all of these upstart banks. They could see that they just couldn't move quickly enough to keep up, and that was partly why they were really losing. They came to us and they said "we're going to, slowly. We've got this big COBOL estate, we think that's what's slowing us down". That was quite possibly true. "We need to get rid of this COBOL. We need to make microservices because everybody else is doing microservices". We said, "yes, we can help you with this". Then they added, "but our release board only meets twice a year". If your release board only meets every six months, you know your release cadence is going to be every six months. It doesn't matter how many independently deployable microservices you have. You're not going to get the agility.

At that point, we thought, yes, we can help you but the help you need isn't technical help, we need to sort out this, first. We need to sort out the automation. We need to sort out some of that continuous delivery, because that is what's holding you back, not the COBOL. Even though the COBOL is sat there looking antique. We say, I want to be decomposed, but decomposed has more than one meaning. When we want for a decomposed application, that doesn't guarantee modularity. Sometimes it just means that the mess is spread more widely. If there's these other external constraints, that are what's holding us back, until we fix those, it doesn't matter how decomposed we are.

Questions and Answers

Richardson: I want to talk about the testing side of it. How widespread do you think automated testing is? Then we can get to contract testing in particular after that.

Cummins: Even the automated testing, it's one of those things, everybody loves the idea. Then you go to do it, and it's really expensive. If you're trying to move with a lot of speed, then sometimes that ends up being the thing that falls by the wayside. Or, sometimes if we're changing really fast, then we know that every time we change something, we have to rewrite all our automated tests. We say, "Ok, no, we'll just do it manually." Manual testing has its place. I think we all have good intentions and slightly less good execution with automated testing. Unless you do TDD, of course, which you should do.

Richardson: I've certainly seen with organizations, it's like, yes, the developers write a few tests. Then we just give it to QA to do whatever they do, which is not in the spirit of microservices.

What about consumer driven contract testing?

Cummins: Those, I think, aren't very widely adopted. If I had all the time in the world, I'd just go around and talk about consumer driven contract tests, and then talk to people to see why they're not doing it. It seems so natural to me that this is a problem that we all have, and we're all trying to solve this problem in lots of different ways, like making our pipeline release everything at the same time, or having the goat sacrificing in the 2-week QA phase. The correct solution to this problem is contract tests. There's good reasons not to do contract tests as well, or there's barriers to them.

One of the questions was, as a producer, is it my fault if I break you as a consumer? Different teams are going to have different answers to that question. The way a contract test will work is that usually the consuming team will check their tests or their contracts into the producer's build. Then that means that if they change something, the producer's build can break. Often producing teams will be like, "No, it is not my fault that this build broke. You just did something, and I was doing everything perfectly right." Then the question is, if we can't have a good negotiation, and a good conversation about what our API should be at the build level, and at the test level, are we really confident that we can have a grownup conversation about our API and whose fault it is that something breaks when it's actually out in the field? I think it's really hard to get right. I think that should be a red flag that if you can't do the negotiation about the contract testing, and figure out whose fault it is at the contract testing level, you're definitely not going to be able to do it at the runtime level.

The other problem with contract testing is it's just really hard. This, I think, seems like a solvable problem. Often, when I'm working with teams, and I'm explaining contract testing. When we actually get down to do it, we end up doing things like testing the mock, because we really care what the other side does, and we have this mock that's coming in. We do a lot of testing with the mock. Then we have to step back and say, "I just spent two hours testing a mock that I defined. Let me start doing that".

Richardson: Then maybe it also plays into the fact that organizations are used to end-to-end testing of the entire system, so it feels more comfortable to do that, rather than a true independent deployment of microservices, which requires contract testing, I think.

Cummins: I think contract tests are almost the worst of both worlds, because they don't give you that snugly confidence that you would get from doing end-to-end testing because it's in all these little bits. At the same time, it really exposes that there's coupling, so my tests can break you. We had this idea with microservices that we're completely decoupled, and we can do whatever we wanted, any change I make doesn't affect anybody else. Of course, that's not how it works. It exposes that gritty, icky truth that we were hoping to ignore.

Richardson: I often like to use the analogy that, imagine you're responsible for a plane. You change the light bulb over someone's seat, just say, part of it is broken, you don't need to take the whole thing for a test flight, just to verify that that fix works.

Cummins: If you look at the Pact website, they have this great analogy that I sometimes use, which is about fire alarm testing. Because we all have our fire alarms, and there's a little button on it that you push and it says test, and that tells you it makes some noise. It doesn't necessarily give you any confidence that you haven't painted over all the things that let the smoke in. That actually if your house was on fire, it would make a noise. It just tells you, if I push the button it makes a noise. That's like a unit test. Then at the other end you have the integration test of, I put some matches to my house and it goes up in flames. Through the inferno, I can hear, beep, beep. I have a lot of confidence, but that was a really expensive test to do. I hope someone does it once, but I'm not going to do it every time I want to get a bit of confidence. Then, what they have, sometimes in institutional buildings, you can see them, it's like this long stick, and it's got a cup on the end. They put the cup around the fire alarm, and then it just hisses in a little bit of smoke, so that you can see, ok, actually, smoke causes a beep rather than button causes a beep. I was able to learn that without burning my house down.

Richardson: That sounds like chaos engineering, isn't it? Directly injecting faults into your production environment?

Cummins: Yes, chaos testing might be letting children with matches into the mall, and making sure that you've got the fire brigade quite close by as well.

Richardson: Should API versioning be an alternative to contract testing?

Cummins: I think API versioning is definitely a great thing. It's also really hard. In an ideal world, you would do both. Most of the contract testing providers, they've got some quite nice versioning support as well. Then you can have a contract for each version. You can say, ok, I know my provider has now moved up, and they've got version 5. I don't want to work against version 5, because that's too scary, I'm still going to be against version 4. Then you have the best of both worlds that they can move forward without breaking you, but you still have the confidence that you're ok.

Richardson: What factors should an organization consider when thinking about microservices, or you could take a step back and choosing an architecture for their application in a monolith versus microservices. What criteria?

Cummins: Yes, the first thing is to have that really hard conversation about, why are we doing this? What problem are we really trying to solve? Is it just that we're feeling a bit left out, because we can see all the cool kids are doing microservices. Then, if we go off that, if we resolve that, then hopefully, we come up with an answer, or something like, we need more business agility, and we're really willing to do the releasing, and that kind of thing as well. Then you want to look at things like, what size team am I? If there's only four people in this team, you don't need to be decoupled because there's only four of you. Probably microservices would be an overhead that you don't need. If you don't really ever imagine that you're going to be releasing these things independently, either, because your organization's appetite for release, is that you release every six months. Then it's probably not a good fit.

Richardson: At some level, I just think, yes, if you need to move fast, that's a big motivation, and your application is complex, and/or there's a lot of people working on it. That's the sweet spot.

Cummins: One of the other things to think about is going back to that example I gave where the domain model just really didn't split nicely. I've just spent a day arguing in an architecture workshop about the divisions in my domain model, and we still have all these stuff that's just spread across every microservice. At this point the pain of this is going to be too much, so maybe let's not do it until we have a crisper set of domain boundaries.

Richardson: I think you do have to have a good understanding of the domain and their boundaries, which pushes you in the monolith first approach and then split later.

Cummins: Yes, because they can change as well, of course. Maybe that you had this really neat boundary and then half of your data leaked across to the neighboring domain, and so then you should maybe be slicing at a different point.

Richardson: What do you do about entities that are pervasive throughout the application, like, you could say a user? That's everywhere. Or in eCommerce, it's all about products, and orders, and customers, or banks, it is accounts and customers. It sounds like you should only ever have three services.

Cummins: I think it is hard. One thing that you can do is you can try and say, I'm calling all of these products, but actually, I'm using them in such different ways, that they're not really products. They're not really the same thing. There's maybe a foreign key between them, but that's the extent of the relationship. There's one field, that's the ID, that's the same, but the way I use them are so different that I could actually call these things. Either give them a different name, or just know, ok, I've got my warehousing product, and then I've got my banking product, and they're different.

The other thing that you can do is you could do the upfront work and say, we're going to accept that these are really pervasive, and so they do not change, because we know if they change that will be a world of pain. Then you just have to make that really explicit and say, we're confident enough in these that we're willing to accept the coupling on these. The other thing you can do as well is you can have a little translation layer. You can say, we thought they were going to be the same, but actually, they've changed over here. In order to allow that microservice to continue to evolve, we'll just put a little buffer between it.


See more presentations with transcripts


Recorded at:

Mar 17, 2022