Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations 9 Ways to Fail at Cloud Native

9 Ways to Fail at Cloud Native



Holly Cummins shares stories of what happens when things go wrong in a cloud native migration.


Holly Cummins is a Senior Technical Staff Member and Innovation Leader at IBM. Holly is also an Oracle Java Champion, IBM Q Ambassador, and JavaOne Rock Star. Before joining the IBM Garage, she was Delivery Lead for the WebSphere Liberty Profile (now Open Liberty). Holly co-authored Manning’s Enterprise OSGi in Action and is a regular keynote speaker.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.


Cummins: I'm Holly Cummins. I work for IBM. What I do with IBM is, I'm a consultant. I help organizations really get the benefits of the cloud, because there's a lot of things that you can do on the cloud that are white, and fluffy, but that don't really end up giving you the benefits of the cloud. In my work as a consultant, I've seen a lot of those things.

Cloud Native and Container Driven Development

One of the challenges with cloud native and with container driven development in general, is trying to figure out, why are we even doing this? What problem are we trying to solve? What we sometimes see is a little bit of everybody else is doing this, so I need to do it too. A bit of CV driven development where we say, yes, I think maybe I need to update my CV by changing my organization's architecture, whether it's the right thing to do or not. You may be thinking, I know we joke about CV driven development. That's really cynical. Not many people are actually doing that. It turns out, quite a lot of us are. Red Hat recently did a survey, and they looked at the drivers for container based development. What they found was 40% said the main driver was career progression in-house or externally, so that is basically CV driven development. Again, maybe that's a good thing. Maybe that's pushing us to better our skills. We need to be really clear on what the benefits are. What are we hoping to achieve?

It's not enough to just put things into containers, and then keep doing exactly the same as we were before. That's maybe ok for a CV, but not for anything else. With cloud native, things are made extra hard, because cloud native means so many different things to different people. That we could all have an outcome in mind, we could all have a goal. Unless we talk to one another, and have that honest conversation about, is your problem the same as my problem? Things can go wrong. For example, we could end up with some people thinking that we're going to cloud to save costs. Other people thinking that we're going to do it ility. Could be an awkward conversation at the end of the project. Or maybe some people think that because microservices are such an orthodoxy now, that in order to be cloud native, we have to be microservices. Whereas other people may have a different view, again, potentially awkward conversation.

Microservices Envy

Microservices are particularly an area where I think there's a lot of misalignment, because microservices have become such a standard architecture now that although not every organization is doing microservices, pretty much every organization feels they should be doing microservices. They're looking at those really microservice heavy architectures. Some of the benefits of this architecture isn't thinking, I need to do microservices. It's important to remember, microservices are not a goal. No client is going to look at your website and say, look at all those microservices underneath. They care about their user experience. They care about how well you're able to meet their needs. Microservices can be one means of meeting that goal. They're not the only means. They're only a means, they're not an end in themselves. I think part of this comes because containers are such a compelling technology. They have so many great characteristics in terms of how lightweight they are. How portable they are. How easy they make it to share images, knowledge, scripts, and reproducibility. We say, these containers, they're so good. I should have as many as possible. Then you end up with things like the Netflix Deathstar diagram. Again, for some context, that might be the right thing to do. You need to have a lot of operational maturity to do that, and it's maybe not the right thing for your organization.

Distributed Monolith

What we sometimes see is this assumption that if we bring in microservices, everything else will just follow automatically. We spoke to a bank in Asia Pacific a while ago. They came to us and they said, IBM, we need your help, our lunch is getting eaten by all these challenger banks. We've got this big, unwieldy COBOL estate. We need to get rid of it and go to microservices. We said, we can help you. Then they added, our release board meets twice a year. If you have that governance structure, it doesn't matter how many microservices you have, they get released at the same time. You're not going to get that agility. No matter how many containers you have, what you're looking at isn't microservices, it's just a distributed monolith. A distributed monolith is like a normal monolith but even more frightening, really, because you don't get some of those benefits that a monolith did give you in terms of compile time checking of your types. That guaranteed function execution where you know, because you're all running in the same process, if you call something it's actually there. You have to deal with all of these challenges but without getting the agility, so why bother. If you don't do that de-spaghettification, you can end up really easily with cloud native spaghetti.

Distributed Is Not Equivalent to Decoupled

I was called in to help a troubled project. When I arrived, pretty much the first thing they said to me was, we've got microservices. We're doing everything right, but every time we change one microservice, another breaks. Of course, that's definitely not the promise of microservices. The promise of microservices is that you have this wonderful decoupled architecture, and you can change them independently. Of course, it's important to remember that just because you've distributed things doesn't mean you've decoupled them. Distributed and decoupled are not synonyms. You can have a lot of coupling even if you're distributed. That was exactly what was happening in this case.

When I looked into the code, I kept seeing the same code that was copy and pasted all through the code base. It turns out that there was this domain model, this quite complicated object model, and it had 20 classes and 70 fields. In order to avoid coupling, there wasn't a common library for this domain model. Instead, it was reproduced across each code base. The coupling was still there. If one name changed in one microservice, that cascaded and broke everything else. Of course, we see this coupling, not just in code. We see this coupling in lots of places, and it breaks things. With coupling, I don't think we can ever really get to a state where we don't have coupling. What we have to do is we have to manage coupling. We have to know where our coupling points are, minimize them, and then watch them like a hawk to make sure we don't have problems at those coupling points.

The Mars Climate Explorer

This is one of my favorite distributed stories. This is the Mars Climate explorer. If you look at the actual photo, it looks a little bit more professional than my drawing, but only slightly. I like how it looks like it's made out of a bin liner. Then there's a tumble dryer tube in there somewhere. I'm sure that's duct tape that's holding everything together. It had a rather sad end, which wasn't because of being made out of tumble dryer tubes and sellotape. It had a different problem, which is that this was a long time ago and so it wasn't supposed to land on Mars. That was too ambitious. It was just supposed to go around Mars. What ended up happening instead, is it went, and that was the end of the Mars Climate Explorer.

Obviously, this was tremendously sad, and an investigation was done to figure out how did this happen. It turned out that there was two control systems. One was on earth and one was on the Explorer itself in space, by two different teams, two different companies. Very much living the microservices dream of independent teams, autonomous, but unfortunately, there was a little bit too much autonomy. One team used metric units for everything. The other team used imperial units, and nobody noticed that on that communication protocol we hadn't agreed about what units we were using. The units were similar enough in magnitude, that it wasn't really obvious that there was a problem until it crashed into Mars. You could not have a more distributed system than this one, half of it was literally in space. Distributing did not help, it still had that coupling. The coupling wasn't managed properly, so it ended in disaster.

Managing Coupling in Microservices

The way to manage coupling in microservices, is, you're going to need some integration tests. Really, you don't want too many integration tests, because they're really expensive. If you do too many, you start to lose that benefit of microservices. You may as well be doing everything in one process. Instead, what you want to do is consumer-driven contract tests, which just allow all of the teams to maintain the autonomy, but give us confidence about those boundaries and confidence that we're actually communicating correctly. Of course, contract tests need to be automated. Automation turns out to be another problem for a lot of organizations. Automation is expensive. What often happens is we all say, automation is a really good idea. Someday, we're definitely going to automate. We just can't afford to automate right now because we've got these deadlines. Whenever I talk to a client, and I hear our tests aren't automated, my heart just sinks, because I know what that really means is, we have no idea if our code works. That's scary. It means that we can't do things like releasing without a huge battery of manual tests, because at any particular moment, the code may or may not work. If we don't have those automated tests, if we don't have enough automated tests to have really solid confidence in the quality, to have really solid confidence that our code changes aren't going to break things, then we end up with what I call the not actually continuous, continuous integration and continuous deployment.

Continuous Integration, Continuous Deployment

Another thing that makes my heart sink when I hear it, is we have a CI/CD, because CI/CD, it's not a noun. It's not like something that you buy, and you just put it on the shelf, and you say, there's the CI/CD. CI/CD stands for continuous integration and continuous delivery or deployment. It's something that you do. It's not a tool that you buy. Often, I hear things like, I'll merge my branch into our CI next week. If it's only going in once a week, it's not continuous integration. Or I'll hear, again, all this talk of CI/CD, and then they're thrown in there, and we release every six months, a CI/CD. I think, remember that C is continuous, and if it's only once a week, if it's only once every six months, you keep saying that word, but I don't think it means what you think it means. Often, the reason that we have the CI/CD tooling, without actually the CI/CD doing is because organizations are really reluctant and really nervous about releasing things. Again, part of that comes back to those automated tests.

If you can't be sure that someone pushing code into your system isn't going to break everything, if you can't be sure that what you release isn't totally broken, of course, you don't want to release it. Then you end up with CI/CD patterns like this one, which is, instead of having a pipeline for each microservice, because they're independently releasable. We know that we can't release them without doing a lot of manual UAT, that kind of thing. In order to enforce that pattern and prevent the microservices from being released, we put a lock on the pipelines so that they have to go out in lockstep. Of course, this is really missing a lot of the benefits of microservices, and it's missing a lot of the agility that cloud native is supposed to give you. Even if not everything is perfect, the second it leaves the developer's desk, which realistically it probably isn't going to be, even if things take time to build, which of course some things will. There are technical patterns that we can use to continue getting that flow of release so often that it's boring, without introducing a horrible user experience and lots of technical risk. In fact, releasing really regularly, getting good at it, reduces risk.

We can do things like deferred wiring. We have this microservice out there, nothing's talking to it. It's safe to release it as often as we like. Or if things are more entangled, we can use things like feature flags. Something like LaunchDarkly or home-rolled feature flags to make sure that that function's there, but it's disabled. Or maybe we do want to be releasing because we do want to be getting that feedback. Of course, we don't want to release to everybody until we've had some feedback. We can do things like A/B testing. We can do friends and family testing. We can do canary deploys, just something so that we can continue maintaining that continuous deployment cadence and continue getting that feedback, which is one of the really big benefits of cloud native.

Cloud Governance

Another thing that we see is we don't keep up with the governance. We see that a little bit with the releases and the release boards. We see it elsewhere as well. The joy of the cloud, is, how easy, how self-service it is. That can make organizations uncomfortable. I was told this story by colleagues. A while ago, we got called in with a complaint from a client, and they said, "IBM, you sold us this provisioning software, and it's broken." This was back in the days of virtual machines was the extent of the cloud. We'd sold them this system that would allow them to provision virtual machines in 10 minutes, self-service. They went, this is amazing. Yes, we definitely want this. Then when they started using it, instead of getting the 10-minute provisions, it was taking them three months to provision things. They said, IBM, your software is broken.

We investigated. What we found was that they'd put in an 84-step approval process. Before anybody could get a virtual machine, they had to fill in 84 forms. When you look at it like that, actually, it's amazing that they were managing to provision things in three months. The sad thing about this is this story was years ago, but I still hear the same thing. Developers will come up to me and they'll say, at my organization, it takes three months still, to get a cloud instance. It makes me so sad, because the beauty of the cloud, the joy of the cloud, is that really frictionless developer experience. If you try and apply the old-style governance, where everything is manual, everything is relying on handoffs. It's not going to work. You're taking the cloud, you're wrapping it in paperwork, and you're putting it in a cage. That may as well not be the cloud anymore.

Of course, there's a reason that organizations do this. There's a reason that organizations are nervous about letting anybody run anything, any time. Part of it has to do with security. Really, when we talk about CI/CD and DevOps, we need to be talking about DevSecOps as well. Automate as much as possible. That includes security. Get it in the pipelines, get it so that it's impossible to deploy something unless we've done the validation, that it's not doing horrible things in terms of organizational security. Another thing that organizations are quite understandably nervous about, is the financial implications of letting anybody provision anything at any time. I think a lot of organizations see this. A lot of organizations see that their cloud bills are quite high. What's more troubling as well, is that when you look into it, it can be really challenging to figure out, what's the money being spent on? How much of this money is for services that are core to my business, so of course, I want to keep spending money on it. How much of it is developer experiments that just got abandoned. Organizations have a lot of patterns for trying to deal with this, and none of them are very good.

It seems like that governance pattern of shutting the barn door before the horse leaves, that seems like it would be a good one. Actually, it doesn't really work because it prevents some really valuable experiments. It prevents some really valuable innovation. It makes everything slower, developers leave. Then you still actually end up with this problem of zombie workloads, abandoned workloads. In order to try and manage this, I think, what we have to do is we have to apply those same automation and self-service principles that we do to everything else, to managing our cloud costs.


I'm really excited by some of the FinOps conversations that are coming along, because a lot of what they're trying to do is bring in that automation, bring in that real-time visibility. I think in the best cases, as well, make it visible to developers. Get that feedback loop complete. Instead of having information go to a CFO or a manager somewhere, and then them have to try and trace back through to figure out who is wasting money. We can see our own financial impact and we can optimize that, if it makes sense. That's a whole bunch of failures, disaster stories, sad stories, and what not to do.

Ways to Succeed at Cloud Native

I want to end on a more positive note, because, of course, cloud native is brilliant. Cloud native has really changed how we develop. It's made things so much more frictionless and enjoyable for developers. It's enabled us to get new user experiences, new technology, out faster than we ever could before, and with less toil than we ever could before. There's a lot to love. We just need to make sure that we're doing it in the right way.

I often talk about culture, and how we really need to be embracing the cloud native culture, not just the architecture, the microservices. Another way of thinking about it, if culture seems too fluffy as a concept, is that we definitely, as well as doing the architecture, we need to be bringing in those cloud native operations. Kelsey Hightower talks about doing cloud like it's 1999. We need to bring our architecture up to the modern day. We need to bring our operations up to the modern day as well. We need to make sure that we're collaborating. If cloud native is something that just happens on the IT side, it's not going to work. Similarly, if it just happens on the business side, then it's not going to work either. We need to be having that conversation together, getting aligned, changing the things that need to change to take advantage of cloud native. Then going forward. Really, with cloud native, one of the reasons it's so effective as an architecture is it allows us to optimize for feedback. It allows individual teams to get feedback much faster than they could before. It allows the organization as a whole to get that feedback. Really, embrace that. Everything we do, whether it's our unit testing, whether it's our automation, whether it's our release strategy, we should be optimizing for feedback.

Ultimately, we need to be clear on what we're trying to achieve. Are we just trying to save cost, trying to get out faster? Are we trying to retain developers by having a great developer environment? All of those are perfectly fair things to do. If half the organization thinks we're doing one and half the organization thinks we're doing another, we're going to have problems.

Questions and Answers

Betts: One of your common themes was that we need a certain level of maturity across the organization, from operations, to testing, to IT, to all parts of the business as well. What does it take for a company to know they're ready to go cloud native? If there was a theoretical checklist of here's all the things you need to be successful, it doesn't seem reasonable to say we have to have all of this in place before we can get started. Neither should you have none of that in place, and just figure it out all of the way. What's the minimum checklist that you see that people should have before they say, we're ready to start the journey?

Cummins: I think it's actually helpful to think about maturity models for this because it's not just a binary. There's definitely a spectrum of cloud nativeness. At one end, you have the organizations like the Netflix, and then at the other end, you have others. I think what sometimes happens is we push really far ahead on one piece, and we try and get to that next level. The rest of it is, if you think of it as a matrix, we're trying to go to 9 out of 10 on one level, but we're still at 1 out of 10 on the other. For example, we go, I'm going to have 500 containers, but I'm going to have one CI/CD pipeline and one release board. Then there is that mismatch that really causes problems. What you want to be doing is trying to inch along on each row of the matrix.

Container Solutions have a really nice maturity model for cloud native. I know some people are quite down on maturity models. I find them really helpful just as a way of saying, it's ok to not do everything. Here's a bit of a roadmap to show me the route from where I am or where my horrible colleagues are, to where I want us all to be. Container Solutions have a good one. Some of my colleagues have done a really nice one as well. There's about seven things. You want to look at your governance. You want to look at your architecture. You want to look at how much automation you have, and then bring each of those along roughly in line with each other, or you're going to get friction.

Betts: I think there's some aspects of cloud native that people don't think of as maybe as important or at least they're not cool. Like containers are cool. Automated tests are not cool. Dozens of deployments a day sounds really cool, but figuring out why your cloud bill is astronomical, is not cool. How do you get people to eat their vegetables and work on the not cool stuff?

Cummins: I think with that, what we sometimes need is a bit more empathy. I say this directed firmly at myself, because I'm such a huge fan of automation. Then I look at other organizations, and I see this lack of automation, and I think, why didn't you do this? Then I talk to people, I'm like, you don't actually really get excited by CI/CD pipelines in the way I do, and you don't really get excited by automation. There's two things that we can do there. One is we can change what we value. We have to say, this bread and butter foundation work is really valuable, because it's what makes everything else work. We should reward it in terms of what we measure and what we reward in individual contributors. Then as well, those of us who really love automation, I think sometimes need to recognize that other people don't necessarily love automation. That's ok. We just need to figure out how to work in a world where not everybody does want to do automation.

Betts: It was, how do you get people to do the stuff that's not cool? Automation is certainly one of those. Have you seen examples of companies saying, we've got this level of maturity over here. We're blazing ahead and we're making 500 microservices, but someone has to work on the CI/CD pipeline. How have you been able to redirect people to say, we need to level up on all of the things before we go any faster over here.

Cummins: I think a lot of that often comes down to how things are managed and those unintended consequences of management where we reward certain things, and then that creates effects that we didn't want. I heard a story from a colleague, and I think this happens all the time, where there was a couple of teams working on a large project. One team worked really hard, stayed up all night, because we had the stakeholder playback on the Friday. They got this frontend that looks spectacular. It was just a hollow shell. There was nothing underneath it.

The other team, first of all worked at a sustainable pace, which again, management tend not to reward, but management should reward because it ends up actually probably being more productive in the long run, than having people do these peaks and troughs of productivity. They worked at a sustainable pace. They started with the foundation. They started with the automation. They started integrating these components together that weren't talking to each other. Then they had something that was fairly unflashy on the frontend, but it worked all the way. It was this full vertical slice of function. I think you can imagine, which the stakeholder was more wowed by and they said, "That's amazing. You worked so hard. Look, it's beautiful."

If you had looked a bit deeper, you would have realized that the more meaningful work was actually done by the team that you didn't reward. I think sometimes that's not done with any malice. It's just quite easy to be dazzled by things. We need to be a bit more thoughtful. What's actually going on under the covers? This team delivered this amazing new feature and this team put in a really solid CI/CD pipeline. Let's think about which we reward.

Betts: There's things that the customer sees, but then, if you're able to make all the developers 10% more efficient, that pays dividends, that just keep compounding. You need to spend some time doing that, otherwise, everyone eventually burns out by working too hard, but not efficiently.

Cummins: That enabling work is so vital. Sometimes it's just a case of asking the question. At IBM, we're measured quite strongly on mentoring. If you want to go for a promotion, or at your end of year, it's like, who are you mentoring? Also, not just names, but what did your mentees achieve? I think we can do the exact same thing. We're starting to do the same thing with asset creation in a lot of organizations, and we can do it with enabling as well. Who did you enable to work better because of the work that you did? Their achievements. If you've supported their achievements, let's talk about their achievements as well.

Betts: One of the questions is somewhat related. It goes to, I think, that slide of we had a 10-minute delivery process that took 3 months. What have you seen for ways to reduce those onerous processes, whether it's getting the engineering team more efficient, or getting executive buy-in to say, we need to fix the process?

Cummins: The first step is just visibility. In that story, they were really shocked by what was going on. They themselves didn't have visibility of how many steps. Each step I think probably seemed really reasonable when it was put in. Each little stakeholder said, but I need to know this, so I'll just put the step in, and then it just, in aggregate, turned into this monster. Doing some of that value stream mapping, and that kind of thing can help.

Then the next thing is really having those alignment conversations to say, I want to achieve this, you want to achieve that, how can we be pulling in the same direction rather than the opposite direction? Again, a lot of that has to do with measurement as well. There's the classic DevOps conversation, where we measure Ops on stability, and we measure Dev on features released. They're in conflict, just from the measuring system before we've even done anything. I think there was a similar thing with that system, where each person who put a check in place, did it, because if that check wasn't there, they risked getting in trouble. It's having that more holistic view to say, this is overall, what we want to achieve. What are your goals within that bigger thing? Let's measure you on the right things rather than measuring you on things that end up hurting the organization.

Betts: What are your top 3-piece advice for companies starting? It sounds like visibility, like you just said, is clearly one of those things. You need to be able to observe what you're doing. What are two others that you would put in that top category of we've got to focus on this first.

Cummins: This is such a cliché, but talking to each other. Particularly talking to each other beyond teams, because I think we naturally tend to form little, quite tribal structures. We know that everybody in our team is quite cool, and they get it, but the people in that other team, what a bunch of cowboys? Why did they go in that way? Someone said, your Ops person, they do eat lunch. They do drink tea. You can do those, treat them as a human being and have those conversations. You can have those informal conversations, which is going over and talking to them. Then we can do more formal things as well. We do a lot of design thinking workshops. What we almost always find when we do the workshops is if we get business and IT in the same room, it is a revolution for the client. They think, each individual thinks they know what the problem is, but they hadn't actually realized that their peers across the hall had a completely different set of priorities. Then, just facilitating that communication is something someone external can do like we're doing, but you can do it yourself as well. Talking to each other is the second one. Automate everything is the third. Especially if you care about it, automate it. Try and make sure that the governance is encoded at the automation layer, and the security is encoded at the automation layer, because then it's more robust and it's visible.

Betts: I think your slide was, if you don't have automated tests, you don't know if your code even works. That scales up. Then you don't know if your entire cloud infrastructure works. You don't know if everything works, you're just hoping it doesn't break and how to fix it.

Cummins: Years ago, I finished a user story, and I reported proudly at scrum, "I finished my user story." Then I realized it didn't work at all. Someone said something to me, and they said, yes, but you didn't put a test on it. It doesn't work until you've got a test on it. Because it worked when I finished it but the next day it had regressed.

Betts: Is cloud native for everyone? What about sensitive data, intellectual property, stuff like that, maybe you shouldn't put it in the cloud?

Cummins: I think public cloud is good for a lot of things, not necessarily for everything. I think a hybrid cloud really makes sense, because it does admit both of those use cases, so you get that elasticity and agility of the public cloud. Again, some things probably you do want on-prem for historical reasons, or for sensitivity reasons. Those cloud native ways of working, they're challenging. A lot of things need to change in an organization to really make it work, as you see it in the movies. If you can do that, the benefits are so big. I think that automation, the emphasis on collaboration, the emphasis on optimizing for feedback. That's just good engineering. That's what we should do, no matter what the context, whether we call it cloud native, or DevOps, or something else.

Betts: That's the behavior. One of the nice things I've always thought of with cloud native is, you don't care about the infrastructure, you don't care that I'm running to the server room with a floppy disk or a USB key, and updating the website. You just publish to the cloud. If you're so removed from the infrastructure, it doesn't matter if it's on-prem or in the public cloud, it's still, you just work on your piece. You're able to focus on it. You have to have all of the patterns and practices in place that the pipeline takes care of the actual deploy. Maybe it uses the public cloud for testing, because it can scale out faster but there's no real data, and we keep the real data in. The hybrid cloud is a good option for handling all of those test cases.

You mentioned maturity models, do you have a link or something you can share?

Cummins: Container Solutions do have a really good one. Some IBM colleagues have a good one as well.


See more presentations with transcripts


Recorded at:

Jan 13, 2022