InfoQ Homepage Presentations Cloud Native Is about Culture, Not Containers

Cloud Native Is about Culture, Not Containers

View Presentation

Speed:

23:25

Summary

Holly Cummins shares stories of customers struggling to get cloud native and all the ways things can go wrong.

Bio

Holly Cummins is the worldwide development practice lead for the IBM Garage. As part of the Garage, she delivers technology-enabled innovation to clients across a range of industries, from banking to catering to retail to NGOs. She is an Oracle Java Champion, IBM Q Ambassador, and JavaOne Rock Star.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Connect, see, and speak with like-minded people. Join us to accelerate your learning, be better informed, and drive innovation.

Transcript

Cummins: I'm Holly Cummins. I'm a consultant in the IBM Garage. About a year ago, I was reading Twitter, as one does. I saw a tweet from Daniel Bryant, and he quoted an article from my Red Hat colleague, Bilgin. He loved this picture, in Bilgin's article. I looked at the picture. I did not love this picture. I was so surprised - well, not surprised, but I was so sad to see cloud native being defined just as an architectural pattern that was like microservices, but more. That it was all about how you arrange your smart proxy in your microservices and what you did with Kubernetes. It's not wrong, but it doesn't match up with my mental model for what makes something cloud native at all. I've written so many applications that I think are cloud native, and relatively few of them were microservices, because I wasn't Netflix. For the problem I was solving, I didn't need microservices. Yet, I was sure those applications were cloud native. They were born on the cloud. They did a lot of things that I think cloud native applications do.

I reached out to Daniel, and I said I think I might write a talk about this tweet, because I don't really like this definition of cloud native. He said, yes, a lot of people had a problem with the diagram. That communication, they felt maybe it should be gRPC, or maybe it shouldn't be gRPC. I had to go back to Daniel and say, "My problem was a little bit deeper than that. It wasn't just whether something was gRPC. It was, I don't even think defining microservices as a form of cloud native, or cloud native as a form of microservices, I just don't think that makes sense."

I start by saying I don't like this picture. Of course, this is a really good article. As well, what made me unhappy was talking so much about microservices. It was an article about microservices. Microservices was the title of the article. Of course, Bilgin is allowed to talk about microservices in this article. Of course, it's not just Bilgin, a lot of people put microservices and cloud native as almost the same thing. Even the Cloud Native Computing Foundation, if you look at their website, I went away and I looked at their website, and they say cloud native computing uses a software stack to deploy applications as microservices. That's their definition of cloud native: microservices with some containers, and then you dynamically orchestrate it.

Then that puts me in this really awkward position, where I'm saying, "I'm Holly. My colleague, Bilgin, who is super smart, who has written a book about this stuff, he doesn't know much. He's wrong." InfoQ, that article, wrong. The Cloud Native Computing Foundation, what do they know about cloud native? They're wrong about cloud native. Then at that point I start to look around and say, why does everybody but me have this opinion? Also, I've just insulted InfoQ who are my host, so at some point this video is going to end. I'm going to be hastily taken away to the side, because I'm just so at odds with the consensus in our industry.

Actually, I think I'm maybe not quite that much at odds. Because if you look at the definition on the CNCF page this year, now, they don't actually define cloud native on their front page. If you go to the FAQ, they don't define it there either. They refer you to another page. That other page is on Git, which tells you that maybe it's something that they think about a lot, and maybe it gets changed a lot. Sure enough, you can see there's 18 contributors to this page. If you look at the history, you can see those 18 contributors have been busy. They've been changing it. Some of those updates are things like translations, but a lot of them are changes to the definition as well. If you look at the original commit to this definition, even that original commit to Git was based on 11 drafts that the TOC had done in a Google document. They were thinking about this really hard, and I imagine having some slightly tense meetings about, what does it really mean?

That's actually pretty consistent with the rest of us. If you ask 10 people to define cloud native, they'll have a definition, but they'll all be different definitions. Some people like the definition that it was born on the cloud. That was the original definition back in 2011. Some people really define it just as microservices. Some people define it even more tightly as Kubernetes is cloud native and cloud native is Kubernetes. Some people talk about it as DevOps. Of course, DevOps is so much older than cloud native as a term. A lot of those principles end up being quite similar between the two. Sometimes, we actually just say cloud native, when we're just talking about software that was written in 2020. It's modern. It's nice. It's got good quality, so it must be cloud native.

Sometimes I think we get so used to saying cloud native that we forget we're actually allowed to say just cloud, and so we just always say cloud native when we mean cloud. Sometimes what we mean is idempotent. We don't use that definition very often, because when you say idempotent to people they go, "Idem what?" What we really mean by idempotent is that something is rerunnable. It may be moving from host to host in the cloud, and so it better behave well when it goes down and goes up and moves around, which old big software just did not do at all. It didn't cope with that.

CNCF Cloud Native Definition v1.0

When you've got all these different definitions of cloud native, then the Cloud Native Computing Foundation have to sift between them and come up with something. In 2019, their definition was really about the software stack. It was about microservices and containers. Now I think it's changed to be a bit less technologically oriented. It's talking about microservices. It adds a few other things in as well like immutable infrastructure, service meshes. What it says about microservices is not that microservices is cloud native. It's that microservices exemplifies cloud native, which I like, because I think it gives a bit more flexibility for having just one service, but it's still cloud native. That makes me a bit less uneasy.

Why?

I think with the modern definition, but also the older definition, credit to the Cloud Native Computing Foundation, because they really did also focus on the why. With the older definition, they said we do cloud native because we want to build great products faster, which is just, yes, totally. In the newer one, they've qualified it a bit more. It's not just about going fast, no matter what. It's allowing developers to make high-impact changes frequently and predictably, with minimal toil, which is totally a definition. I think we all want that, don't we? A lot of practices that we do are driving towards that. Cloud native can really help with that.

What Problem Are We Trying To Solve?

With that definition, what I really like is it gets back to not ‘what technology should we be using’, but ‘what problem are we trying to solve?’ With why cloud native and what problem are we trying to solve, I find it really helpful to actually step back and think, why were we even using the cloud in the first place? Now I'm so used to cloud that I don't even really remember why we went to the cloud. Y had a rather sad phase ou look back, you can think, what were things like when we didn't have the cloud? There was really big cost problems, because we always had to provision everything for the maximum capacity we might need. If there was Black Friday rushes or anything like that, then we had to have this hardware, sat around doing nothing for the rest of the year. The cloud allowed us to avoid that cost. The way it allowed us to avoid it was because it could be elastic, so we could scale up, and then we could scale down again. That gave us really big cost savings.

We noticed pretty quickly, that the cloud, it didn't just give us the cost savings. It gave us an organizational benefit, which is that we could deliver software way faster on the cloud, than we could normally, because we didn't have to be doing things like sending our software to Ireland to be printed onto CDs and then mailed out to our customers. We could just keep pushing stuff out to our customers as fast as we could build it, which was so good and made such a difference to our industry. We're seeing some new things as well, new benefits of the cloud around some hardware. You just wouldn't want it in your data center. If you're doing AI, for example, and you want a whole bunch of GPUs, you could buy them, but they're really expensive, so you'd probably rather rent them. If you want to do something even more exotic like quantum computing, you could buy a quantum computer and put it in your data center, but there's not many of them out there in the world. IBM was the first to make quantum computers available on the cloud. You can still just rock up to the website, and you run computations on the real quantum computer on the cloud. That's cool, because a quantum computer has to be kept at about 0.4 Kelvin, which is colder than the space between the stars. You really don't want that in your data center unless you have a passion for refrigeration.

Of course, the reason that we went to cloud was not microservices. I think we sometimes forget that, and we imagine that that was the problem that we were trying to solve, but no. I think with microservices and with the cloud as well, there is a little bit of the reason we do it is because we can't imagine not doing it. We look around at real people who are doing a great job and they're doing it, so we just assume that we need to do it too. It's this wishful mimicry in our choice of architecture. I think that this causes problems, because if we don't really know whether we're trying to get a quantum computer, or whether we're trying to get microservices for their own sake, or if we're trying to save costs, then that means we can't actually get a good plan, and we might end up with the wrong thing. We were hoping to deliver faster, but then we only ended up with cost savings because both us and our stakeholders said cloud native, but we meant totally different things. Or, we might be having something that delivers really fast but our stakeholders were hoping for microservices because of the wishful mimicry, and there's no microservices, then they're unhappy. We just need to get that clarification.

Microservices Envy

Of course, with microservices in particular, I feel like there's this microservices envy, where a lot of companies that we talk to they want microservices. Microservices aren't a goal in themselves. They're a means to get something else, and so we have to keep resetting to, what problem are we really trying to solve? I talked to a bank a while ago, and they started the conversation that they were really concerned with their speed of development. They couldn't get new features out to their customers fast enough. They had this big COBOL estate that was dragging them down, they needed to go to microservices. Then they added that their release board only meets twice a year. If your release board is only meeting twice a year, it doesn't matter how many microservices you have, your release cadence is going to be twice a year. You may as well stay with the COBOL because the COBOL isn't the problem.

There's good reasons to not go to microservices if you're not going to get that loose coupling and that speed, because a microservices architecture, if you get it wrong, it's a distributed monolith. A distributed monolith is a thing of fear. You really don't want that. Because with a distributed monolith, you get all that coupling, but you don't have the compile time checking that you get with a classic monolith. When one function calls another one, you don't even get that guaranteed function execution. There's all sorts of problems that you have to solve. For what? Because of that lack of compile time checking and because the microservices don't actually guarantee the decoupling, you get this pattern, which is just cloud native spaghetti. It's even worse than normal spaghetti.

I visited a client and they called me in because they were using microservices. Every time they changed one microservice, another one broke. The problem was that they were distributed, but they were not decoupled. They had a real problem with their domain modeling. Each microservice was working in the same domain and so it had the same object model. It was a big object model. They didn't want to be coupled by having a dependency on common code. Instead, they cut and pasted the common code across all the microservices, and so that meant any time anybody changed a field name, everything broke. The problem really was about the domain. It should be that each microservice has its own domain. What they had is a common domain across all the microservices so they didn't get the least coupling.

The Mars Explorer

I heard about this story, and as soon as I heard about it, I thought, "It is microservices." This is the Mars Explorer. Here's what it looks like. Its mission was to go to Mars and orbit around it. It had a rather sad fate, so instead of orbiting around Mars, it crashed into Mars. The problem was that it was distributed, and there was a coupling. There was a little bit of a misunderstanding about that coupling. It had a component that was in space, and it had a component on the ground. The component in space used metric units. The component on the ground used imperial units. Those two aren't the same. They looked close enough that nobody noticed the problem. This distributing meant that the type safety and everything was lost, and they had a crash, and they lost the explorer.

Even if you can get over that coupling, there's another problem with microservices, which comes at ops time, because you have a lot of containers and you have to manage them all. You need to then start looking into things like your SRE and your site reliability engineering, and all of those practices to be able to cope with it. Because to take advantage of what the microservices are giving you, you need to release often. To be able to cope with releasing often, your releases need to be very boring.

How to Brick a Space Probe

Releases can be quite bad if you don't get things right. There's a great story about the Russian Space Agency in the '80s. They had this space probe called Phobos. They had a deployment process to it. Of course, they took the code and beamed it to the space probe. A hapless engineer had an update, and the automated check system had a functional problem in it. He said, "I don't need these automated checks. I'll just bypass them and push straight to production." Then they had a problem, which is that the space probe has got these fins, and they rotate, and he stopped the software that had the fins rotating. The purpose of the fins was to catch solar energy and recharge the probe, which meant that it looked like it was working until two days later it ran out of battery. Then they couldn't fix it at all.

This operational problem, it's potentially quite bad unless you've got the automation in place. When you have containers, instead of having one space probe to manage, you've got 200 space probes to manage. I think there's this mark of pride that we say, "I've got so many microservices. I've got so many containers to manage." That does show a lot of skill, but you may not be spending your time on the best things. It's not a competition to see how many containers you have. In particular, if you have a lot of containers, but you don't have the automation, you have really bad problems. If your tests aren’t automated, then what you're basically saying is we don't know if our code works. We see a lot of problems as well because if you don't know if your code works, then even if you have a continuous integration and deployment system, there's no way you are going to be actually doing those.

CI/CD

A lot of companies, they'll say to me, we have a CI/CD. I'm like, a CI/CD? That's a verb. It's something you do. You don't just buy the tool. If you have a CI system, but then you're only merging next week, it's not continuous integration. If you have a continuous deployment system, but you only release every six months, that's not continuous deployment. That's not continuous delivery. I think, you keep saying this word continuous, but I do not think it means what you think it means because we're not actually continuous. We just have the talks. The reason we're not continuous usually is because releasing is way too scary. Then I ask, why can't we release things? What's the barrier to these more frequent deploys? Often, even though we have microservices, and they should be independently deployable, it is way too scary to release them independently because we have these test challenges. We're afraid that if we change one, everything will break. Then we have a huge pipeline for all of the microservices to enforce that they deploy at the same time, which again puts us back to, it's just a distributed monolith.

Feedback Is Good Engineering and Good Business

We see other problems with releases, which have to do with half-baked features. Of course, you don't want to be exposing your customers to garbage code, but you can still have that skeleton that allows you to go faster and have that continuous deployment. If you have this microservices architecture, you want to be getting stuff out. If you're driving a car like this, that's no way to drive a car. You need that feedback from the field. You need that feedback from your deploy pipelines. That's just good engineering. It's good business too. There's things that you can do to be able to release safely even if the feature isn't complete. You cannot wire it in. You can use feature flags. You can expose it but only in a really safe way. You can have the canary deploys. You can have the A/B testing. All of that is helping you get that actual continuousness in your CI/CD.

Cloud Governance

I do see problems around the governance of the cloud. One of them is that reluctance to release, and that heavy governance. Often, we just don't even know what's going on in the cloud. The cloud makes it so easy to provision hardware. You can just click a button and you've got the hardware. There's still a computer running it, and that hardware still has a cost. If there's nothing actually running on that hardware, that's bad. I think a symptom of cloud native often is that we lose control of what's going on. We just don't have the structures in place to help us understand what our organization is doing. That's an individual thing as well. When I was first learning Kubernetes, I created a cluster. Then I got sidetracked by other things. Then I went on holiday, so I forgot this cluster for 2 months. It was £1000 a month, this cluster. I was there having a holiday and money was just being burned. That's not so good.

Because of that problem, then we get another problem. I think it relates to the reluctance to release as well, which is that the cloud gives us so much flexibility, and it gives us so much speed. We lock it down and we make sure that nothing cloudy is going to happen in it. A customer came to IBM a while ago with a complaint that we'd sold them this provisioning software that would give them 10-minute provision-time. This was a few years ago. They were, "Yes, we want that." Then what they were experiencing was that it was taking them three months to provision things. IBM, we investigated to try and figure out what was going on. When we looked at it, we realized that they'd wrapped this beautiful 10-minute provision software in a governance process. It had 84 steps in that pre-approval process. No wonder it took three months to provision things. Because of the organizational structures that we put in place around it, we just completely lost the benefit of the cloud. No matter how many containers we had, we weren't behaving like we were cloud native. I see that a lot, that you have the cloud, and it's so beautiful. It gives you so much flexibility and so much speed, but then we put it in a cage, and we just put all this old style governance. That's just not going to work. It's not going to give the benefits of cloud native that we're hoping for.

See more presentations with transcripts

Recorded at:

Feb 05, 2021

Holly Cummins

InfoQ Software Architects' Newsletter