InfoQ Homepage Presentations Why Cloud Zombies Are Destroying the Planet and How You Can Stop Them

Why Cloud Zombies Are Destroying the Planet and How You Can Stop Them

View Presentation

Speed:

46:28

Summary

Holly Cummins explains how utilization and elasticity relate to sustainability. She also introduces a range of practical techniques, including absurdly-simple-automation, LightSwitchOps, and FinOps.

Bio

Holly Cummins is a Senior Principal Software Engineer on the Red Hat Quarkus team and a Java Champion. Over her career, Holly has been a full-stack Javascript developer, a WebSphere Liberty build architect, a client-facing consultant, a JVM performance engineer, and an innovation leader.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Cummins: I'm Holly Cummins. I'm a Senior Principal Engineer at Red Hat, where I helped build Quarkus. Quarkus is a super sustainable runtime. I spend a lot of time measuring and optimizing the carbon footprint of Quarkus. What I want to talk about is something I've seen a lot of in my time as a consultant. To be honest, I still see it a depressing amount today when I speak to clients. What is this thing? It's the zombie menace. These aren't Last of Us people zombies, these are computer zombies, but they're still quite problematic. It's something that once you start to look for them, you see everywhere. As an example, I saw a Twitter thread a while ago, and someone was saying he gets charged $2 a month from AWS and he's too scared to turn it off. He's too lazy to figure out what it is that's causing the bandwidth. He gets emails from an 8-year-old client WordPress install, he's pretty sure that client isn't using it anymore, that they've gone off and they've gone to something on-prem. He has no idea how to access this thing to disable it, but it still reliably emails him once a month. In this case, he did go on a bit of a tidy-up. He turned off a Digital Ocean droplet that he'd created in 2013, which was 9 years before. He deleted a snapshot from 2014, 8 years ago. He'd been paying 74 cents a month for this snapshot for 7 years. These things, they all add up and they are absolutely everywhere.

I saw another story. It was a job interview, and he was touring the lab, and he saw some servers, and they looked quite old. He was intrigued. He said, what do these ancient looking servers do? He got a confident answer, one is a backup for the other. Yes, but what do they do? At that point, it got a little bit more awkward, because it turned out, no one had known for decades what these servers did. This was a very large company, and these things fall through the cracks in those kinds of companies. It happens everywhere and at every scale. It's easy to judge, but I've completely done this as well. When I learned Kubernetes, as one does, I went and I created a cluster to play out on. Then, unfortunately, I had too much work in progress, and so I forgot this cluster for a couple of months. When I remembered it and went back to it, I realized that I'd done a fairly well spec'd cluster, and it was $1,000 a month for this cluster. The worst bit is, that was a while ago, before I really started thinking about sustainability. Now I think about sustainability all the time. Yet, guess what happened? While I was literally preparing this talk about zombies, one of the other things on my to-do list, but I have too much work in progress. For Quarkus, we run CI/CD on various hardware, but Mac M1 is less well supported by GitHub Actions, so I was setting up a self-hosted runner, and it was hard to get working, and so it's disabled. It's not actually running any builds. It's just got an instance out there running costing money. This thing, it's not a huge amount of money. It's $159 a month, but it is unused, which is so sad, and so wasteful.

How Big a Problem Are Zombie Instances?

How big a problem is this? It's one thing to see anecdotes, but really the same as any performance optimization, you have to be guided by the data. For zombies, this is tricky, because, by definition, zombies are invisible, which is the whole reason that they're a problem. The Anthesis Group have done a lot of research on this, so we can use their results, which is great. In 2015, they did a survey and they looked at around 4000 servers, and they found, depressingly, of those 4000 servers around 30% were doing no useful work, not a sausage, no traffic in and out, nothing. They repeated the study in 2017, and they found pretty similar results. In 2017, they found 25% of the 16,000 servers they looked at, were doing no useful work. How useless is useless here? Useless is very useless. Their definition of a zombie was something that hadn't delivered any information, or any computer services at all, for 6 months or more. These are what they call comatose servers. They are not doing anything for anybody. They haven't for a very long time.

There's a separate category, which is underutilized servers. These are zombies that are showing some signs of life, they're limping along, and maybe active for less than 5% of the time, which is still incredibly inactive. If you think about you or me, if I sat on the sofa for 95% of the time doing nothing, everybody around me including my employer would be very mad. They would try and optimize that. Yet we tolerate it in our servers. The NRDC pointed out that much of the energy consumed by U.S. data centers, is used to power more than 12 million servers that do little or no work most of the time. When you think about it in these terms, in the context of the climate crises, and also now increasingly financial crises, this just seems absolutely bonkers. It seems so wasteful and so unnecessary. When you slice the stats another way, if you look at the average server, not the zombie outliers, but the average server is running at around 12% to 18% of its capacity, and it's using maybe 30% to 60% of its maximum power. This ratio is not good. When a server is underutilized, it does much less work, and it uses just a little bit less power. These underutilized servers, again, make up a really big proportion. If we look at the ones who were active less than 5% of the time, that was 29%. That's about the same number as aren't active at all. It means when you add those two numbers together, two-thirds of the servers, almost, basically have very little reason to exist.

I mentioned the financial consequences of this, and the financial consequences are significant. There was a study that was done in 2021, by a different set of authors. They were looking just at public cloud. What they found was that there was $26.6 billion, that is a phenomenal sum of money, that was wasted just in one year, by always-on cloud instances. These are things like my GitHub M1 runner, where I power it up, and I don't shut it down, even though it's not actually doing anything useful. This is bad. We're looking a lot now at the efficiency of the data center. We're looking a lot at where we get our electricity from. In some cases, if you're running in a very green region, this money and this electricity might be green electricity. No electricity is truly green. Even if it was, it's not just about the runtime costs, that hardware has what's called embodied carbon. Embodied carbon is the carbon that was emitted in manufacturing the server. It can be quite significant. For a laptop or a phone, embodied carbon is most of the impact of the device. For a server, proportionally it's less, because servers use a lot of power just in their normal life. Depending where it runs, the embodied carbon can still be quite significant. What this means is that using greener electricity maybe doesn't help as much as you'd hope. This is still something that we need to fix.

Complexity of Managing Machines

Before we can fix it, we need to really understand why is this even happening. I think it's part of a larger problem. We're getting better at this. In 2023, with all of the technologies we have available to us, managing machines is still pretty hard. I love this story. It's not actually a zombie story, it's the opposite of a zombie story. Which is that a few years ago, a missing Novell server was discovered after four years missing. This server, it's not a zombie, because it was working really hard. It was doing good work. Everybody loved this server. Everybody valued this server. It's just that nobody could physically find this server. Eventually, what they found was maintenance workers had sealed the server into a wall. This server had been bricked up, which is why nobody could physically find it. That's an extreme case of losing servers. In our heads, we lose servers all the time. In those earlier studies, where they found a quarter of servers were zombie servers, they were scratching their heads going, this is terrible. How does this happen? Why does it happen? The best answer that they could come up with was, perhaps someone forgot to turn them off.

If we dig into that a little bit in terms of the kinds of things that happen in institutions to cause this institutional forgetfulness, there's a few things. One is, sometimes a project ends, but the servers that were used for that project don't get decommissioned. Sometimes the business processes change, and so the servers that were supporting the old process aren't needed. A very common cause of this kind of waste is over-provisioning. Nobody wants to be the one who was responsible for under-provisioning the service, and so we're cautious. We provision too much. Another technical reason why this happens is that nobody wants to be the one who exposed the system to a vulnerability through insufficient isolation. Some technology support multitenancy better than others. Even in something like a Kubernetes cluster, CRDs are shared. That means that you only have limited multitenancy. You do end up with the cluster as the unit of deployment in order to ensure proper isolation. Ultimately, what a lot of this is about is risk aversion, which is a very natural organizational phenomenon. I saw a story that they had 20 servers, they suspected they were zombies. They then were able to pretty much confirm that they were zombies. Even after they got to that point, it took nearly 9 months of bureaucracy to finally get these systems shut down, because there was so much paperwork to say, are we sure it's ok to turn it off?

In the case of the underutilized servers, it can be a little bit different and often it's still risk aversion. It's a different kind of risk aversion. I visited a client a while ago, and they were telling me about how they manage their servers, and they had a whole category of jobs that were run as batch jobs on the weekend, but those servers stayed up all week. They had a whole other category of processes and servers that were only used in UK working hours, they stayed up 24/7. This is the thing that it seems like autoscaling should be able to sort out. The problem is a lot of autoscaling algorithms, it's that same risk aversion. They're optimized for availability, rather than lowering resource usage. They're eager to scale up to make sure nobody gets a request denied. They are reluctant to scale down because nobody wants to be the person who writes the autoscaling algorithm that causes the service to be unavailable.

The Green Computing Model

We can put these factors into a model. There's a couple of different green compute models out there. There's the principles of green software engineering. The Green Software Foundation have an excellent model. This is the one that I find I'm able to hold in my head, so it works for me. Then I can map it to the Green Software Foundation or the green principles. I like to think of green computing in terms of four vowels. The first two are elasticity and utilization. The next one is efficiency. Then the last one is utility. Let's look first at elasticity and utilization. Utilization is how much of a system's capacity is being used. In the ideal case, you want most of the system's capacity to be used, because that shows that you're making good use of the system. You're making good use of the embodied carbon. You haven't got a comatose server. You haven't got an underutilized server. The good case is high utilization. What happens if the load goes up? In that case, you have overutilization, which is the very bad case. That's the one that everybody desperately wants to avoid. Because we're so keen to avoid overutilization, where we can't actually service the requests coming into a system because we just don't have enough capacity, what we often end up instead is the opposite situation where we are underutilized. We have a server that's doing barely anything. That gives a good user experience but it is very wasteful. How do we fix this? How do we get consistently high good utilization? This is where elasticity comes in. Elasticity is a measure of how easy it is to scale a system up or down. In this case, where we have high utilization, and the load goes up, what we want to be able to do is when the load goes up, we expand the system as well. Then that means that we stay at that good utilization, rather than being overwhelmed by the demand. Then, if that load goes down, what we can do is we cannot just watch the load go down, but shrink the system, so that we're still at that good utilization.

What about efficiency? Efficiency is very important. I think that may become obvious why when we start to look at utility. I love this quote from Peter Drucker. He was a management consultant. He wasn't thinking about zombie servers, he was thinking about processes. What he said is, "There is nothing so useless as doing efficiently that which should not be done at all." Efficiency is very good. Thinking about Quarkus efficiency is part of my job. Before you make a process efficient, before you make a system efficient, there's a question you have to ask first, which is, should we even be doing this? In general, in computers, we have a utility problem, which is that a lot of the things that we do aren't necessarily valuable. That's things like running a CI job when we don't actually care about the results. What happens a lot is, if someone wants to listen to a song, a really convenient way to find that song and listen to it is to find the video on YouTube, play the video. Not watch the video, have it in the background or be off elsewhere in the house doing something else, so that video is streaming but it's not being used. You can see this is quite different. This is a problem that we're not even getting to, yet we're still looking at the zombie servers where it's not that the traffic is being generated to them, but it doesn't really make people happy. It's that there's not even traffic being generated to them.

How Do We Solve the Zombie Problem?

When we think about zombies, and we think how do we want to tackle this problem, I want to make my zombies more efficient, is not really what we want to be thinking. We don't want to make the zombies use slightly less electricity, because they're so optimized, we just want to get rid of the zombies. How do we do that? How do we solve this problem in a way that's not just nose cone polishing of slightly improving the efficiency of our zombies. The way to solve the zombie problem, two parts: detection and destruction. This sounds fun. You may be imagining a scene from The Walking Dead, and we've got our flame thrower. As you might expect, when you think about it a little bit more, it's actually not all that entertaining. There are no flame throwers involved. Instead, often what it ends up being is this process of system archaeology, where you have to do the graft. You have to sift through the systems and dig through the layers to try and understand what's valuable, and what's not.

One way that this is often done, it's a little bit of a shortcut, is what's called the scream test. The scream test is a technique where you have a server, you don't really know what it's used for. You take the plug and you pull out the plug, or the digital equivalent, and you see who screams. If nobody screams, then you know you made a good decision. This can be quite efficient. It does have some hazards. When you do the scream test, the scream is real. We're all scared of making a mistake. I heard a story recently. The CIO of a strategic outsourcing company told me this story. They did an exercise to find unused internal applications. In the internal section of their estate, they found the server and they couldn't figure out what it was being used for. No reason to keep it on, we don't want to have zombies, let's turn it off. Unfortunately, they then got a very loud scream. Even though this was the internal section, the scream came from well outside the company, it came from one of the clients because the backbone of their network had just vanished. It turned out, for some reason the backbone of their network was being hosted on this server that was labeled internal. That was a slightly embarrassing moment. There is a bright side of this. This sounds a little bit like Chaos Monkey, of we can test the resiliency of our systems by turning things off. It is a little bit like that, except instead of being Chaos Monkey, it's eco-monkey, where we turn off instances that we think are unused, and hopefully things stay up.

If you know you don't want to YOLO it quite that much, there are other techniques. One of them is meetings. I had a client session a while ago, and it was a UK bank, and the CIO was there and trying to figure out, trying to get a handle on his cloud estate and his cloud costs. What he did was he assembled all of the stakeholders in this meeting. He said, let's go through our estate line by line and try and figure out what these things are that I'm paying for. I don't necessarily recommend this as a technique. I think it was one of the most boring meetings I've ever been in, in my life, and everybody in the room felt the same. When we don't want to do meetings, an alternative technique that I sometimes see is the emails, "Please turn off your cloud instances." "Does anybody know what this is?" "We're using a lot of cloud, could we turn it off?" Again, it doesn't seem like the most effective technique, it is what I see a lot in the field. Another technique I see that feels like it should be more effective than it is, is tagging. We tag our instances, but it does rely on quite a few things. It relies on people remembering to tag. Then it relies on someone manually going through and finding the things with the tag, understanding the meaning of the tag, and what can be deleted and what can't be deleted. Then, managing it. Even though there's the tags, it is still quite manual.

We don't want to be manual. We want to be automated. In the spirit of DevOps, for any problem, there is an ops for it, DesignOps, DevOps, you name it. This is a pretty hard problem, so we have a lot of opses for it. In particular, we have five ops. The first ops to help solve the zombie problem is GreenOps. GreenOps is relatively new. The Googleability of it isn't really great yet. If you try and Google GreenOps, the first thing you get is a company that doesn't really seem to be looking in this area. The second thing that you get is the Wikipedia page, and you think you're on the right track until you read the Wikipedia page. It turns out a GreenOps is a midsize trilobite. Often, when we're thinking about these GreenOps ideas, what we actually will be better off looking at for now is FinOps. FinOps is trying to make financial information automated in real time. I like to think of it as figuring out who in your company forgot to turn off their cloud. Because when you have that information, then all of a sudden, you have the tools to go hunt the zombies. There's a lot of FinOps tools out there. This is a growing area. One that I am excited by is Backstage. Backstage has a bunch of plugins. One of them that's really nice in this area is the cost insights plugin. It pushes those cost concerns to the engineering level where as engineers we have the information to make informed optimizations. In Backstage, there is a cloud carbon footprint plugin, which again, will give you some information. A lot of it's about zombies, but also about things like finding the right electricity source.

Another thing that can be really useful is AIOps. I mentioned that tags aren't great, because they're set manually, and then someone has to manually go through it. What we really want to be doing is automating those kinds of processes. There's a bunch of tools in this area. There's Densify. There's Granulate. There's Turbonomic. There's TSO Logic. One that I know is Turbonomic, because IBM bought Turbonomic a few years ago. Recently, I was talking to the CIO for the CTO of IBM. He was telling me that they had recently installed Turbonomic in their estate. You may say, IBM bought Turbonomic a few years ago, why have they only just installed it? You know how it is. There's always more things to do than you can do. He was absolutely wowed, because they just basically installed Turbonomic and didn't really do any other optimization, and they got a 21% reduction in their cost, in their footprint, in all of that. This was just reducing that bloat basically for free. Another thing that you can do, again, to be smart about it, and to try and not do these things manually, is you can look at traffic monitoring. You can find the things where there is no information going in and no information going out. You have a pretty good idea there are zombies and then you can use some tooling to manage them out of existence.

Knowing Is Only Half the Battle

Even with all these tools, I think identifying the zombies, is only half the battle. There's a phenomenon called the IKEA effect. The IKEA effect came when they noticed that people who had built their IKEA furniture, really were very fond of their IKEA furniture, much more so than furniture that they just bought from the shop. What they realized was that labor leads to love. When you work on something, you get attached to it. That absolutely happens with our servers. Going back to my Mac stadium instance, that I was really reluctant to shut down, I had to work quite hard to configure that. I was only doing it once so it wasn't worth automating. I didn't want to shut it down. We see this everywhere. People are reluctant to shut down clusters in case they need them later, in case their effort will have to be duplicated.

I started thinking about this when I was looking at Quarkus, because Quarkus can run as a natively compiled binary or on the JVM. As a natively compiled binary, it is ridiculously fast. I benchmarked it against light bulbs, it is faster than a modern LED light bulb. That's not on its own enough for elasticity. That startup time helps a lot but it's not enough. When you think about a light switch, that really is ultimate elasticity: we turn it off, we turn it on. We never have any fear of like, I'm not going to turn this light off, because what if I need it later. If I need it later, I just go and I push the switch on the wall, and I get it back. We would never ever say, I'm not going to switch this light off because I'm not sure if it's going to come back on. With our servers, it happens all the time, that we don't switch the server off, because we don't know if it's going to come back up. Really, we should be aiming for this experience of a light switch. When you design your systems, it should be that turning it off and on again, it has to be fast, obviously. The native solves that. It has to actually work. Your system has to behave the same way after the restart, as it did before the restart. That's idempotence. Also, it just has to come up, which is the resiliency.

I've started to think about what I like to call LightSwitchOps. LightSwitchOps is moving towards that light switch like experience, but for servers, that we can automate it, or we can do it manually. Either way, we can flick it on and off without fear, for really small reasons, like I'm leaving the room. These can be quite simple scripts, and it can be quite effective. I heard a story from Jeff Smith. He was saying that they used to leave their applications running all the time, same as most of them. They just scripted turning them off at night, and they reduced their cloud bill by 30%. That is an amazing savings. I heard another story from someone who was working in IT at a school, and she saved her school €12,000 by writing a script to turn the computers off overnight. These are huge savings for relatively simple things.

Of course, before you can turn it off and turn it on again, you do need to have the confidence that you can recreate the system. This is where GitOps comes in. By GitOps, I don't necessarily mean a particular product. Although I know, WeWork, for example, are looking quite a lot at some of these considerations. What I mean really, is just that infrastructure as code. What infrastructure as code allows us to do is to have disposable servers. We can spin it down, and then we can spin our server back up. Then we can do it again. We can spin it down, or we can spin it back up. We can do that as many times as we like, as trivially as we like. It could be something like a kubectl apply. It could be an ansible-playbook. Any of these technologies that allow you to recreate the system fearlessly. That can have a lot of benefits beyond just turning the system off at night or turning the cluster off at night. A lot of us might turn a few pods off at night, but turning the whole cluster off at night, we feel a bit more nervous about. The more we can reduce snowflakes, that allows us to reduce the redundancy in our systems.

A lot of us, the kinds of workloads that we do, we need to run them in multiple regions, or we think we do, because we need to have that failover. This is a really expensive way of achieving redundancy. I was thinking about that because I was talking to a client, and it was an application that really didn't need that 24/7 availability. He wanted to make sure that he didn't have embarrassing outages. He was spec'ing his system with two regions. I thought, this is going to cost a lot. It's really not necessary, because the way we have this system architected is we have the infrastructure as code. In the event of an outage, we can spin you up a new cluster in a new region really quite quickly. Obviously, before you make these kinds of things you need to test them. If you've tested them, and if you know you can get it back up very quickly, you may be able to really reduce your redundancy and therefore reduce your zombies. One of the interesting things about this as well is a lot of these techniques they've been getting fancier. We've got our GitOps. We've got our infrastructure as code. We've got our AIOps. Sometimes zombie reduction can be really quite simple.

I heard another story, this was just in 2013, the earlier days of the cloud. It was a bank, and they had a provisioning system. What they found was, on top of their provisioning system, they implemented a lease system. When you provision something, by default, it lived for two weeks. At the end of the two weeks, it would get shut off. Obviously, it would email you first and that kind of thing. You had the option to renew your lease. Just having that default behavior, be for short-lived servers, meant they reduced their CPU usage by 50%, which is absolutely huge. This system, I think, is something that we should be looking to implement much more often, because just having that limited time and that deadline, focuses the mind, and it prevents things being forgotten. That's all stuff that does help.

Things That Don't Help

I want to talk a little bit about things that maybe don't help. Maybe they do. Some of these, we're still deciding. The cloud moves us much more towards disposable infrastructure. In that way, it's good. It moves us away from snowflakes. One problem the cloud has that physical servers don't have is out of sight, out of mind. In general, like a lot of these other autoscaling algorithms, and like a lot of these elasticity algorithms, the cloud makes it delightfully easy to provision more hardware, but it doesn't necessarily give you that much support for getting rid of that hardware. Going back at technology layer, another thing that maybe doesn't help is virtualization. Again, virtualization intuitively seems like it should help because we're increasing the multitenancy of our servers. We're increasing the density. We're sharing these expensive physical resources. In that 2019 study, they found very similar results with virtual servers and with physical servers. In 2019, they were looking back at the 2017 data, and they were slicing it a different way. You remember in that study that 25% of the servers were doing no work. When they looked at it by virtual servers instead, 30% of the virtual servers were doing no useful work, and 50% of them were active less than 5% of the time. If you take those two together, for the virtual servers, only 20% of them were really utilized in any meaningful way. Because virtual servers are expensive, they're expensive in terms of license costs, and they're expensive as well in terms of their overhead. That does suggest that virtualization again makes it maybe easy to stack these things up, but with virtualization, we still need to remember to turn the virtual machine off. Just because it's virtual doesn't mean we can leave it running forever. Part of the problem with virtual servers is that they do have these high overhead, so they support much more multitenancy. There's an issue there.

What about serverless? Serverless I have some questions about, because modernizing to serverless is a big lift, so not all of us can get to serverless. Serverless may not suit latency-sensitive workloads. There's another problem, which is, with serverless, certainly, historically, when you ran a serverless workload, in order to solve the cold start problem, what your provider was doing behind the scenes, was keeping the instance running but just not billing you. Another problem with serverless is that these systems can have pretty high overhead, so there's a control plane or the equivalent for your architecture. If you're running a low workload, that control plane is perhaps proportionally more of the system. We need to be aware of this. There's some quite interesting research about serverless that if you're interested in it, I encourage you to go read up on. It's definitely not a black and white thing. What this paper here found was that the virtualization overheads meant that compared to a plain old just HTTP server request, each function request used 30 times more energy. Compared to Docker, Docker was still 10 times more energy than a plain HTTP server. This is traded off against the fact that we can support much more density, that we have really wonderful scaling qualities. Another thing to think about with serverless is, are all parts of the system elastic? If we're scaling our instance down, but it's running in a serverless container that is quite fixed in size, maybe we're not achieving what we hoped.

There's a couple of things that definitely don't help. One of these, totally counterintuitively, is prevention. It seems like a lot of organizations will try and prevent these zombie servers, by you are shutting the barn door before the horse has left. That seems like it has to be a good idea. The problem is the way we do the prevention. Usually, the prevention is heavy governance. You have to fill in six forms. You have to get five levels of management to bless your instance requisition. Then this comes back to the IKEA effect. If a server was hard to get, people will not surrender it. We need to make sure that we have a frictionless acquisition process, so that we also have a frictionless de-acquisition process.

Internet Background Noise

In what I've been talking about, I've been focusing on zombies and servers. Of course, zombies aren't just servers, or certainly the ghost in the machine, the phantom, the waste, is not just servers, we also have to think about data. There is an enormous amount of completely unused and useless data in the world. The challenge is finding it. There is also an enormous amount of completely useless and unused network traffic in the world. By this I don't mean things like watching the video, I mean things where it actually never gets to a destination, it just shuffles around the internet. These are zombie packets. They're like servers, but they're just a little packet. Together, they make up what's called internet background noise, which is an absolutely fascinating phenomenon. The best numbers I could find were from 2010. I can imagine it's gone up quite a lot since then. There's 5.5 gigabits per second of traffic going around the internet, and it's data packets which are addressed to IP addresses or to ports, where there's no network device set up to receive them so they just can never reach their destination. With all of these, there's a lot of problems. I think that's a good thing, because these unsolved problems, these are interesting technological challenges, which means it's opportunities for us. These opportunities are potentially very good opportunities, because turning things off, as well as being ecologically responsible, it can save a lot of money. There's this double win of the financial benefit and the climate benefit.

Key Takeaways

If you're a user, really try to up the utilization of your systems, aim for elasticity, try to limit your sprawl, your kubesprawl, all your other sprawls, know what you're using. Get rid of those zombies by turning things off. For us as users to do that, we need the help from the tool creators. If you're making tools, really try and support that better utilization. Try and build in multitenancy. Try and build in elasticity. Then as well, build in some of that disposability through the infrastructure as code, building that visibility of where the things are actually useful. Then, with these together, then the GreenOps, the FinOps, the AIOps, the GitOps, the LightSwitchOps, that's a lot of ops, but they have the power with us to make a really big difference.

See more presentations with transcripts

Recorded at:

Dec 13, 2023

Holly Cummins

InfoQ Software Architects' Newsletter