Why Cloud Zombies Are Destroying the Planet and How You Can Stop Them

Wait, zombies? Really? Zombies are servers that aren’t doing useful work. They’re everywhere, costing money, eating electricity, and belching carbon. And they’re useless! So how do we get rid of them?

At QCon London, Holly Cummins, Quarkus senior principal software engineer at RedHat, talked about how utilization and elasticity relate to sustainability. In addition, she introduced a range of practical zombie-hunting techniques, including absurdly simple automation, LightSwitchOps, and FinOps.

Cummins started explaining zombies through a few stories of running servers that needed to be remembered. Either a Kubernetes cluster for two months leading to over a thousand dollars of cost to running instance costing 160 dollars. The problem with zombies and servers that are forgotten about is hard to measure as you don’t see them. However, researchers are looking into the issue. Their definition of a zombie was:

"They haven't delivered any information or computing service for six months or more."

These are what they call comatose servers doing nothing for no one. And there is, according to Cummins, also another category of "underutilized" servers. She provided an example of those servers with much of the energy consumed by U.S. data centers used to power over 12 million servers that do little or no work most of the time. Furthermore, she mentions that the average server uses 12 - 18% capacity and 30 - 60% max power. And finally, she mentions a few other studies and one from 2021, where 26.6 billion dollars was wasted in the public cloud in one year with always-on cloud resources.

According to the antithesis institute, the problem with zombies is that people need to remember to turn off servers caused by, according to Cummins, over-provisioning services, projected ends, business processes change, or isolation requirements. With underutilized servers, the problem can be caused by batch jobs running on the weekend or systems running only during business hours while running 24/7. Auto-scaling, for instance, could solve that; however, scaling down doesn’t happen that easily.

To solve the problem with zombies is to leverage the green computing model.

Cummins explained the three vowels (elasticity, utilization, and utility). The efficiency algorithms are, according to Cummins, not relevant for zombies.

The ultimate goal is to eliminate the zombies through detection and destruction. A technique can be system archaeology and using a scream test. Cummins told some stories about this test. Another method she mentions is tagging, providing meta-data for future review, which is a manual process and only sometimes clear. And finally, she says -opses that can be useful in eliminating zombies, such as GreenOps, FinOps, AIOps, and LightSwithOps. The latter is an Ops Cummins described as operations to turn off a server and get it up like a light switch. And this needs to be fast and always work (resilient and idempotent).

Cummins provided examples of using scripts to turn on and off servers, thus saving costs. Furthermore, it can also be accomplished, according to Cummins, by leveraging GitOps (infrastructure as code) – spinning servers up and down. Moreover, she explained that GitOps could be an alternative to systems that require redundancy.

Cummins also went into what does not help eliminate zombies, like the cloud (out of sight, out of mind), virtualization, and serverless, as sometimes there are still instances running while there might be no cost. In addition, she mentions that prevention through heavy governance does not help.

Lastly, Cummins mentions that zombies are not just servers but also data and network traffic (packets that do not reach their destination). Together they make up what’s called internet background noise.

She finishes that there is a double-win with finances and the environment. And in addition, she states that we need to try to eliminate zombies, use or create tools that help in better utilization (elasticity, multi-tenancy) and de-zombification (visibility, disposability), and leverage the GreenOps, FinOps, AIOps, GitOps, and LightSwithOps.

InfoQ interviewed Holly Cummins about Cloud Zombies and how you can stop them.

InfoQ: When did you get the idea of the LightSwitchOps?

Holly Cummins: I’d been thinking about zombie servers and utilization for a while, but the LightSwitchOps realization came to me when I looked at Quarkus startup times. Quarkus, when natively compiled, starts up faster than an LED light bulb. That's incredible, technically, and it solves half the "why don't we stop our servers when we're not using them and start them when we need them?" problem. But it only solves half the problem because the other half of the problem isn’t technology-to-start fast; all the surrounding processes and infrastructure make organizations reluctant to stop and start servers. So it seemed that getting to the point where we could treat servers like light switches was (a) a non-trivial goal and (b) would resolve a lot of waste.

InfoQ: What's the best way to get to the best utilization of a server? Are there any frameworks or tools to help you with that?

Holly Cummins: Although imperfect, auto-scaling tools are a good start for optimizing server utilization. They’re usually tuned to specific platforms, such as Kubernetes. At a data center level, tools like Turbonomic Application Resource Management (which my colleagues have had good results with) or Densify (which I haven’t personally used) can help.

About the Author

Steef-Jan Wiggers

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Steef-Jan Wiggers

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter