Cloud computing is more than just fast self-service of virtual infrastructure. Developers and admins are looking for ways to provision and manage at scale. This InfoQ article is part of a series focused on automation tools and ideas for maintaining dynamic pools of compute resources. You can subscribe to notifications about new articles in the series here.
Early cloud computing deployments typically involve small-scale adoption of a handful of servers by just one or two employees for a specific use case. However, today we are seeing increasing adoption of public cloud, with multiple employees from across the enterprise using a vast array of capabilities across all cloud service models (IaaS, PaaS, SaaS).
As more organizations expand their use of public cloud services, from early-stage startups to the biggest businesses and governments in the world, the problems of cloud computing at scale start to arise.
Potential issues with public cloud at scale
While there is no doubt that public cloud adoption offers phenomenal results for businesses of all shapes and sizes, large-scale adoption of public cloud can also create many new challenges and risks. Among the most important of these are the following:
Cost
As you start using public cloud by allowing limited access for a handful of individuals, it is relatively easy to track your costs. However, as more individuals from multiple (often siloed) departments gain access, you will likely experience function overlaps, over-provisioning, unauthorized purchases, unused 'zombie' instances, excess bandwidth and storage bills, and other factors unnecessarily eating into expected cost savings.
Unauthorized access
It is easy to manage small-scale access to public cloud services, but as adoption grows, you can quickly lose control. Former employees may retain access after leaving, employee access is not updated as roles change, while new employees can struggle to access the resources they need. With many cloud providers failing on enterprise-grade security, you can quickly fall victim to unauthorized access as you grow your public cloud adoption.
Penetration
Even worse than having employees with access control issues is allowing malicious external actors to penetrate your cloud services. Password loss, shared user IDs, data leakage, simple passwords, social engineering, phishing, and malware can all expose a public cloud service to data loss, manipulation, attack, denial of service, and other impacts of malicious penetration.
Human error
Individuals can easily manage cloud services when they are small, but as they expand and scale you cannot keep adding human resources to maintain manageability. That means fewer people have more work to do, and the law of averages means that eventually someone will make a mistake. This in turn can cause massive failures, although such problems are certainly not unique to cloud.
Visibility
When you have just a handful of carefully managed services, just one or two people can see where they are deployed, how they are configured, what they cost, how are they being used, who owns what, what is causing problems, how to fix them, when to shut them down, how to recover, etc. As with any large-scale system, however, as you scale public cloud deployment and open access to more use cases, cloud usage can become increasingly opaque.
Triage
As a consequence of poor visibility, problem triage also becomes significantly harder. If you cannot see where a system is running or how it connects with other services, for example, it is almost impossible to nail down a slowdown somewhere in the transaction flow. According to W. Edwards Deming, a leading intellect on systems thinking, "you can't manage what you can't measure," but it is perhaps even more apropos that you cannot manage what you cannot see.
Auditability
In another side effect of poor visibility, as more systems and services are abstracted by cloud it becomes harder to track who is accessing what, when, how, and why, creating critical issues with auditability. Being able to track, record, and review access, change, failure, exposure, utilization and more in a large-scale environment is incredibly hard without tools to automate the process
Recoverability
While severe outages are not unique to cloud, every week we seem to hear new dramatic stories of public cloud failures. Yet with many cloud providers, especially commodity services, there is no recoverability built in; and even more robust services may not provide timely recovery or prioritize your business needs. Outages can be truly disastrous if you do not have systems in place for backup, failover, and recovery.
Automation addresses all these and more
One answer to these problems (and more) is IT automation. Of course, automation is not a silver bullet; and automating bad process merely leads to bad things done faster and without control. However, implemented properly, automation tools in their various forms can allow you to scale your public cloud deployment while avoiding many of these problems.
For example:
- Process automation can execute and integrate existing tasks and workflows faster, at greater scale, across broader geographies, at lower cost, and with greater audit and control than any human could ever hope to.
- Provisioning automation can control who, when, what, why, and how your employees can create and release cloud services, reducing errors, eliminating zombies, tracking costs and enabling granular audit and control.
- Configuration automation can help to ensure systems are patched, unused ports are closed, vulnerabilities are eliminated, overruns are controlled, systems are repeatable, and errors are minimized.
- Event monitoring can track even the largest cloud deployments to track errors and make sure trigger events are made visible, root cause can be established, alerts are escalated, and problems are detected and resolved before they become critical
- Containerization can provide a higher layer of abstraction from the specifics of an individual cloud infrastructure or platform, allowing rapid and low-touch migration from one service to another for better disaster recovery and cost control.
- Performance monitoring with automatic detection, notification, escalation, and triage of problems provides essential visibility, helps avoid poor experience, and prevents cost overruns caused by throwing expensive capacity at poorly diagnosed problems.
- Backup and recovery automation can make failures transparent to end users, especially if they are connected to event and performance monitoring tools, or used to build tolerance for failure and disaster recovery directly into your cloud applications.
- Release automation can take new applications and updates from dev to production in the cloud without human intervention, accelerating innovation even across large deployments while reducing human error, ensuring audit, and eliminating 'rogue' code
- Identity and access management can provide the right access to cloud services when it is needed and revoke access when it is not, to prevent malicious access, eliminate data loss, enable audit and control, improve visibility, and manage usage costs
- Capacity management enables cloud consumers to predict more accurately their service growth and peak requirements, and when to release resources too, reducing potential for service problems while helping to manage cloud resource costs.
Moreover, automation starts to enable new capabilities with public cloud that would be essentially impossible with traditional manual efforts. For example, accelerating application delivery at scale with new approaches like DevOps is arguably only feasible with solutions for self-service provisioning, configuration management, test automation, and release automation. Similarly, leveraging at scale the incredible opportunities in the emerging cloud API economy is at best risky, and at worst disastrous, without solutions to automate API access, identity management, resource utilization, and cost control.
What are the top automation tools?
All of these automation tools and disciplines play a role in a best-practice public cloud deployment. It is not entirely reasonable to name the top tools without understanding any given deployment's goals and constraints. Still, some are certainly more critical to more deployments to others, and if I had to name my top three, I would choose:
- Identity and Access Management - if you cannot ensure only the right people get access at the right time to the right resources, then really, all other bets are off. If your biggest concerns are around protecting your cloud-based data and services, this is a must-have automated solution.
- Provisioning Automation - this is the basis for many cloud services, but it is critical to have granularity in this function especially for audit and control. Manual provisioning is also probably the biggest cause of human error and cost overruns with public cloud deployments.
- Performance and Availability Monitoring - this may be the ultimate tool for all deployments, ensuring you know if and when problems occur, why they are happening, and how to effectively fix them, even across the largest high-scale and high-performance cloud deployments.
Summary
Automation capabilities are almost essential for public cloud to exist. Certainly some basic automation will be included in any decent cloud service - such as self-service provisioning, utilization measurement, or chargeback.
However, as I have written before, there is a strong chance that you will not get any sophisticated automation capabilities with any cloud provider, especially commodity cloud services.
It is therefore up to you to understand both the opportunities and the risks associated with public cloud adoption, to choose the right service providers for your workloads and goals, and supplement them with appropriate automation tools.
Only by integrating the right automation solutions will you truly unleash the full potential of public cloud, by providing and enhancing confidence, safety, performance, speed, and control.
About the Author
Andi Mann – Vice President in the office of the CTO at CA Technologies – is an accomplished digital business executive with extensive global expertise as a strategist, technologist, innovator, marketer, and communicator. With over 25 years experience across five continents, Andi is a sought-after advisor, commentator, and speaker. Andi has authored two books, blogs at 'Andi Mann – Übergeek', and tweets as @AndiMann.
Cloud computing is more than just fast self-service of virtual infrastructure. Developers and admins are looking for ways to provision and manage at scale. This InfoQ article is part of a series focused on automation tools and ideas for maintaining dynamic pools of compute resources. You can subscribe to notifications about new articles in the series here.