Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations From Cloud-Hosted to Cloud-Native

From Cloud-Hosted to Cloud-Native



Rosemary Wang discusses the patterns and practices that help one move from cloud-hosted to cloud-native architecture and maximize the benefit and use of the cloud.


Rosemary Wang works to bridge the technical and cultural barriers between infrastructure, security, and application development. She is the author of Infrastructure as Code, Patterns and Practices. She has a fascination for solving intractable problems as a contributor, public speaker, writer, and advocate of open source infrastructure tools.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Wang: We're going to talk about going from cloud hosted to cloud native. It all starts when you say, I want to put an application on the cloud. It seems really simple. Step one, build the application. Step two, figure out which cloud you want to put it on. Step three, run in production. After this process, you say to yourself, that's great. I've now built a cloud native application. Let's look at the definition. Cloud native is building and running scalable applications in modern dynamic environments, such as public, private, and hybrid clouds. This is a really great definition from the Cloud Native Computing Foundation. When you think about the scenario that I just outlined, let's answer the question, is it scalable? Are we really in a modern dynamic environment? The answer is, kind of. The reality is that when you put an application on the cloud, there are a lot of obstacles that come up in the process, and the first being, what operating system should I run it on? Should I even run it on the operating system in the first place? Next, you think about how you should package it. Should it be a function? Should it be a container? What should it be? Next, you think about its configuration? How do I configure it to run on certain infrastructure? How do I make sure it's routed correctly? The next thing that we think about is, does it even make any sense to secure it? Is it something that is running on a private environment? Is it something that's running potentially publicly? Are there database passwords and credentials that we should be aware of while it runs?

Then, we come to the CI framework. We need to deploy it to the cloud somehow, and deploying to the cloud is complicated, especially when you have network routing considerations in place, and you have to think about it really carefully. Next, we say, ok, it's on cloud. We've done this process. We've built our CI framework so that it can deploy to cloud. We've taken all of these steps. We've thought carefully about these requirements. It must be cloud native. Then you come back to your cloud bill, and your cloud bill shows to you, it's pretty expensive to run this application. Then you go back to the drawing board. You rearchitect the application, thinking to yourself, maybe now we consider it cloud native, because it's after all, taking advantage. All of these service offerings. We've done all of our research. We've done the engineering work to make sure that we've optimized it. The reality is, it's probably not cloud native. The application that you think about putting on cloud isn't going to be cloud native. Instead, it's cloud hosted. You've built and run an application in an environment like a public, private, and hybrid cloud. It's not really scalable. It's perhaps not really the most modern or dynamic application in the first place.

Here, I am going to answer this question, what does it mean to go from cloud hosted to cloud native? Over the years, I've realized that cloud native architecture is very complicated. It has a changing definition. While I think that the CNCF's definition is really useful, and it's actually probably more thorough, it's also not as nuanced as the actual implementation. We're going to talk about the practices and patterns that you can distill down and identify certain architectures as cloud native, without necessarily going to the specific technology and saying that technology is going to give me a cloud native architecture. When you boil down some of these practices to some foundational pieces, and these you can use to build up a cloud native architecture. I have a couple cloud native considerations that I think about. These are some architectural capabilities that I consider important for a cloud native architecture. First is adaptability. Second is observability. Third is immutability. Fourth is elasticity. Fifth is changeability. We're going to go through each of these, and I'll give an example of how they are important to a cloud native architecture.

Adaptability - Adjust Between Environments

The first is adaptability. Adaptability is the ability to adjust between environments. This isn't just environments as in development, staging, and production. This is also environments, whether it be from a public cloud, private cloud, or hybrid environment, or multi-cloud environment. The idea is that you need to be able to adapt your dependencies between different kinds of environments. Let's take this example. Imagine that I have a data center and I run Kubernetes in that data center, it could be on OpenShift or something else. I also have a Kubernetes cluster that I've run on a managed service. This Kubernetes cluster is managed by a public cloud. They're Kubernetes, this makes it pretty easy for me. I could take one application, run it in my data center, as long as it's on Kubernetes, I can just bring it to the public cloud. Both are equivalent, in theory. This principle is adapting by abstraction. The idea is that a cloud native architecture often relies on abstractions to improve adaptability. If you need to adjust or change applications between environments, you're going to use an abstraction to do that.

There is a bit of a caveat to this. Just because you have Kubernetes in your data center and Kubernetes in a public cloud does not mean that it is an easy path to adapt between environments. This is where I think there are some foundational practices that you need in place in order for the adaptability to exist and thus then, for you to have a cloud native architecture. The first is that when you move an application, or some application manifest from one Kubernetes to another, you have to be concerned about versions. Not all Kubernetes resources are available in every version. Second is image. If you've ever worked across multiple clouds with Kubernetes clusters, managed Kubernetes often have different kinds of container registries or image pull registry options compared to other ones.

The other important path or obstacle in the path of moving from one Kubernetes cluster to another often involves customization. Many times, when you've worked in a data center environment, and you have your own Kubernetes, you have customized certain workflows so that they are aligned with your organization's workflows. This could be custom resource definitions or other resources. They don't map perfectly to a public cloud, so then you have to adapt them as well. Persistence becomes an obstacle. If you expect certain persistent volumes or certain resources to exist in the data center Kubernetes, but they don't exist in the public cloud Kubernetes, then you have to readjust your application to work with a different persistence layer. There are abstractions in Kubernetes that do help with this. For example, a Kubernetes persistent volume claim will help you attach to a specific persistent volume type that is different across multiple clouds. You have to make the effort to use that abstraction as well as build the underlying different persistent volumes. There are actual significant differences, and it's not quite that easy moving from one Kubernetes cluster to another.

This is an example, but there are many other scenarios in which this problem exists. What are some foundational practices to keep in mind, if you're designing, you want to use an abstraction, and you need a way to adjust between them? My big tip is to try for just enough abstraction. Kubernetes is one of those examples of a just enough abstraction. There could be other open source standards that allow you to do just enough abstraction, and take away some of the pain of needing to adjust between environments. Just enough abstractions tend to exist in open source standards, but they also may exist in your organization's implementation of abstraction. Even if you say, for example, I build one API layer to make sure that I query information for security specs or security metadata. That is just enough abstraction to make sure that you're not just querying a specific tool.

Here are some foundational practices that help you achieve a cloud native architecture from an adaptability perspective. First, decouple configuration and secrets. I think this is something that you might encounter as part of maybe the twelve-factor app perspective for microservices. Even if you're not necessarily doing those kinds of software architectures, or application architectures, you need to consider decoupling configuration and secrets in a cloud native architecture. The reason why is that as you adapt across environments, you're going to have different configurations, whether it be development, staging, or production, or differing kinds of configurations across clouds, you need to be able to decouple configuration as well as the credentials you use to access all of those target clouds. Decoupling them will allow you to scale your application, but also it will minimize the effort that you need to take to adapt it to specific clouds. Decoupling the configuration and secrets away from the application or away from infrastructure is one way that you can ensure that you have some consistency in how you are going to adjust a dependency across clouds.

The next is to use dependency injection. The important thing about dependency injection is to apply the abstraction in a way that you can change the downstream resource as well as the upstream resource. For example, if you're an application that runs on a server, and that application might need the server IP address for some reason, you don't want to have the application query the server command line, or query the network interface just for the IP address. The reason being, is that querying the underlying interface could differ. It won't let you do it in a container. It will be different on a virtual machine. It might be different on something else. You would want a layer of abstraction there so that way the application can query the server IP without necessarily depending on the server's underlying operating system. The way to think about this is to instead use a metadata service for that machine. You call the endpoint for that machine and you retrieve the IP address from there from an API endpoint. That's an example of using dependency injection to decouple the dependencies. The reason why this is particularly important, especially from an infrastructure standpoint in cloud native, is that you're often going to change the upstream resources, basically, the ones that depend on the underlying infrastructure. The upstream resources are going to change much faster than the lower-level resources. It's not going to be easy to adapt a network across multiple clouds. You can't use the same network declaration from Azure to AWS to Google Cloud, but what you can do is, generally speaking, describe the server in similar manner across Azure, Google Cloud, and AWS. The idea is to use dependency injection. You can make those changes to those upstream resources, and adapt them across clouds without necessarily putting additional effort into the lower-level resources.

The next thing that, foundationally, you need to implement in order for you to get closer to cloud native from an adaptability standpoint, is to write abstract tests. We don't like writing tests. Testing is hard to justify the effort. When you have to work across multiple clouds, and also you have the cloud native architecture, meaning one that's fairly dynamic, the thing that will help you is knowing that functionally everything is working as expected. What I usually do is write an abstract test to test the functionality of my application when it's on a cloud. I call these end-to-end tests. Why are end-to-end tests an important place to abstract? It's pretty much going to be the same across any cloud. If I know my application is going to submit a payment, it doesn't matter which cloud it's running on, it should just submit the payment. In that case, the test itself should have a level abstraction, meaning it should test the endpoint, and the endpoint should return the correct information. It should not matter what the underlying cloud is, it shouldn't matter what the underlying technologies are for. Investing in writing abstract tests are very useful.

Finally, this is probably one of the more disruptive practices that I'm going to include on the list and you're going to encounter. This is one that if you do not have this in place, it's going to make it incredibly difficult to improve the adaptability of your system. That is to update to stable or stable minus one version. This is because most clouds who will offer some managed service, tend to offer different versions. One cloud may offer up to Kubernetes 1.23, another one might only offer stable 1.17, for some reason or another. If you don't have stable minus one, or stable versions, what tends to happen is that it makes it difficult to adjust across clouds, and/or even across environments. You can imagine that sometimes dev might be at 1.21, but then production might be at 1.17. The reality is that when you have all of these different versions, it makes it incredibly difficult to adapt upstream dependencies across all of these different environments. Updating to stable or stable minus one is a great way to ensure that you are checking the ability for an upstream dependency to move across all of these different environments comfortably.

If all else fails, and these practices you have in place, you're finding that it's still really difficult to adapt an application across all these different platforms and clouds, as a last resort, refactor the abstraction. What I mean is that if it's not working for you. For example, if you're finding that it's still incredibly difficult for you to port this application from one Kubernetes to another. That's usually an indication that that's not quite the right abstraction. Maybe the application itself does not lend well to Kubernetes, or maybe it's just not made to run in a container. In that case, figure out what the abstraction is and identify how to best improve that application in order to suit a better abstraction, and make it a lower effort in order for you to adapt across environments. This is something that is the last resort. This is oftentimes the way we jump toward cloud native. We'll start with these foundational practices, and then, eventually, we'll resort to the last one, which tends to be a larger refactor effort.

Observability - Navigate Cloud Cover

Next, we have observability. Observability is the way that you can understand how you're using your cloud as well as how your applications run on it. This is incredibly important. When we talk about being cloud hosted, we have an understanding of usage as well as performance, but we don't have a really deep understanding of how it interacts as a larger system. As part of cloud native architecture, you need to understand how everything interacts together. For example, imagine that I have a monitoring system in my data center. That monitoring system in my data center is now responsible for not only retrieving the information from the data center, which has a more traditional monitoring server approach, it also has to retrieve information across various syslogs in different machines. It has to get AWS access logs, Google Cloud access logs, Azure access logs, Azure Active Directory access logs, Kubernetes logging and metrics. It needs to aggregate some of the logs from Spring Cloud, some of the logs from .NET applications. Then, any services that we run on top of the infrastructure, so this could be a service mesh, this could be a secrets manager. All of this information now needs to get aggregated somewhere.

The best approach is to set some standards. Notice I don't say set one standard, set some standards. With this heterogeneous set of workloads, as well as services, platforms, it's really difficult to have one standard. You'll spend way too much time trying to organize everything and format it into one uniform data format with the correct fields and the correct values. While there is some value to that, the amount of effort you spend in doing that does not necessarily give you the additional benefit. What is the alternative? When you set some standards, identify standards that you can adopt from an organizational standpoint, and fit just enough of your workload footprint. For example, if you have Prometheus, Prometheus is a way for you to have open source standard for metrics and formatting. That serves its purpose for a number of metric servers. A lot of metric servers will pull from different endpoints and use Prometheus format and metrics. You can also look at OpenTelemetry to add instrumentation to the application itself. It works across a variety of programming languages and frameworks. There's also Fluentd. Fluentd will help you extract logs from a machine, and then send it to a target in a more structured format. There are little ways that you can do this, and a number of them are in open source standards. Again, the thought is that abstraction will help you adapt.

However, there is a point to meeting resources where they're at. Sometimes, it's just not possible. You can't add the OpenTelemetry library in, instrumentation is just too difficult. You already have metrics set up for an application, you really don't want to refactor and add a new library in when you don't have to. There is a point to meeting resources where they're at, they're setting some standards. If there are resources that absolutely cannot be refactored, or the level of effort is just too high, and there's no real value from that, meet it where it is right now and just take the information. When we talk about taking that information in, there are some foundational practices to taking that information. The first is tagging and adding metadata. If you are using an existing metrics library in an application, make sure you have consistent and standardized metadata. This metadata should be fairly uniform across infrastructure, applications, and other managed service offerings. Identifying and architecting the correct metadata or the proper metadata for identification will help you identify resources and get a better end-to-end picture of what's going on in your environment. It's very difficult to justify tagging and adding metadata after the fact, but it is worth the effort especially from a billing and compliance standpoint.

Enable audit and access logs. This seems to be pretty intuitive. Most people don't do this until after everything has happened. They decide, ok, we've built and engineered this, now we can enable audit and access logs. The reality is, audit and access logs are pretty powerful. They not only give you a security view into who's accessing and looking at what in your environment, but it's also a very useful way to track transactions and interactions as well. This is important in a very dynamic environment. When we talk about cloud native, it's often in a very dynamic, ever-changing environment. A container is coming up, a container is coming down. It becomes really difficult from a development perspective to understand what is happening in that system, when a container is only available for perhaps 10, 15 minutes, and then suddenly, it's been destroyed for some reason. Enabling audit and access logs are incredibly important.

On top of that, in a cloud native architecture, we tend to skew toward continuously delivering the resources. The reason why is, again, we want to take advantage of the dynamic environment. When we think about continuous delivery, it becomes really easy to say, I'm going to just use a CI framework, and let it have access to deploy anything it needs to. That is access. In order for you to understand the access that your CI framework has, and to properly audit its automation, you do need to have logging available for that. You'll need to do a lot of automation from a cloud native standpoint. It's critical to understand that any automation you do needs to have access to something. Even if you're really great at least privilege, meaning minimizing the amount of access that an automation piece does have, you still need a way to log and audit it. It's not just from a security perspective, if you're a developer or an engineer who's working on it, you need it to reconstruct, oftentimes, interactions between a system. Enable those audit and access logs.

The other thing that you'll need to do is aggregate telemetry. This is a little bit more difficult. Sometimes, it's not so easy to aggregate telemetry without finding a new tool or technology and installing it in the process. There are ways that you can aggregate the telemetry into a couple different targets and make sure that that information exists. Making sure you aggregate the telemetry will help add, again, a level of abstraction for you to adapt across different environments. If you can aggregate telemetry across other clouds, that allows you the ability to understand how applications are interacting across different clouds versus within a cloud. Standardizing and indexing telemetry comes after you've tagged and added the metadata. Standardizing and indexing telemetry does allow you to trace the transactions within an environment. It traces transactions as well as interactions from the application level to infrastructure level. Having some telemetry that you can search on and specific fields that you know will exist will help you identify later on what resources are important, and what resources are not.

Finally, if you've done all of these foundational practices, and you find yourself struggling to make changes to your application, it's still probably mostly a cloud hosted application, it's not really cloud native. In that case, maybe assess a push versus pull model. More of the traditional monitoring systems use a push model, meaning there's an agent and it collects the information and pushes it out to the metric server. Or you bundle an agent with the application and it pushes those metrics or telemetry out to a server somewhere. In more recent years, what we consider more of the distributed approach is to pull so you have an agent that's sitting either in the environment, or on the host level, and it pulls from multiple endpoints, and then sends it out to the server. Assessing a pull-based approach is one way that you can look at getting closer to cloud native, and this will help you scale in the future. It doesn't mean you have to redo your entire systems. It might just mean that changing one or two configurations on the monitoring agent side to say, we only really need you to exist on one host, and you can scrape all the hosts in this region, for example. Assessing the push or pull model will help scale specifically the observability piece of this. That way you don't have to rearchitect or rebuild the entire monitoring system just so that you have more visibility across your cloud environment. As a last resort, then rebuild your monitoring system. Sometimes this is not something you can avoid. If you have a lot of older systems in place and you have a lot of bespoke monitoring, sometimes it's better just to rebuild the monitoring system and standardize with these practices in mind.

Immutability - Keeps Up with Dynamic Environments

Immutability is the next capability that I think about when it comes to cloud native architecture. Immutability helps you keep up with dynamic environments. To describe immutability, I'm just going to go through this example. Imagine that you want to update Java for an application. One way that you could do it is to log into the server that houses the application, and then update the Java package on there. You run into the danger of the application itself potentially breaking. Maybe other dependencies on the machine rely on Java, and now you've broken all the other dependencies. Now you have no server running and no application running. That actually affects your systems as well as your customers'. In recent years, we've moved more toward the immutability approach, where we deploy a new application binary with a new underlying system with the updated Java. Rather than log in and update Java, what we're doing is we're taking a new application, as well as a new instance of the application's environment, deploying it out, and it has the updated Java. If it does not work, you can always revert to the old version. If it does work, you can simply delete the old version. This helps with the overall reliability of the system.

Immutability helps you roll out new resources for most changes. There's a caveat to this. Sometimes there are changes that are related to configuration or secrets. Imagine you need to change the database secret, and you don't really want to do it in the application. You don't want to just deploy a new application with the new database secret, because maybe it's been compromised. Now you're just adding more fuel to the fire. What you'll do instead is perhaps just change the application and say, the application, please reload, because there's some configuration change or some password change. Then it will actually reload. This is not immutable, this is mutable. Some changes are mutable. You have to keep that in mind. There are some changes that are mutable because you've added a level of abstraction. In this case, because maybe you've shifted configuration and secrets into a different management store, the changes can be mutable on the application side.

From a foundational practice standpoint, it's important to do all of the things you can to be immutable, but know that there are some things that are mutable. Let's think about that. The first is automating infrastructure as code. Infrastructure as code tends to assume that the infrastructure is immutable. There's very few places in which you're updating infrastructure in place when you have infrastructure as code. If you have the automation in place to do that, you get the principle of immutability out of the box, which is nice. Decoupling state and configuration will help you separate the pieces that require immutable changes, away from the ones that are able to be handled immutably. What I mean is that if you have data that your application is writing to or data that your application needs, decouple that from the application that is running. Decoupling the state and configuration becomes an important part of cloud native, mostly because your applications will have to be able to adapt and run anywhere, but your data may not. In which case, you might find yourself needing to further decouple the data as well as the app from the application. Decoupling state and configuration is an important foundational step.

The next is reloading and restarting for changes. Not all changes are done immutably. Some of them are mutable. I just covered some of them. Reloading or restarting the application is important. It helps you make a mutable change without necessarily changing the application itself. One good example of this is that if you change a database password and you're using Spring Boot, you could use the Actuator API endpoint, and basically tell the endpoint, reload everything, reload the database connection string. What this will do is gracefully make sure that all the other connection strings are shut down before retrieving the new database password and reloading and reconnecting to the database.

Finally, optimize provisioning. I can't emphasize this enough. When you talk about immutability, it only works if you can create resources quickly. Sometimes, you will have to wait. For example, if you have large clusters that you're provisioning, it makes more sense not to necessarily do that from within an immutable perspective. You sometimes want to make sure that you need to get these resources really quickly, from a functional standpoint. You also want to make sure that you're not just affecting the system, because you're spending a long time waiting for new resources to come up. Make sure you're optimizing provisioning. It's important from at least an architectural standpoint, to ensure that anything that you provision, you can do it repeatedly, but also, you can do it fairly quickly. Because if something is broken, you'll want to take advantage of a new environment, you'll want to use immutability to create new environments and restore the system. Optimizing provisioning becomes an important piece and practice to that.

Finally, distributing data. Distributing data is a little complicated, but at some point, you'll realize that when you have data in a cloud native environment, you need to figure out what to do with it. That's when people start to move toward different kinds of datastores. They move away from the databases, they do some kind of other distributed datastore of some kind, or a distributed database of some kind. What this will do is help treat data immutably without treating the content of data immutably. You still preserve the data itself, but the way that the data is being handled and distributed, is treated in a mutable fashion across your cloud. As the last resort, then you refactor for immutability. This is when you just don't have any other options. If you don't have infrastructure as code and you treated your infrastructure immutably before, you have existing infrastructure, and now you need to try to manage it a little bit better. In this case, you may have to undergo a significant refactor so that you have new resources that are managed by your infrastructure as code deployment.

Elasticity - Make the Most of Resources

The fourth capability that is a little bit more complicated to talk about is elasticity. Elasticity is the ability for you to make the most of your cloud resources. I think this is the hallmark of being cloud native. Is your application elastic? What I mean by that is that most of the time when we talk about moving an application to cloud, we think it's a straightforward, ok, I run it in a data center, it's been updated to all its latest versions. Now I'm just going to pick it up and move it and run it into a virtual machine in cloud. That works, except then you realize that it's quite expensive to run that application on a virtual machine, because what you've done is you've spec'd out the virtual machine to be the same size as the one that's already in your data center. That doesn't necessarily improve your cost. Elasticity is actually about the cost of time. In the cloud, you're getting charged per hour, or per unit, or per run, or how long that run is taking. Elasticity is about taking advantage of the time that you have for that resource. It is all about time.

What do we mean about trying to optimize the cost of time? We traditionally thought about optimizing cost as well as taking advantage of elasticity as the difference between vertical and horizontal scaling. Vertical scaling is a focus on resource efficiency, meaning, if you have an application, you'll get x number of CPU and x amount of memory associated with it. It works, but we found out quickly that most of the time, we weren't taking advantage of all the CPU or memory. We thought about maybe going to horizontal scaling. The idea is that we increase the workload density. We have smaller instances scheduled on a lot of different machines, and these smaller instances can do parallel processing so it can better serve requests. It's not necessarily an either/or, but most cloud native architecture approaches, the general assumption is to do horizontal scaling with increased workload density. Cloud native doesn't always mean horizontal scaling. It's actually pretty complicated, because not everything can be horizontally scaled, but also, you're not going to get the benefit of elasticity just by horizontal scaling.

What do we mean by this? There's a couple important practices to keep in mind. The first is to evaluate idle versus active resources. It's not necessarily about horizontal scaling, vertical scaling, smaller instances with many instances versus fewer instances with larger instances. The idea is that what is idle versus active. The reason why horizontal scaling is really appealing from a cloud native architecture standpoint is that it's taking advantage of active resources. It's using as many active resources as possible while minimizing idle. There are situations in which you absolutely cannot maximize all of those active resources. In the case of perhaps data processing, it might not make any sense to do horizontal scaling. It might just make sense to do one process, spin up a VM for an hour, and then shut it down. That's it. Evaluating the idle versus active resources become important. For a more simplistic win, from an elasticity standpoint, when you're looking at your cloud environment, understand what resources are truly being used. If things are not being used, you can shut them down. A good example of this is that if you have a development environment and you're only using them in weekdays, maybe shut it down over the weekend. That will save you some money. Evaluating the idle versus active resources actually becomes more important than immediately refactoring your applications to handle horizontal scaling.

The next is optimizing warm-up and runtime. When you have jobs that get processed, or you have resources that get scaled up and down, it might be virtual machine instances. Actually, all of the public clouds now have autoscaling capability, so if you scale up, then you can scale down. Optimizing that warm-up becomes incredibly important, because you are again, charged per hour. If it takes you a very long time to warm up those resources, and then by the time you run them you're already being charged by the hour, it's not going to be helpful for you. Instead, you're going to find that you now have some idle resources, that are available to you. Those are what I would consider idle because they're not ready for use yet. Optimizing the warm-up and runtime becomes important. You can use immutability to help with that. For example, if you are currently using user data to configure your virtual machine, it might take optimistically 15 minutes to do it. That 15 minutes might include time to install packages. Instead of doing that for 15 minutes, maybe what you do is you use an immutable image, where you build the virtual machine image and you just provision it with all of those packages in place, and all you do is add configuration.

The next is assessing volume versus frequency. The way I best describe this is that in the data space, in the past couple years, I've noticed folks using lambdas to process data. That works. That works fairly well. AWS Lambda, using functions to process the data. What they're doing is they're saying, I need to create x number of lambdas whenever this happens, and we're pretty much whenever something comes into queue, and it's good. It does work. The lambdas do process our functions, and, for the most part, the price point is ok for them. Then there comes to a point when you have a lot of jobs, or a lot of data that you need to be processing. The volume of that data does not justify necessarily using lambda anymore. It becomes more expensive. Especially if the lambda functions need to use private IP addresses in the virtual network. Using a private IP address in the virtual network, it takes time to free up that interface. By that point, while you're not necessarily getting directly charged for it, you are finding that lambda might be waiting, just so that gets allocated those IP addresses in a private network. By that point, you might as well switch to some other formalized data processing tool, whether it be EMR or something else. The idea is that you want to assess whether or not the volume and the frequency by which you are creating these resources outweigh just having those resources exist, and in being in place. This is the same reason for maybe having a pool of Kubernetes cluster workers, or something like that available. Sometimes you just need that on hand, and it does justify the cost. Assessing volume versus frequency becomes an important part of optimizing for elasticity and making sure you use the most of your cloud resources.

Then, as a last resort, you rebuild to mitigate cost. Refactoring to mitigate cost is a very costly effort, and it takes some time. It also involves oftentimes adapting to new abstractions. If you may be working with a system that was originally not meant to run on cloud, and then all of a sudden now you're trying to mitigate the cost of it, because you did this lift and shift, you will have to replatform. You will have to ultimately rebuild and rearchitect in order to fully mitigate the cost of the lift and shift. You'll have to find a technology that does work for your architecture and the functions that you want. That is a last resort.

Changeability - Use the Latest Technologies

Finally, changeability. We all want to use the latest technologies, but it's not that easy to change an application to use the latest technology, so, how do we do it? Let's imagine this example. I've been hearing a lot about this more recently. You have a CI framework, and you've started looking into the cloud native technology of GitOps. I think it's a really fascinating space. It's also really helpful for certain use cases. You decide, this is really useful for me, I want to continuously deploy. I want to take advantage of blue-green deployments, in that I want to make sure that when I do a blue-green deployment, it's completely automated. My canary will increase automatically, and I don't have to worry about it. That's great. The only problem with this is that the latest often involves paradigm shifts. Telling your security team, or telling your management team, or telling anybody that they're going to remove a manual quality gate, or a manual gate to production, and automatically do this wishy-washy, blue-green thing, is really hard to sell. I think it's really cool, but what I've noticed is that it's not so easy to describe. Most people are wondering, what happens if it fails? The answer is, most of these technologies, especially in the GitOps space, will roll back for you. On the other hand, it's not so easy to just trust that assumption. The latest often involves paradigm shifts.

Instead, what you might think about doing is changing, not the tool, but changing the paradigm first. Rather than say, I have a CI framework, and I'm doing continuous delivery now. What you might consider instead is, ok, in order to take advantage of all of these cloud native benefits, what I'm going to do is take an intermediate step. I'm going to do some modified continuous deployment on my CI framework. That could be perhaps Spinnaker. You might say, I'm just going to use Spinnaker first, and what I'll do is I'll do a manual deployment and manual check. Once I'm comfortable with the canary and the blue-green approach, then maybe I'll shift to a more continuous deployment approach, or maybe I'll even shift to the GitOps approach. There are a lot of intermediate options. These intermediate steps exist for change. It's not to say that you should immediately adopt a tool and that will immediately help you. The reality is that most of the time when you're looking to go from cloud hosted to cloud native, you'll need to take the intermediate step, especially if you plan on such a drastic change, that changes the underlying assumption of how your application behaves and how it's being deployed.

Of course, the first thing you're going to do is assess the benefit, but in a more nuanced perspective, you do want to assess the benefit of changing the paradigm, of changing the assumption itself. The other thing you'll want to do is review all the previous patterns I talked about. All the foundational practices for immutability, adaptability, all of those. Review that you have those foundations in place. If you don't, it will be incredibly difficult to change the tool after that. Then, choose an intermediate step. In the case of the CI framework, maybe we modify our CI framework to do a manual blue-green. Once we're comfortable, then we can move toward an automated blue-green deployment, or an automated canary deployment.

Finally, refactor your application or infrastructure in order to accommodate that intermediate step. In the case of the CI framework that I was talking about, and also moving towards some continuous deployment approach, you will need to refactor the application to expose metrics. If your application does not have metrics, then it will not work in the continuous deployment approach. The reason why is that an automated deployment needs somewhere to retrieve the application metrics. You need either an error rate or a composite metric for you to understand how the application is behaving, and whether or not is failing or succeeding. Without that metric, you cannot do an automated deployment. For the most part, it's also hard for you to determine whether or not you should increase traffic to that new application. You do have to refactor the application or infrastructure to accommodate for that metric. In which case, it will stop you from adopting the next latest and greatest approach.

After that, if you find that you're having a really difficult time with that intermediate step. You're finding that the intermediate step is not sufficient for you to graduate to the latest and greatest technology, build a greenfield environment. It's not easy. It is going to be a very high level of effort. You do have to assess if it's a high value as well. Most of the time, if you've done these basic steps, you've done an intermediate, and it's still not quite working, and you're still finding that you need to go to this new latest technology, because it has some significant business benefit to you, then you build a greenfield environment.


Returning back to the definition of cloud native architecture. I've gone through a lot of these foundational practices that move you from cloud hosted to cloud native. The focus of these practices is not to necessarily say that you will be cloud native. The focus is that you're going to have a more scalable application and you're going to take advantage of the dynamic environment in place. As these technologies change, year over year, month over month, even, you have the ability to change with them. You can adapt with them. In summary, here are the considerations. Again, remember that if you want to get to a cloud native architecture, consider the adaptability, observability, immutability, elasticity, and changeability. All of these will help contribute to a more cloud native approach and your application to better adapt to changes in your architecture.


See more presentations with transcripts


Recorded at:

Sep 22, 2023