Key Takeaways
- Cloud Native Architectures enhance our ability to practice DevOps and Continuous Delivery, and they exploit the characteristics of Cloud Infrastructure.
- Instead of trying to "get better at predicting the future", we want to build a *system* that "gets better at responding to change and unpredictability."
- This cloud-native thing is NOT a big-bang approach and doing so will backfire incredibly hard.
- if you've not made it economical to work in small batches, get feedback, etc, you've probably increased your cost of making changes.
Can you describe what it means to be cloud-native? Does it require that your apps and infrastructure are born in the cloud? What’s the balance between technology and methods? In this panel, InfoQ reached out to three industry experts to learn more about the reasons to become cloud-native, and the approach to follow. Our panelists include:
- Christian Posta – Chief Architect at Red Hat and author of the book Microservices for Java Developers.
- Kevin Hoffman – Engineer at Capital One and author of the new book Building Microservices with ASP.NET Core.
- Matt Stine – Global CTO, Architect at Pivotal and host of the Software Architecture Radio podcast.
InfoQ: Let's set the table. How do you define cloud-native, and why should anyone care about creating apps this way versus however they've been doing it up until now?
Stine: We’re trying to bring a perceived conflict into balance: software-driven business agility vs. software system resiliency. We want to move fast and yet not break things. In order to do this, we're going to change how we build software, not necessarily where we build software. They key changes to how are found in DevOps and Continuous Delivery, which I defined specifically in my book.
DevOps represents the idea of tearing down organizational silos and building shared toolsets, vocabularies, and communication structures in service of a culture focused on a single goal: delivering value rapidly and safely. This is the cultural aspect of moving fast and not breaking things. Continuous Delivery represents the idea of technically supporting the concept to cash lifecycle by proving every source code commit to be deployable to production in an automated fashion. This is the supporting engineering practice of moving fast and not breaking things.
So, by creating a DevOps culture and employing Continuous Delivery engineering practices, we’re bringing Agility and Resiliency into balance. But, obviously, we need a place to do this: "cloud." I define cloud as any computing environment in which computing, networking, and storage resources can be provisioned and released elastically in an on-demand, self-service manner. In the Pivotal ecosystem, Pivotal Cloud Foundry and its surrounding ecosystem of services (like Spring Cloud Services and Pivotal Cloud Cache) becomes the substrate on which we practice software engineering. It’s API-driven capabilities, which drive fast and elastic orchestration of workloads, are the technical toolkit enabling DevOps and Continuous Delivery.
Which brings us to architecture. Architecture is the process by which we make software engineering decisions that have system-wide impact and are often expensive to reverse. Our architectural decision making directly impacts our success or failure. We can express that impact three ways:
- Architectural decision making can enhance or detract from our ability to practice DevOps
- Architectural decision making can enhance or detract from our ability to practice Continuous Delivery
- Architectural decision making can exploit or waste the characteristics of Cloud Infrastructure
Cloud Native Architectures enhance our ability to practice DevOps and Continuous Delivery, and they exploit the characteristics of Cloud Infrastructure. I define Cloud Native Architectures as having the following six qualities:
- Modularity (via Microservices)
- Observability
- Deployability
- Testability
- Disposability
- Replaceability
Obviously, I could go on from here, but I think I've probably already exceeded your word limit. ;-)
Posta: Before answering what is "cloud-native", let me explain why anyone should care about creating apps "this way" vs "past approaches" by setting some context. I think "past approaches" (vague. I know) can be summed up as "get really good at predicting the future". This is the crux of the discussion. In the past we spent incredible time, energy, and money trying to "predict the future" :
- predict future business models
- predict what customers want
- predict the "best way" to design/architect a system
- predict how to keep applications from failing
etc, etc.
Instead of trying to "get better at predicting the future", we want to build a *system* that "gets better at responding to change and unpredictability". This system is made up of *people* and is delivered through technology.
How do we predict what customers want? We don't; we run cheap experiments as quickly as we can by putting ideas/products/services in front of customers and measuring its impact. How do we predict the best way to architect a system? We don't; we experiment within our organization and determine what fits best for its employees/capabilities/expectations by observation. How do we predict how to keep applications from failing? We don't; we expect failure and architect for observability so we can quickly triage failures, restore service and leverage chaos testing to continuously prove this, and on and on.
"Cloud native" is an adjective that describes the applications, architectures, platforms/infrastructure, and processes, that together make it *economical* to work in a way that allows us to improve our ability to quickly respond to change and reduce unpredictability. This includes things like services architectures, self-service infrastructure, automation, continuous integration/delivery pipelines, observability tools, freedom/responsibility to experiment, teams held to outcomes not output, etc.
Hoffman: I don't really think I could've said it any better than Christian. That's a fantastic answer.
InfoQ: Are containers an essential part of a cloud-native approach? Why or why not?
Posta: Not sure I agree with the framing of the question to be honest. I'd prefer to think about "what's essential" in terms of capabilities and practices in a "cloud-native approach". For example, what are the capabilities and practices needed to make it economical to work in small batches, test hypothesis and learn? Capabilities like cheap application builds, safe deployments, automated management of deployments (start/stop/scale/health checking, etc.), security, distributed configuration, traffic routing, and others. Containers play a foundational role in implementing these capabilities and form the basis of technology/platforms that provide these capabilities but the capabilities are what's important.
Stine: I will piggyback on Christian's answer a bit here, as I think he's going down exactly the right path. If you look at how we've both framed the notion of cloud-native, neither of us has focused so much on technology as we have on enabling hypothesis-driven business models and rapid feedback/response cycles. What we need is the appropriate level of platform abstraction to do that economically.
As we've transitioned from physical hardware to virtual machines to containers, we've reduced the cost per unit value of deployments, and that's enabled us to create software platforms with increasing capability to deliver on the promise of rapid evolution of software. Containers have certainly played a huge role in the current evolution of platform building.
But containers, in many ways, should be an implementation detail of the software platform. As a developer, my focus should be as much as possible on fulfilling business needs in a resilient and flexible way. Containers have nothing to do with this. Developers rightfully ought to be largely unaware of them, even if every program they write runs within one.
And we will continue to climb this abstraction ladder. The poorly named serverless ecosystem shifts our attention to individual functions as a unit of deployment. Other than being aware of the language runtime your function targets, you are almost oblivious to what the underlying compute hosting looks like. Workloads that "don't need to know" shouldn't have to care. And so, part of our responsibility then is to negotiate the tradeoffs of constraints vs. promises of living at potentially multiple levels of abstraction: infrastructure, containers, applications, and functions.
Hoffman: I get asked this question all the time, and my answer usually starts with "you're asking the wrong question(s)". When we look at some of the technology-independent requirements to achieve the benefits usually associated with cloud native, we see things like: rapid and automated deployment, ability to seamlessly scale horizontally, fault tolerance, fast startup and shutdown times, and environment parity.
Digging deeper into that list of requirements, we see that many of those rely on a lower-level requirement - the use of an immutable release artifact. It is this single, portable, immutable artifact that can be shipped between VMs dynamically and can run in all of our environments without change; that can be automatically scheduled and launched wherever we want, whenever we want; that facilitates many of our cloud native requirements.
So, I would say that reliance on immutable release artifacts forms the foundation for the building blocks that support cloud-native as we think of it today. The fact that containers (Docker or others) are what we use for our immutable release artifacts is an implementation detail, and merely a means to an end, and not the end goal itself.
InfoQ: What are some examples of clear anti-patterns if you're trying to develop in a cloud-native way? What are things that might not be anti-patterns, but something to be wary of?
Posta: So, this is an area I can rant and rave for pages, but I'll just stick with the handful that popped into my head at this moment. Not sure which count as anti-patterns, but they're all definitely things to be wary of. First is the desire to “build a platform". We're all excited about technology like Docker or Kubernetes, etc. and we want to be infrastructure rockstars (I guess?). But taking technology like Docker and Kubernetes and spending months operationalizing them yourself to build a platform is undifferentiated heavy lifting. Even if you do operationalize a platform, the communities and underlying technologies are changing so rapidly you have to ask yourself: "am I an infrastructure platform company? or am I a retailer/finance/insurance/etc company? Am I willing to invest multiple head counts into each respective open-source community, engineering teams to integrate these things, QA teams to test and harden, and product management, architect resources to do this?" Or, can you rely on companies who specialize in bringing this cloud-native infrastructure to the enterprise (like Red Hat, or Pivotal, etc.) and use that relationship to accelerate your cloud-native strategy? I've run into quite a few teams who went the DIY approach, incinerated tons of money/resources only to end up on a completely unsustainable path going forward and no closer to their business/strategic objectives.
Second is the desire to adopt new technology but not change anything about how they work. A cloud-native approach marries technology capabilities with changes in how your teams work with each other. For example: if you're just going to adopt a cloud-native platform and continue to work in the "dev toss code over to ops", what have you really changed about your capabilities to scale and go faster? What have you really changed about your ability to learn and experiment? In fact, if you've not made it economical to work in small batches, get feedback, etc, you've probably increased your cost of making changes. We need to evolve our practices toward a high-trust, high responsibility environment where we expect failure and learning. This starts at the top.
Another thing I'm seeing is complete and utter fear about a perceived big-bang adoption if doing cloud-native approaches: "containers, Kubernetes, PaaS, Spring Boot, ci/cd, transactions/compensations/sagas, Domain Driven Design, data replication, event-driven architecture, Kafka, rolling deployment, blue/green, messaging, integration, APIs, circuit breaker, monitoring, tracing, logging, metrics, observability, old tools/new tools and crap! I have to be in production in 4months! Aaaaaaaaahhhhh". Yah. Let's just stop. This cloud-native thing is NOT a big-bang approach and doing so will backfire incredibly hard.
Other patterns to be wary of: Things like canonical data models!? I thought we learned our lesson about this. Or things like expecting our bespoke infrastructure to provide things like distributed transactions across boundaries, exactly once delivery, etc. Go check out the end-to-end argument paper and let's chat in the morning. And this thing about "sure, let's just have nice long chains of non-directed cyclic graph call structures between our services because microservices". That's sadness all around. And lastly, not diligently aiming to remove any/all manual steps when it comes to our deployment pipeline.
Stine: Ok, I'll play. Christian hit some of the more obvious ones. I love his take on what I like to call "starship building." And a deep consideration of changes to a company's system or work is also crucial.
I'll throw a couple of my more frequent anti-patterns or maybe cautionary tales into the mix.
The first is a lack of consideration of decomposition strategies. I wrote on this topic. Once an organization agrees with the value proposition behind microservices architecture, and they start making that move, they very often do so with either a confusing strategy or a lack of strategy at all.
Probably the worst adoption strategy I've seen is over decomposition within single teams/applications. This manifests as a single, often large team that takes what would have originally been a monolithic application and develops it as a distributed monolith. The team creates ten or more microservices and then proceeds to orchestrate the testing and deployment of the entire fleet every time a single microservice changes. This team is paying all of what I call "the distributed system" tax while reaping none of the value. I've asked such teams what value they're getting from their adoption of microservice architecture, and they can rarely provide a concrete answer. When I follow up with "then, why are you doing it?" the most common response is "well, we were told we needed to write microservices!"
I wrote about this problem after my one viral tweet. It's a very real problem, and it can quickly get out of control. Once I was speaking at a large company's internal developer symposium. During the Q&A, I received this question: "We have around 50 microservices inside of our application, and new ones are appearing every day. It's getting really difficult to manage. What do you think we should do?" My quick response was "well, first thing is you should probably write fewer microservices!" After the symposium ended, one of the event's sponsors filled in the rest of the story: "As it turns out, that person's org has been incentivized to create as many microservices as possible so that their leader can demonstrate how on board they are with our modernization initiative!"
The best advice I can offer here is to consider why you're adopting microservices in the first place. As Christian has said, it's all about your ability to go faster, innovate, and experiment. Adopting microservices in the way I've described is usually going to create the opposite experience. One useful way to look at this is that organizations create microservices, while teams create well-factored monoliths. The primary reason we want to decompose into microservices is to support individual teams, business owners, and value streams evolving autonomously. While this doesn't necessarily mean one team per microservice, the numbers shouldn't be radically out of proportion.
One more thing I'd like to throw out there and that is proper consideration of quality requirements. I see a lot of teams that are micro focused on availability concerns. Many times, an organization that has had availability challenges in the past will overcorrect. They will overengineer architectures, including far too much redundancy and geographical distribution, and far too many fault tolerance patterns. I'll describe this as "availability by sledgehammer." On the other end of the spectrum, you'll see teams that observe how easy it is to layer in a circuit breaker with Spring Cloud, and they'll think adding the @HystrixCommand annotation makes them immediately fault tolerant.
In both situations, you tend to have a focus on solutions without a clear understanding of the problem. When I ask such teams what their availability requirements are, I get vague answers like "the system can never go down" or "all of our services require five nines." What they've failed to understand is that achieving availability like that is incredibly difficult and costly, and they almost certainly haven't done it. What's worse, they almost certainly don't need it.
At Pivotal we talk a lot about building product management discipline into the business, and part of that is a proper consideration of business needs and priority. Part of that consideration is quality requirements. As we work with the business to write things like user stories, we should be writing stories around things like availability, scalability, security, and performance. We should ask the business "what should the behavior be when business process component X experiences a failure?" We should then properly prioritize all of these stories in context with the business logic stories. At that point, we start to understand how to architect the system with appropriate tradeoffs considered.
Hoffman: I agree with all the anti-patterns discussed so far. One of the ones that I see most often is what I call "buzzword shopping". The fallacy at the core of this is the idea that a given technology is cloud native or not. It isn't the technology that is cloud native, it's how you use it. Architects, developers, and even C-level executives will take their shopping cart through the technology grocery store and throw everything in it that claims to be cloud native. When they get home, they'll throw it all in a blender and hope that it turns into a beautiful cloud native soufflé ... but that's not how it works. That's now how any of it works.
The goal is to start by looking at what you're doing, why you're doing it, and what you hope to gain from doing it. Without knowing these as your foundation, you can't end up with a truly cloud native solution... at best you'll end up just hosting a poorly made "technology stew" in the cloud.
InfoQ: Do cloud-native apps change the nature of operations and systems management? If so, how?
Stine: Sure they do. We've all called out the antipattern of the "throw it over the wall" approach to development and operations. The cloud-native approach is far more about "how" than it is about "where." A key part of the new "how" is how we approach operations and systems management.
There are a couple of different ways I like to look at this. One is a horizontal spectrum of operational approaches. On one end of that spectrum is what I'll call traditional ops, which is closest to the idea of throwing apps over the wall. On the other end of that spectrum is the "you write it you run it approach." For a few years now I've skewed my advice toward you write it you run it, if for no other reason than that it forces the development team to focus on operational reality. You're going to write different (and hopefully better!) code if you have to eat lunch across from the person that got paged this morning at 3 AM! But more importantly, if you are in charge of the operational lifecycle of your application, the ability to move faster and conduct experiments is entirely in your hands.
In the last year, I've become very interested in an approach that skews toward the center of that spectrum, and that is what Google calls Site Reliability Engineering (SRE). They recently published a book by the same name that's freely available on the web. In the SRE world, there is a separate team of software engineers that also have strong operational skills. These SRE's take charge of operating applications as long as they meet an ongoing set of quality requirements, including a 50% time cap on operational toil. Toil is defined as manual, repetitive, automatable, and tactical work that has no enduring value and scales linearly with service growth. Ongoing engineering work should always be dedicated to driving the percentage of toil down. By driving toil down, SRE's are able to focus at least 50% of their time on engineering work mostly related to quality requirements.
There's also a vertical layered approach to operations. We've spoken at length on the necessity of developing and operation applications on a platform. That platform is properly consumed by application developers and operators via an explicit contract, usually exposed as an API and service level objectives (SLOs). These application developers and operators should not have to be aware of or care about anything that goes on below that API and set of SLOs as long as the contract continues to be met. That said, development and operations definitely occur below that API, it just happens to be the purview of the platform developers and platform operators.
These four different role sets:
- application developer
- application operator (or SRE)
- platform developer
- platform operator (or PRE)
when combined with product ownership and management of each vertical layer, comprise what we've started calling "the balanced platform product team."
Posta: Yes. To quote to the wonderful Charity Majors, "what got you here won't get you there". Cloud-native architectures and applications do change the nature of operations and systems management. Matt hits some great points in his answer. The one that I always keep coming back to is the idea of rapidly making changes and measuring their impact.
We should be able to deploy as quickly as the developers/feature teams want and have some controlled way of releasing the software to our users (Note the explicit differences between a deployment and a release: a deployment gets a new version of code into the production environment; a release actual brings live traffic to the deployment... the awesome folks at turbine labs have a blog that goes into this a bit more. This decoupling of deployment and release is very different than the way most teams work. Typically, a deployment == a release and this becomes very risky, and indeed is why there is probably a lot of red tape, approval process, etc that goes into getting code into production.
But more importantly, how do we measure the impact of the deployment and/or release? There are a handful of things we want to measure: SLA/SLOs like Matt mentions. We also want to measure the intended business effect of a change. Did adding recommendations based on our new super-duper prediction algorithms lead to more products added to shopping carts? We want to position our changes as hypothesis and then be able to ask questions about its effects. Just as important as the business-level impacts, we want to observe what happens with the distributed systems interactions and health of our customer's experiences. Did this new release increase latency in our checkout process? Are some customers' shopping carts not showing correctly while some are? Is a particular topic in our Kafka cluster not dispatching messages quickly enough? We should be able to ask questions about our infrastructure, services, and user experience to get a deeper understanding of it and compare to what we're expecting/hoping. In the past, we've tried to predict what proxy values to "monitor" and paste those up on a dashboard. But that's, again, part of the difference between the "old way" and the "cloud-native" way: we cannot possibly predict certainty of any of these dimensions -- need to get better at operating our services and systems in the inevitable environment of confusion, contradictions, and uncertainty.
InfoQ: It's a reasonable assumption that most enterprises have data centers full of traditional, non cloud-native apps. You've all written about decomposition strategies, but what do you suggest companies do if they want to upgrade some of these apps to cloud-native?
Posta: Each company is different and will have different tolerances and appetites for how they wish to approach modernizing applications. Just copying what we see one company do will not get us to where we want to be.
The discussion should involve taking a step back and categorizing initiatives into buckets that align with the IT portfolio strategy. The three buckets that I use are the MVP/Innovation/Exploration applications, the applications/services with high growth expectations, and finally the slow(er)-growth, existing flagship application/services. Each bucket requires different solutions, different measurement heuristics, and different management approaches. In the rush to "cloud-native" or "microservices" everything, we should take a strategic approach. I write more about this in more detail in a blog post "About when not to do microservices."
After having a strategy/shared understanding about the initiatives, we need to explore the value stream that is part of each initiative's ability to deliver. Typically having a good, mapped-out idea of the value stream (the steps needed to deliver an initiative) can highlight areas that need to be improved (or whether a particular initiative is even applicable for improvement/modernization). For example, in our value stream, if our application architecture is not the bottleneck for going faster (deliver faster iterations), then no matter how much we optimize for that (microservices, cloud-native, etc) it's not going to improve our ability to deliver. Similarly, if we map out the value stream for a particular initiative we may find that certain modernization efforts may not have as big of an impact as, say, a process improvement, etc. This is crucial in my opinion -- because otherwise, as excited developers/architects armed with new technology, we want to re-write/re-architect the world This may not be the most effective way to approach.
Lastly, depending in which bucket the application/initiative falls, we may wish to 1) explore improvements to the existing architecture, 2) adding/complementing the existing architecture, or 3) complete/partial re-write to a more optimal architecture. This is where a combination of methodologies like Improvement Kata, Impact Mapping, Event Storming, Domain Driven Design, etc come in to play. Just like Kevin pointed out toward the end of his answer, just as important (if not more?) as the methodology is the measurement of these impacts. Douglas Hubbard wrote an incredibly amazing book "How to Measure Anything" who's strategies should be at the forefront of everyone's mind as we embark on this cloud-native dance.
Hoffman: During my tenure with Pivotal as well as with several other companies and on my own time, I've routinely encountered this scenario. Again, we need to take a step back and ask what the reasoning behind the migration is. Some enterprises have legacy applications that they simply need to move out of a data center and into the cloud for financial reasons. These apps might be feature frozen and are merely on "life support". In cases like this, it might not make any sense to make the application "cloud native" when simply getting it to run in a suite of containers might be enough to get the job done. This is often referred to as "lift and shift"
There's a middle ground between "lift and shift" and true cloud-native. Here companies started out with the idea that they would simply lift and shift and get it running in the cloud, but in the process of stuffing everything into a stack of containers, we invariably discover some things that just don't work properly in the cloud. The older the software it is (and the more host system requirements it assumes), the more likely it is to fall into this category. In situations like this, we often have to decompose the vulnerable parts of the monolith just to get the application to run in the cloud at all. Companies need to make a value judgment to determine whether it’s in their interest to go fully cloud native here, or "get it running" and set a migration plan to continue the decomposition over time.
As for specific strategies for migrating an application, there are countless and there is no single strategy to rule them all. Some that people have a lot of success with include techniques like event storming and identifying bounded contexts to decompose into microservices, but the real questions that need to be answered are - what do you hope to gain from going cloud native? How will you measure those gains? What features do you think you can add to your app once it's cloud native that you couldn't before?
Focusing on specific milestones on your iterative journey to cloud native will help keep things agile and ensure that you're not locking up all of your development resources for years at a time, crippling your ability to continue to build new products while you migrate legacy ones.
About the Panelists
Christian Posta (@christianposta) is a Chief Architect at Red Hat with 15+ years of experience building and designing highly scalable, resilient, distributed systems. He recently authored the book "Microservices for Java Developers" for O’Reilly and Red Hat.
Matt Stine is a 17 year veteran of the enterprise IT industry, with eight of those years spent as a consulting solutions architect for multiple Fortune 500 companies, as well as the not-for-profit St. Jude Children’s Research Hospital. He is the author of Migrating to Cloud-Native Application Architectures from O’Reilly, and the host of the Software Architecture Radio podcast.
Kevin Hoffman is a Digital Innovation Lead Engineer for Capital One where he is responsible for designing, implementing, and socializing modern, cloud-native systems. This includes everything from the choice of programming language to high-level patterns and CI/CD practices. Prior to that he worked for Pivotal where he would embed with customers, teaching through pairing and implementation to migrate legacy monoliths to the cloud. He has been programming for [REDACTED] years and is a distributed systems zealot, polyglot, author of fantasy books, and builder of video game worlds.