Transcript
Today I'm going to talk about what happens when your microservice applications evolve and get connected with other systems and create what we call a “service landscape.” It’s driven by organic, federated growth, which means each team has business objectives, and you deploy new systems fairly organically. If you need something, you deploy it. Eighty percent of whatever you need is already in the enterprise. You're just talking to it. There isn't so much a blueprint, or a full-on architecture. It has an interesting couple of characteristics that I'm going to discuss. I want to emphasize upfront that organic, federated growth is an important, good thing. That's what gives you speed to build on top of existing things. It also requires a totally different way of runtime control in operations.
To illustrate that, I want to start off with an illuminating story from old-school operations in the airline industry. In July 2016, Southwest melted down. What happened is this: one router out of 2,000 in a Dallas network operations center just failed. It failed in a weird way, of course. It was heavily monitored and all the lights were green. Except, zero packets out, billions of packets in, nothing coming out.
What happened is that operations crammed, of course. It took them 30 minutes to discover what was going on, followed by 12 further hours of rebooting adjacent systems that had gone out of sync, couldn't catch up with what happened, had stale date—these kind of things. Meanwhile, all other systems that relied on those systems were down. Flight crews couldn't board because the data that tracks how long they were in the air, was not available, for example. It was a remarkable chain of events and a pretty devastating impact: one router, 5 days of downtime, hundreds of cancelled flights, and $80 million of losses directly. Almost 3.5 billion dollars were wiped off the market cap.
Lessons Learned
There are two important learnings here. Number one is, Southwest was monitoring at the wrong level. Their visibility was at the wrong level. All these green lights didn't mean anything. The second learning is that they needed to have control over system interactions. If they could have seen the incident earlier with visibility at the correct level, seen all these packets coming in and nothing coming out, they could have pushed back on the requesting systems until that was resolved. None of that would have happened. Both things are important: being able to detect really quickly and also having the ability to push back.
That's what I'm going to talk about: The criticality of mission control in complex service landscapes, complex architectures, composite architectures. I’ll explain the characteristics of service landscapes, and stability and security that are different from normal applications. Why are we facing this new reality and what are some common strategies to cope with complexity in service architectures? I’ll explain why they don't really work for these composite architectures and how you can remediate reactively with patterns and proactively create new use cases and help yourself forward in the industry.
Background
A little bit about myself: I am the CEO and Co-founder of Glasnostic. My previous company was a platform-as-a-service and became Red Hat OpenShift. At Red Hat we were fully focused on building applications and how to ideally support that. I learned two interesting things in that time: First, technically, all these applications that were not applications at all. They were all systems of applications with a tremendous amount of complexity. Second, systems of applications that succeeded did so not because they were well-engineered, but because they were well-operated. They had a great operations team behind them, the right levers, the right knobs. At Glasnostic it’s our vision to bring runtime control to enterprises with large service landscapes.
The Agile Operating Model
Why is this a new reality? How did this service landscape develop? I'll start with the new agile operating model. We are all working in small, self-managing, autonomous teams. We have rapid decision and learning cycles and parallel deployments. At the same time, we can really benefit from a fast cloud ecosystem with hundreds of Lego blocks that we can readily use. Then there's cloud-native technology. A lot of forward movement is happening. That has a profound effect on the architecture side that has evolved, from microservices, to shared services, to organic, federated growth, and finally, the service landscape.
This has also affected changes on the control layer that governs how we connect those systems. That's evolved from old-school enterprise integration to free-flowing APIs, extra middleware gateways, and most recently, service mesh. Also, of course, on the operations side, where we have evolved from old-school pushing of boxes to DevOps, and then SRE, and then what we call mission control operations. This is not just a set of individual evolutions. It is truly a new operating model, where all these layers work together to support the agility of the enterprise. That's what we're seeing more and more. These architectures compound, they get composited. They grow organically, relentlessly.
I want to focus now on the service landscape, which is any architecture that evolves through organic, federated growth. Here's an example. If you take a microservice application that becomes useful to other teams, it gets other things bolted onto it, with maybe a mobile gateway, another partner integration, and other applications that use its services. More applications are built on top of this application. Also, new services, because now each service has a number of dependencies, and there are a number of other services. Now you start building out services, different versions of services next to each other. That is really what organic, federated growth is.
Security: Evolving Topologies, Ephemeral Actors
That has three really interesting characteristics. Number one, on the security side, there's a total loss of perimeter. The architecture changes all the time. There's a loss of a blueprint. I can't base my security policies on what I know about the architecture, necessarily, because it may not work tomorrow or next week. There's a fundamentally new challenge. Then, of course, these architectures have complex and very disruptive emergent behaviors. We know those as gray failures. What they all have in common, is that they are large scale. There are a lot of complex systems involved and coming together. The number of systems is staggering. Each system behaves a little bit differently. The chain of events, once you trace it, is very nonlinear. You get an effect that's blue on one side, and on the other side, it's green. Then that makes it very unpredictable -- again, a fundamentally new challenge. The question is, how do you stabilize it?
Almost the most important characteristic is that you can't engineer your way out of it. We can't put parameters that are important for runtime configuration in a YAML file, because tomorrow some other teams deploying something else know this is wrong. Anything by way of resource limits or scaling behaviors, I want to have four copies of. Request behaviors, that many retries, and that's how I'm going to back off. Even connection pool sizing becomes stale very quickly. Typically, there's zero process around updating these things. You don't even know you're coding a service. Next week, some other business unit will talk to you. It’s very difficult and we can't engineer ourselves out of it. In these rapidly evolving service landscapes, we can't simply define structure and set policies in configuration files and forget about them.
Ironically, the key is that the agility that the enterprises crave ends up with an architecture that we can't control. If we can't control it, of course, we can't operate it. If we can't operate it, then innovation dies. This operational crisis is the defining problem in the industry today because service landscapes and their behaviors are the new reality that we're all facing. In all these environments, failure happens overwhelmingly due to environmental factors, not to a code effect, an individual's efforts of execution. The code is subject to factors now that are entirely unrelated to it. Those may be gray failures, any other ripple effects, or resource contention. All these things can affect your code. This is the new reality.
We need to step back and change operations. We need to be able to create structure, and govern, and detect, and react at runtime. We can give the service landscape where it's supposed to run, the system of resilience that it needs. We need to be able to control disruptive behaviors, prevent systemic failures, and avert security breaches. In other words, like an air traffic controller, we need to operate with a mission control mindset. We need to care about the stability and security of the entire airspace, not about isolated ground operations. In order to do that, we need to be able to detect and react in runtime, in near real-time. That means we need to base our detection and reaction on golden signals and metrics that apply to every single flight. We need mission control operations.
This is what I mean when I say that all successful systems actually run. They are not so much coded as operated, because the environment is what determines stability and security. Because the environmental factors are very unpredictable. We need to have a mission control operation system that allows us to remediate in real-time. I'm not saying diagnose and fix. That's an entirely different process. This is all about real-time remediation, fixing something, like a triage nurse would do in an emergency room.
Coping Strategies
I want to look at some of the strategies that we typically use to cope with complexity. One of them is, of course, the do-nothing strategy, where we say, “Let's just continue as we did before!” Or you may hear a developer say, "Netflix does it this way, that's how we should do it, too!" Or, a VP of engineering comes in and says, "I'm going to get my team not to write any bugs anymore." Well, this may work for someone like Netflix, which, at the architectural level is—no offense—a fairly simple application. Their problem is scale! For everybody here and for everyone else, the problem is the other way around: we have layers of complexity, generations of systems. Scale is probably not a top-five concern! It's much more about stability, and how we can combine all these systems. So, it works to do nothing! Building distributed applications works nicely if we have single, standalone applications like Twitter or Netflix, conceptually simple applications with a single blueprint. It does not really work for decentralized, organically growing service landscapes.
The other strategy is, "I got Datadog. I got excellent monitoring." That's true if you work at the lower levels of the stack, then a lot of these host metrics, node metrics, or whatever is part of that package is important. But it doesn't really work for a decentralized service landscape, which really operates on a much higher level where I need to look at the interactions between services. It doesn't help me to know what the heap size in a JVM is, if really, I have a large-scale gray failure happening.
The third one is a newer one. How about I trace into things? It's the promise of perfect visibility plus all the context that I need around it. Yes, it's true. As long as I own these services, tracing makes a lot of sense. But if I have 20 other dependencies that I don't even care about, that are just services that I consume, then tracing stops at that point. That’s what makes it a very local solution.
Then, of course, there's service mesh. Service mesh is really promising to deliver intelligent routing, metrics, policies, and encryption, “security.” Yes, it gets you to the other side. It's a very heavy solution that requires a very stable environment. It’s also very complex and tends to be slow and very invasive. Once you have it in place, unless somebody else manages it for you, it becomes very difficult to change. That’s especially true if you have many teams trying to inject YAML in the different Envoys, because a natural service landscape evolves much faster than that baroque YAML.
Operating Service Landscapes
How should we operate these service landscapes? This answer has two parts. One, because environmental factors are the determining factors today and because they are so unpredictable, success hinges on real-time remediation. We need to be able to quickly see and very quickly react. The quickly seeing part needs to rely on metrics that are very easy to understand. Again, like the triage nurse, if you come in with a pulse of 160, and a temperature of 80, the nurse is going to give you some medication right there. That is exactly what we need to do. Those metrics need to be holistic and universally applicable. In air traffic control those metrics are position, altitude, direction, and speed that can be applied to any aircraft.
For cloud traffic, we need to apply it to any interaction. That includes the number of requests: how many requests are being made between two sets of services? Latency: how long do they take? Concurrency: how many requests are in flight at any given time? And, of course, bandwidth. By correlating these things and examining them, you can find anomalies and react very quickly.
The reaction part is relying on operational patterns. Operational patterns are really encapsulations of best practice remediations. One example is a bulkhead. Let’s say, I have several availability zones. As an operator, I want to make sure that whatever happens on one side shouldn't affect any other. But at the same time, maybe a critical service should be able to fail over. We all know accidental cross-zone traffic happens all the time, because misconfigurations happen. And suddenly something talks from this zone or region to the other zone or region and nobody knows for another month until the bill arrives. It’s important to be able to remediate this.
Another example is backpressure. If you have very spiky workloads, or just too much demand at a given time, the ability to be able to push back against it for a certain amount of time relieves stress from these attacked systems. Whether it's malicious or not doesn't matter. It's just a very quick remediation. It’s a very important operational pattern.
Circuit breaker is interesting because we think of it typically as a developer pattern. For example: I'm running an e-commerce site. There's a recommendation engine. If that's gone, I don't care that much. I can circuit break around it. It's true. From an operational perspective, it's a slightly different use case where I may have a Hadoop cluster going on and I just realized it's going a little slower than it used to. I'm now going to circuit break all the tier-3 services that are not that important, and only the long running queries of those. That's an operational concern.
There are a couple more of these. Segmentation, keeping services from talking to each other, is a security tool that’s also an eminently useful pattern to petition request clients. Another interesting one is the quarantine, because it's such a risk mitigator when it comes to deploying new code.
Remediation Examples
Let me turn to examples. How does this actually look in real life? Going back to the example from earlier, the story here that actually happened is a new deployment, a new piece of organic, federated growth was added. That affected the upstream dependency map. The developers swapped two calls, because one took longer, so it would be sent off earlier. That changed the fan-out pattern of this upstream server in such a way that a shared, centralized cache started thrashing, not immediately, but eventually at a significant rate. That caused widespread, very unspecific slowness on the other side of the landscape.
What does the remediation look like here? First, we need to be able to see very quickly that slowness. Then, we need to identify where the downstream bottlenecks might be coming from and correlate with deployment history. Then, of course, quarantine that deployment until the issue is diagnosed and fixed.
Another really interesting example was published about a year ago, a cascading failure at Target. What happened? Two environments, one VM based running in OpenStack, another in a massive amount of Kubernetes clusters. Historically, a couple of these clusters were really big. Typically, all the other ones were very small. Each workload on Kubernetes had a sidecar injected that did logging to Kafka systems that were running on OpenStack.
The OpenStack guys came and said, "We need to do some Neutron upgrade. It's going to be 30 seconds, or a minute, or whatever downtime." Of course, that lasted more than a couple hours. That caused the Kafka systems to be intermittently available. Of course, all these workloads on Kubernetes tried to continue to log. They couldn't log so they would wait until Kafka came back. When Kafka came back intermittently, everybody would log at the same time. Because, of course, sidecars do the right thing, they log at the same time. That didn't overwhelm the network. It caused a CPU spike on those nodes. That CPU spike squeezed the Docker daemon. Kubernetes does the right thing and says, "This node is unhealthy. I'm going to have to migrate this off to another node." The migration patterns are not uniform. Some nodes now had the same thing happen again. These parts had to be moved somewhere else.
The outward behavior was, "Kubernetes, why is my Kubernetes flip-flopping?" The remediation should have been to very quickly identify logging spikes and exert backpressure against those loggers, to smooth it out, and prevent them all from logging at the same time. Delay individual things, over a second or so. Maybe even, if you have a long running, hanging request, circuits break some of those.
A third example of remediation is something we did for a high security video conferencing environment. The architecture is many organizations under the same umbrella with each organization in their own video conferencing rooms. Typically, participants would come in, going through a gateway, hit a bunch of relays, for each media type a different relay. Then it would be relayed out. Typically, it was very important that participants would only talk to one video conferencing room at the same time.
Looking at golden signals, you can see that they're in the middle. There are a bunch of participants with red lines that reach all over the place. Clearly, two things are going on here. One, there's some sort of DoS happening. The other thing is segmentation violation. They shouldn't be able to talk across these things. It could be a misconfiguration, a vulnerability, exploit, anything. Remediation is to identify the sources very quickly, based on golden signals, then apply an operational pattern like segmentation to prohibit these clients from continuing.
Runtime Control Examples
Those are all operations that you can do to stop the bleeding. Being able to operate and control runtime is also very useful if you want to move forward and accelerate development. One of the ways you can do it is by deploying to production. The reason most people cannot deploy to production and need to stage services is because once it's deployed, there's nothing they can do. They don't even know what's running. We built these staging environments that are very difficult to build and to actually make meaningful. Then the system's around where I'm storing real user traffic and played back and staging the other day. I resort to tricks like that. Then it's too expensive, so my staging environment is only a third of the size. It very quickly becomes meaningless. Because real-time, runtime control is such a massive risk mitigator, we helped an online travel company to completely eliminate the staging environment.
My favorite use case is that you can architect in real time. We did this for a connected car manufacturer. Their problem was hundreds of applications trying to call into millions of cars, get data from the cars, depending on the functionality. This is an oversimplification of the situation. This number of applications is growing about 200% a year. It affects everything from managing the brakes, to entertainment, to autonomous driving, to all monitoring systems on the car. There are way more systems than you probably think.
Their problem was, of course, they couldn't let these applications talk to the car for security and other reasons. They needed to intermediate it. But whatever they architect in between is going to be wrong next week. They’re putting a system in the middle that when the next application comes in, it will be super important to support it and so then they need to touch all these services. Now they need to recertify everything because it's talking to cars. They took the plunge and decided they need to entirely avoid architecture. That's hard to do. They said, "Every new requirement that comes in, we're just going to deploy a new service next to the other ones." There's an API gateway in the middle that then routes this application to this new version of that service. Really interesting, because it allowed the manufacturer to massively accelerate the deployment of end user applications, which would have not been possible if they had designed this upfront. It’s very expensive to design architecture upfront, and most of the time, you don't know at that point what it needs to do afterwards, at runtime.
Another interesting case was a cloud provider, where the ability to define structure was important, because all these architectures were growing and sprawling. It reached into different cloud services, now there is serverless attached to it. What used to be a VM is now Kubernetes, containers, now serverless attached to it. Then all these applications come together. How they could define and segment the individual clients became very difficult. They used to do this with plain SDN. Of course, at some point you configure the SDN into a corner, so you reach multi-data center issues and different service issues. They used segmentation to do quite a bit in that area.
Summary
In summary, we now live in a completely new Agile operating model. We work in small teams with fast cycles. We have these immense amounts of cloud service at our fingertips. Everything happens in parallel. That is really the key driver. That creates organic, federated growth, the result of which is this new reality of the service landscape. The new reality is that in service landscapes, everything depends on the environment. It's very easy to fix a code somewhere. The individual thread of execution in whatever service you're looking at is not that important anymore. All your failures really are determined by the health and how many gray failures you have, and how many discontinuities you have in the environment.
I also discussed some solutions that people typically try to apply to complexity: "We're just going to wing it." Investing heavily in monitoring really only works in the lower levels of the stack. Winging it really only works for simple applications. Tracing is another popular solution, but really only works in the local context for the services that you personally own, or are responsible for, for the services that are in your repository. It doesn't work for any other dependencies.
There’s also service mesh, which turns out to be very complex and very slow. It’s invasive and leads to brittle architecture. The best solution is to really aggressively invest into rapid MTTR, rapid remediation. You need to be able to detect quickly if something happens, before it becomes a real failure, looking at gray failures. Base that on golden signals, and then apply operational patterns to quickly remediate it. Remediation does not mean fixing the “root cause”, it is remediating the situation to restore some form of normality. I gave a couple examples of how to retroactively apply patterns, how to remediate existing issues, and be proactive, to think about what else we can do with runtime control and applying operational patterns.
Takeaways - Developers
Takeaways, for those of you who are developers and may not face this new reality yet, is that absolutely everybody should avoid building distributed systems or solving distributed systems issues. They're very difficult to solve. Most of them are solved in some infrastructure already and they tend to become very expensive. Most importantly, they force a new waterfall thinking on you. Those things need to be designed. There's a long ramp before you can deploy that. It slows everything down. It's almost like serialization in a multiprocessor system. Instead, build resilient federations based on standard domain-driven design, federations of services. Then, for anything that might happen, build a compensation strategy. Heavily invest in compensation strategies in code. That's the best thing you can do. If something doesn't work quite right, make it so that the service degrades instead of failing.
A typical issue that comes up here is shops that have really great infrastructure related to using mocks, stubbing out dependencies. Typically, they have the worst compensation strategies in the code, because it's so easy to mock them up, so everything is always there. If you actually need “five nines” of certainty, that a certain result comes back you need to do this with redundancy and checking several return values, like airplane software does. Any important decision is done by three systems at the same time.
Debugging and tracing, ideally, should be totally kept to the unit level. It's very easy and cheap to debug code at the unit level. It becomes very hard and very expensive later on in production. Most important of all, is by all means, defer design decisions to runtime. Don't try to solve runtime concerns of code and services in your own code.
Takeaways - Operators
For operators, the most important thing to remember is to focus on the environment. Stop debugging individual nodes. Focus on gray failures. What can you see? How can you rapidly detect and react to things? Investing in that capability has massive returns. Also, decide which signals do you want to look at? What is the set of signals that maybe, for your company, is the most important one? I'm not talking about KPIs that are customer facing, but from a stability and security perspective. What patterns do you want to apply? How quickly do you want to do it? The quicker you can apply a pattern, a remediation, the easier it is to deploy something. The easier it is for your company to make forward movement.
Then, one of my favorites: Everybody talks about root causes. Stop! They don't exist, there's no root cause. If there's a root cause it's a trivial bug. Like with families that are dysfunctional, there's not a single person at fault. You unit test all your stuff. It all works. You put it together, it doesn't work. Forget about root causes; it's always a confluence of factors. It's intellectual laziness. As engineers, we're trained to jump on the first thing we see. Typically, it's a rabbit hole we love to jump into and then spend days chasing a bug that is not even that important. Instead, focus on remediation. Because these systems are so complex that you can't chase everything anyways. Things happen. If you think your system runs clear, you're not seeing the gray failures. You're not seeing the discontinuities that happen all the time. Also, don't do any process debugging. If there's a suspected bug in some code, some file handle is not being closed. It's not your job. It's the developer's job.
All these together allow you to truly architect at runtime. Make expensive design decisions. Resolve them at runtime when you have the data, when you see how it's behaving. This of course applies to all decentralized architectures, not just microservice, so any combination of serverless, any combination of VMs, or mainframes, or metal systems, in different data centers, in different regions. The fact that these systems all tend to come together now is really what the new reality is all about.
Proper Mission Control – Apollo 13 vs. Southwest Airlines
To drive this home, Apollo 13 didn't make it back to earth because it was well engineered. It came back to earth because it was mission-controlled properly. Operations is the key driver here. Southwest Airlines could have completely avoided their meltdown if they had the same type of mission-control operations. If they had had a way to quickly see at the right level what was going on. Not deep observability, but high-cardinality, high scalability, high dimensionality, at the right granular level. Then they could have seen billions of packets in, zero packets coming out. Then, if they also had the ability to push back on traffic, to slow everybody down until the system has been swapped out, this router, in this case. None of that would have happened. These five days of outage would not have happened. But they didn't have runtime control.
If you are an architect or developer who wants to move some of those decisions to runtime, or if you're an operator who suffers from being completely ill-equipped to deal with runtime issues, and want to get a new semblance of control, you can talk to me any time.
See more presentations with transcripts