BT

InfoQ Homepage Presentations Controlled Chaos: Taming Organic, Federated Growth of Microservices

Controlled Chaos: Taming Organic, Federated Growth of Microservices

Bookmarks
43:12

Summary

Tobias Kunze focuses on the challenges that result from organic, federated growth as well as the patterns that can be applied to monitor and control these dynamic systems, like bulkheads, backpressure, and quarantines, from both an operational and security perspective.

Bio

Tobias Kunze is the co-founder and CEO of Glasnostic where he is on a mission to help enterprises manage their rapidly evolving microservice architectures. Prior to Glasnostic, he was the co-founder of Makara, an enterprise PaaS that became Red Hat OpenShift.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Kunze: Who is more on the architecture side of the house, architects here? Who is on the development side, in actual coding? Any operators here: DevOps, SRE, whatever the title is? I'm not going to talk about coding. Actually not really talking about microservices that much, per se, I'm going to talk about what happens when your microservice applications continue to evolve and get connected with other systems. Once you start composing things. You work in many teams, everybody deploys to production all the time. You get what we call a service landscape. That's really driven by organic, federated growth. Organic, federated growth really means, each team has business objectives, and you deploy new systems fairly organically. If you need something, you deploy it. Eighty percent of whatever you need is already in the enterprise. You're just talking to it. There isn't so much a blueprint, or a full-on architecture. It has a really interesting couple characteristics that I'm going to get into. I want to really point out upfront that organic, federated growth is super important. It's a good thing. That's what gives you speed to build on top of existing things. It also requires a totally different way of runtime control in operations.

To illustrate that, I want to start off with an illuminating story from old-school operations. This is one of these airlines things. July 2016, Southwest melted down. What happened is this one router out of 2000 in a Dallas NOC, just failed. It failed in a weird way, of course it was monitored up to wazoo with agents, and monitors, and all the lights were green. Except, zero packets out, billions of packets in, nothing coming out.

What happened is operations crammed, of course. It took them 30 minutes to discover what's going on, 12 further hours of rebooting adjacent systems that had gone out of sync. Couldn't catch up with what happened, had stale data, these things. Meanwhile, all other systems that relied on those systems were down. Flight crews couldn't board because the data, how long they were in the air, were not available, and stuff like that. Really interesting remarkable chain of events and a pretty spectacular outcome, one router, 5 days of downtime, $80 million of losses directly, canceled flights and all that, hotels. Almost 3-and-a-half billion dollars were wiped off the market cap.

Lessons Learned

There are two important learnings here, at least two. Number one is, monitoring at the wrong level, visibility at the wrong level. All these green lights didn't mean anything. The second learning is that you need to have control over system interactions. If they could have pushed back against these systems, if they could have, with the correct visibility, seen the incident earlier, seen all these packets coming in and nothing coming out, and then push back on the requesting systems until that was resolved. None of that would have happened. Both things are important. Detecting really quickly and then also having an ability to push back.

That's what I'm going to talk about. The criticality of mission control in complex service landscapes, complex architectures, composite architectures. There are three key terms, service landscape. You're going to hear all over in this talk. Then characteristics of these service landscapes, which is really about stability and security that are different from normal applications. I'm going to talk about this. Why is this a new reality that we're facing here? Then, what can you do about it, operational patterns? That's going to come up. I'm also going to talk about common strategies to cope with complexity in service architectures, and why they don't really work for these composite architectures. Then bring examples about how you can remediate reactively with patterns and how you can proactively create new use cases and help yourself forward in the industry.

Background

A little bit about myself, CEO, Co-founder of Glasnostic. My previous company was a platform-as-a-service and became Red Hat OpenShift. It's been a couple years of Red Hat. Fully focused, of course, on how do we build applications? How can we ideally support building applications? Two interesting things that are, to me, interesting things that I learned, that from a technical perspective, all these applications that we've seen that were meaningful, were not applications at all. They were all systems of applications. There's a tremendous amount of complexity going on. The second was, from a technical perspective, successful systems of these applications, were not successful because they were well engineered, were successful because were operated well. There was a great operations team behind it. They had the right levers, the right knobs. That's what I'm going to talk about. That's why our vision is really to bring runtime control to enterprises, to large service landscapes in enterprises.

The Agile Operating Model

I want to zoom out. Why is this a new reality? How do these things come about, the service landscape? I'm going to start with the new agile operating model that we all live in now. It starts with we are all working in small teams, self-managing autonomous teams. We have rapid decision and learning cycles. At the same time, we can really benefit from a fast cloud ecosystem with hundreds of Lego blocks that we can readily use. Then there's cloud native technology. We can put this all together. Of course, all these two pizza teams that we have now can deploy in parallel. A lot of things, a lot of forward movement is happening in the space. That has a profound effect on the architecture side of things that has already evolved from microservices. Microservices and then to shared services, many microservice applications based on shared services. Then what are we talking about here, organic, federated growth, finally, service landscape.

This has other changes, affected changes on the control layer that governs how we connect those systems, how we connect these things. That's evolved from old-school enterprise integration to free-flowing APIs, extra middleware gateways, and most recently, service mesh. Also, of course, on the operations side, where we have evolved from old-school pushing of boxes to DevOps, and then SRE, and then what we call mission control operations. This is not just a set of individual evolutions. It is truly a new operating model, where all these layers work together to support the agility of the enterprise. That's what we're seeing more and more that these architectures compound, they get composited. They grow organically, relentlessly.

I want to double click a little bit on now, what is this service landscape? Here's an illustration. It's really any architecture that evolves through organic, federated growth. Here's an example. If you take the first panel, it's a microservice application that becomes useful to other teams. It gets other things bolted on, with maybe a mobile gateway, another partner integration, other applications use services of that. More applications are built on top of this application. Also, new services, because now each service has a number of dependencies, or is dependencies, there are a number of other services. Now you start building out services, different versions of services next to each other. That is really what organic, federated growth is.

Security: Evolving Topologies, Ephemeral Actors

That has three really interesting characteristics. Number one, of course, on the security side, as we can all imagine, there's a total loss of perimeter. The architecture changes all the time. There's a loss of blueprint. I can't base my security policies on what I know about the architecture, necessarily, because it may not work tomorrow or next week. There's a fundamentally new challenge. Then, of course, these architectures have complex and very disruptive emergent behaviors. We know those as gray failures. What they all have in common, that they are large scale. There's a lot of systems involved and coming together. They're complex. The number of system is staggering. Each system behaves a little bit differently. The chain of events, once you trace it, once you narrate it really, is very nonlinear. You get an effect that's blue on one side, and on the other side, it's green. Then that makes it very unpredictable, again, fundamentally, new challenge. The question is, how do you stabilize it?

Almost the most important characteristic is that you can't engineer your way out of it. We can't put parameters that are important for runtime configuration in a YAML file, because tomorrow some other teams deploying something else know this is wrong. Anything by way of resource limits or scaling behaviors, I want to have four copies of that. Request behaviors, that many retries, and that's how I'm going to back off. Or even connection pool sizing becomes stale very quickly. Typically, there's zero process round updating these things. Besides, you don't even know you're coding a service. Next week, some other business unit talks to you. Very difficult, we can't engineer ourselves out of this. In these rapidly evolving service landscapes, we can't simply define structure and set policies in configuration files, and set and forget about them.

The key is really that, ironically, the agility that the enterprises crave ends up with an architecture that we can't control. If we can't control this, of course, we can't operate it. If we can't operate it, then innovation dies. This operational crisis is the defining problem in the industry today. It's a defining problem because service landscapes and their behaviors are the new reality that we're all facing. The new reality is that in all these environments, failure happens overwhelmingly due to environmental factors, not to a code effect, an individual's efforts of execution. The code is subject to factors now that is entirely unrelated to it. Those may be, I mentioned gray failures, any other ripple effects, resource contention. All these things, very important that affect your code. This is the new reality.

Because of that, we need to step back and change operations. We need to be able to create structure, and govern, and detect, and react at runtime. We can give the service landscape where it's supposed to run, the systemic of resilience that it needs. We need to be able to control disruptive behaviors, prevent systemic failures, and we need to be able to avert security breaches. In other words, like an air traffic controller, we need to operate with a mission control mindset. We need to care about the stability and security of the entire airspace, not about some ground operations. In order to do that, we need to be able to detect and react in runtime, in almost near real-time. That means we need to base our detection and reaction on golden signals, on key signals, on metrics that apply to every single flight. We need mission control operations.

This is what I mean when I say that all successful systems actually run. Not so much coded, they're operated, because the environment is what determines stability and security. Because the environmental factors are very unpredictable. We need to have this mission control operation system that allows us to remediate in real-time. I'm saying remediate. I'm not saying diagnose and fix. That's an entirely different process. This is all about real-time remediation, fixing something. What a triage nurse would do in an emergency room.

Coping Strategies

I want to look at some of the strategies that we typically bring to cope with complexity. Of course, let's just continue as we did before. That's typically, you hear a developer say, "I want to do this. Netflix does this, this way. That's how we should do it too." Or, VP engineering comes in and says, "I'm going to get my team not to write any bugs anymore." Clearly, I'm sure this is like Netflix is at the architectural level, a fairly simple application there from a scale. Everybody here for everybody else, the problem is the other way around. We have generations of complexity. Scale is probably not a top five concern. It's much more about stability, and how can we combine all these systems. It works to do nothing. To build distributed applications works nicely if we have single standalone applications like a Twitter. Again, simple application, conceptually simple application, single blueprint. It does not really work for decentralized service landscapes.

The other strategy is, "I got Datadog. I got excellent monitoring." That's true if you work at the lower levels of the stack, then a lot of these host metrics, node metrics, whatever is part of that package is important. It doesn't really work for a decentralized service landscape, which really operates on a much higher level where I need to look at the interactions between services. It doesn't help me to know what the heap size in a JVM is, if really, I have a large scale gray failure going on.

The third one is a newer one. How about I trace into things? It's the promise of perfect visibility plus all the context that I need around it. Yes, it's true. As long as I own these services, tracing makes a lot of sense. If I have 20 other dependencies that I don't even know who have coded them, I don't really care about. It's just a service that I consume. Tracing stops at that end. Tracing is a very local solution.

Then, of course, there's service mesh. I love this image because to me, it really shows how service mesh relates into a service landscape. Service mesh is really promising to deliver intelligent routing, metrics, policies, and encryption security. It is an art solution. Yes, it gets you to the other side, like this baroque garden here. It gets you to the other side. It's a very heavy solution. It requires a very stable environment. Then it is very complex. It tends to be slow and it's very invasive. Once you have it in place, unless somebody else manages it for you, it becomes very difficult to change. In particular, if you have many teams trying to inject YAML in the different envoys, because a natural service landscape evolves much faster than that baroque YAML.

Operating Service Landscapes

How should we operate these service landscapes? This answer has two parts to it. Number one is, because environmental factors are the determining factors today and because they're so unpredictable, it hinges about real-time remediation. That means we need to be able to quickly see and very quickly react. The quickly seeing part needs to rely on metrics that are very easy to understand. Again, like the triage nurse, if you come in with a pulse of 160, and a temperature of 80, the nurse is going to put you down into some other room. The nurse is going to give you some medication right there. That is exactly what we need to do. Like in air traffic control, those metrics need to be universally applicable. They need to be very holistic. Air Traffic Control, it's very clear, need to be applied to any aircraft. That's to really position its altitude, its direction, its speed.

For cloud traffic, we need to apply it to any interaction. That includes number of requests. How many requests are on the interaction? The latency of it, how long does it take to be fulfilled? Concurrency, how many are in flight at any given time? Of course, bandwidth. By correlating these things and examining them, you can find anomalies. Those allow you to react very quickly.

The reaction part is relying on operational patterns. Operational patterns are really encapsulations of best practice remediations. Some of them here, an example is a bulkhead. I have several availability zones. I want to make sure as an operator that whatever happens in this side shouldn't slosh over here. Maybe at the same time, a critical service should be able to failover. We all know that happens all the time, because misconfigurations happen. Then something talks from this town or region to the other region and nobody knows for another month. A very important pattern to be able to do this.

Another one is backpressure. If you have very spiky workloads, or just too much demand at a given time, the ability to be able to push back against it for a certain amount of time relieves stress from these attacked systems. Whether it's malicious or not, it doesn't matter at this point. It's just a very quick remediation, very important pattern.

Segmentation, of course, we think of segmentation as a security tool, these services can't talk to this service. That's true. It's also a pattern that's eminently useful to petition request clients.

Circuit breaker, interesting pattern because we think of a circuit breaker typically as a developer pattern. I'm running an e-commerce site. There's a recommendation engine. If that's gone, I don't care that much. I can circuit break around it. It's true. From an operational perspective, it's a slightly different use case where I may have a Hadoop cluster going on and I just realized it's going a little slower than it used to. I'm now going to circuit break all the tier-3 services that are not that important, and only the long running queries of those. That's an operation concern.

There are a couple more of these. Another interesting one is, of course, the quarantine, because it's such a risk mitigator when it comes to deploying new code. I'm not going to run through all of those.

Remediation Examples

Let me turn to examples. How does this actually look in real life? Going back to the example from earlier, the story here that actually happened is a new deployment, a new piece of organic, federated growth was added. That affected the upstream dependency map. In fact, what it did is the developers swapped two calls. Because one took longer so it would be sent off earlier. That changed the fan-out pattern of this upstream server in such a way that the cache behind it, a shared centralized cache started flashing, not immediately, but significantly higher. That caused widespread, very unspecific slowness on the other side of the landscape.

How does the remediation look like here? First, we need to be able to see very quickly that slowness. We need to see this. Then we need to identify the downstream bottlenecks, where might it come from? Correlate with deployment history. Then, of course, quarantine that deployment until the issue is then diagnosed and then fixed.

Another really interesting example, I love it, was published about a year ago, happened at Target, a cascading failure at Target. It's eminently important that these things get published. What happened? Two environments, one VM based running in OpenStack, another in massive amount of Kubernetes clusters. Historically, a couple of these clusters were really big. Typically, all the other ones were very small. Each workload on Kubernetes had a sidecar injected that did logging to Kafka systems. The Kafka system is running on OpenStack obviously.

The OpenStack guys came and said, "We need to do some Neutron upgrade. It's going to be 30 seconds, or a minute, or whatever downtime." Of course, that lasted more than a couple hours in various ways, very detailed, apparently. That caused the Kafka systems to be intermittently available. Of course, all these workloads on Kubernetes tried to continue to log. They couldn't log so they would wait until Kafka came back. When Kafka came back intermittently, everybody would log at the same time. Because, of course, sidecars do the right thing, they log at the same time. That didn't overwhelm the network. It caused a CPU spike on those nodes. That CPU spike squeezed the Docker daemon. Of course Kubernetes does the right thing and says, "This node is unhealthy. I'm going to have to migrate this off to another node." Of course, the migration patterns are not uniform. Some nodes now had the same thing happen again. These parts had to be moved somewhere else.

The outward behavior was, "Kubernetes, why is my Kubernetes flip-flopping?" The remediation should have been to very quickly identify logging spikes. Then exert backpressure against those loggers, which really means just smooth it out, don't let them all log at the same time. Delay individual things, over a second or so. Maybe even, if you have a long running, hanging request, circuit break some of those.

Third example of remediation, this is something we did for a high security videoconferencing environment. The architecture is roughly on the left side, many organizations under the same umbrella, each organization in their own video conferencing rooms. Typically, participants would come in, going through a gateway, hit a bunch of relays, for each media type a different relay. Then it would be relayed out. You see a node diagram on the right side how it looked like. Typically, very important that participants would only talk to one video conferencing room at the same time.

Looking at golden signals, you can see that they're in the middle. There is a bunch of clients, participants, with red lines that reach all over the place. Clearly, two things going on here, one is, there's a DoS happening, some DoS. The other thing is that there's segmentation violation. They shouldn't be able to talk across these things. Could be a misconfiguration, could be anything, could be a vulnerability, exploit, whatever. Remediation is identify the sources very quickly, based on golden signals. Then apply an operational pattern like segmentation to prohibit these clients from continuing.

Runtime Control Examples

Those are all operations that you can do to stop the bleeding. Being able to operate and controlling runtime is also very useful if you want to move forward and accelerate the development. One of these ways of how you can do that is by deploying to production. The reason most people cannot deploy to production and need to stage services is because once it's deployed there's nothing I can do. I don't even know what's running. We built these staging environments that are very difficult to build and actually make meaningful. Then the system's around where, I'm storing real user traffic and played back and staging the other day. I resort to tricks like that. Then it's too expensive, so my staging environment is only a third of the size. It very quickly becomes meaningless. Because real-time runtime control is such a massive risk mitigator, we helped an online travel company to completely eliminate the staging environment.

Another use case that's my favorite and really interesting is, you can architect in real time. We did this for a connected car manufacturer. What their problem was, is hundreds of applications, on the left side here, trying to call into millions of cars, data from the cars, depending on the functionality. This is an oversimplification of the situation. This number of applications is growing about 200% a year. It affects everything from managing the brakes, to entertainment, to autonomous driving, to all monitoring systems on the car. There's way more systems than most of you probably think.

Their problem was, of course, we can't let these applications talk to the car. Security reasons, all reasons. We need to intermediate it. Whatever we put in the middle, whatever you architect here, is going to be wrong next week. I'm putting a system in the middle that the next application comes in, it's super important to support this application. Now we need to touch all these services. Now we need to recertify all these stuff because it's talking to cars. They took the plunge and decided we need to entirely avoid architecture. That's hard to do. If you look at this diagram, a lot of concerns that need to be taken care of. They took the plunge and all these boxes here represent about 200 services, and growing over time. Said, "Every new requirement that comes in, we're just going to deploy a new service next to the other ones." There's an API gateway in the middle that then routes this application to this new version of that service. Really interesting, because it allowed the manufacturer to massively accelerate the deployment of end user applications, which would have not been possible if they had designed this upfront. Very expensive design architecture upfront, and most of the time, you don't know at that point what it needs to do afterwards in runtime.

Another case, that's interesting, that was a cloud provider, where the ability to define structure was important, because all these architectures are growing, sprawling more. It reached into different cloud services, now there is serverless attached to it. What used to be a VM is now Kubernetes, containers, now serverless attached to it. Then all these applications come together. How they could define and segment the individual clients became very difficult. They used to do this with plain SDN. Of course, at some point you configure the SDN into a corner, so you reach multi-data center issues, different service issues. They used segmentation to do quite a bit in that area, very important again.

Summary

In summary, what I talked about is how we now live in a completely new Agile operating model. We work in small teams. It's fast cycles. We have these immense amounts of cloud service at our fingertips. Everything happens in parallel. That is really the key driver. That creates organic, federated growth, the result of which is this new reality of the service landscape. The new reality is that in service landscapes, everything depends on the environment. It's very easy to fix a code somewhere. The individual thread of execution in whatever service you're looking at, is not that important anymore. Because all your failures really are determined by the health and how many gray failures you have, and how many discontinuities you have in the environment.

Then I talked about some solutions that you typically apply to complexity. That's, of course, "We're just going to wing it." Then there's monitoring, invest heavily in monitoring, and that really only works in the lower levels of the stack. Winging it really only works for simple applications. Then tracing was another solution that is popular, but really only works at the local context for the services that you personally own, or responsible for. For the services that are in your repo. It doesn't work for any other dependencies.

Then service mesh, which turns out to be very complex, very slow. Also, it leads to brittle architecture, how it's invasive. The real solution is to really aggressively invest into rapid MTTR, rapid remediation. Being able to detect quickly if something happens, before it becomes a real failure, looking at gray failures. Basing that on golden signals, and then applying operational patterns to quickly remediate it. Remediation is not a root cause fix, remediation is remediating the situation. Get it back to some form of normality. Then about a couple examples, retroactively how to apply patterns, how can we remediate existing issues? Then proactively, what else can we do with runtime control and applying operational patterns?

Takeaways - Developers

Takeaways, for those of you who are developers and may not face this new reality yet, is that, absolutely everybody should avoid building distributed systems. We're solving distributed systems issues. They're very difficult to solve. Most of them are solved in some infrastructure already and become very expensive. Most importantly, they force a new waterfall thinking on you. Those things need to be designed. There's a long ramp before you can deploy that. It slows everything down. It's almost like serialization in a multiprocessor system. Instead, build resilient federations based on standard domain-driven design, federations of services, of course. Then anything that might happen, build compensation strategy. Heavily invest in compensation strategies in code. That's the best thing you can do. Something doesn't work quite right, make it so that the service degrades instead of failing.

A typical issue that comes up here is shops that have really great infrastructure related to using mocks, stubbing out dependencies. Typically, have the worst compensation strategies in the code, because it's so easy to mock them up, so everything is always there. If you actually need five-nines of certainty, that a certain result comes back, needs to be done with redundancy and checking several return values, like airplanes do. Any important decision is done by three systems at the same time.

Debugging and tracing, ideally, should be totally kept to the unit level. It's very easy to debug code at the unit level. It's very cheap to do so. It becomes very hard and it becomes very expensive later on in production. Most important of all, is by all means, defer design decisions to runtime. Don't try to solve runtime concerns of code and services in your own code.

Takeaways - Operators

For operators, the most important thing to remember is focus on the environment. Stop debugging individual nodes. Focus on gray failures. What can you see? How can you rapidly detect and react to things. Investing in that capability has massive returns. Also, decide which signals do you want to look at? What is the set of signals that maybe, for your company, is the most important one? I'm not talking KPIs that are customer facing KPIs, but from a stability and security perspective. What patterns do you want to apply? How fast, how quickly do you want to do it? The quicker you can apply a pattern, a remediation, the easier it is to deploy something. The easier it is for your company to make forward movement.

Then this is one of my favorite ones. Everybody talks about root causes. Stop, they don't exist, there's no root cause. If there's a root cause it's a trivial buck. Like with families that are dysfunctional, there's not a single person at fault. You unit test all your stuff. It all works. You put it together, it doesn't work. Forget about root causes, it's always a confluence of factors. It's intellectual laziness. Frankly, I think it's just really lazy. We as engineers, we're trained to jump on the first thing we see. Typically, it's a rabbit hole we love to jump into and then spend days chasing a bug that is not even that important. Instead, focus on remediation. Because in these systems, these systems are so complex that you can't chase everything anyways. Things happen. If you think your system runs clear, you're not seeing the gray failures. You're not seeing the discontinuities that happen all the time. Don't do any process debugging. If there's a suspected bug in some code, some file handle is not being closed. It's not your job. It's the developer's job.

All these together allow you to truly architect at runtime. Make expensive design decisions. Resolve them at runtime when you have the data. When you see how it's behaving. This of course applies to all decentralized architectures, not just microservice, so any combination of serverless, any combination of VMs, or mainframes, or metal systems, in different data centers, in different regions. The fact that these systems all tend to come together now is really what the new reality is all about.

Proper Mission Control - Apollo 13 vs. Southwest Airlines

To drive this home, Apollo 13 didn't make it back to earth because it was well engineered. It came back to earth because it was mission controlled properly. Operations is the key driver here. The Southwest Airlines could have completely avoided this disaster, this meltdown if they had the same type of mission control operations. If they had had a way to quickly see, at the right level what was going on. Not deep observability, high-cardinality, high scalability, high dimensionality, at the right granular level. Then they could have seen billions of packets in, zero packets coming out. Then, if they also had the ability to push back on traffic, to slow everybody down until the system has been exchanged, this router, in this case. None of that would have happened. These five days of outage would not have happened. They didn't have runtime control.

You can, if you are an architect or developer who wants to move some of those decisions to runtime, or if you're an operator who suffers from being completely ill-equipped to deal with runtime issues, and want to get a new semblance of control. Talk to me any time.

 

See more presentations with transcripts

 

Recorded at:

May 07, 2020

Related Sponsored Content

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Resilience is great, but what about correctness

    by Jaime Metcher /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Some great lessons here about operating around complexity, but to me the elephant in the room is the correctness of these non-deterministic systems of systems. We suffer from two flavours of magical thinking:

    1. It's possibly to completely decouple microservices. No, coupling is like friction - too much is bad, but in actuality it's what lets you get work done. So now we have stochastic herds of microservices that are coupled and have to be coupled, but we can't predict how they're coupled in aggregate.
    2. We somehow don't need to worry about correctness because our heroes are companies that have essentially no notion of correctness, or at least prioritize availability much higher than correctness. What's the correct answer to a Google search? Is your Facebook feed correct? What SLA did Twitter commit to?

    If we care about correctness, then we have to ask at what point does increasing resilience just allow a partially broken system to keep digging the hole deeper. We can push that limit out, but we're still going to hit it. At a whole-system level, we're trading failure frequency for failure severity. How to calculate that curve?

  • Emergence

    by Mike Peters /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    A superb talk Tobias. Embracing emergence and treating it as something to be tuned is my strategy. Covert lab at Standford has made a computer model of a micoplasma cell whereby tuning they got the accurate emergent behaviours including cell division.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.