Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Architectures That Scale Deep - Regaining Control in Deep Systems

Architectures That Scale Deep - Regaining Control in Deep Systems



Ben Sigelman talks about "Deep Systems", their common properties and re-introduces the fundamentals of control theory from the 1960s, including the original conceptualizations of Observability & Controllability. He uses examples from Google & other companies to illustrate how deep systems have damaged people's ability to observe software, and what needs to be done in order to regain control.


Ben Sigelman is a co-founder and the CEO at LightStep, a co-creator of Dapper (Google’s distributed tracing system), and co-creator of the OpenTracing and OpenTelemetry projects (both part of the CNCF). His work and interests gravitate towards observability, especially where microservices, high transaction volumes, and large engineering organizations are involved.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Sigelman: I am here to talk about architectures and not to talk about products or anything like that. Particularly, I want to talk about the idea of depth in systems and the implications of depth and specifically how we can think about that in the context of a couple of problems that I've been working on for a long time, just because those are the ones I'm most familiar with.

Scaling & Deep Systems

The talk is divided into three parts. The first part is about scaling in general and deep systems. We often talk about scale when we're at conferences or at our jobs, and we refer to it as if it's a linear concept and a one-dimensional thing. I think if any of us were asked to scale in linear or one-dimensional, of course, we'd say no, but we still keep on doing it. I'm trying to find a better word to describe the type of scaling that I think is really problematic architecturally.

One type of scaling is scaling wide or scaling broadly. Here's a puddle. We're all familiar with those. A puddle is a bunch of water on the ground. If you add more water, you get a really big puddle, and if you add even more water, you get a really big puddle. This is an example of something that scales wide. There are many of them in the world, and I'll let you think about it, but it won't be hard to come up with examples. Sand dunes, satellite arrays, solar panel arrays, things like that.

There's also a different type of scaling that I would call scaling deep. Here's a small village. It's a place where some people live, hopefully semi-permanently. If we scale up a village by adding more people, it starts to look more like this, which is a medium-sized city. If you add more people and find places for them to live, it looks more like this. This is an example of something scaling deep. In this case, the thing I want to emphasize is that the deep system, in this case, the deep city, is not just a large village. It's a different thing altogether. When things scale deep, they change fundamentally, and we probably need to have totally different infrastructure to support them. Certainly, the infrastructure to support a village would not work to support a city like this.

My favorite thing in Paris, aside from the food, is the Sewer Museum, where you can go into an old Parisian sewer and actually experience the engineering marvel that is the Parisian sewer system. The entire city was literally grinding to a halt from cholera and diphtheria and all sorts of horrible things that happen when you don't have good sewers. Then they fixed the infrastructure, and it actually allowed the city to scale, literally, from a population standpoint. It's really fascinating. Moving beyond sewers, there's also things like hardware. This is an example of a deep system in hardware. I think, even just visually, you can look at this and be, "That looks really complicated," and it's not just like a lake of silicon. This is really complex and has a lot of nuance and depth.

The question is, what does it look like for software, especially software built around services? The microservices and serverless movements are really based on a managerial need to have small groups of engineers and developers operating independently. If you have thousands of engineers, then you have hundreds of services and you have to arrange them. When we started LightStep, I actually asked myself that question since it was important for our product. I saw it when I was at Google. I was at Google for about nine years, and I know how their system was built. How does it look for the average microservice architecture? Is it going to look like this?

This is an artificial system diagram of software scaling wide architecturally. Each of the circles in my mind is a service. There are examples of this sort of behavior, for instance, cache services, map produces. Things like that do have aspects that scale wide. It's not like it's unseen in the realm of software, but I would say, for microservices, it's the exception to the rule. There's also software that scales deep. This is a diagram based entirely on dragging and dropping of little circles in this slide, so you shouldn't take it very seriously, but this is what software might look like if it scaled deep. We can actually ask ourselves the question, what do real software systems look like?

I present to you some totally blurry, intentionally anonymized diagrams of real architectures just so we can have some grounding. These are chosen at random from a bunch of architectures where we have data for this at my company, which is irrelevant to the pitch but just lets us think about it. This is about a dozen or so services. There are actually two distinct sets, it turns out. One, I think, has to do with producing ML type models, and the other serves them. You have an indexing pipeline and a serving pipeline. This is more like 50 services. This is more like 100 services, but I couldn't fit it on the screen. This is more like 1,000 services, and I really couldn't fit it on the screen. This thing goes like way out to the next building. It's this crazy.

The answer is that microservices at scale are deep systems. I feel really confident about this having worked with a lot of folks and thought about this for a long time. I'm absolutely positive about this. By deep system, I mean the architectures have greater than or equal to four layers of independently operated services. Some of those will be in-house, some of them will be managed services in the cloud, whether they are just literally a managed open-source service, like a managed Kafka instance, or if it's totally proprietary stuff, like Cloud Spanner or DynamoDB or whatever. They're independently operated services, and with each layer of the service architecture, you introduce a new opportunity for miscommunication, multitenancy, unexpected releases and unexpected side effects to those releases. These layers are really the thing that produce problems for operators and developers in these systems.

What does that sound like? It might sound like this. This is a classic thing debated on Twitter. That is a great indicator of a deep system. This is another one "Where's Chris? Things are totally on fire. He or she is the only person who actually knows how to debug this problem." I think people here probably can resonate with that. I certainly have been there as the person looking for Chris, not, unfortunately, as the Chris. "It can't be our fault. Our dashboard says we're healthy." I saw this all the time at Google. You've got two services, there's definitely a problem. Everyone's agreed on that.

The consumer says, "It's the service," and then the service says, "No, it's the consumer, something in the client." Sometimes it's the network, and they're both right. This is a huge problem and it produces a lot of friction in an on-call cycle, but also reduces trust within organizations. This is the kind of thing that people don't forget, and it's bad to have relationships within work where you don't trust people.

"Kafka is on fire," also something everyone here is probably quite familiar with. This is usually a symptom of multitenancy. It's not really Kafka's fault. it's tough if you had this great idea, the platform team is going to run one giant Kafka instance for our entire organization, which does make sense, but it turns out that a single bad actor can run rampant in that Kafka instance, and that has a lot of unintended consequences. This is another symptom of a deep system.

This is a real story from myself. I worked on Google Monarch, which is Google's high availability time series system. I think they've since publicly talked about it, and they've quoted that it's running on 200,000 VMs all the time. It's a giant system. I was in a review with the SVP at Google, and he said he needed 100% availability from Monarch, which I thought was a preposterous statement, because that doesn't exist. I said, "How many nines do you actually need?" and he was just, "100%." To me, this is evidence of politics. Politics are often a part of deep systems, where he, of course, wasn't responsible for budgeting Monarch, so he could ask for ridiculous things like 100% availability. This doesn't make any sense to anybody, including him, I think, but he didn't care. It's like, politics are a symptom.

Oftentimes, you don't know you depended on something, and then it goes down and you do realize you depended on it, but this is the downside of abstraction. We create abstraction and then we forget that no one knows what's underneath abstraction anymore. This happens all the time. You've seen a dashboard, and you can remember seeing the dashboard. You want it, but you can't find it. You can't string search for it, and so it's basically not useful.

The themes here are a lot of people-management issues, security issues, multitenancy, half the stuff. I would specifically talk about big customers. We have seen this a lot in our travels at LightStep. You have a pretty reliable available system with decent latency, and you have 100,000 customers, but 5 of those customers generate 70% of your revenue and 70% of your workload. Those customers end up having a very bad experience despite the fact that the average or even the P99 metrics look pretty good. This is another form of deep systems and multitenancy causing trouble. Performance, and then, finally, observability. I put that one in blue, mainly because I actually have more to say about that than other parts, just because I've just spent too long thinking about it. We'll focus a little bit more on observability in the rest of this talk, but I really want to make sure people understand this basic concept of deep systems and what that does to all these different aspects of running a business and having an application. Hopefully, that part is clear.

Control Theory

Part two, control theory. This is not my area of expertise, to be clear, but I'm going to try and step way back. Observability is such a long word. I think it's six syllables, I can't count, it's too many, but it really comes back to control theory and I think in the '60s. Why do we care about observability? It's an awfully long word. It definitely gets a lot of people talking about it, but why is it important? It sounds very passive to me. It actually sounds like you're looking at the system through this pane of glass, and it's this blurry thing on the other side, and it feels very disempowered to me as a word, but we must care about it for some reason.

Let's go back to the definition. In control theory, you have this idea of a system. The system has a state vector, which mathematically is just an internal implementation detail of everything happening in that system. It has a number of inputs, where operators can control the inputs to the system, although they can't directly control the state vector, and a number of outputs. Observability is concerned with the outputs and the state vector. It's saying, observability is a measure of how well you can infer the internal state just using the output. You don't get to cheat. You can't actually look at the internal state unless you make the internal state an output of some sort. You just get to look at the outputs, you have to figure out what's going on inside.

The reason why I think people have used that term a lot lately is that it's pretty rare to be able to run a real honest to goodness debugger, like GDB or something like that. It's really hard to run something like that in production these days for a variety of reasons. All we've got are the outputs, which in modern days are basically the telemetry. You have the telemetry, you have to infer the state of the system. Controllability, on the other hand, is very similar. You have the inputs, and the question is, just by controlling the inputs, how well can you manipulate the internal state of the system? That's controllability. In control theory, controllability was the thing that everyone is talking about and observability was this little sideshow and it wasn't the point. I actually kind of wish that we could get our industry back over to the controllability side of things, because that's really the one that matters. Having control of your system is a very empowered position to be in. Why do we talk so much about observability?

There are a number of reasons for that, but the most important one is that, mathematically, controllability and observability are duals. This means that colloquially, if you make a change to improve one, you're also probably making a change to improve the other. To me, this is profound, like, "Whoa, this is really deep." The point here is that if you make a change to improve one, you're probably improving the other.

A good example of this would be the craze around service mesh. I don't think people necessarily are thinking this themselves, but my gut feeling is the reason it's so exciting is we instinctually realize that we're getting both. Service mesh gives you a single point of control for the internal state, i.e. the RPCs in your system, and it also gives you a single point to observe the telemetry. That's a very appealing thing. A lot of changes that we make in one place are effective in the other as well. These two things are closely linked.

What Deep Systems Mean for Observability

That brings us to part three. What do deep systems mean for observability and for controllability as well? Here's a simple diagram. On the horizontal axis, we're looking at the total number of services, and on the vertical axis, there's the number of developers per service. The transformation in our industry right now is really going from pure monoliths, where you have, basically, one service with a lot of developers, and then you kind of quickly try to spread that out into smaller services around that, and then you develop layers beneath it. Somewhere in the midst of that transformation, you have a deep system once you get four or more layers, and you're confronted with all these sorts of problems that we were talking about earlier.

This is the overall arc in our industry. If we want to think about why this is so painful, it's probably easiest to talk about stress. Stress can be defined as responsibility without control. If you are in a deep system, and let's say you're responsible for the service at the top of that triangle, very literally, the only thing you can control is your own service. From a controllability standpoint, that is what you control. If anything beneath you is having a bad day, you are actually responsible for that, and it bubbles up to you. The relationship between what you control and what you are responsible for as your system gets deep is completely out of hand. I want to emphasize that the thing at the top is a dot and the thing on the bottom is actually an area. I'm going to do some pseudoscience in just another slide or two, and we have to think about that area when we're doing it.

Here's the pseudoscience. My contention is that, as your system gets deeper, you don't really get to control anything more than the service you started with, but your scope of responsibility is actually proportional to the square of the depth of your system. This is why things get so totally out of hand once you're in a deep system if you're using conventional tools. Everything about observability is about tightening that gap, both by giving you more leverage and more control, so bending that bottom curve up, and by helping you cut through what you're responsible for as quickly as possible so that you don't feel like you have to do a linear search through it, because that is not going to work anymore.

This is how I think of deep systems affecting observability, conceptually. We start with a super abstract thing. We also have this way too concrete thing. In the middle, I'm going to use this pyramid structure for the remainder of this discussion just to help ground this in something that's a little bit more concrete, but not too concrete.

Managing these systems - this is something I feel really strongly about. This guy, Alex Hidalgo, is writing a book about SLOs, which I'm about to talk a lot about, and I would encourage people to read that, but this is my one-slide version of that book basically. If you want to manage deep systems, I think you can model almost everything in terms of service-level objectives. Not always, but typically, SLOs are objectives around latency, errors, and throughput. Sometimes there are some other things that sneak in there, but that's the lion's share of them. The idea is that if you want to understand the relationships between these services, the best way to do it is to think about what your consumer needs. You should always be thinking if you're a service owner, "What does my consumer actually care about?" Those are what you structure your objectives around and everything you do should be about basically one of three things.

The first one is obvious. It's releasing new service functionality. We're all, hopefully, for developers, at some level, responsible for pushing new functionality. If it wasn't for that, the easiest way to maintain your SLOs would be just to walk away from the keyboard and never push another change. We do need to do this, it's number zero. The other two things we need to do are to gradually improve our SLOs, so that looks like this "This quarter, I have an OKR to improve my P99 performance by X milliseconds," something like that, or "I want to reduce my error budget even tighter, so our consumers see fewer 500s," that sort of thing.

The other thing we need to do with SLOs is to rapidly restore them. This is basically to say, at 2:00 in the morning, you can get woken up and you have to get your SLO back into compliance as quickly as possible. Typically, this involves rolling something back, although unfortunately, it's often not your service that needs to be rolled back. Something downstream or upstream can change, and that can affect your SLOs. I think it's really important to focus on SLOs because they both make this problem feel less overwhelming. I think the total number of things happening if we're talking about that diagram with the internal state, the internal state of a deep system is literally too much for a human being to comprehend, so we need something that's a little bit more manageable. It's also conveniently the thing our consumers care about. SLOs are super important.

In a deep system we have to control the entire triangle below us in order to maintain our SLOs. If we're not able to account for and react to the things we depend on, we will not be able to have control over SLOs and, thus we'll disappoint our bosses and our consumers and everyone else, and it's a bad day. This is what it's all about. I spent a lot of time in this slide just to make sure it sunk in. I hope this makes sense to people.

There's that word again - control, controllability, observability. These things are all very tightly coupled. The conventional wisdom about observability is that it's difficult, that Google and Facebook solved this - P.S., they did not - and they use metrics, logging, and tracing, so we should too. Obviously, I'm putting this off as a straw man, this doesn't make any sense. It's illogical argument, just a literal level, and it's also wrong, which I'll try to explain.

Three pillars, three experiences, that's one way of thinking about it. This is often referred to as the three pillars. You have a whole lot of metrics that come out of your services, you have a whole lot of logs, and maybe if you're new into the new stuff, you also have some traces. You develop a product strategy or observability strategy around three products or at least three different SKUs that you buy separately. Then from a workflow standpoint, you need to go back and either improve your SLOs or resolve them quickly using these three different products.

This is really crazy-making for me. This is absolutely not a pitch for any particular vendor, but I think this is a terrible way to think about observability, and it's not an honest representation of how things work at Google. I can't really speak for Facebook. We did have some of these things, but I would say very frankly that what we had at Google for observability was pretty bad, all things considered, at least when I left in 2012. It was better than what other people had at the time, but we spent a lot of time struggling to understand our own deep system and didn't really have the right technology to do it. This is nothing to mimic, I guess, is what I'm trying to say. I think it's upside down, in fact.

Let's think about a different way of modeling it. Let's talk about two giant pipes data. I definitely think all that stuff you saw is super important. I'm not good enough with this to go back to the slide, but we are talking about observability being based on the outputs. Those are the three fundamental outputs. It's the telemetry. They're the three pillars of telemetry, not the three pillars of observability. You can think of each pillar of telemetry as a pipe. One pipe is a bunch of metrics, another pipe is a bunch of logs. Then if you don't have any traces, the cognitive load of understanding anything below you is basically tantamount to doing a linear search through metrics and logs and dashboards around those metrics and logs. If I haven't made it clear enough, as your system depth increases, what's underneath you may be dozens or hundreds of services, so that's fundamentally infeasible, I think, for a human being to do that kind of information retrieval exercise. It's too much to take on.

The answer is to sprinkle traces on it, which doesn't make any sense either. If there's maybe one thing you can weigh from this talk, it would be that deep systems are a thing. The second thing is that traces are not sprinkles. If you put traces on top of everything else, it will solve some problems, but they're not very interesting problems. You'll be able to look at individual transactions, but the more important task is trying to reduce the cognitive overhead of everything else that you're dealing with. You have too many metrics and too many logs. How can we use traces in some fundamentally new way to reduce the overhead of the metrics and logs? I'll focus on this piece.

The idea with traces, I think everyone here got the basic pitch, but it's really like fancy logging. You have a transactional log that goes all the way through your stack and follows individual requests as they propagate from service to service. By their very nature, they span the entirety of your architecture, or at least they should. One aside here is that, at Google, we had a little bit of a leg up in that we were able to trace through our storage layer. For most people, who are sane anyway, you don't try to write your own storage layer from scratch, and you depend on someone else, like a cloud provider or a database or something like that. I would love to see the cloud providers find some way to share a little bit more about what's happening within their managed services with their consumers that would allow us to understand our dependencies, or at least understand why our queries are slow, without just sort of grasping at straws or filing support tickets.

Anyway, you got traces that go through a lot of your system, and they provide context. In this case, we have a dependency on a particular service that's slow or having errors or what have you. The important thing is not that, but everything else. What the traces can allow you to do is to eliminate everything else as interesting. You can rule out any hypothesis that's not having to do with that particular dependency chain. In this particular diagram are 7 services, but imagine that there are 50 or 100 or 1,000 services. This is a much bigger win. You're basically taking a geometric problem and making it linear in terms of how far you need to look. The most valuable thing that traces can do is not to explain what happened but to explain what did not happen. That's what we should be using them for.

Then if we go back to this diagram, the point of traces should be to use the context from traces to provide a filter, and reduce our consideration. I don't want to make it sound like traces are going to solve your problems. They're not. What they will allow you to do is to focus on the subset of your telemetry that actually is related to your problems. This is a pretty powerful difference in a deep system. I hope that that makes sense to people. By the way, no one has done this yet. I think we have a lot of people who are packing away at it, but this is where I think things are going and I think where we should be headed and the way we should think about these problems. You can get a glimpse of it in certain places, but this is kind of aspirational in my mind.

The point of traces is that they can reduce your cognitive load from the square of the depth of your system to something closer to the depth of your system. A deep system is always going to be harder than a shallow system to understand, but hopefully, it's not so much harder that we're in paralysis.

The idea is that observability will be able to shrink this gap between the scope of our responsibility and the scope of our control, and traces can provide something pretty fundamental. They can provide the backbone for much more rigorous and automated process of observing and understanding our systems.

I may have made a mistake by doing this, but I've intentionally tried to leave extra time for questions, and so on and so forth. I'm going to go through what I've talked about so far. We're going to have plenty of time for an actual discussion or questions, but let me just summarize what we've talked about so far.

Microservices don't just scale wide, they scale deep, and we have to recognize when we're in a deep system. The stress of operating system is that you have far more responsibility than you have control, and we need to find some way to minimize that. The controllability of your SLOs depends on high-quality observability. You can't divorce the two from each other. They're very tightly related. The three pillars of observability is a terrible metaphor, and traces are not sprinkles. Traces need to be more like the backbone, because they can reduce cognitive load from the square of the depth to just the depth of your system. As I said, tracing should be the backbone for simple observability in deep systems.

I'm definitely not here to talk about my product. You can play with it just because it touches with some of the stuff, and you can also provide any kind of feedback. With that, I'm hopeful that some people will stick around. We can have some Q&A, or even, if someone wants to grab the mic and just say whether or not this resonates with their system, I'm really interested to get some conversation going, too.

Questions and Answers

Participant 1: All that I got from the entire conversation is the traceability part. When I architect or when I design something, I do think about traceability, telemetry, and how my microservice will get into the deeper. Apart from the traceability, is there any pattern which the architect has to follow so that you know the best way it can be implemented in-depth? Do you suggest any way of approaching the architecture itself taking traceability?

Sigelman: I think the most important thing for architecting around depth is probably some sort of rigorous standardization of the way that services are deployed and operated. I think things like SLOs, from a managerial standpoint, are incredibly important. I'd also say that it's not like your entire system has to use one language, but allowing a total proliferation of 12 different languages or something like that makes it extremely difficult to have any leverage at the platform level. You can't have a security team roll out a fleet-wide thing that sits at the application later if you have too many different languages, for instance. I think service mesh is very helpful for this. It's like trying to have the platform team create a paved path I think is the most important single thing for ensuring controllability and observability.

Participant 2: Your three pillars was metrics, logs, and traces. I don't really know what you mean by trace. Can you define trace and how it compares to metrics versus logs?

Sigelman: I have a rant that I haven't written down yet about logging and tracing basically being the same thing. I was talking to a customer, and he was saying that he thinks of logging as selfish tracing, where the idea is that logging is about your own service, but the logs that you're creating are difficult for other people to understand or consume because they're not connected to transactions. The only difference between tracing and logging is that traces keep track of where the request came from. It's like a correlation ID in a logging system. If you use Splunk or something like that, a correlation ID is often a way of getting a version of tracing going with a logging system.

The reason why tracing ends up being its own category of things is that the task of propagating the context along is very difficult, and so it required a bunch of additional work in the open-source world. OpenTracing, OpenCensus, OpenTelemetry are all tasked with that. Then, also, tracing generates so much data that, for a high throughput production system, no one has figured out how to get all that data recorded durably for a long time. Tracing often immediately introduces new problems, like how do you deal with sampling and how do you deal with summarization, that sort of stuff. Tracing is like logging but for entire transactions and that level of detail where you can't afford to store all the data, but they're very similar conceptually. It's just that those engineering constraints force a new set of solutions. That's how we ended up with a new term, I think.

Participant 3: You mentioned microservices and service mesh. How does any of this change, if at all, with eventually consistent event-driven, event-sourced systems?

Sigelman: I really like the idea of things being eventually consistent and event-driven, and so on. I think that the appeal of the systems that precede those sorts of event-driven eventually consistent thing is their simplicity in terms of understanding. It's not difficult forensically to figure out what happened if you have the trace. Whereas, for these eventually consistent systems, that's like a fancy way of saying, in some ways, that it's going to be really hard to look at a single transaction and understand everything that happened. The fact that interference between transactions becomes more of the norm. Eventually consistent systems I think presents a lot of diagnostic problems, both from a correctness standpoint and when there's a choke point or some kind of concurrency issue. I think, also, this can be very difficult to diagnose the "Kafka is on fire" thing. I know Kafka is not a perfect example, but that's the sort of thing I'm talking about. Those sorts of problems where no one actor is necessarily doing anything wrong, but the sum total, like put something over the scaling limit, is just so nasty. Then those types of architectures, I think, it's easier to get into that kind of pickle.

A lot of the best in the sense of being the most senior and decorated engineers at Google, certainly much smarter than I am, they would turn out these very simple designs that seemed like a design constraint. It was just debuggability, because it was so hard to understand what's happening. Jeff Dean has a great talk he gave at Berkeley many years ago about how the way that he was recommending to deal with tail latency, was just to make two requests for everyone, just literally make multiple requests and take the one that comes back first. It's so gross. You're literally having the efficiency of your system from a dollar standpoint in order to deal with this tail latency issue. Those sorts of inelegant things can often be very usable.

Participant 4: My question is, what if, in my microservice cluster, I already have all the metrics, logs, traces set up? What's the life look like after I have everything set up? What kind of experience do you have to share after I have metrics, logs, traces set up?

Sigelman: I think I was saying this towards a few minutes ago. A lot of this, in my mind, is a little bit aspirational. That is to say that I don't know of a way to have everything work perfectly right now. I think the biggest hurdle is actually the telemetry. It's less about the solutions themselves but just the quality of the data that we're getting out of our systems is so poor that the quality of the outputs that we have right now is very low.

I'm totally biased, but there's a lot of other folks involved with that project. I think it has a lot of promise in that the trouble with a lot of the previous efforts with that, although the idea of it made a lot of sense, required developers to go and do a lot of manual work. I think most developers aren't getting paid to instrument codes, so they don't really want to do that, and so you end up with low-quality telemetry for most of your system. OpenTelemetry does have an effort now to automatically install itself in the process. Just by turning it on, you'll be able to get a pretty high-quality telemetry streaming out. Until that happens, until the outputs are there, none of this stuff is going to work that well.

I think the most important thing as an industry is for people to push for a higher quality telemetry, which the main thing that's missing actually is tracing. There's also pretty low-quality metric instrumentation, where you have huge cardinality problems, stuff like that. Getting higher quality telemetry would be enough to unlock a lot of value in a number of different approaches.

Then the other piece of it is to design around use cases. A lot of observability tooling right now is built around these data types, which is sort of stupid. The telemetry is important to the tool, but it should not be important to the user. The user is generally trying to resolve an incident or they're trying to do a release or whatever. The use cases for observability should be built around those actual real-world scenarios, not around a database, like a database query interface for these underlying telemetry data types.

I think that transition is starting to happen, but it's a very painful thing to do, especially in open source, because the various pillars, so to speak, haven't built those separate databases that don't naturally integrate. If we're talking about open-source solutions, a lot of the work has to do with reimagining these workflows around use cases. For vendors, it's a little bit easier, but you have to pay for this.

Participant 4: Basically, optimizing everything you have makes the life easier for debugging operation.

Sigelman: I would encourage people to evaluate observability based on use cases. I think it's really silly to say, "Do I have traces, metrics, and logs, or do I have this particular data type?" It's much more interesting to say, "I'm doing a release. What do I want to have on the screen while I do a release to guarantee that I have confidence even at 4 p.m. on a Friday?" That's the sort of thing that we need to be asking ourselves when we're evaluating tools rather than, "Does it have this particular data type in it?"

Participant 5: As a developer, how would you introduce SLOs in your company and to other teams and trying to get everybody on board with that?

Sigelman: I think that it's easier to motivate people to establish SLOs if they get something for it. One way to do that, depending on your organization, sometimes management actually has the respect of the engineering organization, so they can just say, "We are doing this," and then that's the easy case. When that doesn't happen, I think it's also feasible to say, if you're a service in the middle of the stack or even if you're at the top of the stack, there's like a mobile or web app that depends on you, just to say to them "I'm going to establish an informal contract with you that I'm going to do these things and get to this level of performance, and then I'm going to stop."

The point of an SLO is that you're giving yourself permission to stop at some point. You're saying, "I'm within my error budget, so I'm not going to actually prioritize greater reliability. Even though you had an outage last week I'm still within my budget." It gives you the capability to predict your velocity better than if you're just reacting to people the loudest, squeakiest wheel above you. I think SLOs are a good way to establish expectations.

Participant 6: I had two questions actually. I think this whole thing is super fascinating. First, because you talked a little bit about control theory and introduced control here. I'm curious often, in other systems, the next thing that would come in would be feedback and actually using that feedback to generate that control. It seems, certainly, in my organization, it sounds like what you're talking about, you're creating context and observability so that we can go debug problems as opposed to actually responding. I think we have simple examples, out of scaling groups, and after that, it's like, humans. I'm curious if you have any examples or cases of where feedback has been used to actually create that automated control mechanism.

Then the second question is, you talked about the overlap of logs and tracing. I actually find the overlap of logging and metrics very interesting. There's a lot of opinions out there, and I think people are using them differently in different organizations. I'm curious if either you have a strong opinion on how you separate them or if you see some convergence there.

Sigelman: In terms of feedback loops, yes, that's really tricky. The most interesting thing that I can think of, to refer back to Google, I think they've written about this, so I don't feel like it's a problem to talk about it. They have these really large multitenant storage systems, so Bigtable, Spanner, whatever. They spend I don't know how much money, but a lot of money, many AdWords clicks are spent on their storage systems every second. They naturally want to optimize their usage, just probably not dissimilar to any of you. It's like you have a database, it's expensive, you want to make sure you use all the capacity.

The trouble is that if you have some big storage system and it's multitenant, in order to maximize your efficiency, it's the Kafka is burning scenario all over again. At Google, there are about 2,000 different product lines, a single product line shouldn't be able to blow up the entire shared resource. Dapper, which I helped to develop, was Google's tracing system. First of all, going back to controllability thing, Dapper was actually a hack on the controllability thing. There's something called a control flow at Google which was the way that we propagated context for totally different purposes, and Dapper was only possible because that already existed as a way to control the behavior of applications. We piggybacked on that and made a tracing system out of it. Then we went back of the direction and took the Dapper trace context and just stuck one idea in it. The idea was what product did you start with.

At the bottom of the stack, you had no idea, but at the top of the stack, it was Gmail or web search or whatever. There's a unique ID for Gmail and web search and calendar and all the rest of them, and you kept that unique ID literally in the thread-local, all the way down the stack, into the kernel basically, where it was actually doing disk I/O. Then, this is the feedback loop piece, we would aggregate all the information. If you were doing disk writes or something, you could say, "I want to guarantee I only have 1,000 units of disk writes for this entire region. I want to allocate 7.5% to Gmail and 2.8% to calendar, whatever, and it would, in real-time, aggregate all the disk writes and then push back if someone went over their quota."

We're able to run massively distributed multitenant databases with lots of different clients. The second you went over your quota, even though you, as Gmail, are way at the top of the stack and your identity had otherwise been lost throughout the architecture, we're able to enforce that, which probably saved Google hundreds of millions of dollars a year. I'm making that up, but it's certainly more than that. That's a great example of taking an observability tool, which was the trace context, throwing one little integer into it, and using that to do resource accounting and enforcement across the entire architecture. I think a lot of organizations would benefit from that similarly from an economic standpoint.

Then your second question was logging and metrics. I hate the term metrics. I didn't say that because I felt like I was already venting enough. It doesn't make any sense. A metric is just a statistic over time. That's all it is. Yes, totally, it's like you can count the number of times you log something, and that becomes a metric. I don't know why it gets to be a metric if it's committed by the process. It doesn't make any sense. The only metrics that are really true metrics are things like gauges, like CPU load, memory usage, things like that that aren't counters. When you're counting events, which is almost all metrics, I agree, it's just a matter of when do you do the counting. If you do it within the process, we call it a metric. If you do it in the SaaS logging tool, we call it a query, it's just BS. It doesn't make any sense. Yes, I agree with you. It's very fuzzy.

Participant 7: I was thinking back to the question earlier about event-driven architectures, and it occurs to me that maybe what you're doing when you have an event-driven architecture is you're breaking the synchronous immediate consistency boundary between services. In the context of this talk, it seems like what you're doing is basically reducing the depth of your system by letting your service handle its own local cache building essentially of downstream services as a separate process. I guess my question to start with is, do you agree with that? If so, do you think there's a certain level of depth knowing that that's sort of a lever you can pull to control your depth, where this sort of more advanced observability becomes necessary or not?

Sigelman: That's a really good question. I don't think I have a prepared response for that one. I hadn't thought about the caching piece either. I'm not sure. I don't like BS-ing people, so I'm not going to make up an answer on the spot. That's a good question, and if you come up afterwards, I'll try and talk to you about it. I don't want to make something up and be wrong, or I'll hate myself watching the video later.

Participant 8: We're on the agile world, and what management does is just, "We need to do this," or "this team," seven people everywhere are mostly working on feature. These are cross-cutting concerns, and not each one of them may be aware that they need to be aware of all this. How can you go back to management and say, "I need a complete new team which just does DevOps?" In the name DevOps, they just give you two or three people. They sometimes throw in some jargons, throw in Kubernetes, Rancher or something like that. Fundamentally, as a software engineer, yes, I am aware that we all have to log beautiful logs well-written English, which everyone understands, or any other language. My question to you is, how do I go back to management and say, "Ok, agile is fine," and then define you're transforming us? How do I make sure this kind of problems are solved by an agile team? Can I say, "Give me an agile team who just does this?"

Sigelman: That's a great question and one I think about a lot. My opinion - and you should try this and then send me an email and tell me why it didn't work - is that the best answer would be to find a pocket within the organization where you have a direct consumer and producer relationship between two services where they're having a hard time agreeing about what's going on. That thing I said about how this sounds when you can't get agreement about two teams that are talking about the same thing. Find that little pocket, have both of them adopt something that's not, if not the best practices, at least good practices, show how it resolved the issue, and bring that as like a mini case study to management. The answer is definitely not asking management to go and implement this across the entire system. That will never work. I think you can create a little pocket of excellence.

The nice thing about this triangle is that they next, so you can just find a triangle with three nodes on it, or two nodes, frankly, and show a lot of the value here as like a mini case study, and then try to build out from there. That's the way I'd approach it. I think that actually does work.


See more presentations with transcripts


Recorded at:

Jan 14, 2020