BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Resources & Transactions: a Fundamental Duality in Observability

Resources & Transactions: a Fundamental Duality in Observability

Bookmarks
39:01

Summary

Ben Sigelman explores resources and transactions, both theoretically and through some real-world examples, to develop an intuition for how to understand a system more completely.

Bio

Ben Sigelman is co-founder & CEO at Lightstep, a company that makes complex microservice applications more transparent and reliable. He is an expert in distributed tracing and also co-founded the OpenTelemetry project.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Sigelman: My name is Ben Sigelman. I'm a co-founder and the CEO here at Lightstep. I'm here to talk about resources and transactions, a fundamental duality in observability. I've spent most of my career working on observability. I was at Google for nine years in the beginning of my career, and worked on Google's distributed tracing system called Dapper, and their high availability monitoring and metrics system called Monarch. Then, Lightstep, of course, has been focused on observability as well. It's taken me a long time to get here. I've come up with a different way of thinking about observability than I used to, and this is what the talk is about.

Transactions

What are transactions? On the right, you see a diagram of some system. We're going to be talking a bit about things from the perspective of this bank account service, which is in the midst of some larger microservices architecture, some cloud native architecture. A transaction in this presentation, is not necessarily like a banking transaction. It just means any request that propagates from one part of the system to another or not, but just an entire request. It's all the work that it does, until it gets back and completes whatever it was trying to accomplish. A transaction are the things in your application that actually "do something for their end user," whether the end user is a human being, or in some cases, if it's Twilio, or something like that. Maybe Twilio's end user is actually just another program somewhere running a script. Transactions are what deliver value to users or customers.

These days, especially for cloud native, transactions are distributed. They don't literally have to be, but they often are. They can be described at many different granularities. By that, I mean the same transaction could be described at a very coarse granularity, like just an entire end user workflow. Like if you're a ride sharing app like Lyft, or Uber, or something like that, the entire flow of requesting a ride might be considered a transaction. That's the only level of granularity you have. You might want to get a little bit more granular and think about the individual requests between services, like the HTTP requests. You might think of that as the granularity you want to use to describe things, or maybe you want to get more detailed further and look at some or even all local function calls. Then I suppose, theoretically, you could look at a transaction in terms of every single CPU instruction that happens across the entire system to accomplish a transaction. Those are all valid ways of modeling transactions. Of course, as we get down the list, the cost of recording things at that level of granularity gets higher. In fact, it can get so high that there's a lot of overhead and you start affecting the transactions. Theoretically, these are different granularities. The telemetry used to describe transactions are usually traces, and structured logs. Structured logs being like a textual logging statement but with clear key-value attributes. Those things are illustrated here. You can see that the bank account request has a request size attribute, some HTTP path, status code, latency, that stuff. These are theoretical attributes in this model for the pieces of the transaction.

I would argue as well that tracing will eventually replace logging. It will take time, but for transactions, specifically tracing will replace logging. I'll try to motivate that right now by showing you how much more flexible the tracing model can be than simple single process logging. I'm not talking about 2021 here, but this is more like zooming out where observability is going. Here's a logging statement. You have some pseudocode here on line 22. Each one of these logging statements really defines its own table. You have a series of keys that are defined by the struct here, request size, path, status, latency reflected over here. These become the columns of this table. Then the values which are pulled from local state or whatever. The set of values become the rows in the table.

Tracing Is Just a JOIN across Transactions

To spell that out and make it clear, you have this logging statement, as the code runs in production is populating this theoretical table. I'm not suggesting, of course, that we actually take this data and insert it all into MySQL or something like that, or even necessarily that we insert it on to Elastic, or Splunk, or something like that. Just that there's a theoretical table that's being described by the logging statement itself, and that you can model things that way. In some cases, you have tools that allow you to run queries against those tables. What's cool about tracing is that these logging systems, it's quite difficult or impossible, or expensive or impossible to execute flexible joins. A distributed trace is doing a join across the entire system. Again, if this is your system architecture, what we're going to do is implement a join of all of the structured events, whether you call them logs, or trace spans, it doesn't really matter. You're still describing the transaction. We're going to use the tracing context to join the structured events from all these services together into one much larger table. Where you have a table that has columns from these different services, color coded here, where service A, B, and D are also joined across. Then each distributed transaction represents one row in this table.

This is really powerful, because if you can think of things in this conceptual model, it's possible to run all sorts of analyses to figure out correlations between columns across service boundaries. Which in turn allows you to understand how behavior in one service might be affecting some behavior in another service. To be concrete about this, it's possible that you'd have a customer ID field in service A, at the top of the stack, and you might find that certain customers are involved a higher percentage of the time, when latency is high in the bank account service. Then that gives you something to go on, like, how did that customer change their workload, or what have you? These sorts of joins are actually a really important mechanism to understand cause and effect across service boundaries. If you've been wondering what all the fuss about distributed tracing is, if you think of it in this model, distributed tracing is really a join for your structured logging statements. Then a bunch of semantics and query capabilities on top of that. That was transactions.

What Are Resources?

Next, we're going to talk about resources. What's a resource? Resources are the things that transactions consume in order to do their work. One side effect of this definition is that by definition, a resource is finite. Your Amazon bill is a type of resource. Again, many different granularities. That amount of throughput through a Kafka topic, a Kafka cluster can only support so much load. When you get to the end of that load, and you've consumed all of it, things get really bad really quickly. You end up with a lot of pushback, and very high latency, dropped requests, things like that. Similarly, CPU usage is totally fine until it's not. If you saturate CPU in a service, all sorts of things break that you took for granted. Even worse, your memory usage for the process, just straight up crashes. You can also get really granular and talk about individual mutex locks, for instance. If you have a lot of contention on a single lock, you'll end up with a read lock which should be 180 nanoseconds, can take a million times as long if there's a lot of contention for a lock. That causes problems, too. These are all types of resources. What makes a resource a resource, is that they survive across transactions, and that they're shared across transactions. It's critical that you do share resources, because if you don't, your system will be incredibly expensive. That's the whole beauty about running in a multi-tenant, multi-request environment is that you can make better use of the resources and share them across transactions. That's what resources are.

To make this a little bit more visual, I've drawn these boxes for a microservice, a Kafka cluster, and a mutex lock. This is just totally illustrative. I'm sure there's better ways to measure the health of these things. For a resource, what you want to think of is the amount that remains of that resource, at some level. It's an indicator of how much that resource has been consumed. You can see that CPU usage can spike, RAM usage can spike. You can see that consumer lag or producer lag is an indicator of health for a Kafka cluster, or the duration that you have to wait to acquire a mutex lock is an indicator of health for a mutex. Any resource has some health metric. What I'd want to emphasize here is that none of these is an indicator of the success or failure of an individual transaction. Although, certainly when things like CPU and memory usage spike, you have issues with transactions. It's meant to indicate the health of the resource. I'll talk about why that's relevant. Then the resources also have a bunch of tags. This is actually really important.

The purpose of these tags or attributes, in my mind, is manifold. Of course, you're just trying to understand and disaggregate. If you see a spike like this, you might want to group by region, or group by cluster ID, or something like that. That's fine. You should be able to do that in a time-series database. More importantly, these tags are the lingua franca to communicate about resources and transactions. When a transaction crosses a resource, ideally, the transaction in some ways annotates that resource. It can serve as a way to join from the transaction data to the resource data, or vice versa. That's a really powerful thing. I'll talk about that later, when we get into an actual example.

Resources Are a Hierarchy, Too

I had said that there are different granularities, there are also a hierarchy. This is true for transactions, but I think that it's more important to make the point here. You might have a Kafka cluster, which itself has many microservices. Within those, you have a bunch of VMs. Within those, you have a bunch of mutex locks. These things level up and level down, too. There's hierarchy in the resource landscape, as well as just these health metrics.

Codependency

We've talked about transactions. They're the things that actually do work that your customers care about. We've talked about resources, and they're the things that make transactions do something, and are shared across transactions. These are codependent. Here's a diagram of these resources. These green squiggly lines are meant to illustrate transactions flowing into or out of those resources and doing their work. You can see that the transactions are going, in this case, to different HTTP endpoints. In this case, to different Kafka topics. In this case, you have readers and writers trying to perform their lock on a mutex. There are different types of transactions, and we would like when a transaction crosses a resource for it to be annotated with the identifiers from that resource. If this topic Y request is coming in, that transaction in terms of that schema of different levels of granularity, it's quite valuable if you ever want to be able to understand how the resource and the transaction interact, for the transaction to be annotated with the region and the cluster ID for a Kafka instance status. Or, for this endpoint, for the transactions to be annotated in the trace with something like the hostname or container ID, the service name, the version, and that sort of stuff. Again, you can use the tags from the resource to empty the transactions and serve as a map between these two worlds. This is an illustration of that. The green things are basically traces. Then in the resources, you typically use metrics telemetry, time-series statistics to represent the health of these resources. Not always, but typically.

Resources and transactions are completely, totally codependent. That's a really important problem. Which is to say, if your resources are unhealthy, the transactions suffer greatly. If the transactions become too numerous, the resources suffer greatly. One of the things that bugs me the most about continuous integration as a topic, actually, is this idea that code can even be correct or incorrect, without knowing what the workload is. I think that's a total mirage. I think CI is great. Of course, we use CI at Lightstep. The only way you can know, at least for a system software, or a backend software, if the code is correct or incorrect is to run it under a real workload. Because the workload is actually really part of the semantics, and all the tuning around how resources are configured and even how resources are designed and implemented, it depends in large part on what the actual workload of transactions looks like. It's not just that you might need more of something, but you might actually want a different thing. I don't like when people hate on MySQL too much. I hated on it a little bit earlier about certain types of data, but it's perfect if your data can fit easily in one machine, and you just want relational semantics, there's really not that much wrong with it, aside from some of the replication issues. Similarly, if you want something that's truly planet scale and cheap, you're going to have to move off of that model and do something else. Resources really can't be considered correct or incorrect from a design standpoint, or from a code standpoint, until you consider the transactional workload as well. Observability turns out to be one of the easiest ways to understand how the workload is affecting the health of the resources and vice versa.

Speaking of codependency, the customers of your application, they only care about transactions. By that, I mean if you have an outage, and someone's trying to write up a report especially for a non-technical end user, they really don't care at all about the fact that you ran out of resources. That is not their problem. It's your problem, not their problem. They only care about whether their transactions were correct, and came back in reasonable time. Correctness, also implying that it wasn't an error. Correctness and latency are the two things that a customer or end user will care about for their transaction, and nothing else. How you accomplish that is up to you. For an operator, the only thing you can control are your resources. By operator, that includes DevOps, SREs. The whole point of software is that we're not sitting there taking the facts from the customer, and then filling out some stuff, and then doing some work as a human being. We're writing software to automate that stuff. That software runs on resources.

Operators, totally, extensively are mainly concerned with resources, because that is actually the point of control that they have. End users only care about transactions. The end users and the operators are codependent in this way as well. If end users change their behavior too quickly, it can create resource issues for operators. Obviously, if operators end up doing something that damages the health of a resource, then end users suffer. End users, operators, transactions, resources, they're all codependent in this way. There's this relationship between them. End users and developers, or developer or operators are also totally codependent. I think that's a really interesting, and to me, anyway, a profound thing. That end users and transactions, operators and resources, that's what they tend to think about, because that's what they can control and what they deal with. They're actually totally different planes that intersect in the workload itself.

What's a DevOps Eng/SRE To Do?

What do we actually do? This sounds like a problem. It's necessary to be able to pivot between these two, across telemetry types, metrics and traces, in the aggregate via tags, and automatically. This is the way that you can take some scarcity on a resource or a health issue with a resource and pivot to understand how the transactions have changed. Or, you can go from a transaction being really slow or returning an error, and figuring out which resource has something to do with that, because high latency is almost always going to be some resource that's under contention, whether it's your database, or a Kafka queue, or whatever. Then trying to understand why that's under contention is going to be a matter of figuring out how some workload has changed, like someone pushed new code that increases the number of database calls by 100x, and that's why our database got slow. It's an interesting thing. You're often pivoting between transactions being slow, finding a resource that's under contention, and then realizing that someone increased their load by 100x. Again, pivoting from transactions to resources and back again, or vice versa. This is really hard, because metrics and traces right now, people integrate them in the frontend, but they actually need to be integrated at the data layer to make this possible. It's a very difficult data engineering problem, because the data actually looks quite different. Metrics are typically represented as time-series statistics, traces typically as structured events, whether it spans your logs, whatever, it's basically a bunch of structured event data, not statistical data. It's hard to pivot between them. The tags are the way that we do that.

Then, finally, I've been doing this stuff for 15 years or something like that, I think about it all the time. No one should have to be an expert in this stuff to use it. It needs to be something that's somewhat intuitive, and bring these sorts of insights to people in their everyday workflows. If we're not able to do that, then we basically haven't succeeded in solving codependency.

Service-Level Objectives (SLOs)

One helpful tool is SLOs. I wanted to just briefly mention something about SLOs in the context of resources and transactions. SLOs are a hot topic. I would argue that SLOs are goals. They are goals about a set of transactions that are scoped to a set of resources. Resources and transactions are a really nice way of thinking about SLOs. I think the term service-level objective, people often think, service means microservice. Not really. If you were old, like me, you picked up a telephone, and got a dial tone, and the service level is that that happened 99.99% of the time, or whatever. It doesn't have to be a microservice. The service level is just talking about, how reliable are the transactions? Whether it means that they don't error out very often, or that they're fast, or what have you. You want to examine that service level in the context of some set of resources. This is actually where the microservices come in. You could also have an SLO for some other thing, again, Kafka queues, databases, that sort of stuff. In this way, the SLOs can represent a contract between these two dualities, transactions and resources on one side, and operators and end users on the other. I think it's an elegant way to think about why SLOs are so vital and so important in bridging these two different worlds.

What Does This Look Like In Practice?

I know what I've said so far is quite theoretical. I decided I would show a worked example of a real incident in a production system in these terms. It does show some product screenshot stuff, but this is not meant to be a demo or something like that. It's just meant to help illustrate conceptually how this plays out in practice, specifically around a Kafka outage.

Here's a picture of a chart in the dashboard showing consumer lag for a Kafka queue, actually in Lightstep's own internal systems. It wasn't like an outage that required an incident or that had a customer visible effect, but it was definitely something where people were scrambling to figure out what was going on. You can see, at 10:45, things were fine. Then they became not fine. A classic situation, you have Kafka, which is a distributed system, in and of itself. It's a resource, clearly something has gone wrong, because it's getting really slow to do anything as a consumer. You're having to wait a really long time to get any data out of the Kafka queue. What do you want to do? I think one option would be to try and group by and filter this in various ways. What we really want to be able to do is just say, what changed in this resource? This resource has transactions going through it, a lot of them actually right here. Also, a transaction is going through it right here. What changed about the transactions between this period and this period? That's what I actually want to know as an operator, because that's probably going to give me a clue, because the code didn't change in Kafka. The workload changed. How? You should be able to click on this thing. Here, you can see the query.

You should be able to click on this change, and say, what caused this change? Then get to a UI where we say, ok, so here's that anomaly where the queue was unhappy. Here is a baseline where it was working normally. Again, we're just zooming in to see what these two periods look like. Then what we really want to do is to understand, statistically, what's different about the transactions here from the transactions there? We're trying to understand not just that Kafka was unhappy, but what's different about the workload here versus here. The beauty is that there are tags. The Kafka queue has a name. The hosts involved have names. The traces that go through the Kafka queue also are annotated with those tags. We can actually join from this resource over to the workload and actually answer that question. Going back to those giant SQL tables I was talking about, we can say, so let's just look at the traces that went through this specific Kafka queue, because that's a column in that giant table. Let's just look at the traces that went through that specific queue in these two different time windows and understand what other things are correlated with this regression.

What we see is that in this multi-tenant system with many different customers, there's a single customer whose product ID is 1753, that went from being 0.8% of all the traces, to almost 16% of all the traces. That's about a 20x increase between this baseline and the regression. That's actually really interesting. Some customer significantly changed their workload, and that's exactly the thing I'm looking for. The trouble is that customer tag is way up the stack. It's not even available in the Kafka queue. Only by pivoting from the resources to distributed transaction traces, are we able to understand automatically that there's a specific customer ID that's involved in this regression. We get more detail on this by expanding it to say, ok, so we're again increasing by about 20x. Then, if you want to, we can view sample traces to see specifically what that customer is doing.

What I'm trying to illustrate is that it's possible to go from this feeling to this feeling without writing any queries. You shouldn't need to be an expert to pivot from resources to transactions. Right now, it's very difficult to do this actually. In general, anyways. I think that my vision for observability is that we stop talking about metrics, logs, and tracing as the point of observability. That's just the raw data. Instead, we allow operators to understand the health of their application in terms of SLOs, understand the health of their resources, just as they do today. Then to pivot between those two things naturally, without having to write queries. To understand how the transaction workload affects the resources and how resource health affects the transaction workload without having to lift a finger or do any real work.

Summary

Transactions traverse systems, use resources. Users don't care about your resources. Similarly, DevOps can't do anything about individual transactions. They can only do things about their resources, at least not manually. We must be able to use systems and use providers that join resources and transactions naturally, to address the most important question in observability, which is, what caused that change? Whether it's a change in my transactions, or a change in my resources. That's really where I think observability is headed, within this framework of resources and transactions.

Questions and Answers

Butow: You spoke about the cost of storing data. That's always a really tricky conversation, I think, for folks. What are your general tips there for having those conversations so that you can minimize the cost but maximize observability?

Sigelman: It's definitely a problem right now. I have so much to say about that topic. The first things that come to mind are, first of all, are we talking about transactions or resources? Which is to say, are we talking about things like traces and logs, or more like statistical time-series data like metrics? Because I think the conversation is different for those two categories of telemetry. What we've found, at least in my time at Google, and then with Lightstep as well, is that it's not just a binary thing. Do you keep data or do you not keep it? It's like, do you sample it to begin with? Do you get it off the host? Do you centralize it over the wide area network? How long do you store it? At what granularity do you store it? How does it degrade over time in terms of the precision? When you see organizations that get very sophisticated about this, they actually have different answers along that entire lifecycle for the telemetry. I think it's a really challenging question to answer because it depends on what granularity you're talking about.

The main thing I would say is, if an organization doesn't have the capacity to profile their telemetry costs for that entire lifecycle, including network, which is actually one of the largest components of the lifetime cost is just sending the data. It's like any optimization problem, you have to start there. I think, frankly, a lot of organizations aren't able to profile it. You can't say which part of your application is causing you the most long-term telemetry costs. Until you can do that, there's no way to optimize it. I'd start there. Then for folks that have done that, I think the main thing is being able to control the costs of an individual developer, like adding a line of code adds too much cardinality to a metric. That can cost hundreds of thousands of dollars a year if you are doing it wrong. Being able to correct for that in a central place, I think is the next thing to focus on. Make sure that an individual instrumentation line can't balloon into some unbounded cost for the platform team.

Butow: I really liked your example at the end, where you went through an incident that had happened and you share on the screen. Ben brought up Lightstep, so you can actually see how it works. I've used it myself as well. I thought it was really amazing how quickly you could go down to an actual specific customer doing a certain action and it causing an incident. I know for myself, like this has happened so many times over my career working on production systems, and being able to get down to that really small, granular level super-fast is very effective.

What was the web interface that was able to output the important differences given two time intervals? Did you want to talk a little bit about that? Your thoughts around that, like how different interfaces can actually help people do a better job? It's a big thing.

Sigelman: This is an editorial talk. I hesitated to show that. I don't know of any other way to show it. I feel like it's too abstract if I don't show something. To answer the literal question, that was Lightstep's product. There's no reason why it couldn't be done somewhere else, if you can do the integration at the data layer. I think the thing that makes it difficult in practice, is that the integration between the resource metrics data and the transaction tracing data has to be done at the data layer. You can't just do it via hyperlinks. What I described as a join is actually a join. You have to be able to join from tags in the metrics over to tags on groups of thousands of traces. To do that requires some data engineering at the platform level. I'd love, actually, to see more solutions in the open source world go in that direction. The shortest path for incident resolution for sure is being able to pivot via tags between time-series data and transaction data. It's actually a difficult problem if you have to start with a siloed metrics solution and a siloed tracing solution to do that, from like the web UI, because the integration has to be at the data layer. That's what makes it so tricky from a product standpoint.

Butow: I've seen some platforms be created internally at places I've worked, but it takes many years and a lot of iteration to be able to get it to work right, so not at all easy. The interesting thing too, like when you do have access to systems like these, say, for example, you have one customer that suddenly changed their pattern, but maybe it's great. Maybe they've onboarded all these new users and a whole bunch of amazing work is happening, it means you can then actually say to other teams, like maybe the business side of the company, "Looks this customer is doing really great work. You should probably know about that. Did you know that we're even doing that?" It enables engineering teams to have better conversations with really interesting and insightful data, which I think is good, too. What's the recommended tools to get tracing? It's always a service mesh or something similar, or is it better to do it individually inside of the system?

Sigelman: Yes. I actually did a talk, I think it was 2017, at KubeCon about service mesh and tracing. Service mesh is helpful, but it absolutely does not solve the problem. All that service mesh really does is give you telemetry about the calls between services. The hardest part of tracing has always been successfully propagating this trace ID context from ingress in a service through the function calls, out to the egress from that service. The service mesh has nothing to do with that. The service mesh only handles the calls between services. It's within the services, the hardest part of tracing. It doesn't really help. You end up with a bunch of point to point data with a service mesh, but to actually successfully address the context propagation problem, the only advice I have would be to move towards OpenTelemetry. Which actually is trying to address that problem in a pretty comprehensive way, and make that context propagation a built-in feature. OpenTelemetry will also integrate with service mesh. Part of the pitch for service mesh is that it solves the tracing problem, and it's just not true. It adds data for tracing but it doesn't solve the context propagation problem that's actually at the core of distributed tracing.

Butow: There's another question around OpenTelemetry. What's your thoughts there?

Sigelman: I'm extremely biased in that I'm on the governing committee and I co-created the project. It does seem objectively to be a very successful project from the standpoint of having many contributors. I think we have 1000 contributors in the last month alone. Every major vendor and cloud provider has bought in and are staffing the project and so on. The only issue with OpenTelemetry is that it has so much activity that we're having some difficulty just literally maintaining the project. I think the reason it's been so successful is that it benefits many parties. It's a win-win for major infrastructure providers, cloud providers, observability vendors, and especially for end users, because you can end up getting high quality telemetry without binding yourself to any particular vendor or provider. That portability is a very attractive thing. I think observability solutions have long been constrained by the quality of the telemetry they can collect. OpenTelemetry really is going to rise that tide, and then you'll see solutions improve as a result. OpenTelemetry, yes, it's a very exciting project. I think the only thing we really need is just to be able to say no a little bit within the project, so that we can hit our milestones. It's been a victim of its own success, to a certain extent. It definitely has a bright future.

Butow: Do you have a few things to share for the roadmap of OpenTelemetry, just over this year, a few things you're excited about?

Sigelman: The three pillars of traces, metrics, and logs don't really make any sense for observability. I will hold my light on that. They definitely make sense for telemetry, so those three aspects. We started with tracing. That's basically been GA to this point. Metrics is coming up soon. Then logging will come later. I think it's actually the least important of the three for OpenTelemetry to address from a standards and API integration standpoint. I'm very excited for metrics to get past that point. We're also doing a lot of work with the Prometheus community to make sure that there's interoperation between Prometheus and OpenTelemetry so that you won't be forced to choose. I think that's also going to be a really good thing to see that stabilize this summer.

Butow: There was another common question around the performance cost associated with distributed tracing. What's your thoughts there?

Sigelman: Distributed tracing does not need to have any overhead from a latency standpoint. There is some minimal marginal throughput overhead in the sense that it does take resources. Usually, people can adopt some sampling to address that. The Dapper paper that I helped to write at Google describes in details our measurements for performance. It was imperceptible in the sense that it was in the statistical noise, and Dapper runs 100% of the time for 100% of Google services, and has for 15 years. It's definitely not a high overhead thing if it's done correctly. That's one of the beauties of it.

 

See more presentations with transcripts

 

Recorded at:

Nov 12, 2021

BT