InfoQ Homepage Podcasts Martin Mao on Observability, Focusing on Alerting, Triage, & RCA

Martin Mao on Observability, Focusing on Alerting, Triage, & RCA

Jul 13, 2021

Podcast with

Martin Mao

Wesley Reisz

Observability is a crucial aspect of operating Microservices at scale today.

Today on the InfoQ podcast, Wes Reisz speaks with Chronosphere’s CEO Martin Mao about how he thinks about observability. Specifically, the two discuss Chronosphere’s strategy for implementing a successful observability program. Starting with alerting, Martin discusses how metrics (usually things like RED metrics or Google’s Four Golden Signals) are tools to aggregate counts and let operators know when things are moving towards an incident. In stage two of this approach, operators begin to isolate and triage what’s happening in an effort to provide a quick system restoration. Finally, Martin talks about root cause analysis (RCA) in the final stage as a way of preventing what happened from happening again. Martin uses this three stage approach (and the questions that should be asked in each of these stages) as a way of focusing on what’s important (or reducing things like Mean Time to Recovery) in a modern cloud native architecture.

Observability is the ability to understand the state of a system by observing its outputs, on today’s podcast we talk about a strategy for implementing a meaning observability program.

Key Takeaways

While observability is a new term in software, monitoring (or the idea of observing your systems) has been around for some time. What’s changed today is the velocity of how we ship software and the idea of ownership of the monitoring solution. Much of this can be found in modern DevOps.
The danger of focusing on logs, metrics, and tracing is that it often leads to a misconception that because you have these things, you will understand when things go wrong. The focus should first and foremost be on what we’re trying to achieve (or the ability to keep things like MTTR--Mean Time to Recovery--down).
A 3 step process or focusing on the questions around alerting, triage, and then RCA (Root Cause Analysis) allows teams to focus on system stability and not the datatype itself. When the focus is on the questions (and not on data types), it's an outcome-driven approach as opposed to a data or a data size driven approach.
A common anti-pattern in the area of observability is focusing so much on individual service SLAs that overall business SLAs aren't met. While having SLAs are good. It’s not the whole story.

Subscribe on:

Introduction [00:21]

Wes Reisz: There's a lot of definitions of observability out there, but they usually go something like this. Observability is the ability to understand the internal state of a system by its external outputs. Too often though, when we talk about observability, some of the very first things we talk about are the three pillars of observability, logs, metrics and tracing. While logs, metrics, tracing, and we often hear events often thrown in that group as well, are super important and are very important for the underpinnings of the discussion. Starting the conversation there often is a bit of an injustice because it has us focus, maybe not on the right thing.

Today on the podcast, we're going to be talking to Martin Mao, the CEO of Chronosphere, about how he thinks about observability and some of the strategies that they're using around talking about observability and how we might solve for what's important at different phases of an application life cycle. Hi, my name is Wes Reisz, and I'm one of the co-host of the podcast, and chair for the QCon Plus Software Conference that's coming up this November, We're in the early planning stages of QCon Plus. Stay tuned.

You'll hear things about tracks taking shape around architecture, edge microservices, developer practices, observability, and things like AI and ML. The conference takes place in early November. Stay tuned for more information. As I mentioned, today on the podcast we're talking to Martin Mao, Martin is the CEO of Chronosphere. We'll talk a bit about Chronosphere in just a second. He's also one of the co-creators of M3. Prior to Chronosphere, he worked on M3 in Jaeger and Uber for about four years. As always, thank you for joining us on your jogs, walks and commutes. Martin, thank you for joining us on the podcast.

Martin Mao: Thanks for having me, Wes. Looking forward to the conversation today.

What is the focus and purpose of Chronosphere? [01:52]

Wes Reisz: When I think of Chronosphere, M3 is the first thing that comes to mind for me. Chronosphere is about much more than just M3, right? Do you mind taking a second and talking a bit about the mission and purpose of Chronosphere?

Martin Mao: Yes. 100%. If you look at the M3 project (the open source project there) and what we created at Uber, it's really a metric storage engine that is critical for a monitoring system or perhaps an observability system. But it's only one piece of the puzzle, right? It's the backend storage and engine of such a system. Generally, when a company looks at solving a problem that is one piece of the puzzle, but they need more than that. They also need the visualization and display of the data. They need alerting and notifications, and more things in that as well, some of which you mentioned earlier, like traces and logs.

What we're trying to do here at Chronosphere is to provide everybody with an end-to-end hosted monitoring solution, not just a storage engine, but an end-to-end solution to solve people's monitoring and observability problems. In particular for us, we're targeted at cloud native environments because we think that the problem has fundamentally changed so much in this new architecture that you need a different set of solutions to go and tackle it. But that's, in a "TL;DR" form, what we do here at Chronosphere.

Are the Three Pillars of Observability the best place to start a conversation about Observability? [03:05]

Wes Reisz: I mentioned in the introduction that when we talk about observability, a lot of times the conversation immediately kicks off about these Three Pillars of Observability (logs, metrics and tracing). Is that the right place to start a conversation about observability?

Martin Mao: I think if we take a step back a little bit, observability is a new term for sure, but the concept of monitoring has been around for decades already. It's the goals that you mentioned earlier, to understand a particular system, whether that's an infrastructure or an application, and measure its outputs and understand it. Really what you're trying to do when you're doing that is really trying to understand what the impact is to the users of that system, whether that be internal teams or end customers for a particular business.

I think from that perspective, it hasn't really changed. It's been termed monitoring or perhaps application performance monitoring or infrastructure monitoring in the past, and that's largely the same now that the terms observability and perhaps some of the data types of change, but it's largely the same. I think what has really changed is two-fold recently. One is when you look at modern businesses and how we do development, that's changed fairly fundamentally in the sense that we are in a mode now where we are shipping updates to our customers and to other internal teams a lot quicker than we were before.

Related to that, a lot of design and infrastructure architecture has changed because of that, so that we can respond much quicker to the business need. I think that's caused or introduced a need for change there. The second one is, if you look at who was in charge of monitoring before, historically it's been an SRE team or an infrastructure team that's really responsible for monitoring, and you really depend on the tools like your APM tools to go in and collect and display all this data for you. But really what we've been seeing more recently with the DevOps movement is that developers themselves own this end to end, right?

They own the whole life cycle of the application from development to testing, to CI/CD, rolling it out multiple times a day, to observing and monitoring how it's reacting in production. Then anytime something goes bad, they own the consequences of that and resolving that. I think the ownership of this has changed over time, and the use case and the business requirements have changed a little bit. But really we're fundamentally trying to do the same thing as historically monitoring has been doing.

Tying this back around, I think because of that, the goals largely remain the same, and it's not about having logs, metrics or traces. The ultimate goal of this is to observe and understand the system and know when something has changed in a negative way and fix that and ideally fix it before there's an impact to end customer of the particular system. Finally here, I'd say that the issue perhaps with focusing on just the data types of logs, metrics, and traces is, what we find with a lot of companies out there is that they think having all three means you can achieve that goal of identifying and remediating issues really quickly, and that's often not the case.

Just because an application emits all three doesn't actually mean you have a great grasp on serving when things go wrong. We also find that the more data that gets produced doesn't directly lead to better outcomes. For example, if you emit 10 times the amount of logs or metrics as you did before, it doesn't mean you have a 10 times better meantime to resolution or MTTR, for example. There's definitely a little bit of a mismatch in terms of the amount of data being produced and really the return on that or the outcome of that. From my perspective, at least, there is a little bit of disconnect in the focusing on the data types. It doesn't quite make sense when you look at the broader picture of why we're doing this stuff and really what the outcome of this is.

Wes Reisz: Yes. It can almost make it harder to get to what the actual problem is when you've got so much data out there that you've got to get through. I want to go back for a second. Right at the very beginning in your response, you talked a bit about APM and observability. Is observability different than APM?

Martin Mao: Yes. I'd say there are definitely some changes there. I mentioned the goals are roughly the same, but it's definitely some big differences. I'd say the first of which is if we look at the modern architecture, generally we don't run monolithic applications on VMs anymore. We're running microservices on containers. When you looked at tracing in the APM days, it was mostly tracing through a single process, whereas these days when we talk about tracing, it's more about distributed tracing and requests coming across multiple microservices. Same thing at the infrastructure level, we're looking at VMs before, but now we're looking at containers or perhaps serverless technology there.

I think the landscape in which we're solving the problem in the environment has changed so much that you do need a little bit of a different solution there. The other thing about observability is what we find is that a lot of the APM solutions were very out of the box and actually a little bit opaque in this perspective, in the sense that it is doing instrumentation for your application automatically, whereas what we see with observability these days is the control of the instrumentation is in the hands of the developers themselves a lot more.

As an end developer, you have a lot more control in your instrumentation and what you want to measure versus what you had in the APM days. I would also say that with observability, there's a lot more open source standards these days as well to ensure that that instrumentation can be consumed by any sort of tooling that's built on top of as well. There are, I would say, a decent amount of differences between APM and observability, but the same high level outcome driven goals for both projects.

How do you like to see customer’s framing the conversation about observability? [08:20]

Wes Reisz: It makes sense. With great power, comes great responsibility, right? You've got to be able to make sure you can use the data that's being generated. It's not just given to you necessarily. When you start to talk to customers, when you start to talk about observability, how do you talk to people about framing a structure in their mind? How do you talk to them about getting their minds around all this different data that's coming in and structuring it? I guess, a mental map, help them attack this problem of observability and the issues that they can use the data to solve.

Martin Mao: Yes. Our framework for thinking about this, and to your point, to get them outside of thinking about just it could be so much data that gets produced, is to use a framework where we're focused on the outcome. If you look at it from an end-user perspective, which is the developer themselves, the ultimate goal for developer is to be notified and remediate an issue as quickly as possible, in their applications, right? And ideally remediate it before a customer finds out. For us, if that's the ultimate goal and we're optimizing for that, it really actually comes down to answering three questions. The first of which is, can I even get notified when something is wrong?

Because if I can't, as a developer, I'm not even going to start to pay attention to the problem. I don't know anything's wrong at all and I can't even get started. That's the very first step of this. The second one is when something does go wrong, can I triage it quickly and know what the impact is? If I get woken up in the middle of the night at 2:00 AM, do I need to actually wake up and get out of bed and fix this right now, or can this wait until the morning to do? What is the impact? How many users are impacted? Et cetera. The last one is, can I figure out the underlying root cause of the issue to fix the problem?

What we have found is that answering these three questions really leads you to the outcome of being notified and ultimately remediating the issue for your customers. What we find is that these are definitely sequential steps, right? You have to get notified first, then you have to do your triaging, and then you go and figure out the root cause. The framework here in the sequential like this, but you don't actually need to go through all three steps. Really the goal is remediation, and if you can remediate as quickly as possible, that's the ultimate goal. If you don't need to actually run through all three steps, that's actually the ideal case. Right?

Let me give you an example here. Generally for system, most issues are introduced when you're introducing change. When you're rolling out a new version of your service or application, or you're changing a configuration, generally that's when something breaks. Because if you just have a system that is in a stable state, normally it continues to operate that way. You can imagine, you're doing a new rollout of a new version, and instantly if you know that something is wrong, if you can get notified that something has gone wrong, the first thing you would do is roll that back, right? Without knowing what the impact is, without figuring out what caused it, the first thing you would want to do is just get the system back to a healthy state, right?

In that example, you've remediated the issue without having to triage it, without having to understand it, and then you can take the time when the system is in a healthy state, to take the time to figure out what was wrong with that change that caused this issue. Sometimes you can't do that. Perhaps, sometimes you do need to go to the second step. An example here could be you get notified that something's wrong, you haven't introduced the change, perhaps it's due to an issue in your infrastructure tier, and you triage it and you realize, hey, the issue is located to a particular cluster or the issue is only impacting a particular region or availability zone.

Again, in this particular scenario, you don't actually need to root cause the issue. You could easily do something like reroute your traffic away from that cluster, or away from that availability zone, and again, get the system back to a state where it is functioning again for your customers. Then the third step, which is actually finding out the root cause, this is something that is ideally done, not in the heat of an incident, not when your customers are impacted. These things are already pretty difficult to do, so you can imagine to apply the pressure of I have to figure this out while all of the customers have been impacted or while all of other engineering teams are breathing down my neck to figure out how to fix this.

It's actually not the best time to do it, but sometimes you do have to do it. If you do, you want the right tooling to be able to do that in production as well. Our framework, to sum it up I guess, is to answer these three questions, to get to an outcome of remediation as quickly as possible, going through them one at a time, but at any step of these questions really trying to get to that remediation point as quickly as possible.

Wes Reisz: As you were talking through it, the thing that kept coming to my mind was time. The thing that helps you figure out what you need to use here is time, right? A short, quick alert, and then when you have more time to be able to get to root cause? That's all the way out the other end of the spectrum, root cause. Is that accurate?

Martin Mao: 100%, that is accurate. If you think about why we want to observe our systems is because we want to reduce the negative impact to the business and to the end users, right? If we can shorten the time of that, that generally directly correlates to the impact of the particular issue. Optimizing for reducing the time to remediation, 100% is a main point of this framework.

Wes Reisz: Now, is this the right time now to come back and start talking about logs, metrics, tracing, events, spans, things like that? Once we know what our focus is, once we know our interval to be able to respond to things, now do we need to start using these things? Is this when we start bringing up these conversations?

Martin Mao: It definitely is. There's a reason why observability is directly related to logs, metrics, and traces because just having the outcomes is good and knowing what you're optimizing for is good, but you can't do that without the underlying data itself. Definitely logs, metrics, and traces play a big part in answering these questions. In fact, it is the underlying data to answer all of these questions we want to optimize for. It's just that when the focus is on the questions and the outcome, and not on the data types, it's an outcome-driven approach as opposed to a data or a data size driven approach.

How do you begin to apply logs, metrics, and tracing in the context of these three phases you spoke about? [14:07]

Wes Reisz: I watched a video of you. You were talking a bit about the life cycle of an application when you first deploy it and go through things. You can use logs, metrics and tracing at different points throughout this life cycle. Can you talk a little bit about what data structure, I guess, is best at different parts of these phases that you've been talking about?

Martin Mao: If you look at the three phases, so the first one is notification. When you think about notification, generally, it's done through alerting. If you think about alerting, generally, the underlying data type that's best for alerting is metric data because metric data is an aggregate view of each of your perhaps individual requests. If you think about what is the best underlying, most optimal underlying data type for getting notified as quickly as possible, it would be the metric data type. That doesn't necessarily mean you have to go and have metrics instrumented in your application.

A lot of applications, especially the legacy ones, may not have metric data instrumented. They may only have logs, which is perfectly fine. But if you want to do real time alerting off of that, you can imagine that the step that would be to consume the log messages, convert them into metrics, because you're generally trying to count how many requests, how many errors, what the latencies are, you don't necessarily need the details of the individual log message itself to do the alerting. You want to know the aggregate data. You're really generating metric data out of your logs perhaps if you don't have the metrics instrumented and using that to do this first step, which is getting notified.

Wes Reisz: You mentioned Rates/Errors/Duration, like RED method, or Google's Four Golden Signals. Are those the types of things that you're recommending here?

Martin Mao: Those are definitely the things to be alerted on, because if you think about alerting, it's going to be pretty hard for us to create alerts for a very particular edge case or combination of scenarios together. What you're really trying to do, and if you look at an application or service, the leading indicators are those Four Golden Signals or perhaps the RED Metrics that you're talking about, which is just how many requests come in, what the error rates are, and what the latency or duration of the requests are.

Wes Reisz: Okay, then off to triage?

Martin Mao: When you look at triage, you've got the aggregate details that you can use to get the notification. Then when you think about triage, you really want to dig into that data a little bit more, right? You know your P99 latency is high. Now, in the triage phase, you want to dig into that a little bit more. You want to start looking into, okay, can I break this P99 down between cluster A versus cluster B, easy A versus B versus C, or region one versus two, or perhaps a subset of my customers like it's only impacting North America and not Europe or something like that?

Triage is really about digging in one level deeper and understanding more about really where is the issue coming from? Is it across the board for all of these requests, or is it just a subset of them there? Generally, there's perhaps a couple of data types you can use here. Again, actually metrics is pretty good at this because at this level, you're still not trying to look at each individual requests. You don't care about a particular customer. You're still caring about one particular request.

You are still trying to care about groups and look for particular patterns. Because of that, metrics is still great for this, and you can imagine with metrics you would add dimensionality or add additional labels or tags to your metrics so that you can use them to differentiate group A versus group B. Metrics, what we find, is certainly still really good data type for this, but you can start to use logs here as well if your logs are tagged appropriately, in the same way.

Wes Reisz: I was going to say things like high cardinality so that you can ask questions. You don't quite necessarily know what questions you're going to need to be able to answer, so you need that high cardinality here to be able to get to the problem.

Martin Mao: Yes, 100%. When we think about high cardinality, in my mind at least, it's two categories. Think about some of these aggregate ways you want to group the data, like region one versus region two, or availability zone versus two, versus three, there's not an infinite amount of values there, right? It is a limited set. It does increase the cardinality. You can imagine if you put three different label values on it, you're tripling the amount of metric data, but it's still an aggregate view.

Martin Mao: In my mind and what we have with M3 is a technology that can support this level of cardinality. When you get to more about what logs and traces are about, they're really looking at the individual request. They want to know the exact customer ID or the exact request ID, and you can imagine that really blows up cardinality, right? You're going to have an almost infinite amount of different dimensions there. When you need to get to that level, and generally, you need to get to that level when you're looking at the final phase, which is root cause analysis, right?

Because if triaging isn't enough to tell you what the issue is, then at that point, you want to go even one level deeper and actually look at a particular individual customer or a particular individual request, and then it's really high cardinality, and that's when logs, and in particular distributed traces, comes in really handy because they're better data structures to handle individual requests level unique data points, whereas metrics is great for groupings of sets of data.

When you talk about these three phases (alerting, triage, and RCA) can you help me understand what good looks like? [18:56]

Wes Reisz: Yes, that's great. Makes sense. Makes sense. I wanted to talk a little bit about what good looks like. We were having a conversation before that a lot of times when I'm talking to people on their incident, I like to know what the business driver is, about is something being impacted? You made some comments about that's great, but there's some concerns that because of our environment today, we really need to be able to look at the individual service and be able to get things at a more finite scope. Can you talk a little bit about what good looks like and the impact, I guess, of business versus individual service drivers?

Martin Mao: Yes. If you look at the business, definitely the most important drivers, right? If you're optimizing an application and it doesn't actually move the bottom line for the business, it's almost like what is the point of doing that? Right? Those are definitely important, and these are generally high-level either revenue or number of checkouts or very high level business goals that you're measuring, and that is definitely important in a business needs to have an accurate measure of that and know when that's impacted.

I think what ends up happening is if you think about modern architecture, an individual requests through one of those flows that results in that top level business metric isn't fulfilled by a single application or service these days. There's tens or hundreds of them, each of which is fulfilling a tiny part of that request. In this modern architecture, because of that, and if you're thinking about the ownership model, a developer only owns one or perhaps two of those services in that chain of 100.

I think what ends up happening is there's a little bit of a disconnect that the business is measuring what is important to the business, and that makes sense, and then each developer needs to first and foremost care about their application and their microservice, and that is perhaps serving one, but perhaps having multiple business use cases, but it's only one part of many steps. It's generally easy for a developer to focus only on, you can imagine the Golden Signals or the RED Metrics of their own individual microservice.

But then there becomes a little bit of a disconnect between, well, if my individual microservice can stand within a two-second SLA, does that mean I can fulfill and I have no impact on the business? That question is really hard to answer because it's not just that one microservice fulfilling that business use case. It's a combination of a lot of these. We've seen many different approaches to this, and I think it's an extremely hard problem to solve because it's also not such a case that, you can imagine, it's not just the sum of the parts. It's not like you can give each service a small slither of the ultimate latency that you have to promise your customer because each service is doing it like a different thing.

They will have variability in the performance requirements that they can give. There's intermittent network blips as the request travels between services. There's retries. There's a bunch of things that I think make it really hard to, from an individual service level, guarantee that the business use case can be fulfilled for sure. When we talked to companies about this, we actually generally find that there's even a lower level of best practices that isn't generally implemented that is almost a requirement before you even start thinking about filling that gap.

Some of this is everybody knows what the RED Metrics are. It's the request count error rate and the latencies. But even how each individual developer measures this stuff and thinks about this, what we find often in an organization is different. If that's different, you're measuring two different things in two different ways. That's the starting point you can even start to solve the problem with. What we find with the best practices there is generally we would recommend a central SRE team or perhaps an observability team to start introducing best practices for, hey, across the company, when we say latency, we mean either a percentile or a histogram calculation and over this particular time window, and that's how we're measuring it, right?

Are we counting the number of requests over that threshold over a one day period, or one hour period or one second period? How are we generating and coming to these values? The very first step is just even getting standardization there on how people think about these metrics and how they do the measuring of these metrics. Then having that central team go the next step further and perhaps automatically generating some of these for developers as well, right? It's actually a pretty nice property for a developer to roll up their application, and all of a sudden they can actually have these metrics generated for them already.

I think in modern architecture, that's quite easy to achieve actually because most microservices talk to each other through a proxy or a reverse proxy, or they talk over some sort of standard RPC protocol, or perhaps there's a network mesh that they are discovering themselves and talking to themselves over. It's actually these RPC protocols and proxies that can actually generate a lot of these top line measurements for developers and for services themselves, and then the central team can standardize how we generate dashboards and alerts based off of this dataset.

What about bad? Are there any antipatterns to watch for? [23:40]

Wes Reisz: That's a good pattern that you talked about. Well, what about some anti-patterns that people may be run into that you think needs to be addressed?

Martin Mao: I think the anti-pattern here, and we see this often, is that there'll be some initiative or directive from the top that says, "Hey, every service needs to pick an SLA and stand within that, and that's what everybody has to optimize for," without thinking about how the sum of all of those parts impact the business. What often you'll find is that companies think that they're in a good shape because every service in their environment has an SLA that they're measuring against. Yet you can often find cases where the business objectives and SLAs are not met, even though each microservice is keeping their contract, right?

I think that having a very individual services-based view and doing high level requirements there without having the standardization that I mentioned earlier, and also without thinking about, well, there is a bit of a gap in between, and it isn't a problem where it's just the sum of the parts, is generally the patent that I see a lot where I think it lures you into a false sense of security a little bit in the sense that I have all of this and everybody is measuring against their SLAs, yet the business metrics are still impacted quite often because there isn't a direct correlation there.

What’s next for Chronosphere? [24:54]

Wes Reisz: Yes, totally makes sense. All right. We're coming up here on the end, Martin. What's next for Chronosphere?

Martin Mao: If you reflect on our conversation and our framework here, because our view of observability is this outcomes-based view, you can imagine what we're doing at Chronosphere is to provide the best product that can allow you to achieve these particular outcomes and answer these questions in the best way. We did start the product focusing on the first phase of this, which was all about notification. In fact, the first two phases are notification and triage, and we do that based on the metric data type. The next focus on us would be to move into the third phase, which again, hopefully, ideally a lot of folks can avoid if possible, that provides the tooling to also do the third phase, which is the root cause analysis in production as well, if required. So completing the phases and really providing customers with the product that best optimizes for the outcome is what we're focused on right now.

Wes Reisz: Very nice. Well, I appreciate you joining us on the InfoQ Podcast.

Martin Mao: Thanks so much, Wes. It was great chatting with you today.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.