InfoQ Homepage Presentations Solving Mysteries Faster with Observability

Solving Mysteries Faster with Observability

View Presentation

Speed:

26:13

Summary

Elizabeth Carretto discusses observability at Netflix and where and how their internal tool, Edgar, comes into play.

Bio

Elizabeth Carretto is a Senior Software Engineer at Netflix in Productivity Engineering, where she builds UIs for the observability space. Her work focuses on delivering value from observability data to service operators through products like Edgar, a troubleshooting tool built on top of distributed tracing.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Carretto: One of my favorite genres of movies, TV shows, and podcasts is "True Crime." With "True Crime," there's a lot of hypothesizing about motivations, opportunity, and bigger questions about human nature, like why people do the things they do. Often, it's the mystery that hooks us, the unknown. In "Tiger King," for example, we were all wondering, did Carole Baskin kill her husband? At other times, it's the thrill of the chase. Like in "Mindhunter," where we get an inside view of tracking down a serial killer. In stories like that, you watch detectives looking at bits and pieces, trying to get to the bottom of it before time runs out. They take things like eyewitness reports, pieces of evidence left behind, or a behavioral profile to try to answer the question of, who'd done it, and to understand what really happened. This is the draw of mysteries, the desire for the unknown to become known.

When it comes to operating our services, mysteries hold a lot less appeal. Unfortunately, when something goes wrong, troubleshooting an issue can make you feel like a detective as you dig through log stores and peer into dashboards, struggling to reproduce an issue. The end goal is to understand what really happened and what factors led you into this issue. With all the tooling we have, from metrics to log stores to exceptions, you have a plethora of evidence to work with, but finding your issue and then resolving it is not always an easy feat.

Background

I'm Elizabeth. I'm a Senior Software Engineer in productivity engineering at Netflix. I like to think of myself as someone who solves mysteries, or who at least builds tools for people who do. Primarily, I work on a distributed troubleshooting platform called Edgar, where we try to aggregate a multitude of sources and provide analysis to help engineers and our other users solve problems faster.

Outline

I'd like to start by looking at troubleshooting and talking about what engineers often do to figure out issues. What tools they leverage, what does the process look like? What are the pain points? Next, I'd like to share how we tried to tackle this problem at Netflix, with Edgar. I'll address things that went well, as well as some of the challenges we had along the way. Lastly, I'd like to talk about how you might leverage some of these same concepts in your own organization to use your own data similarly.

A Degraded Experience

Often, it starts right here. You have an end user or end users who start to see a degraded experience. From here, once an issue is discovered, whether that's through customer service, or maybe through an alert, it gets eventually put into the hands of an engineering team. That engineering team would start to open up dashboards and dig through logs, doing a lot of manual footwork to uncover each step. All of this happens while watching the clock. Outages are expensive for all businesses. When your service isn't available or if your experience isn't what people have come to expect, you're potentially losing revenue or customer loyalty, neither of which we want.

At Netflix, our users should be watching their favorite TV shows, not staring at an error or a loading screen. When I think about troubleshooting, this is the first use case that I think about. Production issues, with live users being impacted. We don't only troubleshoot issues in production, there are plenty of things to troubleshoot before we ever get to this point. I'd argue that production troubleshooting takes up the smallest amount of our time. Similar problems plague us through all phases of development. Maybe you're seeing slow responses from a service, or a new integration isn't behaving as you expected, or there could be environmental differences between test and prod. These are all issues that you need to troubleshoot. Troubleshooting them before production is really the best case scenario. Debugging issues is a key part of our day-to-day lives as engineers. When it's our own service, it might be straightforward, but that's not always the case. The larger your system, the more you might have to dig.

Observability

Let's start by looking at the sources that we turn to when we look for clues. We often start with observability tooling. Logs, metrics, and traces are the three pillars of observability. Logs give a richly detailed view into an individual service. Logs give the service a chance to speak its own piece about what went right or what went wrong as it tried to execute its given task. Then metrics. Metrics indicate how the system or subsets of the system like services are performing at a macro scale. Are you seeing high error rate somewhere, like in a particular service or region? Metrics give you a bird's eye view. Then we have traces. Traces follow individual requests through a system, illustrating the holistic ecosystem that our request passes through.

In addition to observability tooling, we also turn to metadata. By metadata, I mean supplemental data that helps us build context. For us at Netflix, this might be like, what movie or what show was a user trying to watch? What type device were they using? Or details about the build number, their account preferences, or even what country they're watching from. Metadata helps add more color to the picture that we're trying to draw.

At Netflix, with our microservices architecture, and with just one request hitting 10 services, there could be a log store and a dashboard per service, which might leave you with many log stores and many metrics dashboards. What if your request hit 20 services or 100? You could see how this would quickly become challenging. That's an inherent difficulty of debugging a larger system. Each microservice might be easy to understand and debug individually, but if all you know is an error occurred during this request, in one of these 20 microservices, searching for key evidence becomes like digging for a needle in a group of haystacks.

How Can Observability Teams Make Troubleshooting Easier?

For teams building observability tools, the question is, how do we make understanding of systems behavior fast and adjustable, quick to parse, and easy to pinpoint where something went wrong? Even if you aren't deeply familiar with the inner workings and the intricacies of that system. We only hinted on distributed tracing as the glue to tie it all together. While there might be many dashboards or log stores, there should only be one distributed tracing UI. Since distributed traces capture the ecosystem of a request, it gives us a clear answer of which services were involved, allowing us to use the trace ID as a common thread to tie all these services together.

Request Tracing

Let's quickly level set on request tracing. Request tracing is the process of capturing incoming and outgoing network calls, as well as the internal activity of a service, details about the requests and responses, including latency, and then storing that data. A trace can tell you the path of a request, as well as the timing. You'll often see traces depicted as a waterfall graph like this. The makeup of a trace is something like this. To follow request activity through a system, a trace ID is generated at the onset, typically by the first service that the request hits. That trace ID is then passed along as the header value as the request proceeds, so that each subsequent service knows the ID of the request. As that request goes along, it generates spans. A span is a unit of work. It can represent a network call from one service to another, indicating a client server relationship, or it could indicate a purely internal action like starting and finishing a method. Spans contain a set of key value pairs called tags, which is where service owners can attach helpful values that could be URLs, version numbers, regions, corresponding IDs, errors, really anything that the service owner determines to be important to this set of data.

Then, an individual trace is just made up of many spans, which are grouped together using a trace ID to form a single end to-end-umbrella representing each step that a request went through as it passed through the system. Traces with their underlying spans offer us a lot for visualizing as well. Spans relay to other spans the parent-child relationship, which allows us to create a directed call-graph depicting the trace. Spans have a start time and an end time. Thanks to these timestamps, a user can quickly see how long the operation took. We're getting a lot from trace data. We have timing. What services were hit? We can have rich tags with statuses and details, and so we're starting to draw a decent picture. What are we missing? Right now we don't have any detail from the logs or any additional context about what environment this trace happened in. Logs live elsewhere. If you want that level of detail, you still have to go find them. We're missing metadata. What movie was the user trying to watch? What device were they using? By using the trace ID as the thread to tie it all together, we can consolidate this information into one place.

Edgar - A Distributed Troubleshooting Platform

Edgar is a distributed troubleshooting platform for distributed systems, built on a foundation of request tracing, with additional context layered in by aggregating correlated logs and supplemental data. That aggregation is a big selling point of Edgar. Edgar is able to pull in logs and metadata to show a more complete picture of what's going on, and to cut down on the manual footwork necessary. We do more than aggregation. We partner with teams to provide analysis for common use cases, and then we summarize that data to help users identify root causes more quickly with less effort.

Let's look at how Edgar is used, and how we got to where we are today. When Edgar started, initially our tool was built for engineers, and at that, specifically streaming engineers. We built an experience around trace data and logs on the streaming side to represent an abstraction of a streaming session. There's a start event and a stop event. A license needs to be acquired. The right kind of content needs to be delivered based on the device and the user's preferences. We have log data and tracing that can represent all of those steps. We started to shape this data into something immediately consumable. We built a view of a playback session. This view represents summaries, analysis of the logs, and the traces involved in the viewing session. By building out this abstraction, users are able to see what went wrong at a glance. With some supplemental configuration, we can even provide a team or a person's contact information for issues with a known resolution. This saves a lot of human time and work of manually stepping through these logs and traces to try to assemble this view mentally in your own mind.

Soon now, the Edgar team realized that these abstractions could serve other user groups as well. By offering our tool to customer service operations to help them quickly understand member issues, we could actually take some weight off of our engineering teams, and we could help resolve member issues faster to get them back to enjoying Netflix. Building an abstraction around a Netflix streaming session allowed us to answer some of the most common questions that engineers and support had to address. We found that 20% of issues or issue types often led to around 80% of the support burden. By focusing on those frequently asked higher leverage issues, we were able to provide high leverage with the abstractions that we built. Using logs and trace data, Edgar could answer why a member didn't get a 4K stream on their device, or why a license wasn't able to be acquired for a particular asset. We found that Edgar could be expanded to more than just streaming engineers. It had huge value for customer service operations who had to answer these types of questions all the time.

In order to deliver value to engineers and our customer service, whenever they confront an issue, we had to make a commitment to have data about those issues. We made the decision to capture 100% of playback related data, and we built in some knobs to control how fine grained that data would be. If our users are troubleshooting a particular issue, they can turn on an increased level of tracing for a certain device and capture full headers and payloads for those requests and responses. For non-playback services, Edgar's users can turn on tracing on demand at a few different levels. They can say they want trace data only, or full headers and payloads if they're digging into something. This helped our tool become a reliable source of information that could provide greater detail in times of need.

Supporting Studio and Content Production

Over time, we found this was just the tip of the iceberg. As the studio side of Netflix grew, we found a whole new set of users. On this side of the business, it's a very different world. Instead of one streaming video service consumed by millions, we have many apps doing a wide variety of tasks with a much smaller set of users. Yet, all of these users are essential to the production of our amazing movies and shows. We have applications built for every step, from an initial pitch, all the way to playing on the Netflix service. Often, there are several tools for every step. The engineers building these applications found themselves troubleshooting just as much as our engineers on the streaming video side. With so many applications and teams building tools with so many uses, we needed to make our product more self-service. We built an admin page where we allowed users to add the details and the field mappings of their log stores, so that we could automatically correlate the applications logs with their traces without any additional work. That saved us legwork, and made it so that we on the Edgar team weren't the bottleneck.

Next, we collaborated with other teams to make Edgar implementation and access easier. Our developer experience team worked to integrate Edgar into their paved path, Java GraphQL framework, so that detailed tracing could be done at the framework level with little implementation cost for our users. This lowered barrier to entry and meant that more engineers could be onboarded. Then other teams worked to expose Edgar details and the tooling that was closer to users. For example, there's a GraphQL service for request detail so that users can easily add the Edgar link or a trace ID to their query. Integration with existing tooling increased the adoption of Edgar and helped users get more value out of it. All of which meant that when a production issue arose, Edgar was a more familiar tool and could help users get to resolution faster.

On the studio side, with minimal configuration, engineers can access a powerful experience that combines traces and logs. Edgar then sifts through them to determine what information is both valuable and relevant, so that we can visually highlight in Edgar where an error occurred and provide detailed context, cutting down on any manual work on the user's behalf. By incorporating logs, Edgar is able to get service level detail, and by pulling in metadata, Edgar is able to provide context about the environment, where the trace happened, and what behavior the user was trying to achieve.

With this base level experience available, we worked with our users to find what issues they had to solve most commonly. We reached beyond studio engineers to include production support, who provide technical support for the talented artists, animators, editors, and all of the production staff who take part in creating Netflix's movies and shows. For some of their most commonly confronted issues, they found themselves digging into log stores too. We worked with them to understand those problems and to build out a solution inside Edgar, so that they could access that information in one place with only one search.

Edgar provides our production support the ability to search for a given contractor, vendor, or member of production staff by their name or email. After finding the individual, Edgar reaches into numerous log stores for that user ID. We pull together their login history, their role access changelog, and recent traces emitted from production related applications. We scan through all this data for errors and warnings, and we present that to the user right at the front. Perhaps the vendor tried to log in with the wrong password too many times, or they were assigned an incorrect role on a production. These are some common issues that Edgar can help unpack.

That's Edgar today. Edgar is a crucial tool for operating and maintaining a production service at Netflix. Edgar has a self-service offering as well as more curated workflows for high leverage use cases. We serve a wide portion of Netflix, from streaming to studio, and from engineering to customer support. In all of these cases, Edgar is solving the same multi-dashboard problem by tying together information and pointing its users toward the next step of resolution.

Things That Paid Off

Let's talk about some things that paid off. Our users were trying to answer domain specific questions. For example, customer operations for streaming might need to answer why a member did or didn't receive 4K. Production support might need to answer why a sound editor couldn't access their materials for a certain project. In both of these cases, the granularity of trace data might not be helpful in understanding their issue. Seeing a list of traces or a list of logs would still require manual digging to get to what really happened. The point here is that combining trace data with logs, and building into a representation of behavior can show you where a failure occurred inside of a logical group.

Next, we needed our users to trust that Edgar would have data about an issue in order for them to use it. By providing 100% tracing around a business critical subset of traffic, our users learned that they could rely on us. Over time, we were able to refine our sampling approach to capture 100% of interesting traces rather than 100% of all traces. We also approached each section of our business differently. For enterprise applications like on our studio side, we were dealing with much smaller scale, and we could afford to capture all activity. Whereas on the streaming side, we needed to be more judicious about our approach to sampling. By putting fine grained control into our users' hands, users were able to get a level of detail that'd be unsustainable for all traces, but they could turn on when needed.

By working with peer teams, we were able to make Edgar easier to use with a lower barrier of entry. Our Java GraphQL framework has tracing built in as well as outputting the trace ID into their logs. Our GraphQL registry will automatically add a services log store details to Edgar for them. These steps are huge in adoption, as users could get a lot for free right out of the box. Our initial use case was very targeted. We really focused on streaming video engineers. As we tested out our experience with this group, we learned that we could expand out. We expanded to multiple user groups, leveraging our same data set at first with customer service, and then eventually expanding out more broadly, taking the same concept and applying it to a new domain. By starting small, and extending out, we had the chance to test and become more robust along the way.

Self-Service Configuration

Initially, in the streaming world, we took a very white glove approach. We worked closely with our users to make sure that we understood their logs, and we worked together to shape and display them in Edgar's UI. This was probably the right approach on the streaming side, where there was such a high return on the investment of our work. In the studio side, though, with a much higher number of applications, that same white glove approach was unsustainable. Having a self-service config allows us to offer a lower maintenance experience. With tracing and log correlation, our users get a useful experience immediately, with no busy work on our side or theirs. This frees up our team to invest in building curated experiences around those higher burden support cases.

Takeaways

Next, let's talk about you. What are your takeaways from this? As with all things, observability has a cost. Implementation and storage are two that come to my mind immediately. One of the first questions I would have is, how can I get the best return on my investment? Here are a few things that can make a huge difference in how usable and leverageable your observability data is. One relatively simple thing you can do to maximize your ROI is to tie your traces and logs together by writing your trace ID into your logs. Exactly how you do this depends a lot on your implementation details, but the benefit is well worth the effort. By tying together your logs and your traces, you combine that ecosystem overview with the inner service level details of a log. For a failing request, the combination of traces and logs can point you to exactly what went wrong and where. This is a powerful combination.

Combining traces and logs offers very detailed insight into requests, but the traces and logs have to exist, which can get tricky. Many tracing implementations are focused on sampling a small fixed percentage of traffic. A fixed sampling rate of say 5% might be reasonable cost-wise, but it makes it much likely that you'll find data on a particular failing request. Accordingly, this makes it less likely for developers to rely on that trace data to solve their problems. If users can't rely on it, they won't use it. One place to start is to capture 100% of a reasonable subset of business critical traffic. Maybe that's all calls to a particular endpoint that's business critical. Either way, try to isolate some small subsection of traffic that you could trace at 100%, and experiment with that for your organization.

Finally, a tool is only worth as much as the value it delivers to its users. You can only deliver value if people use your tool. The key to making observability tooling a success is to make it valuable and make it easy. Strive to make your trace data as accessible as possible. Make sure your tooling can be used throughout development and testing as well. Consider a wider spread of users if you can. Like any tool, the more often it's used, the more comfortable users will be with it. Commit to a tracing client that works with as many services as you have, and build abstractions that minimize work on developers. We have found particularly strong value in the abstractions around automating log correlation by trace ID. It's not one size fits all, but by maintaining a focus on developer experience, you can ensure that your tools actually get used.

Conclusion

When you tie together traces, logs, and metadata, you tell a detailed story about what is happening in your services. The trace gives you the timeline and all the participating services, as well as some clues through tags, like the status and the build number. Combining trace data with the logs gives you all the great details. The combination gives engineers the ability to see into a complex system and understand its inner workings, without having to pretend to be Sherlock Holmes. We can leave the mysteries to our Netflix watching.

Resources

If you're interested in learning more about the evolution of Netflix's API and how we use GraphQL today, check out Jennifer and Stephen's talk. Then if you're interested in the work done by the developer experience team to provide a great developer experience around GraphQL Federation, check out Kavitha and Paul's talk. If you're interested in learning more about what we're doing with observability at Netflix, and what we've done with Edgar, check out the Netflix tech blog.

See more presentations with transcripts

Recorded at:

Apr 23, 2021

Elizabeth Carretto

InfoQ Software Architects' Newsletter