InfoQ Homepage Articles Solving Mysteries Faster with Observability

Solving Mysteries Faster with Observability

Jun 30, 2021 19 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Observability gives you the clues you need to solve production outages.
Listen to your users and make observability choices that give them tools to answer their day-to-day problems.
Give your users context by connecting traces with logs (writing your trace IDs into your logs is a start).
The cost of investing in observability is offset by time saved -- both reducing time to resolution of outages and cutting down on engineering resources devoted to debugging day-to-day.

At QCon plus, a virtual conference for senior software engineers and architects covering the trends, best practices, and solutions leveraged by the world's most innovative software organizations, Elizabeth Carretto discussed observability at Netflix and how their internal tool, Edgar, comes into play.

While we love unpacking mysteries in our tv shows, when it comes to operating our services, mysteries hold a lot less appeal. Unfortunately, when something goes wrong, troubleshooting an issue can make you feel like a detective as you dig through log stores and peer into dashboards, struggling to reproduce an issue. The end goal is to understand what really happened and what factors led you into this issue. With all the tooling we have, from metrics to log stores to exceptions, you have a plethora of evidence to work with, but finding your issue and resolving it is not always an easy feat.

Outline

I want to start by looking at troubleshooting and talking about what engineers often do to figure out issues. What tools do they leverage, and what does the process look like? What are the pain points? Next, I'd like to share how we tried to tackle this problem at Netflix with Edgar. I'll address things that went well and some of the challenges we had along the way. Lastly, I'd like to discuss how you might leverage some of these same concepts in your organization to use your own data similarly.

A Degraded Experience

Often, it starts with a loading icon that never seems to end. You have an end-user or end-users who begin to see a degraded experience. Once an issue is discovered, whether that's through customer service or maybe through an alert, it gets eventually put into the hands of an engineering team. That engineering team will start to open up dashboards and dig through logs, doing a lot of manual footwork to uncover each step. All of this happens while watching the clock. Outages are expensive for all businesses. When your service isn't available or if your experience isn't what people have come to expect, you're potentially losing revenue or customer loyalty, neither of which are outcomes we want.

At Netflix, our users should be watching their favorite TV shows, not staring at an error or a loading screen. When I think about troubleshooting, this is the first use case that I think about: production issues, with live users being impacted. However, we don't only troubleshoot issues in production; there are plenty of things to troubleshoot before we ever get to this point. I'd argue that production troubleshooting takes up the smallest amount of our time. Similar problems plague us through all phases of development. Maybe you're seeing slow responses from a service, or a new integration isn't behaving as you expected, or there could be environmental differences between test and prod. These are all issues you need to troubleshoot — troubleshooting them before production is the best-case scenario. Debugging issues is a crucial part of our day-to-day lives as engineers. When it's our own service, debugging might be straightforward, but that's not always the case. The larger your system, the more you might have to dig.

Observability

Let's start by looking at the sources that we turn to when we look for clues. We often begin with observability tooling. Logs, metrics, and traces are the three pillars of observability. Logs give a richly detailed view of an individual service and provide the service a chance to speak its own piece about what went right or what went wrong as it tried to execute its given task. Next, we have metrics. Metrics indicate how the system or subsets of the system, like services, are performing at a macro scale. Do you see a high error rate somewhere, perhaps in a particular service or region? Metrics give you a bird's eye view. Then we have traces, which follow individual requests through a system, illustrating the holistic ecosystem that our request passes through.

In addition to observability tooling, we also turn to metadata. By metadata, I mean supplemental data that helps us build context. For us at Netflix, this might be, what movie or what show was a user trying to watch? What type of device were they using? Or details about the build number, their account preferences, or even what country they're watching from. Metadata helps add more color to the picture that we're trying to draw.

At Netflix, with our microservices architecture, and with just one request hitting ten services, there could be a log store and a dashboard per service, which might leave you with many log stores and many metrics dashboards. What if your request hit 20 services or 100? You could see how this would quickly become challenging. That's an inherent difficulty of debugging a larger system. Each microservice might be easy to understand and debug individually, but if all you know is an error occurred during this request, in one of these 20 microservices, searching for key evidence becomes like digging for a needle in a group of haystacks.

How can Observability Teams Make Troubleshooting Easier

For teams building observability tools, the question is, how do we make understanding a system’s behavior fast and consumable, quick to parse, and easy to pinpoint where something went wrong? Ideally, even if you aren't deeply familiar with the inner workings and the intricacies of that system. When we built our own tool Edgar to solve this problem, we relied on distributed tracing as the glue to tie it all together. While there might be many dashboards or log stores, there should only be one distributed tracing UI. Since distributed traces capture the ecosystem of a request, it gives us a clear answer of which services were involved, allowing us to use the trace ID as a common thread to tie all these services and their data together.

Request Tracing

Let's quickly level set on request tracing. Request tracing is the process of capturing incoming and outgoing network calls as well as the internal activity of a service. It includes details about the requests and responses, including latency, and then storing that data. A trace can tell you the path of a request, as well as the timing. You'll often see traces depicted as a waterfall graph like the one below, which shows the basic makeup of a trace. To follow request activity through a system, a trace ID is generated at the onset, typically by the first service hit by the request. That trace ID is then passed along as a header value as the request proceeds so that each subsequent service knows the request’s ID. As that request goes along, it generates spans. A span is a unit of work. It can represent a network call from one service to another, indicating a client-server relationship, or it could indicate a purely internal action like starting and finishing a method. Spans contain a set of key-value pairs called tags. Service owners can attach helpful values into this bucket of tags. Tags could be URLs, version numbers, regions, corresponding IDs, errors, really anything that the service owner determines to be important to this set of data.

What we call a trace is really simply a group of all spans with a given trace ID. This forms a single end to-end-umbrella representing each step that a request went through as it passed through the system. Traces (and their underlying spans) offer us a lot for visualizing as well. Spans relay to other spans the parent-child relationship, which allows us to create a directed call-graph depicting the trace. Spans have a start time and an end time. Thanks to these timestamps, a user can quickly see how long the operation took. Trace data gives us a lot of data to start with. We have timing. We know what services were hit. We have rich tags with statuses and details, and so we're starting to draw a decent picture. What are we missing? Right now, we don't have any detail from the logs or any additional context about what environment this trace happened in. Looking for this additional context will take time for our users, since logs live elsewhere. If you want that level of detail, you still have to find them. We're also missing metadata. What movie was the user trying to watch? What device were they using? Using the trace ID as the thread to tie it all together, we can consolidate this information into one place, saving our users precious time. That’s what we did with Edgar.

Edgar – A Distributed Troubleshooting Platform

Edgar is a distributed troubleshooting platform for distributed systems, built on a foundation of request tracing, with additional context layered in by aggregating correlated logs and supplemental data. That aggregation is a big selling point of Edgar. Edgar can pull in logs and metadata to show a more complete picture of what's going on and cut down on the manual footwork necessary. Plus, we do more than aggregation. We partner with teams to provide analysis for common use cases, and then we summarize that data to help users identify root causes more quickly with less effort.

Let's look at how Edgar is used and how we got to where we are today. When Edgar started, initially, our tool was built for engineers, and at that, specifically streaming engineers. We created an experience around trace data and logs on the streaming side to represent an abstraction of a streaming session. Streaming sessions have a predictable flow. There's a start event and a stop event. A license needs to be acquired. The right kind of content needs to be delivered based on the device and the user's preferences. We have log data and tracing that can represent all of those steps. We started to shape this data into something immediately consumable: a view of a playback session. This view presents a summary of the session built from analysis of the logs and the traces involved in the viewing session. By building out this abstraction, users can see what went wrong at a glance. With some supplemental configuration, we can even provide a team or a person's contact information for issues with a known resolution - which saves a lot of human time and work of manually stepping through these logs and traces to try to assemble this view mentally in an engineer’s own mind.

Soon enough, the Edgar team realized that these abstractions could serve other user groups as well. By offering our tool to customer service operations to help them quickly understand member issues, we could actually take some weight off our engineering teams. We could help resolve member issues faster to get them back to enjoying Netflix. Building an abstraction around a Netflix streaming session allowed us to answer some of the most common questions engineers and support had to address. We found that 20% of issues or issue types often led to around 80% of the support burden. By focusing on those frequently-asked, higher leverage issues, we delivered considerable value with the abstractions that we built. Using logs and trace data, Edgar could answer why a member didn't get a 4K stream on their device or why a license wasn't able to be acquired for a particular asset. We found that Edgar could be expanded to more than just streaming engineers. It had huge value for customer service operations who had to answer these types of questions all the time.

To deliver value to engineers and our customer service whenever they encountered an issue, we had to make a commitment to have data about those issues. We decided to capture 100% of playback-related data, and we built in some knobs to control how fine-grained that data would be. Suppose our users are troubleshooting a particular issue affecting a known device. In that case, they can turn on an increased level of tracing for a specific device and capture full headers and payloads for those requests and responses. For non-playback services, Edgar's users can turn invoke tracing on-demand at a few different levels. They can say they want trace data only or full headers and payloads if they're digging into something that requires a high level of detail. This configurability and our commitment to data storage helped our tool become a reliable source of information that could provide greater detail in times of need.

Supporting Studio and Content Production

Over time, we found this was just the tip of the iceberg. As the studio side of Netflix grew, we found a whole new set of users. On this side of the business, it's a very different world. Instead of one streaming video service consumed by millions, we have many apps doing a wide variety of tasks with a much smaller set of users. Yet, all of these users are essential to the production of our incredible movies and shows. We have applications built for every step, from an initial pitch all the way to playing on the Netflix service. Often, there are several tools for every step. The engineers building these applications found themselves troubleshooting just as much as our engineers on the streaming video side. With so many applications and teams building tools with so many uses, we needed to make our product more self-service. We built an admin page to allow users to add the details and the field mappings of their log stores to automatically correlate the application logs with their traces without any additional work. That saved us legwork and made it so that we on the Edgar team weren't the bottleneck.

Next, we collaborated with other teams to make Edgar's implementation and access easier. Our developer experience team worked to integrate Edgar into their paved path, Java GraphQL framework so that detailed tracing could be done at the framework level with little implementation cost for our users. This out-of-the-box implementation lowered the barrier to entry and meant that more engineers could be onboarded. Then, other teams worked to expose Edgar's details within tooling that was close to user workflows. For example, there's a GraphQL service for request detail so that users can easily add the Edgar link or a trace ID to their query. Integration with existing tooling increased the adoption of Edgar and helped users get more value during every phase of the SDLC. All of this meant that Edgar was a more familiar tool when a production issue arose and could help users resolve it faster.

With minimal configuration, engineers can access a powerful experience that combines traces and logs on the studio side. Edgar then sifts through the data to determine what information is both valuable and relevant to visually highlight in Edgar where an error occurred and provide detailed context, cutting down on any manual work on the user's behalf. By incorporating logs, Edgar can get service level detail, and by pulling in metadata, Edgar can provide context about the environment, where the trace happened, and what behavior the user was trying to achieve.

With this base level experience available, we worked with our users to find what issues they had to solve most commonly. We reached beyond studio engineers to include production support, who provide technical support for the talented artists, animators, editors, and all of the production staff who take part in creating Netflix's movies and shows. For some of their most commonly confronted issues, they found themselves digging into log stores too. We worked with them to understand those problems and build out a solution inside Edgar to access that information in one place with only one search.

Edgar provides our production support the ability to search for a given contractor, vendor, or member of production staff by their name or email. After finding the individual, Edgar reaches into numerous log stores for that user ID. We pull together their login history, their role access changelog, and recent traces emitted from production-related applications. We scan through all this data for errors and warnings, and we present that to the user, highlighting anything that went wrong. Perhaps the vendor tried to log in with the wrong password too many times, or they were assigned an incorrect role on a production. These are some common issues that Edgar can help unpack.

That's Edgar today. Edgar is a crucial tool for operating and maintaining a production service at Netflix. Edgar has a self-service offering as well as more curated workflows for high-leverage use cases. We serve a comprehensive portion of Netflix, from streaming to the studio and from engineering to customer support. In all of these cases, Edgar is solving the same multi-dashboard problem by tying together information and pointing its users toward the next step of resolution.

Things that paid off

Let's talk about some things that paid off. Our users were trying to answer domain-specific questions. For example, customer operations for streaming might need to answer why a member did or didn't receive 4K. Production support might need to answer why a sound editor couldn't access their materials for a specific project. In both cases, the granularity of trace data might not help understand their issue. Seeing a list of traces or a list of logs would require manual digging to get to what really happened. The point here is that combining trace data with logs and building it into a representation of behavior can show you where a failure occurred inside of a logical group.

Next, we needed our users to trust that Edgar would have data about an issue for them to use it. By providing 100% tracing around a business-critical subset of traffic, our users learned that they could rely on us. Over time, we could refine our sampling approach to capture 100% of interesting traces rather than 100% of all traces. We also approached each section of our business differently. For enterprise applications like on our studio side, we dealt with a much smaller scale, and we could afford to capture all activity. Whereas on the streaming side, we needed to be more judicious about our sampling approach. By putting fine-grained control into our users' hands, users were able to get a level of detail that'd be unsustainable for all traces, but they could turn on when needed.

By working with peer teams, we were able to make Edgar easier to use with a lower barrier of entry. Our Java GraphQL framework has tracing built-in as well as outputting the trace ID into their logs. Our GraphQL registry will automatically add a services log store details to Edgar for them. These steps are huge in adoption, as users could get a lot for free right out of the box. Our initial use case was very targeted. We really focused on streaming video engineers. As we tested out our experience with this group, we learned that we could expand out. We expanded to multiple user groups, leveraging our same data set at first with customer service, and then eventually expanding out more broadly, taking the same concept and applying it to a new domain. By starting small and extending out over time, we had the chance to test and become more robust along the way.

Self-service Configuration

Initially, in the streaming world, we took a very white-glove approach. We worked closely with our users to make sure that we understood their logs, and we worked together to shape and display them in Edgar's UI. This was probably the right approach on the streaming side of Netflix, where there was such a high return on the investment of our work. On the studio side, though, with a much higher number of applications, that same white-glove approach was unsustainable. Having a self-service config allows us to offer a lower maintenance experience. With tracing and log correlation, our users get a powerful experience immediately, with no busywork on our side or theirs - freeing up our team to invest in building curated experiences around those higher burden support cases and freeing up our users to do what they do best.

Takeaways

Next, let's talk about you. What are your takeaways from this? As with all things, observability has a cost. Implementation and storage are two that come to my mind immediately. One of the first questions I would have is, how can I get the best return on my investment? Here are a few things that can make a huge difference in how usable and leverageable your observability data is. One relatively simple thing you can do to maximize your ROI is to tie your traces and logs together by writing your trace ID into your logs. Exactly how you do this depends a lot on your implementation details, but the benefit is well worth the effort. By tying together your logs and your traces, you combine that ecosystem overview with the inner service level details of a log. For a failing request, the combination of traces and logs can point you to precisely what went wrong and where - a powerful combination.

Combining traces and logs offers detailed insight into requests, but the traces and logs have to exist, which can get tricky. Many tracing implementations are focused on sampling a small fixed percentage of traffic. A fixed sampling rate of say 5% might be reasonable cost-wise, but it makes it much likely that you'll find data on a particular failing request. Accordingly, this makes it less likely for developers to rely on that trace data to solve their problems. If users can't depend on it, they won't use it. One place to start is to capture 100% of a reasonable subset of business-critical traffic. Maybe that's all calls to a particular endpoint that's business-critical. Either way, try to isolate some small subsection of traffic that you could trace at 100% and experiment with that for your organization.

Finally, a tool is only worth as much as the value it delivers to its users. You can only provide value if people use your tool. The key to making observability tooling a success is to make it valuable and make it easy. Strive to make your trace data as accessible as possible. Make sure your tooling can be used throughout development and testing as well. Consider a wider spread of users if you can. Like any tool, the more often it is used, the more comfortable users will be with it. Commit to a tracing client that works with as many services as you can, and build abstractions that minimize work on developers. We have found particularly strong value in the abstractions around automating log correlation by trace ID and analyzing the combined dataset for errors. It's not one size fits all, but you can ensure that your tools actually get used by maintaining a focus on developer experience.

Conclusion

When you tie together traces, logs, and metadata, you tell a detailed story about what is happening in your services. The trace gives you the timeline and all the participating services and some clues through tags, like the status and the build number. Combining trace data with the logs gives you all the nitty-gritty details. The combination provides engineers the ability to see into a complex system and understand its inner workings without pretending to be Sherlock Holmes. We can leave the mysteries to our Netflix watching.

About the Author

Elizabeth Carretto is a Senior Software Engineer at Netflix in Productivity Engineering, where she builds UIs for the observability space. Her work focuses on delivering value from observability data to service operators through products like Edgar, a troubleshooting tool built on top of distributed tracing.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Solving Mysteries Faster with Observability

InfoQ Article Contest

Key Takeaways

Outline

A Degraded Experience

Related Sponsored Content

Observability

How can Observability Teams Make Troubleshooting Easier

Request Tracing

Edgar – A Distributed Troubleshooting Platform

Supporting Studio and Content Production

Things that paid off

Self-service Configuration

Takeaways

Conclusion

About the Author

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter