InfoQ Homepage Presentations An Observable Service with No Logs

An Observable Service with No Logs

View Presentation

Speed:

43:18

Summary

Glen Mailer discusses building and using event tracing to monitor a system.

Bio

Glen Mailer worked across a variety of roles at all levels of the stack: from infrastructure to frontend with a detour via databases. Most recently he’s worked on building infrastructure at CircleCI, and is now integrating with every SaaS under the sun at Geckoboard. Although capable of deriving meaning by staring at metrics graphs or grepping logs, he prefers the power of observability.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Mailer: My name is Glen. I work at a little company called Geckoboard. I'm going to be talking about some work that me and my team did at CircleCI. CircleCI is mostly made up of microservices, and they're written in Clojure. They all share some common tooling using a shared library. That library is maintained by a separate team that tries to help act as an accelerator for all the other teams, and keeps that library up to date. Meanwhile, each CI build that runs on our platform defines its own CI environments. We run the infrastructure, but they get to decide what dependencies and so on are installed on the system. In order to power those builds, we needed to implement a build agent, a bit of code that can run anywhere on those customer defined machines. We implemented that in Go. My team was responsible for this agent written in Go, that build infrastructure that it runs on, and some of the microservices that it communicates with. We decided we needed to build a new service for an upcoming build agent related feature, and we decided that rather than Clojure, we were going to make a bit of a departure. We were going to also build that in Go. There were various tradeoffs involved in this decision. Overall, we were confident that this was a good choice.

I'm going to claim that this resulted in a fairly unusual situation, or certainly unusual to me. We had quite a large team of mostly experienced senior engineers. We had experience working together building software in Go, to build this build agent. We also had experience of building web services together in Clojure. Because of the choice to build this new service in Go, we'd left ourselves unable to use that shared library. All of the tooling and the patterns we had with writing a CLI tool, we didn't have any of that scaffolding, that standard library stuff for building a web service. I think this was a relatively unusual situation. In particular, this situation is a real recipe for bikeshedding, for bickering and debating over people's personal preferences for things that aren't really that important. As well as the bikeshedding, I think it's also a great opportunity for some experimentation, for some chances to try some things that we haven't really tried before, try and mix things up a bit.

CircleCI's Observability (2019)

A little bit more relevant background knowledge. I think this is quite relevant to the story is what CircleCI's observability was like at the time. This story takes place sometime around summer 2019, possibly back then this was before the term observability had really started to get really heavy use. CircleCI had a two-pronged approach. Both prongs were really expensive. StatsD was the primary tool we used. We used metrics. We had lots of metrics, and we had lots of tags. The pricing model for StatsD and actually the underlying cost of the system itself is based around the unique combinations of tags that you use. The more tags you use, the more it's going to cost you. We also had a rather expensive, self-hosted Elasticsearch system. That's where all our structured logs went. That's generally the thing we're comfortable doing, adding logs to our system with occasionally some fields that we can query.

Honeycomb

Around this time, we'd started working with a new kid on the block, and that was Honeycomb. We'd had a period of database problems in the spring of 2019, and to help us uncover what was really going on there and get a bit more visibility into our system, we'd started exploring Honeycomb. It really opened our eyes to how our systems were behaving in production. We'd start to spot things that we hadn't really appreciated before. I was hooked. I'd started looking at Honeycomb before even checking metrics or logs, even when we had only instrumented quite a small proportion of our overall stack at this point. We were also starting to see tweets like this floating around. This is Charity, one of the founders of Honeycomb. She's railing against this concept of three pillars of observability: metrics, logs, and traces. She's claiming they aren't really a thing. I was starting to buy into this. I proposed an experiment. Given the situation, we've got a little bit of a blank slate. We know what we're doing. We understand each other. We know how to work in a programming language. We're not completely inventing everything from scratch. I proposed this experiment.

What If We Didn't Have Any Logs?

I said, what if we didn't have any logs? What does it mean to not have logs? We need a definition of logs in order to not have them I think. This is the definition I'm going with. Logs are text-first telemetry. Telemetry meaning the information our system emits in order to understand what's going on. Text-first, meaning when you're writing a logline into your code, you're thinking about it in terms of the text it produces. Logs are intended to be human readable. You're usually writing a log as a message to your future self, or some other operator. A little message in a bottle, as it were. If you're lucky, the logs will have some structured data attached. Some key-value pairs to provide a bit more context about what's going on that are queryable. If you're less lucky, then that extra contextual information will be part of the human readable bit of a log, which will make it much harder to query against.

Logs are generally scattered across a code base. There isn't a common pattern for exactly where a log should or shouldn't go. You probably are trying to debug an issue and you realize there's something you'd like to see that you can't see, so you go back into the code, you add that logline, and then you ship it. Then now you've got a new logline and you've got that additional visibility. Logs are pretty much always visualized and explored as a stream. Sometimes there's just text with grep, sometimes in visual logger creation tools, but even then they're still usually thought of as a stream of text going past that you have some way of picking the lines that are most interesting to you within them. Logs also have this concept of a log level, so debug, info, warn, error. Some logs are more important than others. When we configure a system in production, we often disable debug logging. I'm sure many of you have been in a situation where you've had a production issue, you'd love to see what's going on, and you've looked in the code. There is a logline that would tell you exactly what you need to know, but it's disabled in production, because it's just way too noisy. That's logs.

An Event-First Approach

If we don't have logs, what do we have instead? We'd have what I'd like to call an event-first approach. Not the catchiest name, but I think it's a reasonable term. Again, we need a definition, what is an event-first approach? An event-first approach, as opposed to text-first telemetry, is data-first telemetry. We're primarily thinking about describing what's going on in our system as data. We do that via events. Each event has a unique name that identifies it within our code base, and describes the action that's going on. All of the properties of that event are structured data. This is partially by virtue of it being new. Generally, now, if I'm writing a new logline, it will also all be structured data. When I'm creating an event now, the event name is just an identifier, and all of the data about that event goes into properties as structured data. We're also going to wrap these events around units of work and mark the duration of that unit of work. When I say a unit of work, I mean something important that our system is doing. Rather than having a logline for like, yes, a thing started, yes, it's still going. Ok, it's finished now. With an event, I'm going to wrap that unit of work and I'm just going to emit an event when a piece of work is completed. Then it will have all the properties I require and it'll know the duration of that chunk of work.

Events, we're going to explore and visualize in multiple ways. That includes the stream that we get for logs. It will also allow us to do time-series queries and other types of ways of slicing and dicing that same bit of raw information. We're not going to have log levels or any equivalent. Where in a logs based system we say certain events are more important than others, in an event system, we're going to say all events are important in terms of the code level. If we're struggling with volume, and we need to throw some away, we're going to use dynamic sampling techniques.

These are quite similar lists of bullet points. Events and logs aren't actually that different. Events are just good logs. There's something I've been saying to myself at least for a number of years is I'm going to stop using the word just. There's a lot of hidden meaning behind the word just. It implies a certain level of triviality, which I think is rarely the case. Even if events are just good logs, I think there's an important philosophical difference there. In particular, I think new words can encourage new ways of thinking. Many of us take our logs for granted. We don't put a lot of thought into what we're writing in our loglines. We've been using logs for years, so we know how to do logs. Suddenly I say, we're doing events now, and you go, that's different. How do I do that? I think using new words, new terminologies, new APIs is an important tool in our toolbox. Because, ultimately, we're building these systems for people, and so finding ways to change the way people think is an effective way to bring about change.

Example

Let's see what these look like in practice. I'm going to show a bit of code now. Hopefully, you're familiar enough to follow some Go. Even if not, I hope you can still appreciate how minimal these interfaces really are. We've defined our package with the trendy o11y name. We've got a StartSpan function. This is going to create a span with the span representing a unit of work, and it's going to store it in the context. Context is Go's equivalent of thread-local storage, but it's explicit rather than implicit. We're going to create our span. We're going to give it a name. Then once we've got our span, there's only two core operations we really need on a span. The first one is to add a field. This has a key and a value, where the key is the name of the data, and the value is anything we like. Then after a span has run for some time, we'll call this end function on the span. That will record a duration from the start to the end and send it off to the backend. Fairly straightforward, I hope.

Then, when we write our application code, we're going to sprinkle a little bit of the span code in throughout it to produce the telemetry. In this case here, we're going to start our span and we're going to name it DeleteEntity, to match up with the name of the unit of work that we're performing. We're going to use Go's defer construct so that when this function ends, it captures the error and the span ends when the function ends. Then throughout the body of this function, we're going to use a series of these AddField calls to add important context about what's going on. Generally, domain specific information about what's happening within this function, what work we're performing right now.

Then when we do this throughout our code, we can send it off to Honeycomb, and we get to see these nice little trace waterfalls like this. If any of you have used some tracing tool, I don't think this will be a revolutionary concept to you. You'll have seen a lot of stuff like this. As we browse around the Honeycomb UI, and we click on these rows, all of that extra field data that we attach to our spans will be viewable in the little sidebar.

Some Patterns Emerged

We started from first principles. We had this really super minimal interface. Then as we worked on it over time, and built up this new system, we started to develop and extract some patterns. One of the first patterns that emerged is having a consistent way to capture errors when a span ends. There's a little bit of sleight of hand here with named return values and pointers, and so on. The general idea here is that whenever this function ends with an early return, or whatever situation, the defer block will run, and we're going to call the standard helper, this end function to end our span. This makes sure that every time this function ends, it goes through our helper. There's another way to do the same concept, which will be to use a slightly more aspect oriented style, where we wrap our code with an observability layer that respects the same interface as the actual code. Of the two approaches, we much preferred this inline approach where we actually clutter our code with the observability code. The reason for that is generally when you're coming to make a change to the code, because it's right there in your face and in your way, then you're much more likely to remember to update the observability code to match. Whereas if it's off in a separate layer, it's very easy for the two to fall out of sync.

Inside the body of this helper function, we've just got some really basic logic to ensure that we've got consistent storage fields. Every time there's an error, we're going to set the result field of the span to have a value error, so we can very easily find spans containing errors. We'll also populate the error field with some details about what went wrong. In the real version of this, there's a few more cases that handled for like timeouts, and so on. This is the core concept. When we send this off to Honeycomb, it's able to highlight all of those errored spans and pull them out so we can see them clearly. If you've got a much longer trace and only one or two of those have errored, this is way more visually helpful. It also means we can do a time-series query on those same fields. Here, I've taken a single one of those names, and I've said, all the spans with a certain name, show me whether or not they errored. You can see there are some spikes here where clearly something went wrong and those error rates spiked up, and maybe that's something worth looking into.

Another pattern that emerged is what I like to call medium cardinality span names. Cardinality is the number of unique values of a particular field. For example, earlier that results field has only got two or three possible values, that's low cardinality. Whereas something like a customer ID is high cardinality. Hopefully, you got a lot of customers and there's lots of possible values of a field like customer ID. Low cardinality values are really great for grouping events together. You can say, show me how many events failed versus passed. Really useful. Whereas if you say, show me group by customer ID, you just got loads of separate rows, not quite as useful for grouping. When you're looking at single events, knowing exactly which customer affected is really effective. Medium cardinality is somewhere in between. I've often found that auto-instrumentation will default to quite low cardinality span names. For instance, something like, it might just say, HTTP request, or select query, or insert query. When you're looking at a trace waterfall, and it says something like HTTP request, HTTP request, select query, insert query, it's not entirely clear what's going on. Whereas what I found was by using medium cardinality names, so in this case, the pattern we did was generally package method. Occasionally, we'd also prefix with the type. The examples I've got here are like a gRPC call, or a database query, just the extra little bit of context means that when you're looking at a trace waterfall, or when you're grouping by span name, you can immediately see what action is going on, rather than having to drill down into the individual spans and look at their fields. Generally just gives you a much quicker at-a-glance view of a trace.

Spans Are Cheap but They're Not Free

Another thing we start to notice fairly early on was cost. We already had two very expensive telemetry backends, we didn't really want a third expensive telemetry backend. I'm sure you can imagine, CircleCI is quite a high volume system, millions of builds a day, lots of RPC calls as part of those builds. Anything that multiplies per call gets very expensive very quickly. If you take a naive application of the approach I described above, where each layer gets wrapped with some spans to denote the unit of work, you can very easily end up with a trace like this one. I've got four different spans each representing a different layer of our code. Ultimately, there's no other code happening in that request handler. All of these spans are mostly recording redundant information. When we start to see traces like this, we got a little bit more selective about which layers really benefited from being considered units of work, versus just being delegated down to the other layer. In this case, the top layer there is the HTTP handler, and the second layer down is the domain layer, which is the core of our application logic that's separate from HTTP. We made some tweaks to how many of these spans were produced. That did save us actually, quite a bit.

Adding Lots of Domain-Specific Span Fields

When we're throwing away spans like that, or reducing spans, we are losing a bit of information. We then developed a few patterns to try and get that information back, so generally, adding lots of domain specific fields onto spans. In Honeycomb, it doesn't cost anything additional to add more fields to make a span wider. It generally tends to encourage you to do that. For example, in the previous trace there, there was an HTTP client that did retries. That was creating quite a deep trace. Whereas, instead of having a span for each retry attempt, you can just have a counter on the final one to say how many attempts were made. That way, you're losing a bit of information, but you're still overall capturing what's going on. We've also found that in a microservice system, you've got loads of different storage backends, lots of different identifiers. It's very helpful to add as many of those onto your spans as you can get away with. Then when you do have an issue, you can see exactly which customers it's affecting, or even in many cases, which customer is causing that issue.

Isn't This Just Instrumenting with Traces?

Anyone who spent some time instrumenting their code with modern tracing tools is probably mostly shrugging at this point. You'll have come across many of these patterns in your own work. At this point, I'm just instrumenting with traces. There's that word just again. So far, yes, I haven't really done anything new apart from not introducing logs. I'm instrumenting with just traces. Actually, when I said no logs in the title, I was lying a little bit, or bending the truth, perhaps. When I first proposed these experiments, some of my colleagues were a little bit wary. They had quite reasonable questions like, what will we do if Honeycomb is down? How will we see what's going on? We had this logging system that there were things we didn't like about it, but we understood its operational constraints, we knew how to operate it. Switching entirely to a brand new approach can be a little bit scary. What we ended up implementing was a little tee inside the o11y library. As well as sending events to Honeycomb, we also converted them to JSON, and wrote them to stdout. That way, after sending to stdout, we then pumped off to our standard log aggregation system. This way, we've got a fallback. If Honeycomb is not working, we can just see our logs normally. We could also send these off to S3 or some other long term storage system if we wanted.

We've got some JSON logs out of our traces. These aren't super usable. They're not particularly readable. There's just a big wall of text. Actually, we found ourselves using them. In particular, we found ourselves using them in local developments. Because we all have this habit, this pattern of when you're working locally, you expect to see output from your process right there in your face. When you do things, you want to be able to see that immediate feedback loop. Because we started using these JSON logs, even though they were difficult to read, we made them easier to read. With a little bit of parsing and a little bit of color, we can immediately add a lot more information architecture. We can pull out the span name in color here. If there are any errors in span, we can tag those and make them really pop out. We can abbreviate things. We don't need the 36-character trace ID in a stdout dev output. We can just take a few characters, enough to make it visually distinct and see what's going on.

Actually, we found this was really most effective in tests. The Go's test runner can capture stdout, and it will display it next to a failing test. You get test failure, you can see all of the traces, or all of the spans that happened during that test, in color, in text, right there next to the test failure. This was almost accidental, really, but this turned out to be one of the most effective changes. Because now if you've got a failing test, and you can't tell why it's failed, then that means your code is not observable enough. The solution to fixing that is not to go fire up your debugger and figure out what's going on in this particular case. Because this same issue might also happen in production. The solution instead is to extend your tracing code to add new fields, to add new span so you can see what's going on. Then the reason for the failure of your test is revealed, but also you've enriched your observability, you've enriched your telemetry, so that when this code makes it to production, you can see what's going on. That turned out to be really, quite surprisingly effective.

How About Some Metrics?

We've got some logs back, what about metrics? What about our third pillar? It's become a bit fashionable lately to really hate on metrics dashboards. I've used metrics dashboards for a lot of my career now. I know how to read them. I'm familiar with them. I think I'm still a big fan. What I especially like is a dashboard like this. For each and every service that we run, I like to have, it's like headline dashboard. Is that service looking all right? How is it doing? What's it up to? When we develop new features for that service, we also add them to the dashboard. We co-evolve them over time. It's not just a static artifact. When I'm then deploying that service, I've got something that I build into my process as something to look at. I'll always open up the dashboard. I'll hit the deploy button. I'll merge my pull request, or whatever the process is, and check that things look ok. I still think I'm slightly better at spotting patterns during a deployment than an automated regression system. Especially because that system is averaging everything over time, whereas I know what change I've just made, I know what might be likely to be affected. Currently, Honeycomb doesn't really have a way to do something equivalent to this. It doesn't have this, let me just have an overview. At CircleCI, we're very familiar with the system we had. We didn't necessarily want to throw that all away just because we liked this newer way of instrumenting our code.

I asked myself, could we derive our metrics from our spans? We've said there aren't three pillars, metrics are just another way of looking at events. I really liked how simple and minimal the telemetry code we'd already added was. I didn't want to clutter it up with an extra layer of metrics code over the top. I had a lot of play routes, and I found an approach that I quite liked. We take our simple span interface, and we add one additional method, the RecordMetric, and that takes a description of a metric to record. Then we have a series of factories that can create those metrics descriptions. Here, we've got a timing function that creates a timing metric, an increment function creates an increment metric, and so on.

Then when we want to use that, we take our existing span, and we tell it to record a metric. Then under the hood, the code will then derive the metric from the spans data. When we say a timing metric, the library knows to take the timing metric using the duration of the span. Then these extra fields on the end here, tell us which tags we want to attach to our metrics. We have to be selective about which fields get recorded as metrics tags, and that's down to cost again. It's really expensive to send a high cardinality field to the metric backend. While we want those fields in our spans, we have to pick a few of those as being safe for sending off to our metric. In this case, for an HTTP request, all we can really afford to keep is the status code and the results. I think maybe we'd take the path or something as well. We make the author explicitly choose which fields they want to associate with their metrics. Other than that, all of the values for these, the value of result, the value of status, and the value of duration, all come from the properties of the span. Much like logs, our observability abstraction is then able to tee these off to a StatsD daemon, and from there it goes along to the metrics visualization tool of your choice. By ensuring that the metrics and events both have the same source in the code, there shouldn't be any discrepancies in the values between them.

We've got all of those three observability pillars, and only by adding really one set of instrumentation into our code, or one and a bit. If you've been researching observability over the last couple years, you'll have seen many variations of this slide, the three pillars of observability. In fact, I've realized we've been looking at this pillars diagram all wrong. If you look at it from above, the pillars are hollow, and they're all just windows into the events happening under the hood. If you think about this in terms of the execution model of what our code is even doing, it makes sense to me. Because ultimately, our code is executing a series of instructions with some arguments. The act of instrumenting code is deciding which of those instructions and arguments we think are worth recording. Those are events. Anything else we derive from that point is all from that same source.

Was This a Success?

At the start of this, I described this situation as an experiment. Was it a success? I think so. It wasn't a definite success. Yes, there are some pluses or minuses. On the whole, yes, I think this was a success. I definitely learned a lot about what was really important to me when adding telemetry into my code. One thing that I think was a definite success was having only one source of telemetry in the code. Not having a logline here and a span here and a metric here, just being like, this is the telemetry code, and separately, I worry about how that gets off to the backend. That bit I can say, I'm definitely happy with. The trace-first mindset, I also thought worked really well. Training us all to think in a sense of, here is a unit of work that I'm instrumenting, here's a unit of work that I'm wrapping with a span. I think that worked very well rather than having a logline at the start of something and a logline halfway down, and another log. Then a log at the end. It's wrapped. It's a unit of work. That part I think works great.

The bit that I think didn't necessarily work as well or was arguably not as worth it, was the whole like, let's burn everything to the ground and start again from scratch. The rethinking everything from first principles, a bit of a mixed bag. In the process of rediscovering everything, we spent a bunch of time bikeshedding, arguing over details, and it turned out to not be important, in the end, reinventing a bunch of wheels. Perhaps that process was still valuable, because in the process of having those discussions we built a much better understanding as a team of what was important to us. We also discovered what we needed to include and what we didn't need to include in this library we built up.

The other thing that was a bit of a mixed bag was not having any auto-instrumentation available. Because we've just invented this new approach, it's not like we can just grab something off the shelf that will automatically create all our spans for us when we make requests or database calls. The reason I call this one a mixed bag, and not just a straightforward negative is because I've had very mixed results with auto-instrumentation itself. I often find that the things that the person who wrote the library considers important might not align with the things that I consider important. Auto-instrumentation can be very chatty. I could be creating a lot of spans with details that I don't actually care about. The cost-benefit might be off there. Effectively, by not having this available to us, we were forced to instrument everything ourselves. In 2022, all the instrumentation code out there is open source. If you want to instrument code in a new way, you can fork it, you can refer to the way it's done elsewhere. It wasn't a huge undertaking, but it did give us the opportunity to reevaluate, did we want this field? Did we care about this bit? How do we want things to work? Yes, a bit of a mixed bag there.

What Now?

What next? What should you do with this story? My recommendations for what to take away from this would be, first off, try a trace-first approach in your own code, regardless of the interfaces or how you're actually plugging things into your code. When you're writing new code, try and think about units of work and traces as the primary way of getting telemetry out of your code. Start from that first and then think about logs and metrics afterwards, maybe if you need them. I think that was very effective for us. I think, just that way of thinking even just helped us structure our code a little bit nicer.

I tentatively recommend this concept of a unified telemetry abstraction, having this one place to produce logs, metrics, and events. The reason I'm tentative on this is because people still found it a little bit weird. It wasn't as intuitive. Maybe this isn't an area you really need to innovate in. I think it was effective, so I would still say it's a good approach, but it's not enough better than the current approach maybe to consider ripping things up, but worth a try. Some of you right there might be hearing the word unified telemetry abstraction, and you may think about OpenTelemetry. OpenTelemetry is an open source project. It's community led. It's got a lot of vendors signed up to it. Its goal is to be a unified telemetry platform, so when you grab things off the shelf they come with instrumentation out of the box, they've got all that detail in there. It means that when you're writing instrumentation in any language, you've got a standard way of doing it. It's a really great project.

One thing that's really different to OpenTelemetry, compared to what I've described, is that they still have this very strong distinction between traces, metrics, and logs as different APIs. I don't know if they're interested in trying to consolidate these. I haven't asked them yet. OpenTelemetry, you take traces, metrics, and logs separately, and you put them into the OpenTelemetry box. Then, based on your application configuration, it sends them off to one or more backends, as required. I think if I was doing this experiment with OpenTelemetry, I think I'd still want that extra layer over the top, so that in my application code, at least, I only had that one bit of telemetry code that then teed off to the various different types.

Questions and Answers

Bangser: The first couple questions were about maybe the specifics of experience with logs, and how does that translate to this world with event-first. A really good example I often hear come up is what do you do with the exception handling? What did you all do with that? Were those also sent to Honeycomb?

Mailer: Partially, this story takes place in Go. Anyone who's worked with Go will know it doesn't have exceptions. Actually, out of the box, the traditional pattern with Go errors is it just passes them up, and they don't get any context. There's no stack with them. That was actually just really annoying is errors, you just get an error that says, file not found. Which file? Where from? Then, a few years ago, Go added this error wrapping feature, so each layer of your stack, you put a bit more context in the error as it travels up, and you end up building your own stack traces, which is not the greatest system in the world, but it works. We do include all the layers of the error, or as much information as we've got about any error in the Honeycomb event. The convention, like the field we had there, like the error field, that would be the string of an error which effectively included all that stack information. It did mean that in a single trace, if you get an error midway down, you would get this span with error, and the parent with error, with a bit more information in the parent, that with error with a bit more information and so on. Generally, if that error wasn't handled in a way that turned into a non-error when it got all the way back up to the top, then that error would be the low error with all the extra detail and the full stack.

I have also done this approach in languages which do have stack traces. In those cases, I would have the error field be like the string version of the error, which is a bit shorter, and you can consume it. I would generally also add this full stack trace in another field. At my current job we use tools like Rollbar, Airbrake, and Sentry, and so on, like exception tracking tools. You can often pass the full stack of an error over to that and that handles grouping and those sorts of things. Then, basically make sure you include the trace ID when you send it to those tools, then you can link everything up and match things together.

Bangser: Of course, stack traces can get quite large, and no matter what tool you're getting to use, there probably will be a limit of how many characters, how large those fields can get. They can often handle quite a lot. There's some practicality there. Otherwise, yes, throw it all in, get it all included.

Mailer: When you're exploring the data, as I showed in those examples, there's multiple different modes of exploring the data. If you're plotting a time-series, and you want to group by, you probably don't want to group by a stack trace, because there's enough difference. Then once you've clicked into a trace, and you've found a single error, you probably do want to see that stack trace. Having those both things available in different fields, I found has worked quite well.

Bangser: Different fields, but same UI, same tool.

Mailer: Exactly. You don't have to go off and open up another tool and log in again, and go through your two factor auth code just to see a bit more detail of your error.

Bangser: You spoke a little bit about telemetry at scale, and so how you'll use logging levels with logs, and you might use sampling with this event-first. There's a question here around whether or not you send all service invocations to this event telemetry, or if you don't trace all of them? How did you handle that?

Mailer: In the story here, we were building a service from scratch. Every time we added a new service or a new dependency to it, we would add the tracing straightaway. In other scenarios we've had an existing service that has a lot of dependencies already. Generally, we'd be driven a bit more by need, so we'd go back and add telemetry when we're discovering a problem with that service. Actually, what I found when you do that piecemeal adoption, is that when you then look at a trace you go, there's a gap there. Why is there a gap? That gives you that nudge to go and add that extra tracing. Once you get used to seeing these traces and everything linking together, as soon as you get a trace that's got a bit missing, or a trace that like, "Somebody should have called me, where's my parents?" You start to go around and add all these correlation headers and so on, so everything all links together nicely.

The way tracing decisions work in most modern tracing tools is you either keep a whole trace, or none of it. The more of stuff you link together, the more expensive it is to say, yes, I'm keeping one. When you start to do dynamic tracing, then the cost of keeping each individual trace increases. That cost, hopefully, is proportional to the value of the trace. Something I've been really wanting to experiment with for a while is what I like to think of as like horizontal sampling. Can I keep a slimmed down version of some traces? Maybe I'll just keep the top level span from each service, but not the detailed spans on 9 out of 10. Then 1 out of 10, I'll keep the rest, or always on errors. That's not something I've had either enough of a need or the opportunity to explore yet. There's definitely some stuff you can do that's similar to log levels without quite being the same as log levels.

Bangser: That's exactly what I was thinking of when you were describing, it's almost like you can add a level to your sub spans that then you decide which one gets kept.

Mailer: Trying to have that dynamic tail sampling thing where you say, if it's an error, capture more. I've worked on systems in the past, where there was a special header that was undocumented, but if you set an HTTP header in your browser, then it would turn on debug logging for that request. Those tools are really helpful for like, ok, if I can reproduce an issue, then I can run it in production with more logging.

Bangser: Some more questions around the usability of this, and developer experience. There's a question here around the annoyance or level of friction in readability when you have to add those span and error calls across your stack. How does that work?

Mailer: Again, in our case, this was in Go, and all of the libraries and code we were using were quite modern in Go, so they all have this context parameter. Basically, the convention is to pass this around, and that's where you stuff all your data in. That's worked really well. I've had a different experience doing this in JavaScript and TypeScript, where the way the event loop works and the asynchronicity, sometimes the task can get lost. There are helper functions certainly in the Honeycomb instrumentation libraries, and I believe in the OpenTelemetry libraries as well, that basically let you give it hints when you lose that information. Most of the time, it just works and the spans and errors, everything links together.

Sometimes even in process, you have to give it a bit of a hint, to cross process boundaries. If you're doing common RPC things like HTTP or gRPC, then usually out of the box, you get effectively headers, which match up the IDs and pass them across process boundaries. If you're putting messages on RabbitMQ, or Kafka, or SQS, or queue based systems, then generally, at the moment, at least I've seen, you have to roll your own. You take the same headers that you would put in an HTTP request, and then you add them to the metadata of your queue system. Then you unpack them at the end if you want to link them up. There's various tradeoffs about whether that should be the same trace or different traces. I think there is a bit to think about there. Each individual choice is actually quite cheap to make. They're not really complicated things. Which is like, do I want to link this together? If so, I'll do it. Ultimately, it's just passing a couple of strings around. It's not that difficult.

Bangser: I think that also touched on that next question around correlating logs and traces and metadata, and how you're able to pass that through with things. Is there any other way in which you connect your event data with anything else? Viewing it in, as you said, the Elasticsearch that you had, and things like that, other than just trace IDs and things?

Mailer: Yes, not really. Some of the systems I've seen where we've added trace information to, already have a request ID concept within them. We add that to all of the various different backends. Generally, when I'm working on newer systems, they've never had that before, so the trace ID and the span IDs, those are treated as the things that tie everything together. When we send our errors off to Rollbar, we have a section for like, here's the trace ID, and so we can match those things back up again, and so on. Likewise, even in the logs, we make sure it's got the trace ID, it's got the span ID. If, say you're on a system where you're keeping all of your logs, maybe sending them to S3, but you're dynamically sampling your traces, you can still say, here's the dynamic desampled ID. You can match that up to the user ID and so on.

Bangser: I think you went into great detail about how you are using this third party SaaS product, but also having a bit of realistic awareness that things go down, and what you need to do in those cases, and sending that to a second location? Have you pragmatically had to deal with that? Did you ever try and send traces out when Honeycomb was down or was experiencing issues? What happened with that? Do you lose those traces from that view, or was it just delayed? How did that work?

Mailer: For Honeycomb, I think they have had a handful of outages but nothing major, so far, touch wood. In those cases, we're just a bit blind there, or we can log on to the box and have a look, or we can refer to our other backend. If Honeycomb is down, there's a little background queue in the process which tries to send. If that errors, then it just drops them. I've had this same experience with [inaudible 00:45:19] log backends, when sometimes the logging backend goes down, and then you just lose logs with it, or worse. One time, the application was writing to syslog, and then syslog was forwarding, and then to a central that ended up on a box somewhere. Then that box ran out of disk. I don't know if they'd thought about it, or this was just the out of the box defaults, but it had been configured with backpressure. What actually happened was the applications all crashed because one box filled up with logs. Of the two choices, I would always rather discard logs than take an outage when telemetry is down.

Bangser: Have to pick your priority sometimes. Hard decisions, but I think those make sense. Absolutely.

You ended that talk on a bit of an interesting balancing question of, what would you suggest around unified telemetry approach? You hedged a bit and said, I think I'd suggest it but maybe not. Often with these talks, we gain some more experience between when we deliver them and when we're actually here for the Q&A. Has anything changed around that? Would you be more bullish in suggesting that unified telemetry approach?

Mailer: I don't think things have hugely changed. The reasons I gave in the talk there, it was just a bit weird. Everyone looked at me like I was crazy, like, what are you doing? I've spoken to a few people that, "Yes, we started doing this. Actually, yes, it was great." Once you start doing it, it's like, yes, this feels fine. Just having to convince people over again, especially if you're a growing organization, when someone comes in and says, "How do we log?" "We don't do logging here." They're like, "What? Who are you? What are you doing?" Effectively, that's the biggest cost, is getting people on board, getting people to believe that actually it's fine. Don't worry about it, this works. I think, effectively, there's a chicken and egg problem there. Whereas if it becomes a more widespread practice, then the cost of convincing people goes down by quite a lot.

See more presentations with transcripts

Recorded at:

Dec 16, 2022

Glen Mailer

InfoQ Software Architects' Newsletter