Ben Sigelman is the CEO of Lightstep and the author of the Dapper paper that spawned distributed tracing discussions in the software industry. On the podcast today, Ben discusses with Wes observability, and his thoughts on logging, metrics, and tracing. The two discuss detection and refinement as the real problem when it comes to diagnosing and troubleshooting incidents with data. The podcast is full of useful tips on building and implementing an effective observability strategy.
Key Takeaways
- If you’re getting woke up for an alert, it should actually be an emergency. When that happens, things to think about include: when did this happen, how quickly is it changing, how did it change, and what things in my entire system are correlated with that change.
- A reality that seems to be happening in our industry is that we’re coupling the move to microservices with a move to allowing teams to fully self-determine technology stacks. This is dangerous because we’re not at the stage where all languages/tools/frameworks are equivalent.
- While a service mesh offers a great potential for integrations at layer 7 many people have unrealistic expectations on how much observability will be enabled by a service mesh. The service mesh does a great job of showing you the communication between the services, but often the details get lost in the work that’s being done inside the service. Service owners need to still do much more work to instrument applications.
- Too many people focus on the 3 Pillars of Observability. While logs, metrics, and tracing are important, observability strategy ought to be more focused on the core workflows and needs around detection and refinement.
- Logging about individual transactions is better done with tracing. It’s unaffordable at scale to do otherwise.
- Just like logging, metrics about individual transactions are less valuable. Application level metrics such as how long a queue is are metrics that are truly useful.
- The problem with metrics are the only tools you have in a metrics system to explain the variations that you’re seeing is grouping by tags. The tags you want to group by have high cardinality, so you can’t group them. You end up in a catch 22.
- Tracing is about taking traces and doing something useful with them. If you look at hundreds or thousands of tracing, you can answer really important questions about what’s changing in terms of workloads and dependencies about a system with evidence.
- When it comes to serverless, tracing is more important than ever because everything is so ephemeral. Node is one of the most popular serverless languages/frameworks and, unfortunately, also one of the hardest of all to trace.
- The most important thing is to make sure that you choose something portable for the actual instrumentation piece of a distributed tracing system. You don’t want to go back and rip out the instrumentation because you want to switch vendors. This is becoming conventional wisdom.
Subscribe on:
Show Notes
What should an on-call system do when being paged?
- 01:50 First, my condolences to whomever is on call …
- 02:00 The first question is: what are you getting alerting on? There are different ways of setting up alerting.
- 02:05 At Google we saw a great diversity of approaches to when you want to be notified and woken up about something not going well.
- 02:15 There were two schools of thought: one was where every time bad happens, as part of the post-mortem, you set up an alert that would detect the root cause to prevent its recurrence next time.
- 02:30 The other approach was to alert on the things your consumers depend on - for example, if you were in the middle of the stack, you would alert on the reliability of what they are calling.
- 02:50 That’s the approach of alerting on SLIs (Service Level Indicators) although I do see a lot of people reporting on what are potential root causes.
- 03:00 It’s possible, where you are alerted on root causes, that you’ll be woken up for something that is irrelevant to your consumers.
- 03:10 For example, you might have set up an alert a year ago because you had an outage on some dependency on your service, and so you set up alerting around that.
- 03:15 It then turns out that you get woken up due to an alert that is no longer relevant due to subsequent code change in the interim.
- 03:25 The natural thing to do is to silence the alert until the morning, and go back to sleep.
- 03:35 If you’re getting woken up, it had better be an emergency.
- 03:40 There are a lot of times where you get woken up for things that used to be an emergency but which aren’t an emergency any more.
- 03:50 That creates a boy-who-cried-wolf [https://en.wikipedia.org/wiki/The_Boy_Who_Cried_Wolf] situation.
- 03:55 So if you get woken up, you should know it’s an emergency, and you don’t have to wonder whether you should have been woken up.
- 04:05 In a real emergency, you know you don’t have much time and are feeling anxious about it.
- 04:10 The first thing I would want to understand in that situation is knowing when the event happened, how quickly it happened or had been changing prior to the alert, and what things in my system are highly correlated with that change.
- 04:25 Those things may be in your own service, which will be relatively easy: a common situation in a CI/CD pipeline is when something gets through that shouldn’t have, and is rolled back.
- 04:40 The other relative easy situation is that someone else did a release - which in a micro-services architecture is really hard - but you have to detect that and get it rolled back.
- 04:50 The worst-case scenario is when you actually have to write some code, integrate and test it in order to fix this.
- 05:00 Detecting which change is responsible fro the SLI regression that woke you up in the first place can be a very easy thing with the right tools, and can take hours or days if not.
- 05:15 The challenge with moving away from monolithic architectures is that you cannot just look at one thing in order to find out what has changed.
We started logging, then more logging, then alerting on anything and that brings alert fatigue.
- 05:50 I won’t say which, but there is a major Google consumer product whose configuration file for reporting was auto-generated and is 18k lines long.
- 06:10 On the other side, you had services that were pared down to 12 things which were vital, and they had per-second resolution on those changes.
- 06:20 If there was any indication that any change was wrong, they would hop on it - and that was a much saner on-call shift.
How high is the bar for micro-services?
- 07:10 It’s a matter of opinion, but I think that there’s a lot that is necessary but not sufficient.
- 07:30 If you are an agile 1-pizza or 2-pizza agile team - small teams running services, they are left in complete freedom in implementation as long as their APIs do what they should.
- 08:05 Everyone agrees that APIs should be stable, but left to their own devices about technology choices.
- 08:15 I think that is an incredibly dangerous road to walk: you end up with a place where it works quite well for the beginning - you don’t find out until quite far down the road what the problems are.
- 08:30 Once you go from 15 to 50 micro-services, you find you can’t amortise the cost of multiple platforms simultaneously being developed at your own org, it gets very difficult.
- 08:40 There are a number of cross-cutting concerns: orchestration, service discovery, monitoring, security - that sort of thing - that have some sort of dependency on the stack that your engineers have decided upon.
- 08:50 What I see is that people will adopt devops, and have a “hippie” ethos where everyone gets to make their own decisions, and there’s no global view of anything because there’s no application level hook.
- 09:15 I like the idea more about individual teams making their own decisions about the way that they build their software, once they’ve done some multiple-choice test about the basic stack that they are using.
- 09:30 You don’t need to mandate that everyone at the company uses a single language or platform, but maybe you can narrow it down to three or four.
With the rise of service meshes, what are your views on libraries versus frameworks versus service meshes?
- 10:05 I’m certainly enthusiastic about the move to service meshes, so that applications have a standard layer that provides the hard parts, circuit breaking, service discovery, monitoring etc.
- 10:30 People’s expectations about what service meshes will deliver are often unrealistic or inflated.
- 10:35 I may be biased, because I’m always thinking about tracing, but there is an understanding that you deploy a service mesh, and because it’s able to see all the calls between services that you can do distributed tracing.
- 10:55 Unfortunately that’s not true - you are able to understand the individual calls, but the most difficult part of getting tracing right from an integration standpoint is not between the services but rather within the services.
- 11:15 If you’re in a service, and you receive a request followed by doing some work, then delegate it off to another service to do some request, the hard part is not getting lost in the bunch of work.
- 11:30 There may be internal queues and batching, and that’s the hard thing - services meshes don’t help at all there, because it only sees the thing between services.
- 11:40 I do see people deploying service meshes and then wondering why they don’t automatically have a bunch of things.
- 11:50 Getting back to the libraries and frameworks thing: I have a hypothesis that if service meshes and containers went a similar way.
- 12:05 When containers first became popular (they aren’t a new idea) - we all understood that we had to get into a world that was easy to deploy software, and what we were working with was a VM and a kernel.
- 12:30 Containers were a way of taking what we were using and making it much more agile and gave higher velocity.
- 12:50 In as much as a container is similar to a VM, the VM is still the wrong programming model.
- 12:55 You can see the attraction of things like Lambda or serverless frameworks or micro-service frameworks: you can just think about your API and business logic and not to worry about what Linux kernel version you’re using.
- 13:10 I expect that kernel containers will continue to be a vital part of our dependency chain for many years to come, we won’t think about them at all.
- 13:25 The conversation will be about the application layer - the fact that it is too hard to program.
- 13:30 The fact that we’re still talking about containers means we aren’t there yet.
- 13:35 Service meshes are a vital transitional technology, and everyone is moving to a mature RPC library (like gRPC), someone’s going to wonder why we’re making 4 extra copies of the bytes being sent over the network.
- 14:05 I suspect that service meshes will continue to exist for a long time for some workloads, but I expect that the service mesh will be a transitional technology that move us to user-space libraries that do a lot more work than they do now.
- 14:40 Service meshes are a more convenient place for this to happen, because you don’t want to have to keep redeploying your application, but once it’s over, I expect technology will move up the stack and into the process itself.
Should you be polyglot or does having a limited number of languages make sense is a service mesh world?
- 16:15 It hinges on the maturity of certain things that don’t exist yet.
- 16:20 Today I don’t think that’s sufficient, because you are still dependent upon non-standard application layer libraries.
- 16:30 Let’s be more concrete: say you were working in Java, and depending upon Dropwizard to have a reasonable sense of what’s going on in the application.
- 16:50 Things like open tracing actually help a great deal here, at least for observability, but there’s similar things around security, service discovery, that containers alone and service mesh alone don’t expose enough surface area to do that well.
- 17:05 I think that we either need to see people choose a short list of frameworks to build their application, or wait for the industry to build standards so that no matter what language you use to build an application that you can get your cross-cutting concerns checked off.
- 17:30 At the moment, we’re not in that world, so if you’re developing in 12-15 different languages, it’s a very difficult thing to develop a coherent strategy around monitoring, observability, security, when you have a diversity of languages and frameworks.
“Observability is not about the three pillars; that’s just data - it’s about detection and refinement” - can you talk a bit more about that?
- 18:30 I’m not a big fan of the term “the three pillars of observability” - my concern is pretty simple: I’ve seen a lot of customers where they went into their observability strategy with the pillars of logs, metrics and traces.
- 19:05 They check the three boxes, and then still woken up in the middle of the night despite having checked the boxes - I think that’s a very frustrating experience for them.
- 19:20 It’s not that I have any issue with logs, metrics or tracing - they’re all important - but in my mind, it’s better to thing about them as pipes of data that you need rather than pillars.
- 19:40 Your observability strategy ought to be more organised around the workflows and needs.
- 19:50 Detection is about having incredible visibility to your SLIs.
- 20:05 Once you have precise control of your SLIs, you need to take the set of tens or hundreds or thousands of things, and refining that search space onto a shortlist of credible hypothesis about what is going on.
- 20:35 Any vendor (including Lightstep) who tells you that you will automatically root cause all of your issues in a distributed system is lying to you - it’s not possible.
- 20:45 The most we can hope to do is eliminate hypotheses en masse and get you to something that a human being can actually get through in a few seconds.
- 21:05 It’s really dangerous when you’re on call to rely on institutional knowledge - it’s difficult to keep that knowledge up-to-date.
- 21:20 The tools have to be able to do that for you - that’s the bit of observability that’s really difficult.
- 21:25 I do think that metrics are really valuable for modelling SLIs and can be useful for refining them.
- 21:30 Logs, I think, are just getting to converge with tracing, but they are about events at different frequencies - and are a good way of explaining what’s going on in a system.
- 21:55 It’s not sufficient to say that we have logs, metrics and tracing and move on - that’s not enough.
What recommendations do you have for logging done right?
- 22:40 I was at a talk by Lyft a couple of years ago on their use of logging and micro-services.
- 23:00 I asked him what the summary was (since I missed it) and his advice was: micro-services and logging: don’t do it.
- 23:10 He was being glib, but you can think of logging about individual (distributed) transactions, and everything else.
- 23:30 The everything else - process startup, general information, is all good - keep doing that.
- 23:35 If you log the individual transactions then at scale you won’t be able to afford it because of ROI.
- 23:50 You’re taking a logging system like ELK which was designed in the monolithic era, and as you multiply the number of micro-services, you’re almost linearly multiplying the amount of logging data you’re generating.
- 24:00 If you move to a world where you have 100 micro-services, you have added orders of magnitude to your logging bill.
- 24:10 At the same time, your logging is actually less useful (even ignoring the cost of it) because you need to understand how events in one process affect events in another process, and logging systems weren’t designed for that.
- 24:25 Distributed tracing is really logging redone for distributed systems.
- 24:30 It is a log of activity for a transaction, but to make it work, we have to do event sampling because of the cost issue and the causality between events between different processes.
- 24:45 It’s really just a specialised form of logging.
- 24:50 I think of distributed tracing as being the successor to transaction logging.
- 24:55 Logging about other things is totally fine, but probably not what you’re interested in if you’re doing difficult investigations of operational issues.
What about metrics?
- 25:20 That is interesting - I would refer to what I said earlier with regards to logging transactions vs everything else: metrics is the same.
- 25:30 From an instrumentation standpoint you shouldn’t have to instrument separately for metrics and for transactions - if you’re trying to measure how long it took to service a request, that should come from the tracing implementation.
- 25:50 There’s also metrics about other things: CPU load, memory usage, as well as application level things like queue depth etc.
- 26:00 Those things do need plain metric instrumentation - I call those gauge metrics.
- 26:05 So gauge metrics require specialised metric instrumentation - everything else, if it’s to do with a transaction, can come from the same instrumentation you’re using for tracing.
- 26:15 We talked about open tracing earlier; open tracing is actually merging with open census - our shared goals are to find a way to prevent developers having to tightly couple themselves to observability choices downstream when they’re adding instrumentation.
- 26:30 It will also provide an ecosystem of instrumentation for common libraries that will be ready made for use in production.
- 26:40 Both of these projects share these goals, so it is a good outcome to merge both the projects.
- 26:50 One of the things I like about open census is that they have had metrics from their charter from the start; at open tracing, we made the mistake that we tried to scope too narrow, and it would have been better to think about instrumentation in general.
- 27:10 Tracing implementation should also be used for metrics, so this merged project is going to be a constellation of separate concerns around metrics, logging, tracing etc. but they’re all one well-reasoned well-factored decomposable library.
- 27:30 We announced some of that a few weeks ago [https://www.infoq.com/news/2019/04/opencensus-opentracing-merge] and we’ll be announcing more later in May.
What characteristics should you build in for metrics?
- 28:10 The trouble with metrics in particular is that you’ll be looking at a squiggly line in a graph somewhere, and when it squiggles in the wrong direction you need to understand why that happened.
- 28:20 The only way you can do that in a metric system is by grouping by some tag, and hoping that the reason something spiked is to do with one tag.
- 28:30 As a concrete example, let’s say that you’re grouping by hostname, and you might be able to see that a single host is responsible for all of the latency, so you can diagnose further.
- 28:50 The trouble is that a lot of those things have so many values that the cardinality of the metric grows to be very high.
- 29:00 The cardinality of the metric is directly proportional to the cost of the metrics - and any metrics vendor will charge you for that.
- 29:20 You end up in a catch-22 where the only tool you have to explain the variations is grouping by tag, and the tags you really want to group by have high cardinality, so you can’t.
- 29:40 I’m a little bit focussed on tracing, but I do think in this situation it’s incredibly helpful because metrics and traces are mostly about transactions, you can take an intelligent sample of traces and do the kind of analysis that you want.
- 30:10 When it’s all done, you can take your investigation into individual transaction traces (which you can’t do with metrics anyway) so it ends up being much more flexible way to understand particular failure modes in a distributed system.
- 30:25 The only caveat is that you need to have distributed tracing deployed, so there’s an integration process to get there.
- 30:30 I think it’s a much more powerful way to understand systems than grouping by tag that ends up being too expensive or too coarse grained.
What are some of the mistakes you see people make about tracing?
- 30:55 The one I alluded to earlier: tracing is not a checkbox.
- 31:00 Tracing is a data type, and most traces (in terms of population) are boring on their own.
- 31:10 They are often not very large - we might imagine a trace having hundreds of thousands of services - but you might also have ones that are pretty small, touching three or four services.
- 31:25 Taken on their own, they aren’t very interesting - but what’s interesting is if you take hundreds of thousands of them then patterns emerge that are not visible from any other data source.
- 31:40 To me, tracing is about taking traces and doing something useful with them - to the detection and refinement piece from earlier, you can answer some important questions about workloads and evidence.
- 32:05 That’s only possible if you can consider hundreds and thousands of them in the context of a particular investigation
- 32:10 That’s what tracing should be, but for most people tracing is a lousy parametric search on stale data and then investigating individual services manually, which is a time consuming and not very fruitful effort.
How can you look at the traces statistically?
- 32:40 The cool thing about traces is that (especially if you’re looking at latency or errors) they do understand how individual transactions propagate through an entire system.
- 32:50 So if A depends on B depends on C depends on D and D is having a bad day, that will certainly affect A, B, and C - and the trace can see that on an individual basis.
- 33:00 If you think about them in aggregate, it can see that and provide evidence.
- 33:05 It’s easy to look at programmatically at a single trace and understand those relationships, and where the latency or error originated from.
- 33:20 Looking at the statistics of that data can provide very high confidence of what’s going on, going back to earlier where you can eliminate huge numbers of hypothesis quickly.
- 33:30 Let’s say that D is having a bad day, we’re going to see terrible performance in A, B, and C - especially if you are the personal responsible for A who then wakes up the person who is responsible for B.
- 34:00 A tracing system can determine that B and C are innocent here, and that they are just the messengers for a failure that’s downstream.
- 34:10 There’s no way of getting that information reliably without a lot of context and knowledge, which is a really dangerous thing to rely on when you’re on call.
- 34:15 That’s why I like tracing.
Can you paint the picture of the tracing landscape?
- 34:50 I am incredibly biased (being the CEO and co-founder of a vendor) - I will try to be concise about the landscape.
- 35:05 I see a number of projects that are concerned with integration from an observability standpoint.
- 35:10 A service mesh can help with that, open tracing can help with that (or with the open census project which is merging).
- 35:30 There is no downside to adopting those things, because they are incredible value adds to get data out of your system.
- 35:40 Then there’s the matter with doing something with that data.
- 35:45 Going back to my three pillars critique; they will give you the data, but you still need to process it.
- 35:50 Here’s where things get more interesting and more complicated.
- 36:00 There are a number of vendors who over-promised, and a number of open-source projects that are somewhat underdeveloped from a product standpoint.
- 36:15 What I would suggest is to think hard about where they are in terms of their own scale.
- 36:20 The main thing the vendors don’t advertise is what scale they are appropriate for.
- 36:25 I have a slide which I’ve used in the past “Design your own observability system” - high cardinality, long retention window, no sampling and ad-hoc queries: choose three.
- 36:45 You can’t have all four of those things, and I have seen vendors who don’t do one of those things well.
- 37:00 If you’re operating at high scale, beware of vendors who don’t have existing customers at scale.
- 37:15 To a certain extent, if you see vendors who operate at really high scale, there are probably some things that you want to have at your scale because it’s a law of nature thing.
- 37:30 I wish that the landscape was a bit more segmented, less around future stuff but more about the scale of things being monitored, since it’s missing in most of the analyses that I see at scale.
What is the state of serverless and distributed tracing today?
- 38:00 Serverless as a category is a bit confusing, because it’s defined as a negative.
- 38:10 If we’re talking functions as a service (like Lambda) - tracing is more important there than anywhere.
- 38:20 It’s a little difficult to deploy in some environments, but it does depend on the language being used.
- 38:25 Node is very popular to deploy in serverless, and unfortunately node is probably the hardest popular framework/language of all to trace.
- 38:35 I would say for serverless, it is totally necessary.
- 38:40 For a small number of micro-services you can probably just cheat, but for serverless it’s a nightmare because everything is ephemeral.
What about insights to cold start problems for services?
- 39:00 With distributed traces, they can be tagged with certain information that can be cross-correlated with other information that you have.
- 39:10 It goes back to my feeling that distributed tracing is a data source, not a be-all and end-all.
- 39:15 The observability solution that you’re using should allow you to take traces and cross-correlate them with other pieces of information.
- 39:20 For example, if you see high latency in a service, it should be able to correlate that with something that is spinning up.
Any final thoughts or recommendations?
- 39:50 The first and most important thing: choose something portable for the instrumentation integration piece.
- 40:00 That’s getting to be an easier and easier decision, as open tracing becomes more mature.
- 40:05 You don’t want to have to go back and rip out your instrumentation because you’ve decided to switch vendor.
- 40:20 The other thing I would emphasise is to think through the on-call scenario that we started with, and thinking through a month to make an improvement scenario.
- 40:30 You need to think about the workflows that you want, and to choose a stack for observability that provides for those workflows.
- 40:40 Don’t fall into the trap of thinking that any distributed tracing solution will just work.
- 40:50 There are many tools out there that display distributed traces but don’t help to solve problems.
- 40:55 Try to think back from workflows and use cases and go from there.
- 41:00 Lightstep recently released a guide about observability and I gave a talk at KubeCon and QCon about thinking about observability and traces.