InfoQ Homepage Podcasts Charity Majors on Honeycomb.io, the Social Side of Debugging and Testing in Production

Charity Majors on Honeycomb.io, the Social Side of Debugging and Testing in Production

Oct 07, 2017

In this podcast, recorded live at Strange Loop 2017, Wes talks to Charity, cofounder and CEO of honeycomb.io. They discuss the social side of debugging and her Strange Loop talk “Observability for Emerging Infra: What got you Here Won't get you There”. Other topics include advice for testing in production, shadowing and splitting traffic, and sampling and aggregation.

Key Takeaways

Statistical sampling allows for collecting more detailed information while storing less data, and can be tuned for different event types.
Testing in production is possible with canaries, shadowing requests, and feature switches
Pulling data out of systems is just noise - it becomes valuable once someone has looked at it and indicates the meaning behind it.
Instrumenting isn’t just about problem detection - it can be used to ask business questions later
You can get 80% of the benefit from 20% of the work in instrumenting the systems.

Subscribe on:

Show Notes

What does honeycomb.io do?

2:30 We started in January 1st 2016, and started off writing a storage engine, but we’ve had customers for little more than a year now.
2:40 It’s the last five or six months we’ve become more popular.
2:50 We’re trying to help people understand the world of very complex systems.
2:55 This was born out of my experience with Parse; my co-founder Christine Yen was at Parse too.
3:05 Around the time we got acquired by Facebook we were hosted about 60,000 apps - and the system was effectively undebuggable.
3:20 It wasn’t that it was built poorly - we had Graphite, Ganglia - all the best metrics open source had to offer.
3:30 A few times a week we would have a customer coming in and saying “Parse is down” or there’s some problem.
3:40 It would take hours, days, sometimes weeks to fix the long tail of these problems.
3:45 When we were acquired by Facebook and we started using their Scuba tool we cut the debugging time down from days to minutes.

How did Scuba do that?

4:00 Scube is a in-memory system that takes raw events and as a columnar store, and lets you explore.
4:15 So it’s not a dashboard of “here are some results” - you start with a question, and then iterate.
4:20 So many of the problems are hard to debug because they happen so rarely.
4:30 It takes a combinations of specific errors that can cause problems - this version of iOS, that version of data base.
4:40 What you’re doing is trying to answer the question - what do all of these complaints have in common?
4:50 If you ask engineers who have worked at Facebook what they thought of Scuba, they get all misty-eyed.
5:00 However, it wasn’t without problems - forget defaults, it didn’t even have any defaults. Engineers would pass round text files of useful links they have constructed.
5:10 If you invested the time to get good at it, you become a 10x engineer.

So is honeycomb.io a product of what you learned by using Scuba?

5:15 Very much so. The first six months we were building something similar to what we remembered.
5:30 After that, it was focussed on building something that people want to use.
5:50 Having the best back-end isn’t alone going to make us succeed; the ability to brings teams up to the best debugger in every situation will.
6:00 Think about your bash history; you have a record of all the things that you have run, repeatedly. You want access to them.
6:15 It would be instructive to view a SQL wizard’s bash history, so you could see what things they have done.
6:25 We built this history into honeycomb, and also your team’s history, so that if you’re on-call and you get paged, you can see what the tribal knowledge is to solve the problem.
7:15 I’ve been the debugger of last resort before, and it’s fun the first time, but you don’t want to get called out on your honeymoon.

How do you expose these things in a social history?

7:55 If you’re using slack investigating an issue, and you paste links to graphing systems identifying what the problem is, then you can let others click through and see the problems for themselves.
8:30 We’re thinking of these like Spotify playlists; it’s not like a dashboard where that is a solution, we think of these as a starting point rather than a stopping point.

Are they exposed through Slack?

8:50 You can go to the honeycomb.io console - you can see what your team has been running recently.
9:00 We’re thinking of exposing them as heatmaps; you can see them as an emergent system run-book - there’s a gap between doing it and writing it down.
9:10 You have a view into what other people are asking or the questions that they are asking around.
9:25 Pulling data out of systems is just noise - it becomes valuable once someone has looked at it and indicates the meaning behind it.
10:05 When you join a team that’s been around for a while, it feels like you can never catch up because there’s so much tribal knowledge.
10:30 It’s a smell if the system can’t be explored.

How does the events get into the system?

11:05 We take the events from arbitrary sources; you can send them into the API or via SDKs, and you can run an agent that tails a structured log writer.
11:30 You can ingest apache logs by converting the unstructured text log data to structured data on the fly.

How do you get there?

11:50 If you don’t have a big distributed system, you don’t need this - if your boring technology is working for you that’s great.
12:50 If you have hundreds of micro-services and many databases, the hardest problem can be working out where the problem is.
12:15 If you have that category of problem, then the number of engineering hours spent trying to locate the problem goes up.
12:40 In that case, you have to bite the bullet and acknowledge that it’s a new category of problems.
12:50 One of the solutions is distributed tracing, and the other is honeycomb.
13:00 Distributed tracing is very much depth first; you pick something, and it shows you the history of everything.
13:05 The hard thing is finding what to trace - that’s something that honeycomb does very well.

What does honeycomb do differently?

13:30 You will need to wrap every call-out to another service with a header and you can embed timing information in there.
13:45 This is good because information will percolate to the edge; if you’re running HA proxy then you can put all that timing information through the stack.
13:55 It gives you a quick glance to find out where the problem lies.
14:10 We deal with raw requests - we don’t deal with aggregation.
14:15 We do read-time aggregation for display purposes, but we don’t do write aggregation.
14:20 Finding problems is often finding needles in haystacks - you need to go back to raw data to find it.

You said: monitoring is for operating systems software; instrumentation is for writing software; observability is for understanding systems. What did you mean by that?

14:45 There’s been a recent push for monitoring for everyone, not just for ops.
14:50 Monitoring is an act of operating software - it’s problem oriented, and leans towards failures.

Where do you draw the line with what to monitor?

15:20 I think monitoring could be for operating your systems and services - looking for results and making sure they are still within bounds.
15:30 There are known unknowns - these come out of debugging and you can subsequently monitor for it.
15:45 Sometimes it includes instrumenting your code - it’s all about white box vs black box monitoring.
15:50 Fundamentally it’s about the operators’ perspective; what they are doing.
16:00 Instrumenting is the developers’ perspective; how am I going to understand this once I have problems?
16:05 Monitoring used to be about making sure there were predictable results; instrumentation is slightly different since you can take advantage of these open ended questions.
16:15 You want to instrument your code so that you can get the data so that you can ask exploratory questions later.
16:30 It’s not just about problems - it’s about understanding it, for example, from a business perspective.
17:00 Good engineers observe their system, so they know what it should look like when it’s running normally.
17:20 If you don’t observe your system, how do you know when it’s behaving abnormally?
17:25 Engineers don’t realise how many things are going wrong all the time, because no-one is looking at the errors.

What do you mean by high cardinality?

17:45 By high cardinality I mean what happens if you have ten million userids - every single one is different.
17:50 Existing metrics and dashboards do not handle that at all.
18:00 Many categories of things become massively trivial once you can handle those.
18:10 Lots of people don’t understand what it is or how much easier their life would be if they had it - because they haven’t experience it.
18:25 It is transformative.
18:30 If you have a user who contacts you to let you know that Parse is down, you then start looking through logs looking for their userid.
18:45 If you have a tool that can operate on a wide range of users, you can break it down by looking for a single application ID.
18:55 If you have tens of million requests per second, and that app has only a few, then it’s never going to show up in the background noise.
19:00 If you have other queries - for a single shopping basket, for everyone with a last name of ‘Smith’ - if you can slice and dice the data then you can get information.
19:30 At every context you capture, look for the userid or the group id or any kind of co-ordination ID.
19:45 The act of slicing and dicing is dependent on the information collected at the point of instrumentation.

Why did you call the talk: “What got you here won’t get you there?”

20:40 We’ve built up these best practices and approaches for understanding our systems very well.
20:50 If you look at the LAMP era, we’ve got tools like Splunk and New Relic for handling known unknowns.
21:15 They’re not as good as finding out what is going wrong when you have a handful of reports.
21:20 It’s almost like: the answer is easy, but the question is hard.
21:25 Honeycomb is about discovering the question.
21:35 It’s not as hard or open-ended as it sounds; you start at the edge and work your way down.
21:45 If it’s a data problem, you start at the bottom and work your way up.
21:50 For example: if you get a report that latency is rising across the board, what do you do?
21:55 You start at the edge or endpoint; what’s the error rate? Where are the slow ones?
22:05 Let’s look at the data sources that those slow endpoints are writing to.
22:10 If they’re writing to MySQL, is it all of them, or just one of them?
22:20 If it’s just MySQLs instances that are in this availability zone, then you can investigate further.
22:25 You follow the breadcrumbs - if you don’t know exactly where you are going to end up, but you can follow it up.
22:30 IF you were using dashboards, you’d have to go to a specific dashboard to find that problem - and for degradation but not failure, they might all be green.

Why did you use the electrical grid as your analogy?

22:45 I wrote an article called “test in production” and got some feedback suggesting the use of staging environments.
22:50 The problem is it’s difficult to have a staging environment the size of Facebook - at some point, it becomes uneconomic to be able to use that as a reason.
23:15 Staging environments can drift from production; it can become more difficult to maintain, for combinations of unique deployments.
23:35 So I said: “Try building the electrical grid as staging” as an example.
23:40 We only have so much engineering time, and the more time we invest in fooling ourselves with testing in staging, we’re starving production systems making it more resilient or hardening it.
24:05 I don’t know many companies who have invested in their production tooling, canary releases - but that’s what you have to do at scale.

What tools and tips do you have with testing in production?

24:30 It’s a waste of time to do it until you need it - and then once you have it, you can’t live without it.
24:40 Any time you have the ability to deploy a version to a system that you use in-house, do it (dogfooding).
24:50 You’ll all notice if you’re using it all the time that your users don’t have to see it.
24:50 Automated checks on canaries are amazing.
25:05 If you deploy to a small subset of canaries, and have automated checks, you can see if it looks the same, and roll back automatically if there are problems.
25:10 They’re table stakes, and yet most of us don’t have them.
25:20 It forces them to follow a lot of good data practices, like versioning your schemas and having the database run fine with multiple versions, before and after.
25:40 One of my favourites is: have your load tests running all the time, say on 20% of your hardware.
26:00 It means you’ll always have headroom - if you have scalability bursts, then you can just turn it off.
26:15 You always have to test before production as well, of course.
26:30 Everyone who has access to master should know how to deploy and how to roll back.
26:40 A reasonable time to deploy is about five minutes, start to finish.
26:50 Make sure that the easiest, fastest path is the default one. You don’t want the rollback to take much longer than the deploy, for example.
27:10 Splitters and shadowers are important as well.
27:20 We did a rewrite at Parse from Ruby to Go.
27:40 We had a one thread per request model; and we had a pool of requests that timed out after a few seconds if the database was slow.
27:55 We did a re-write endpoint by endpoint and it was painful.
28:05 We could not break the mobile clients, because they are in the appstore and it’s not within our control to redeploy all of them.
28:10 We had a shadow of the API using the new code, and it would fork the request and send it to both the old and new code paths.
28:25 We would return the result from the old code, but we could compare the result that was generated with the result that would have been if the new code had been used.
28:30 We could then update the code to match the expected results when they diverged.
28:40 That’s the most paranoid form of testing in production, but there are times when you have to do this.
29:05 Another great thing is to use feature flags, where you selectively enable codepaths based on whether a flag is set.

Why do you prefer sampling over aggregation?

29:20 Metrics are very cheap to collect and store.
29:35 The problem is that you lose context; and once you’ve lost it, you can’t get it back.
29:45 When the priority is the health of the system, that’s fine.
29:50 Increasingly, the priority is the health of the whole experience - and you need to be able to track that request all through the stack.
30:00 We can’t pay to store 200 times as many events.
30:20 If you don’t capture the data at all, you can’t investigate ever.
30:30 So the statistical way to handle this is through sampling.
30:35 We do dynamic sampling at honeycomb; the sample rate is in every key value event, so you can draw it appropriately in the Ui.
30:55 So you can say that your database reads for this type is going slow - keep 1 in 100.
31:00 For deletes, we want to keep all of the events.
31:10 For web traffic - you don’t need all of the 200s, 500s you probably do.

How do you tune these?

31:25 In a world when we can page a lot less but do a lot of instrumentation - that’s what you can do.
31:30 If you get problems about a particular request, you can increase the sampling rate to get more visibility when it’s needed.

What else would you have liked to say in your talk?

32:05 I would have liked to say: if what you are doing at the moment works, then it’s OK.
32:15 You should never rip and replace something that’s already working for you with an unknown.
32:50 It’s not hard to get started - you can run a dashboard for some things and then use honeycomb for others.
33:20 If you start at the edge, developers will start instrumenting the pain points - and at Parse it took six months from starting to having everything instrumented.
33:40 You don’t have start out with an end goal of instrumenting everything - it can grow organically.
33:50 You aren’t ever going to finish, but you will get to a pause point where engineers can answer what they need to in order to get the job done.
34:10 You get 80% of the gaing for 20% of the effort.
34:20 We write about a lot of these topics on our blog

Mentioned

honeycomb.io
Facebook
Strange Loop talk “Observability for Emerging Infra: What got you Here Won't get you There”.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.