BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Jessica Kerr on Observability and Honeycomb's Use of AWS Lambda for Retriever

Jessica Kerr on Observability and Honeycomb's Use of AWS Lambda for Retriever

Bookmarks

In this podcast Charles Humble talks to Jessica Kerr about Honeycomb's architecture and use of Serverless, specifically AWS Lambda, as part of their custom column database system called Retriever. They also explore key differences between Retriever and Facebook’s Scuba, and how Honeycomb differs from traditional APM tools.

Key Takeaways

  • Observability allows you to explore a system in real-time without pre-defining what you need to watch.  
  • The core of Honeycomb is a custom database called Retriever, inspired by Facebook’s Scuba. Retriever has no fixed schema, no pre-defined indexes apart from timestamp, and is multi-tenanted. 
  • Retriever ingests data from Kafka topics storing it on local SSD or in Amazon S3.  This use of local storage is a key difference from Facebook’s Scuba which is performant in part by dint of storing everything in memory.
  • Honeycomb uses AWS Lambda for Retriever, which is useful for on-demand compute, but Honeycomb’s application of it is challenging, and they hit limits such as the burst limit and on how much data can be returned by a function.

Transcript

Introduction [00:36]

Charles Humble: Hello and welcome to the Infoq podcast. I'm Charles Humble, one of the co-hosts of the show and editor-in-chief for Cloud Native consultancy firm Container Solutions. Every year, Cindy Sridharan publishes a list of her favorite conference talks from the last 12 months. Her 2021 list, which I'll link to in the show notes, includes a talk from this week's guest on the InfoQ podcast, Jessica Kerr. In it Jessica describes how Honeycomb where she works is using Serverless specifically AWS Lambda as part of their custom column database system, which they call Retriever. Retriever is inspired by Facebook Scuba and we'll talk a little bit about the differences between oo in the podcast as well. Jessica has over 20 years in software is a regular conference speaker, having talked about languages from Scala to Elm and Ruby to Clojure and movements, including Dev Ops. She was last on the Infoq podcast back in 2017, where she talked to my friend Wes Rice about productivity, Slack bots, Yak Shaving and diversity. Today, it's going to be all about Honeycomb and Serverless. Jessica, welcome back to the InfoQ podcast.

Jessica Kerr: Thank you, Charles. It's great to be here with you.

In your Strange Loop talk you said “observability turns monitoring on its head.” Can you unpack that a bit for us? [01:43]

Charles Humble: So in that fantastic, Strange Loop talk, which again, I'll link to in the show notes so people can watch it themselves. You gave what I think is the best definition of observability that I've ever heard, which was the observability turns monitoring on its head. Can you unpack that a bit for us?

Jessica Kerr: Right? Observability is a new thing. It's different from monitoring, but let's start with monitoring. Monitoring is "I need to know what my system is doing so that I can react if it changes". And typically you have lots of things that you want to count. I want to know how many requests we're receiving, every minute, tell me. So you start counting those requests and you've got this little counter and every time a request comes in, you increment that counter. Then you probably store that in a time series database, whatever, you keep a counter of requests that came in, you keep a counter of requests that errored. You keep a counter of requests from Canada and a counter of requests from the US. Yeah. And everything else you might need to divide them by to see what's going on. And then those counters very naturally make graphs.

The problem is, they pretty much only make graphs. And then you're like, oh, look at all these new errors. Show me one of them. How do I find out more? You don't. You don't find out more because all you have is a count. Observability turns that on its head in the sense that we're not going to optimize for graphs by counting, when something happens. We're just going to store that something happened. This takes more storage.

We have to store, okay, the request came in. It was from the US, it errored with this error message, for instance. And then the thing is when you store that, you can make all the counts. You can count them, you can graph them and you can find out everything about a particular individual request that failed. The cost is you have to store events and you have to count them when you make the graph, which is a lot more work than graphing a bunch of counts that you've already summarized and pre-aggregated. So observability says, we're going to get all these graphs, but not by counting what we think we need to count, but by remembering everything important that happened.

Charles Humble: And it's very different from monitoring. So I worked for a traditional monitoring company in the late 90s, early 2000s around then. And typically the way it would work, a customer would buy the product, you would go in to help them set it up and the first thing you would ask them was basically, "what is it that you want to monitor?" The point being that they would then tell you the things that had gone wrong in the past. And so typically what you would end up doing is monitoring things that were historically problematic, but that might not be problematic now.

Jessica Kerr: Or your CPU spikes and it doesn't actually matter because customers aren't affected. So monitoring can tell you when something is different, but it can't tell you what to do about it, observability can.

And in order to do that, what Honeycomb is doing is, it's storing basically all of the events that a system sends to you. [04:35]

Charles Humble: Right? And in order to do that, what Honeycomb is doing is, it's storing basically all of the events that a system sends to you. Is that right?

Jessica Kerr: Right. We store the event and in particular, we talk about wide events, meaning events with a lot of attributes on them. So when you use Honeycomb, you get charged per event that you send, and it doesn't matter how much stuff you put on there. Okay. I got a request from the US, it came in on this region, on this user agent and it was for this user ID and this customer. And it was this particular error message. And you can put all kinds of things on there. You don't need to reduce them to a few values that you're counting. It's just a lot of stuff because you don't know what you're going to need.

Charles Humble: What do you do if the numbers are too high, you have too high a volume of data?

Jessica Kerr: If the numbers are too high, what you'll wind up doing is sampling. And just keeping 1 out of 10, except you'll keep every error and you'll keep all of the slower ones and you can do smart sampling. So you always have full information on the events that really matter.

How is this different from just chucking everything into a log file? [05:39]

Charles Humble: So we've talked about what makes observability distinct from monitoring. How is this different from, I don't know, just chucking everything into a log file?

Jessica Kerr: Ah, logging is clearly better than nothing, but a lot of things are better than logging, better than logging is structured logs because you can search those and then you can count and make graphs. So the events that come to Honeycomb are structured. Better than logging is tracing because tracing shows context and connectedness. When you have a bunch of logs in a log file, they're all jumbled up between requests and you can't even tell what's happening and you can be like, oh, okay, well we'll put the request ID on every log, okay, now you have a bunch of jumbled entries from a particular request. Maybe, is it the one you want? You don't know, maybe. But which of those caused the others, which were going on at the same time? How long did each one take? Because typically you'll have a, oh it started now, oh it ended now.

Now we've got to tie this, anyway. Logs don't give you the structure that lets you pull out the causality. Traces give you this request came in, it took this long. Meanwhile, this happened inside it and that triggered this, which happened, which triggered this. And then you can see what called what, kind of a stack trace only across the request. And only for the units of work that you have decided are important enough to record and you can really figure out what happened. You can totally figure out what took so long and it gives you the power of seeing into your system.

Could you give us an overview of what the Honeycomb architecture looks like? [07:07]

Charles Humble: So at a high level, could you give us a bit of an overview of what the Honeycomb architecture looks like? And also where does Lambda fit into that initial picture? So maybe start with, what's the goal of Honeycomb and go from there.

Jessica Kerr: Our goal in Honeycomb is for developers to understand what's going on in their app. In order to understand what's going on in their app, they need to run queries and see traces. And for that we need telemetry data and we need to be able to query over it really fast. Telemetry means data that your app sends about what it's doing, as opposed to useful data that it's sending to users and stuff. This is just for you. Here's what's going on, telemetry. Telemetry data comes from instrumentation code. The word instrumentation means code that sends telemetry. We typically use Open Telemetry for that, but this resides in your application. You add code, usually just a little bit of configuration and some libraries that you import to make your application send this telemetry data about itself to Honeycomb, okay. That comes in to Honeycomb and it is immediately accepted.

Here's your 200. And then we throw it on Kafka topics. So we throw it into Kafka and then it's picked up by our database. Our database is called Retriever. And we use Retriever for both the ingestion and the query engine because they're totally coupled. So Retriever picks your events up off the Kafka topics, chops them up into columns, into attributes because it's a columnar data store. So every attribute that you send gets its own file. Typically, those are pretty sparse files because the incoming request has a user agent, but all your internal spans and your database calls and stuff like that, do not have a user agent. This is fine. So Retriever chops it up, puts it in a many files, one per column. The only index that we have in Honeycomb is timestamp. These files are chopped up and limited in length based on quantity of records or timestamp.

And we record which beginning and end timestamp is in each file. So we know which ones to check. All of our queries are timestamp based. And then you can add a whole bunch of filters on top of that. Okay. So Retriever puts all this stuff on a local disc. We have two Retrievers running off each topic at any time. So there's always a backup. And then whenever those files get full and the local disc is like, "oh, I'm feeling it, man," Retriever moves this stuff into S3. So the data lives on local disc and in S3. And in S3 there's a funny thing that Lambdas can get to it really quickly. When we first moved, started putting data in S3 so that we could give customers a longer retention period. Typically, we keep data for 60 days now. And of course S3 is effectively infinite so that scales, but what doesn't scale is reading S3.

And in particular, we only read at query time, we don't do any aggregation or indexing, even calculated columns are all done at query time. So we wind up being compute bound. It's not even network incoming read data. It takes a lot of CPU to read all this data and run your query over it. So the Retrievers, the EC2 nodes that are running Retriever, that's both ingesting data and writing it to disk and reading the local files. Retriever can do really well on what it has in local files. But when you start asking it to do local files and an infinite amount of data, that might be on S3, that's too much.

So you are CPU bound and you need really fast query response times. So in order to do that, what do you need? [10:39]

Charles Humble: Right. So you are CPU bound and you need really fast query response times. So in order to do that, what do you need?

Jessica Kerr: We need this really sudden amount of compute and our queries are fast. Oh, I forgot to tell you, that it's really important that all of these queries over unindexed data that we're calculated, the stuff complete in 1 to 10 second. A minute is acceptable only if you're querying like a ridiculous amount of data over 60 days. But normally you're querying over two hours and that should come back in a second or two. So suddenly we need a bunch more compute maybe for a second or two, maybe for 30 seconds in a really big query case. And we need it now.

So we can't spin up more Retriever nodes and we can't run enough Retrievers all the time, because that's really expensive. So we started spinning up Lambdas, lots of Lambdas in each Lambda. It's, okay, your job is to read these files on S3, this little timestamp range for this data set. And it does it. And it sends its information back to Retriever depending on how much data you're querying, any query you send to Honeycomb may execute on our Retriever nodes or in Serverless in a whole lot of Lambdas and all the Retriever nodes.

Charles Humble: And it makes sense because your storage servers, as I understand it, would typically be spending a lot of their time, basically relatively idle they're ingesting data and they're waiting for a big query. And then when that query comes in, you suddenly get a huge burst of CPU.

Jessica Kerr: Yeah. It's so bursty.

Charles Humble: You said you were CPU bound, so you get a sudden burst of requirement for CPU when suddenly all that idle CPU had, isn't really enough to get your return fast enough.

Jessica Kerr: Yeah. We store a lot of data. We make it accessible within a sip of coffee, the time it takes you to take a sip of coffee, and we do this efficiently. We're super efficient at our use of AWS. We're always tweaking how we use AWS. We work with them, the new graviton processors. We were one of the first people to test those out and write about it. We're really efficient with that. And part of that is when do we run more servers? When do we spin up Lambdas? When do we use spot instances? There's a bunch.

I'm imagining that you must have hit all kinds of sort of weird and unusual and interesting edge cases trying to do this with Lambda. Is that right? [12:55]

Charles Humble: Right? And it's one of the things that's really interesting about this, because it is a very unusual use case. I interviewed Matthew Clark who is head of architecture for the BBC's digital products. And the BBC moved from what was basically a LAMP stack to running on the cloud of AWS and Lambda. But their use case, I mean, there was some interesting aspects to it, but their use case is perhaps a bit more familiar. What you are doing writing effectively a data Retriever layer in Lambda is quite an unusual thing to do. And so I'm imagining that you must have hit all kinds of sort of weird and unusual and interesting edge cases trying to do this with Lambda. Is that right? Could you talk about some of those? I'm imagining for instance, that you might be hitting burst limits because there's a limit to how many Lambdas you can start up.

Jessica Kerr: Right. So there's a limit to how many Lambdas you can start up at one time, the burst limit. And I don't know, maybe it's a thousand to start with. You'd have to look it up to see what it is today, for instance. And then if you continue to run Lambdas, if you keep that thousand full for a whole minute, then Amazon will increase your limit a little bit until it gets to the absolute concurrency limit. That doesn't help us at all!

Jessica Kerr: When we've run thousands of Lambdas, it's within a second, we need it now. And we need it for, maybe we need it for 10 seconds, probably less they're expecting web apps or maybe streaming streams probably go on for more than a minute. So I mean, we talk to them, we got this limit raised somewhat. There is some degree of raising that they can do. But also it does make our queries slower. Some of those queries that work over the full 60 days of data, for instance, we wind up having to wait to up start more Lambdas. We have talked about maybe provisioned Lambdas. I don't think we've done that yet, but it is something we might do in the future. It winds up being more expensive because again, the whole point was to not pay for the compute when we're not using it.

How do you find the startup time for Lambda? [14:52]

Charles Humble: Given that, how do you find the startup time for Lambda? It does have a bit of a reputation for being quite slow, having a bit of a sort of first person penalty. And given you need your compute really quickly. I was wondering if that might be a problem.

Jessica Kerr: We find that it starts up at about 50 milliseconds. That's our median startup time. Lambda does take a long time to startup. The first time you load a piece of code it's longer, but when you're running the same code that you've been running over and over. And there are idle versions of your code loaded up that it eventually kills. But when we use our Lambda integration that sends data to Honeycomb, we can see - it's kind of fun. We can see each execution on the Lambda. I'm running your function now, it's doing this thing. Here are all the activities in it. Okay. Now I'm done. And then it has to make sure it sends that back to us before, well actually it goes through the logs. So it's okay. Sends that to logs before it exits. Because as soon as it exits, as soon as it returns the value, man, you can't do any clean because Amazon immediately puts that Lambda to sleep.

But we can also see with our Lambda integration in Honeycomb, we can see the stretch of time that the Lambda loadedness thing has your code ready. You can see it waking up. You can see it running something and then it goes to sleep and then it runs and then goes to sleep. And all this goes into the logs and comes through you to Honeycomb. So you can see, how many of your Lambdas were loaded up at a given time.

We have this concurrency operator in Honeycomb now, where you can ask how many of these were happening at any given point in time. And so you can see, oh now there's a thousand of these loaded up at a time. Now there's 2000, oh, now it's dropped off because they've gotten idle. We built this concurrency operator specifically for this case of how many Lambdas are we running at once, which is important to answer. How much is this query costing us? It's all the same to the customers. But those long queries over really big data sets, we wind up paying for in Lambda. But what matters to us is that you don't have to know ahead of time what you're going to look at, it's all fast. And so we go to the extremes of spinning up thousands of Lambdas to make everything fast.

How do get around the limit that a function can only return 6MBs of data? [17:03]

Charles Humble: Something else you mentioned in your Strange Loop talk, which I'll be honest since I haven't hit myself, it wasn't something I'd come across before. But you said that a function can only return 6 MBs of data.

Jessica Kerr: Oh yeah. That-

Charles Humble: Which is like, okay. So again, how on earth do you get around that? Because presumably you're returning an awful lot more than 6 MBs fairly routinely.

Jessica Kerr: Oh yeah. That is kind of the point. Well, I mean, where does all data go? It goes to S3. It's probably still there. Yeah, so passing data back and forth to Lambdas, if it's any amount of significance, oh throw it in S3. Lambda and S3 talk to each other pretty efficiently. Oh, another one was requests had to be in JSON. There's some JSON cop in AWS that requires it to resemble JSON according to some rules. Well, if you want more control over that, throw it in S3.

And then presumably do you compress the data that you're storing in S3?

Charles Humble: And then presumably do you compress the data that you're storing in S3?

Jessica Kerr: Yes. I don't have the details of this. I'd have to ask the database engineers. But what I do know is that one of the major improvements that we had a couple years ago was using a better compression algorithm than GZip. It turns out that GZip is compatible a lot, but it's not the most efficient algorithm for anything anymore.

Charles Humble: Right, yeah. Because it's just so old, it's 27 years old or something now GZip I think. So yeah, it's not really a surprise. Do you know what you're using now? Is it LZ4?

Jessica Kerr: I forget which one we use now, but it's something intelligent that's suited to our data format in particular.

Why build your own database? [18:35]

Charles Humble: I want to talk a little bit more about this decision to build your own database. There's actually a great talk from Sam Stokes, who was one of your database engineers. It's from another Strange Loop actually back in 2017. And he talks about it in more detail. I came across it when I was researching for this podcast. And I'll link to that in the show notes as well. But given that building your own database is one of those things. It's like doing your own cache or rolling your own security. It's just, why would you do that? So why would you do that? Why build your own database?

Jessica Kerr: Yeah, we did. We had to build our own database. Charity and Christine who founded Honeycomb, looked at the existing databases that are out there. And there's time series databases that hold counts, which are useful for monitoring. That's great if you need to track the weather, but we actually care about causality and determining it. So they didn't work. Charity and Christine came from Facebook where they used a tool called Scuba, which managed to do a lot of fast queries. I think you might know more about Scuba than I do, Charles.

How does Retriever differ from Facebook’s Scuba? [19:37]

Charles Humble: Oh, I doubt that. But one of the things I have read the Scuba paper, and again, I'll link to that in the show notes as well. But one of the things about Scuba is because it's Facebook and obviously they have to all intents and purposes, limitless money. So one of the reasons Scuba is able to do such fast querying is because basically it's holding everything in memory. While as what you're doing with Retriever is you are needing to send stuff to disk, whether that's local SSD or S3. So I think that's one of the distinguishing features between the two systems. That said I'm imagining there are other major differences. So because otherwise you just implement Scuba again. So what are some of those other major differences between the two?

Jessica Kerr: A big one is that we have usability designers. I mean, we have UX, we have product. Honeycomb is designed to be a joy to use. Our back end is unique and it is necessary to make these queries fast, which is necessary for observability to be responsive and let you drill down into the problem and keep your attention. So you don't have to go get a cup of coffee and get distracted while you're waiting for your Splunk query to come back. But in my opinion, it's the front end of your usability platform that well, it makes your experience. It really brings your data to you. And Scuba was designed by engineers to do whatever the heck that engineer wanted to do that day. And Honeycomb is designed to really help engineers figure out what's going on in their system. It helps with onboarding. It helps with sharing information with the rest of the team, between SRE and dev, for instance,

There's a lot that's different. And a lot of that filters back to the query engine, the operators that we support, the part where we don't index, you can define derived columns and it did require its own database to make these queries ridiculously fast without pre aggregation. Most databases, if you want something to be fast, you build an index on it, which slows down writes. We cannot slow down writes. Oh, another thing besides the queries being fast, you also have to be able to see your data within five seconds. Usually it's more like one. So when you send something into Honeycomb, you see it immediately. You don't don't need to wait 10 minutes to find out if requests are really dropping or if the ingestion is just slow. Right, so we can't slow down writes and we have to make everything fast, including new derived columns that you just came up with.

And we didn't even know anything about 10 minutes ago, but we're calculating that all on the fly. And then the things that we can use to make it fast are the stuff like partitioning by timestamp are the columnar data store. Because usually when you're graphing anyway, you're looking at a few columns to filter by and a few columns to group by, and then whatever aggregation we're doing in real time.

So the columnar data store is another optimization, but we have a very specific use cases in some ways, every query has a timestamp and completely general use cases. In other ways, you have to be able to graph anything, filter by anything, group by multiple things. I think that's unique to Honeycomb, and it's its own problem space. And then we have to do it with like a ridiculous quantity of data. And we do it in AWS and we somehow make this efficient with just a lot of attention. So it's a custom database. It is not easy to run, but you know what? It's central to our business. And we do have an entire team of engineers that keep it running and keep it efficient on AWS and are always looking at it and always improving it. And that's what it takes. That's why you don't want to write your own database because then you don't just write your own database once and then use it. You write your own database forever.

Charles Humble: Right, yes.

Jessica Kerr: Because it has to keep getting better and keep staying running and keep staying efficient and getting more efficient. As Honeycomb scales, we keep making those queries faster and we keep accommodating more data. And yeah, it's an ongoing work all the time to keep this database best in class.

Charles Humble: And it is a really interesting problem. You've got no predefined schema. You are multi tenanted. You don't precalculate indices.

Jessica Kerr: Yeah. When we see an attribute, we're like, oh, guess that's in the schema now.

Charles Humble: Absolutely. Which is kind of bizarre when you think about it. You're basically storing, it's time series and raw event data. And then you've got very distinct operations that you're needing to perform. So histograms, presumably percentiles, count distinct those kind of things.

Jessica Ker: Right. But we do all that based on your events. Yeah. We store the events in time stamped chunks, and then we create histograms based on them as you ask. And oh, and we haven't even talked about bubble up.

Charles Humble: Well, I was just about to, so tell me about Bubble Up.

Jessica Kerr: Bubble Up is another thing that I find really unique to Honeycomb. But it's where you're looking at your heat map of latency, for instance. And you're like, ah, look at these slow requests. Why are some of them slow? And you draw a box around them. And then Honeycomb does a statistic analysis on all the attributes in there and shows you which ones are different. Oh, these ones were from Canada. Maybe Canada's slow today. Or these ones were all from T-Mobile network. So that's something unique that Honeycomb does. And it can only do that because it's not just creating a histogram of duration for every 10 second interval. It knows which events were slow and it knows everything else attached to those slow ones.

Is this primarily a tool for SREs or does it have wider use? [25:00]

Charles Humble: Now we're coming towards the end of our time. And what I'd like to do is just get you to step up sort of several levels and think about where observability is useful. Is this primarily a tool for SREs or does it have wider use?

Jessica Kerr: We talk a lot about observability for SRE, observability in times of crisis when things are slow or expensive in our case, and when things are erroring and what's causing the failure and how do I drill down to that? But I'm a developer by background. I'm not ops, I'm not SRE. And I really like observability for knowing what's going on in my code in real life. So something that I'm focused on now that I'm at Honeycomb is how do we use observability in the development process while we're coding.

Jessica Kerr: And because a big goal of Honeycomb is to get developers, not afraid of production. Production is your friend. If you want to know, Hey are people using this concurrency operator, you can go to production and find out. And that makes me really happy because as a dev, I'm not satisfied with writing code. I'm not satisfied with writing code that's correct or code that's elegant somehow in its magical abstractions. I want to write code that's useful to people. And when I stick a span or an attribute into my traces, and later I go back and I can query by that and I can see who, specifically which customers are using the feature that I worked on feels so good.

Charles Humble:  I completely understand that. I mean, I'm sort of more of an ex developer, but I completely get that sentiment. And it's interesting. I mentioned at the beginning of the podcast that I worked a long time ago for a sadly now defunct monitoring vendor, APM vendor, and there you were selling the product to the ops team. And one of the things that's different I think now, we've said that the world has hopefully shifted to something which is more DevOps oriented and the tooling needs to change to reflect that. And I think that it's just really interesting to think about Honeycomb as a product, has a more developer centric view of the world than a traditional monitoring tool has. It's not just the capabilities. It's the whole approach is a little different.

Jessica Kerr: True monitoring can tell you when to panic, but we want more than that. And yeah, Honeycomb it's not a tool built for ops. It's also not a tool built for devs exclusively. It's definitely a DevOps tool though, because the Opsy types and SREs are looking at the same graphs that the developers can see and you can send them back and forth to each other. Another thing that I like is we don't charge by user. So give access to everyone, please. That's how you get more value. And it's a also kind of democratizing in that people who are new to the team can immediately start learning things about their system and they can look at what queries other people are running and it spreads information and it particularly spreads the same feedback loop between opsy people and devy people. So yeah, it's very Dev Ops in the original sense.

Charles Humble: So yes, very Dev Ops in the original sense. And also I think a really interesting application of AWS Lambda for a problem space that we might not typically think about. My thanks to Jessica Kerr for joining me this week. And also to you for listening to this edition of the InfoQ podcast.

Mentioned

About the Author

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT