BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Honeycomb: How We Used Serverless to Speed up Our Servers

Honeycomb: How We Used Serverless to Speed up Our Servers

Bookmarks
49:16

Summary

Jessica Kerr reviews the benefits (user experience on demand!) and constraints (everything in AWS has a limit!) of serverless-as-accelerator, and gives practical advice.

Bio

Jessica Kerr is fascinated by how software doesn’t get easier as she gets better at it: it gets harder, and also more valuable. Jessitron has developed software in Java, C, TypeScript, Clojure, Scala, and Ruby. She keynoted software conferences in Europe, the US, and Australia. She ran workshops on Systems Thinking (with Kent Beck) and Domain Driven Design (with Eric Evans).

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Kerr: One Tuesday morning a developer sits down at their desk, opens up their laptop. They look at some dashboards, maybe, what is this blip in the error rate? Could that possibly be related to the change I pushed yesterday? They open up their log aggregator and they type in a query, take a sip of coffee. Might as well go get another cup. That's it. That's the whole story. They never get back to that question. Maybe their kid walked in and distracted them. Maybe they got an email. Who knows? Their attention is gone. I'm Jessica Kerr. I'm here to talk to you about how at Honeycomb, we use serverless functions to speed up our database servers.

Overview

I'm going to talk about how serverless is useful to us at Honeycomb, not for answering web requests, but for on-demand compute. Then I'll talk about some of the ways that it was tricky. Some of the obstacles that we overcame to get this working smoothly. Finally, how you might use serverless. Some things to watch out for, what workloads you might use this for.

What Is Lambda For?

First, I need to tell you why we use serverless at all. We use Lambda functions on AWS Lambda, to supplement our custom datastore, whose name is Retriever. Your first question there should definitely be, why do you have a custom datastore? Because the answer to let's write our own database is no. In our case, our founders tried that. It turned out that we are really specialized. Retriever is a special purpose datastore for real-time event aggregation for interactive querying over traces, over telemetry data for observability. Why would we want to do that? Because Honeycomb's vision of observability is highly interactive. People should be able to find out what's going on in their software system when they need to know not just learn that something's wrong, but be able to ask, what's wrong? How is this different? What does normal look like? A repeated question structure, always get to new questions. The difference for us between monitoring and observability, is in monitoring, you decide upfront what you want to watch for. You maybe watch for everything that has been a problem in the past. Then when you want to graph over that, you can graph that over any period of time, and it's really fast because you've stored it in a time series database. You've done all the aggregating already. In Honeycomb, you don't yet know what you're going to need to ask about production data. Dump it all into events. We'll put it into Retriever. Then we'll make every graph fast. Each different field that you might want to group by or aggregate over, each different aggregation you might want to do, from a simple count to a p50, or a p90, or a p99, or a heatmap over the whole distribution. Our goal is to make all of these graphs fast, so that you can get the information you need and immediately start querying it for more information.

It goes like this. Say I want to know how long our Lambda functions take to execute. What's the average execution time? Not always the most useful metric, but we can use it today. I know I need to look in Retriever's dataset for something in Lambda, but I don't remember the name of the spans I'm looking for. I'll just ask it for group by name, give me all the different names of the spans. I recognize this one, invoke. I'm looking for invoke. Next question, run this query, but show me only the invoke spans. Ok, got that back. Next query, show me the average of their durations. I can scroll down and I can see what that is. Then I get curious. This is important. I'm like, why is this so spiky? What is going on over here where it's like, super jumpy, and the count is way higher and the average duration is bouncy? I'm like, look at this, look at that spike in the p50 of the median duration down there? Let's see, I'll heatmap over that. Doesn't look like they're particularly slower in the distribution. Let's say, what is different about these spans compared to everything else in the graph?

Honeycomb does a statistical analysis of what's going on. Then we can scroll down and we can see what's different. It looks like for the spans in this box I drew they're mostly from queries, and they have a single trace ID so they're from this particular query. Ok, so now, next aggregation. Give me only the spans inside this trace. Now I'm seeing all the invocations in this one query, but now I want to know what query were they running that made it take so long? Instead of looking for the invocations, let's look through this trace, but let's find something with a query spec in it. Now I'm going to get back probably just one span, a couple spans, of Retriever client fetch. I recognize that name. That's the one that's going to tell me what this particular customer was trying to do. If I flip over to raw data, then I can see all of the fields that we sent in Retriever client fetch. Look, there's the query spec right there. I'm not sure exactly what that is but it looks hard. Some queries that customers run are definitely harder than others.

Interactive Investigation of Production Behavior

The point is to get this interactive feel, this back and forth, this dialogue going with your production data, so that you can continue to ask new questions over and over. For that, it has to be really fast. If I hit run query, and then I take a sip of coffee, now I should have my answer. If I have to go get another cup, complete failure. We've lost that developer or that SRE. That's not good enough. The emphasis on this is on the interactivity here. Ten seconds is a little slow. One second is great. A minute, right out.

Retriever Architecture

How do we do this? Architecture of Retriever. Customers send us events. We put them in the database. Then developers, and SREs, and product, and whoever, runs the queries from our web app. Of course, the events come into Kafka. This is not weird. Naturally, we partition them. Retriever is a distributed datastore. There's two Retrievers to read off of each topic, so that we have redundancy there. It reads all the events, and then it writes them to local disk. Because local disk is fast, in-memory is too expensive. Anywhere else is slower. It writes all these things to local disk. That's quick. The more of Retrievers we have, the more local disks we have. Then, when a query comes in, it comes into one Retriever. That Retriever says, ok, this dataset has data in these many other partitions, sends off inner queries to all of those Retrievers so that they can access their local disks. Then there's a big MapReduce operation going on, it comes back to the Retriever you asked, and it responds to the UI. That's the distributed part.

The next trick to making this really fast is that Retriever is a column store. It's been a column store since before these were super cool, but it's still super cool. Every field that comes in with an event goes in a separate file. That's fine. This is how we scale with quantity of fields on the event. Because at Honeycomb, we want you to send all kinds of fields and they can have all different values. We don't care because we're only going to access the ones we need. When a query comes in, if we're looking for service name equals Lambda, and name of the span is invoke, and we're aggregating over the duration, all Retriever is going to look at is the service name, the name, and the duration columns, and the timestamp. There's always a timestamp associated with every query. That's the next trick is, in order to segment this data, we use timestamp. At Honeycomb, I like to say we don't index on anything, but that's not quite true, we index on timestamp. The data is broken into segments based on like, I think, at most 12 hours, or a million events, or a certain number of megabytes in a file. Then we'll roll over to the next segment. Then we record like, what timestamps are the earliest and latest in each segment. That way, when a query comes in, we're like, ok, the query has this time range, we're going to get all the segments that overlap that time range. We're going to look through the timestamp file to find out which events qualify. That's how Retriever achieves dynamic aggregation of any fields across any time range at that interactive query speed.

More Data and Big Datasets

Then we have the problem of success, and we've got bigger customers with more data coming in, and datasets are getting bigger. The thing is, our strategy used to be, whenever we run out of space for a particular dataset, new segment starts, older segments get deleted. That was fine when the oldest segment was like a week old. The point is, your current production data is what's most important. We got datasets that were big enough that at our maximum allocation for a dataset, we were throwing away data from like 10 minutes ago. That's not ok. You need more than 10 minutes window into your production system. We did what everybody does when there's too much data, we started putting it in S3. This time, instead of deleting the oldest segment, we were shipping it up to S3. Each Retriever still takes responsibility for all of the segments in its partition, it's just that now we're not limited in storage. We can store it up to 60 days. That's a much better time window, than, until we run out of space, much more predictable. Then those queries are going to be slower. They're not as fast as local disk. It's the most recent stuff that you query the most often, and that's what you want to be really fast. It's also the stuff that's the most urgent.

We're like, ok, so each Retriever, when it needs some data that's older, it'll go download those files from S3, and include those in the query. It won't be quite as fast, but it'll be a lot more flexible, because you have more data. That's good. Now people can run queries over 60 days' worth of data. No, 60 days is a lot. How much longer is that going to take? When you're reading from local disk, it's really fast, but as soon as you hit S3, the query time grows, at least linearly with the number of segments that it has to download and query. If you query for the last few minutes, yes, you can take a sip of coffee. If you query for the last few days, you might have to take a couple sips, and 60 days, we had to change our maximum query timeout to an hour. That's way beyond a cup of coffee. That's like roast the beans and brew the pot. I hear you can roast beans, it doesn't take that long, but this took too long.

That was not ok. What are we going to do? Retriever is like, I need more compute. The network wasn't the bottleneck here. It was actually the compute because we're doing all those reads and the aggregations, and group bys, and filters, and all that stuff in memory. At query time, compute was on limitation. We could just like spin up more Retrievers. We could get more EC2 instances. You can buy compute. Except we really don't need it all the time. The Retriever dog doesn't always want to play. This is like when we need the compute. This is the concurrency of how many Lambdas are we running at any one time, and it's super spiky. Often, pretty much none. Sometimes, we need thousands. This is very different from the compute profile of EC2 because we don't need it 30 seconds from now, after use. Even if an instance spun up that fast, which they don't all, that's too long. We need sudden access to compute while you're lifting your cup. That is exactly what serverless provides. Also, Lambdas are like right next door to S3. Retriever, you get some minions. Now, when a Retriever needs to access its segments in S3, it spins up a Lambda for each eight or so segments. That Lambda reads the data from S3, decrypts it, looks at the files just that it needs to. Does the aggregations. Sends the intermediate result to Retriever, and the MapReduce operation flows upward. This is much better.

See, our query time, it still goes up with the number of segments queried. That's not weird. It's very sublinear. If you're running a 60-day query, and it's a hard one, you might get more than one sip in but you're not going to have to go get another cup. Win. It turns out that buying compute in, used to be 100 milliseconds, now it's 1 millisecond increments, you can do it. This is like us scaling the compute, so that the time of the query doesn't scale with how much it's doing. We're throwing money at the problem, but very precisely, like only when we need to.

Lambda Scales up Our Compute

We use Lambda to scale up compute in our database. We found that it's fast enough. Our median start time is like 50 milliseconds. My cup doesn't get very far in that amount of time. It's ok. We don't see much of a difference between hot and cold startups. They tend to return within two and a half seconds, which is acceptable. They are 3 or 4 times more expensive, but we run them 100 times less, at least, than we would an EC2 instance, for the same amount of compute, so this works out. There are caveats to all of these, or at least caveats that we overcame. Watch out.

We started doing this a little over a year ago, and at AWS, this was a new use case, at the time, for serverless. Because they designed it for web apps, they designed it as like a backend on-demand. The scaling isn't exactly what we expected. The scaling for Lambda is, it'll go up to what is called the burst limit, which in US-East-1 is 500. In US-West-2 I think it's 3000. It varies by region. That burst limit is like 500 Lambdas. Then they stop scaling. Then AWS was like, but if you have continuous load, then over the next minute, they will scale up, I think it might be linearly, I've drawn it as steps, to the concurrency limit, which is like 1000. The rest of them will get a 429 response which is throttled for retry. We hit this. Spending a minute scaling up by 500 more Lambdas is not helpful, because our usage pattern looks like this. We don't have a minute of sustained load. That doesn't help us at all, so we really needed our burst limit raised. We talked to AWS and they raised our burst limit. You can talk to your rep and you can get your burst limit raised into the tens of thousands now. That helps, at least your concurrency limit, both fairly. The trick is to not surprise your cloud provider. We were able to measure how many Lambdas we needed to run at a given time, or are running. In fact, we added this concurrency operator to count how many of a thing it wants, just for this purpose. Now that's available to everyone.

Startup, we need this to be fast. People talk about cold starts, warm starts. Is that a problem for us? It hasn't been. When you invoke a Lambda function, AWS may or may not have some already ready of these processes already started up and ready. If not, it'll start up a new one and then invoke it. Then that one will hang out a little while waiting to see if it gets some more invocations. You only get charged for while it's running the code. You can see the difference between these. We can make a trace, and we do. We make a trace not only of our invocations, but of that wider Lambda process, because we omit a span when it wakes up and we omit a span right before the function goes to sleep. We can see run, sleep, run, sleep, run sleep. You can actually follow what's going on in that process, even though during those sleeps, it's not actively doing anything. I think that's fun.

Generally, our startup within 50 milliseconds, like you saw. This is in Go, so that helps. Here it goes. Here's the Lambda function process, you can see that this one hung out for a while. We can count the number currently running. We can use concurrency to count the number currently sleeping, and you can see that those are wider. That's just neat. What matters is that, when we invoke them, they start up quickly, they do their processing. They return within two-and-a-half seconds, most of the time, 90% of the time, but definitely not 100%. You can see the 30,000 millisecond to the 32nd line in the middle of this graph, there's a cluster, that's S3 timeout. Lambda may be right next door to S3, but S3 does not always answer its knock. The trick to this is just don't wait that long. Start up another one with the same parameters, and I hope you get a little luckier on the timing this time and S3 does respond. Watch out because the default timeout in the Lambda SDK is like 30 seconds or longer, it's way too long. Do not want to use the default timeout, make sure you give up before the data becomes irrelevant.

We did also find a peculiar restriction that like the functions can't return more than 6 megabytes of data. Put the return value in S3 and respond with a link. Amazon has a limit for everything. That's healthy. They have boundaries. They will surprise you. You will find them. Also, when you try to send the functions data, we would like to send them binary data, but they only want JSON. There's weird stuff. JSON is not that efficient. It's not exactly JSON, it's whatever AWS's Lambda JSON cop has decided is JSON. Don't deal with it. Put the input in S3 and send a link. This is fine.

Finally, everyone knows that serverless is expensive. Per CPU second, it costs like three to four times what an EC2 instance would cost. Given that we're running at less than a 100th of the time as much, that seems like a win. What can we do to keep that down? First of all, what really worries me about Lambda costs is that you don't know what they're going to be, because how many of these is your software going to invoke and suddenly spin up? What are the costs associated with that? Are you going to get surprised by a bill that's like a quarter of your AWS bill? Sometimes. This is where observability is also really important. Because we have spans that measure that invocation time, we can multiply that duration by how much we pay per second of Lambda invocation. We can count that up by customer, because all of our spans include customer ID as a dimension. Then, we can get notified, and we do, whenever a particular customer uses more than $1,000 of Lambda in a day or an hour. Then sometimes we get the account reps to reach out and talk to that customer and be like, what are you doing? Here's a more effective way to accomplish what you're looking for. We throttle our API and stuff like that. Really, the best you can do is find out quickly if you're going to get a big bill.

Also, we do a ton of optimization. We do so much optimization of our Lambda execution, really all of our major database processes to get that speed. One way that we optimize is that we've moved from x86 to ARM processors to the Graviton2 processors, both for our Retrievers and our ingest, most of our other servers, but also for our Lambdas. Liz Fong-Jones, who's our field CTO now, has written several articles about the ARM processors are both faster in the sense that it's going to take less CPU to run them. Those CPU seconds are cheaper. We get lower costs in two different ways. We can measure that. We started building our Lambda functions there in Go for both x86 and ARM. The first time we tried a 50-50 split, and we ran into some, ok, maybe this, maybe not. Initially, the ARM64 processors were about the same average, but a lot more varied in their performance, and overall slower. Take it back. They were not the same average. They were more varied in their performance and overall slower. We're like, ok, let's change that feature flag, and we'll roll this back so we're running 1% on ARM processors and 99% on x86. We did that. Yes, so now you can see our ARM percentage, you can barely see the orange line at the end after the feature flag was deployed.

Why So Slow?

Then we started investigating, why was it so slow? One was capacity. Even though we had our Lambda executions limits raised, there were only so many ARM processors available to run them. The total capacity in AWS for these is still lower than for x86 Lambdas. We had to work with AWS directly, and created a capacity plan for when we would be able to spin up more of them to ARM. The next thing we noticed was that these were running slower, because at the time, the current Golang was 1.17, and 1.17 had a particular optimization of putting parameters in registers instead of having to put them in memory for function calls that made calling functions faster on x86. Because we're doing all these super complicated queries, and which filter are we doing, and which group by are we doing, and there's a lot of branching in what our aggregators are doing, there were a lot of function calls. A little bit of overhead on a function call went a long way. Go 1.18 also has this optimization on ARM, so we started using 1.18 a little bit early, just for our Lambdas, and that made a difference. Now Go is 1.19, it's fine. At the time, that was a significant discovery. We figured that out with profiling.

Also, through profiling, we noticed that the compression was taking a lot longer on ARM than on x86. It turned out that the LZ4 compression library had a native implementation on x86, but had not been released yet natively in assembly for ARM64. Liz spent a couple afternoons porting the ARM32 assembly version of the LZ4 compression library to ARM64, got that out, and brought the performance more in line. These three considerations fixed the performance problems that we saw at the time. Under the capacity ones, that's a gradual fix over time. Since then, since a year ago, we've been able to bump it up to 30% ARM. Then AWS called and said, "Try. Go for it." We bumped it up to like 99, but then there were some regressions and so we dropped it down to 50, and that was ok. Then we got those fixed, and then bumped it up to 90, or gradually worked it up to 99%. Now we're there. We keep 1% on x86 just so we don't break it without noticing.

The performance is good. There's a little more variation in the purple x86 lines here, but that's just because they're 1%. The orange lines are ARM. Yes, the performance is the same. We figured out also through profiling and observability that on ARM, with the same CPU size as x86, it was sufficiently fast enough that we'd actually hit network limitations. We scaled back the CPU by 20%. On fewer CPUs, we're getting the same performance. Also, those CPUs are 20% cheaper. This continued optimization is how we manage to spend money very strategically on our database CPU, so that people can get that interactive query timing, even over 60 days.

Try This at Home?

We scaled up our compute with Lambda, should you? Think about it. If you do, be sure to study your limits. Be sure to change the SDK retry parameters, don't wait 30 seconds for it to come back. Deployment is its own thing. We stub that out for automated tests. The only real test is production, so also test in production, with good observability. Observability is also really important for knowing how much you're spending, because you can really only find that out, again, in production from minute to minute. Always talk to your cloud provider. Don't surprise them. Work this out with them. Talk about your capacity limits. A lot of them are adjustable, but not without warning. The question is, what should you do on serverless, and what should you not? Real-time bulk workloads. That's what we're doing. We're doing a lot of work while someone is waiting in our database. It needs to be a lot of work, or don't bother, just run it on whatever computer you're already on. It needs to be urgent, like a human is waiting for it, or else there's no point spending the two to four times extra on serverless, unless you just really want to look cool or something. Run just a Kubernetes job, run it on EC2, something like that, if it's not urgent.

Once you've got someone waiting on a whole lot of work, then what you're going to need to do is move the input to object storage. You've got to get all of the input that these functions need off of local disk, and somewhere in the cloud where they can access it. If they have to call back to retrieve or to get the data, that wouldn't help. Then you've got to shard it. You've got to divide that up into work that can be done in parallel. It takes a lot of parallelism. The MapReduce algorithms that our Lambdas are using have this. Then you'll want to bring that data together. You could do this in Lambda, but this also can be a bottleneck. We choose to do that outside of Lambda on our persistent Retriever instances, which are also running on ARM for added savings.

Then you're going to have to do a lot of work. You're spending money on the serverless compute, use it carefully. You're going to need to tune the parameters, like how many segments per invocation. What's the right amount of work for the right Lambda execution? How many CPUs do you need on Lambda at a time? I think memory is connected to that. Watch out for things like when you're blocked on network, no more CPU is going to help you. You'll need to optimize properly, and that means performance optimizing your code where it's needed. You'll need profiling. You definitely need observability. There's an OpenTelemetry layer, and it will wrap around your function, and create the spans at the start and end. It's important to use a layer for this. Your function can't send anything after it returns nothing. As soon as it returns, it's in sleep mode until it starts up again. The Lambda layer allows something to happen to report on the return of your function. Be sure to measure it really carefully, because that's how you're going to find out how much you're spending.

In the end, technology doesn't matter. It's not about using the latest hotness. The architecture doesn't matter. It's not about how cool a distributed column store is. What matters is that this gives something valuable to the people who use Honeycomb. We spend a ton of thought, a ton of development effort, a ton of optimization, a ton of observability, we put all of our brainpower and a lot of money into our serverless functions, all to preserve that one most precious resource, developer attention.

Resources

If you want to learn more, you can find me at honeycomb.io/office-hours, or on Twitter is jessitron, or you can read our book. Liz, and George, and Charity all from Honeycomb have written about how we do this, how we do observability and how we make it fast in the "Observability Engineering" book. You can learn a lot more about Retriever in there.

Questions and Answers

Anand: I was wondering how much data we're talking about, when we say 60 days for large clients?

Kerr: I think it's in terabytes, but tens of terabytes, not pi terabytes.

Anand: What's the normal workflow for your customer using the Retriever function? What's their normal method? Let's say you have a customer, they build a dashboard with charts, do they basically say, this chart, I want to be faster or more real time.

Kerr: We just work to make everything fast. You don't pick custom indexes. You don't pick which graphs to make fast. We aim to make all of them fast. Because we don't want you to be stuck with your dashboards. Yes, you can build a dashboard. That is a functionality that Honeycomb has. It's not what we're optimizing for, we're really optimizing for the interactive experience. You might start at your dashboard, but then we expect you to click on the graph, maybe change the time range, maybe compare it to last week, more likely group by a field, or several fields. Get little tables of the results as well as many lines on the graph. Then maybe click on, make a heatmap, or click on it and say, what's different about these? We're going to go on a series of queries to tell you that.

Anand: It's completely done on demand, in real time as the user is doing his or her analysis. It's not about optimizing the speed of a chart in a dashboard. It's all about the interactive part.

Kerr: Yes. We could go look in the database, what you have in your dashboards, but your dashboard queries are not any different from a live query.

Anand: Do you also speed those up with Retriever, the canned set of charts that people have?

Kerr: Yes. If you make a dashboard that's for a long period of time, and access it a lot, we're probably going to notice and maybe talk to you. If you're like updating that every 30 seconds, we're going to cache it for you. Because those are expensive queries.

When to use Lambda functions and when not to, is whether the data is in S3. If it's in S3, we're going to use a Lambda. If it's on local disk, then we're not. That's entirely determined by time. The time isn't the same for every dataset, though. If you have a smaller dataset, maybe all of the data is on local disk. As it gets bigger more, a larger percentage of that data is in S3.

Anand: It's based on the dataset size. Then, do you move the data to S3, like behind the scenes?

Kerr: Retriever does.

Anand: You get to decide how much data you hold? Do I decide if I want six months of data?

Kerr: You can negotiate that a little bit with your contract. I think we have a few exceptions where customers keep more than 60 days of data in particular datasets. At Honeycomb, we keep things pretty simple. Pretty much everybody had 60 days of data. How much of that is in local disk is like a fixed amount per dataset, roughly. Some datasets have more partitions than others, and so they'd have correspondingly more data on local disk, but it's all invisible to the customer. You don't know when we're using Lambda.

Anand: Can you elaborate on what makes Lambdas hard to test?

Kerr: You can test the code inside the Lambda. You can unit test that. It's just fine. Actually testing whether it works once it's uploaded to AWS, like integration testing Lambdas is really hard. You can't do that locally. You can't do that in a test environment. You can do that in a test version that you uploaded to AWS, but that's really slow. Honeycomb is all about test in production. Not that we only test in production, but that we also test in production and we notice when things break, and then we roll back quickly. The other thing we do is we deploy to our internal environment first. Our internal environment is not test, is not staging, it's a completely separate environment of Honeycomb, that we're the only customer of. There's production Honeycomb that monitors everybody else's stuff, all of our customers' observability data. Then, there's our version of Honeycomb that just receives data from production at Honeycomb. We call it dog food, because we use it to eat our own dog food. The dog food Honeycomb, we deploy to that first. Technically, there's another one that monitors dog food, but close enough. If we broke the interface between Retriever and EC2, and then the Lambdas, or anything else about the Lambdas that we couldn't unit test, if we broke that we'll notice it very quickly. We even have deployment gates, that normally, deployment to production would just happen 20 minutes later. If our SLOs don't match, if we get too many errors in dog food, we'll automatically stop the deploy to production. We test in prod, where it's a smaller version of prod. It's not all of prod. It's limited rollout.

Anand: How do you compare Lambdas to Knative?

Kerr: I've never tried anything in Knative. We do use Kubernetes in Honeycomb.

Anand: Are you using Kubernetes over EKS, the Elastic Kubernetes Service for the control plane.

Kerr: EKS, yes.

Anand: Does it sometimes make sense to use a platform agnostic language like Java that may help avoid issues with suboptimal libraries that are not yet ported to native CPU architecture?

Kerr: Sometimes, absolutely. It depends on your business case. We're doing something really specialized in this custom database. In general, don't write your own database. In general, don't optimize your code this much, for like business software. This is like the secret sauce that makes Honeycomb special. This is what makes it possible for you to not have to decide which queries you want fast, they're just all fast. It's a dynamically generated schema, we don't even know what fields you're going to send us, just that it's going to be fast. It's super specialized. At the scale that we do these particular operations at, it's expensive. That's where a significant portion of our costs are, in AWS, and a significant chunk of that is Lambda. We are constantly optimizing to run really lean in AWS, and we watch that closely. Liz Fong-Jones is always noticing something that that could be faster and could save us tens of thousands of dollars a month, which is significant at our size.

Anand: Is your entire platform written in Go.

Kerr: Pretty much. The frontend is TypeScript.

Anand: What are your timeouts? A user types in a query in the UI, how long will they wait? Will they wait as long as it takes to get a result, but you try to be as fast as possible?

Kerr: It'll time out after 5 minutes. If it takes that long, there's a bug. More likely something went down. The user waits until there's a little spinny thing. Then, all the queries will populate all at once, when the results have been aggregated and sent back. Usually, it's like 5 seconds on a long one, and 2 seconds on a typical query.

Anand: This is the Holy Grail, like call cancellation, someone closes the window, you have to schedule the workload. Everyone wants to do it, and never gets around to it.

Kerr: It'll finish what it's doing, there's no really stopping it, because it's doing everything at once already. We will cache the results if somebody runs that exact query with that exact timespan again. Those results are actually stored in S3. This makes permalinks work. Those results are actually stored forever, so that your queries that you've already run that you've put into your incident review notes, those will always work. You just won't be able to drill further in once the data has timed out.

Anand: What's the time granularity of your buckets, like your timestamp?

Kerr: You can set that on the graph within a range. It'll usually start at a second but you can make it 30 minutes, you can make it 5 milliseconds, depending on your time range. You're not going to do 5 milliseconds for a 60-day query, no, but appropriately.

Anand: They could group by second. You support group bys where they have 1-second buckets.

Kerr: That's just bucketing for the heatmaps. Group by is more like a SQL group by where you can group by account ID, you can group by region, any of the attributes. It'll divide those out separately, and show you a heatmap that you can hover over.

 

See more presentations with transcripts

 

Recorded at:

Jul 07, 2023

BT