Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Honeycomb - A Tool for Debugging Complex Systems

Honeycomb - A Tool for Debugging Complex Systems

Honeycomb is a tool for observing and correlating events in distributed systems. It provides a different approach from existing tools like Zipkin in that it moves away from the single-request-tracing model to a more free-form model of collecting and querying data across layers and dimensions.

How does Honeycomb differ from software like Zipkin - a distributed tracing system based on the Google Dapper paper and written and open sourced by Twitter? InfoQ got in touch with Charity Majors, co-founder at Honeycomb to learn more about the product. Instead of working with the globally unique UUIDs that are used to trace requests, Majors says that “what's generally useful for everyone is some kind of user ID or application ID, plus other types of IDs for grouping families of requests that share characteristics you may want to calculate or aggregate on.”

What does this mean in practice? Request based tracing tools like Zipkin assume that there is a unique ID attached to each request. From the time that the request enters the system, the ID is propagated through the various subsystem calls (which can be for microservices) that get triggered as a result of the initial call. If this ID is logged at every step, and there is a central place where the logs are aggregated and indexed, it becomes easy to search for and trace a particular request as it moves through the system if the request ID is known. One example of such a log aggregator is ELK (Elasticsearch/Logstash/Kibana).

Honeycomb breaks away from this model by attempting to capture data at every level like the load balancer, microservices and databases, tagging the data and letting the user mix-and-match and fire ad-hoc queries later on the data. Majors explains that Honeycomb takes this approach because tracing by itself “leaves you with the looming question of which requests are representative and thus worth looking into in the first place.” Once the data is present with with Honeycomb, the user can tie together data from different layers, aggregate and apply functions to them, from different systems across a period of time to understand its performance. For example, an increase in response times for a request spanning multiple systems might be due to a collective effect from more than one factor, including time. This is not easy to do with request tracing, since a request represents a single related thread of events in a given span of time.

Data can be sent to the Honeycomb via API calls. An example of an API call to log data about a web request looks like

curl -X POST \
  -H "X-Honeycomb-Team: YOUR_WRITE_KEY" \
  -d '{"status":200,"path":"/docs/","latency_ms":13.1,"cached":false}'

The “-d” parameter can take a JSON object with any app specific information that can be used later to query. Data is collected as a series of events, where each event is something that should be tracked. Events can be tied together under a single entity called a ‘Dataset’.

Honeycomb can be integrated with applications via what it calls ‘Connectors’. A connector is an adapter that pulls data from a specific software and pushes it to Honeycomb. Integration can also be done with SDKs as well as a tool called honeytail that can push data from existing logs into Honeycomb.

To add to the context of the data being collected, Honeycomb also marks events that are triggered by a human operator or something like cron - a deployment, a script or a one-off action. These actions show up as vertical lines with additional information attached to them like who ran the script and a link to the deployed code. This is similar to what Etsy’s operations team did with Graphite, but which lacked contextual information.

Since Honeycomb collects a lot of data, how do they deal with querying it at scale? Majors says that right now they're "focusing on recent debugging scenarios, which allows us to employ some powerful tricks around sampling retention when close to 100% of the queries people issue are for the past week or two."

To handle the massive amounts of data, Honeycomb uses their own column store:

We looked into a number of existing solutions when we started building Honeycomb, but none of them were quite right. We ultimately found that most of the pre-built solutions in the space made tradeoffs for functionality that we didn't need (e.g. transactions) and sacrificed functionality that we thought was essential (e.g. being able to quickly access a raw input event).

Honeycomb does not yet support integration with other alerting systems like Nagios/Zabbix/PagerDuty. Signups to the service are currently by invite only.

Rate this Article