BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Chaperone - A Kafka Auditing Tool from the Uber Engineering Team

| by Hrishikesh Barua Follow 4 Followers on Jan 26, 2017. Estimated reading time: 2 minutes |

The Uber Engineering team released a Kafka auditing tool called Chaperone as an open-source project. Chaperone allows for auditing and detection of data loss, latency, and duplication of messages in the multi-datacenter and high-volume Kafka setup at Uber.

Uber’s data pipeline in Kafka traverses multiple datacenters. Uber’s systems generate a lot of log messages from service call logs and events. It’s optimized for high throughput. These services run in active-active mode spanning datacenters. The data flowing through Uber’s Kafka pipeline is used for both batch processing as well as for real time analysis.

Kafka acts as a message bus connecting different parts of the Uber systems along with a tool called uReplicator. uReplicator is a Kafka replicator that is based on the design of Kafka’s MirrorMaker, which can be used to create copies of existing Kafka clusters. The log messages are pushed to Kafka proxies, which then aggregate and push to a “regional” Kafka cluster that is local to that datacenter. Consumers consume messages from both the regional Kafka clusters as well as from a combined Kafka setup that has data from multiple datacenters. Chaperone was built to fulfill the need to audit these messages in real time.

Chaperone's primary job is to detect anomalies like data loss, lag or duplication as the data flows through the pipeline. It has four components:

  • The AuditLibrary, which collects, aggregates and outputs the statistics as audit messages from each application. This library uses the concept of “tumbling windows” to aggregate the information into audit messages and send them to a dedicated Kafka topic. A tumbling window is used in stream processing systems like Apache Flink to group streaming data into non-overlapping segments.
  • The ChaperoneService, which consumes each and every message from Kafka and records the timestamp and produces audit messages that are pushed to a dedicated Kafka topic.
  • The ChaperoneCollector, which takes the messages produced by the ChaperoneService and pushes them to a database and gets them displayed on user-facing dashboards. The dashboards make it easy to detect and pinpoint message drops as well as latencies.
  • The WebService, which exposes REST APIs that can be queried to fetch and process the data.

In implementing Chaperone, the team had to guarantee accuracy of the auditing data. One of the strategies to achieve this was to ensure that each message is audited exactly once. A write-ahead logging (WAL) technique was used for this. The WAL records an audit message before it’s sent to Kafka from the ChaperoneService, allowing it to replay any messages if the Service crashes. This is a common technique used in some databases like PostgreSQL.

The other strategy was to have a consistent timestamp irrespective of where an audit message is processed in the various stages. This is not a completely solved problem in Chaperone yet. A combination of techniques is used based on the message encoding. For Avro-schema encoded messages, the timestamp can be extracted in constant time. For JSON messages, the Chaperone team wrote a stream-based JSON parser that reads just the timestamp without decoding the entire JSON message. In the proxy client and server tiers, the processing timestamp is used.

Apart from detecting data loss, Chaperone also allows reading data from Kafka based on timestamps instead of offsets. This lets users read messages in any time range irrespective of whether they have been consumed or not. Thus, Chaperone can also function as a debugging tool since it allows the user to dump the read messages for further analysis.

The Chaperone source code is available on Github.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Project without documentation by Rainer Guessner

Documentation is 1 page?

Re: Project without documentation by Hrishikesh Barua

The README in the Github project is just a writeup of how to get a basic setup of Chaperone up and running. I guess creating an issue requesting for more specific documentation might help.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT