The Uber Engineering team released a Kafka auditing tool called Chaperone as an open-source project. Chaperone allows for auditing and detection of data loss, latency, and duplication of messages in the multi-datacenter and high-volume Kafka setup at Uber.
Uber’s data pipeline in Kafka traverses multiple datacenters. Uber’s systems generate a lot of log messages from service call logs and events. It’s optimized for high throughput. These services run in active-active mode spanning datacenters. The data flowing through Uber’s Kafka pipeline is used for both batch processing as well as for real time analysis.
Kafka acts as a message bus connecting different parts of the Uber systems along with a tool called uReplicator. uReplicator is a Kafka replicator that is based on the design of Kafka’s MirrorMaker, which can be used to create copies of existing Kafka clusters. The log messages are pushed to Kafka proxies, which then aggregate and push to a “regional” Kafka cluster that is local to that datacenter. Consumers consume messages from both the regional Kafka clusters as well as from a combined Kafka setup that has data from multiple datacenters. Chaperone was built to fulfill the need to audit these messages in real time.
Chaperone's primary job is to detect anomalies like data loss, lag or duplication as the data flows through the pipeline. It has four components:
- The AuditLibrary, which collects, aggregates and outputs the statistics as audit messages from each application. This library uses the concept of “tumbling windows” to aggregate the information into audit messages and send them to a dedicated Kafka topic. A tumbling window is used in stream processing systems like Apache Flink to group streaming data into non-overlapping segments.
- The ChaperoneService, which consumes each and every message from Kafka and records the timestamp and produces audit messages that are pushed to a dedicated Kafka topic.
- The ChaperoneCollector, which takes the messages produced by the ChaperoneService and pushes them to a database and gets them displayed on user-facing dashboards. The dashboards make it easy to detect and pinpoint message drops as well as latencies.
- The WebService, which exposes REST APIs that can be queried to fetch and process the data.
In implementing Chaperone, the team had to guarantee accuracy of the auditing data. One of the strategies to achieve this was to ensure that each message is audited exactly once. A write-ahead logging (WAL) technique was used for this. The WAL records an audit message before it’s sent to Kafka from the ChaperoneService, allowing it to replay any messages if the Service crashes. This is a common technique used in some databases like PostgreSQL.
The other strategy was to have a consistent timestamp irrespective of where an audit message is processed in the various stages. This is not a completely solved problem in Chaperone yet. A combination of techniques is used based on the message encoding. For Avro-schema encoded messages, the timestamp can be extracted in constant time. For JSON messages, the Chaperone team wrote a stream-based JSON parser that reads just the timestamp without decoding the entire JSON message. In the proxy client and server tiers, the processing timestamp is used.
Apart from detecting data loss, Chaperone also allows reading data from Kafka based on timestamps instead of offsets. This lets users read messages in any time range irrespective of whether they have been consumed or not. Thus, Chaperone can also function as a debugging tool since it allows the user to dump the read messages for further analysis.
The Chaperone source code is available on Github.