Uber Reduces Logging Costs by 169x Using Compressed Log Processor (CLP)

Uber recently published how it dramatically reduced its logging costs using Compressed Log Processor (CLP). CLP is a tool capable of losslessly compressing text logs and searching them without decompression. It achieved a 169x compression ratio on Uber's log data, saving storage, memory, and disk/network bandwidth.

Uber runs 250,000 Spark analytics jobs per day, generating up to 200TB daily logs. These logs are critical to platform engineers and data scientists using Spark. Analysing logs can improve the quality of applications, troubleshoot failures or slowdowns, analyse trends, monitor anomalies, and so on. As a result, Spark users at Uber frequently asked to increase the log retention period from three days to a month. However, if Uber were to increase the retention period to a month, its HDFS storage costs would increase from $180K per year to $1.8M annually.

Instead, by partially implementing CLP, Uber reduced the storage cost to $10K a year after increasing the retention period to a month. The authors, Jack (Yu) Luo and Devesh Agrawal, explain their work:

We found CLP, a tool with unprecedented compression (2x of general-purpose compressors) while preserving the ability to search the logs without full decompression. CLP required some customisation since it was designed to compress batches of files at a time, whereas our logging library writes a single log file at a time. Specifically, we split CLP's algorithm into two phases: Phase 1 is suitable for compressing a single log file at a time while achieving modest compression; Phase 2 aggregates these compressed files into CLP's final format.

Splitting CLP Compression deployment into two phases at Uber
Source: https://www.uber.com/en-DE/blog/reducing-logging-cost-by-two-orders-of-magnitude-using-clp/

Luo, together with the other researchers behind CLP, Kirk Rodrigues and Ding Yuan from the University of Toronto and YScope, describe CLP in their research paper:

Widely used log-search tools like Elasticsearch and Splunk Enterprise index the logs to provide fast search performance, yet the size of the index is within the same order of magnitude as the raw log size. Commonly used log archival and compression tools like Gzip provide high compression ratio, yet searching archived logs is a slow and painful process as it first requires decompressing the logs. In contrast, CLP achieves significantly higher compression ratio than all commonly used compressors, yet delivers fast search performance that is comparable or even better than Elasticsearch and Splunk Enterprise. [...] CLP's gains come from using a tuned, domain-specific compression and search algorithm that exploits the significant amount of repetition in text logs. Hence, CLP enables efficient search and analytics on archived logs, something that was impossible without it.

Source: https://www.uber.com/en-DE/blog/reducing-logging-cost-by-two-orders-of-magnitude-using-clp/

The above figure shows how CLP compresses a log message in four steps. In the first step, CLP deterministically parses the message into a timestamp, a list of variable values, and the log type. Next, CLP encodes the timestamp and non-dictionary variables. Then, CLP builds a dictionary to deduplicate repetitive variables. Finally, CLP converts the log messages into a table of encoded messages consisting of the timestamp, a list of variable values (either the variable dictionary IDs or encoded non-dictionary values), and the log type ID. Once many log messages are buffered, each column is compressed (in column-oriented order) using Zstandard.

In the future, Uber engineers plan to deploy CLP's phase 2 compression, reducing storage costs by 2x. In addition, they plan to store the compressed logs using a columnar storage format such as Parquet, potentially integrating with Presto for interactively analysing logs with SQL queries.

About the Author

Eran Stiller

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Eran Stiller

Rate this Article

This content is in the log4j topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter