Amazon Switched Compression from Gzip to Zstd for Own Service Data

A tweet from Adrian Cockcroft, former VP at Amazon, recently highlighted the benefits of switching from gzip to Zstandard compression at Amazon and triggered discussions in the community about the compression algorithm. Other large corporations, including Twitter and Honeycomb, shared interesting gains using zstd.

Analyzing the savings at Twitter, Dan Luu recently started a conversation tweeting:

I wonder how much waste has been eliminated by Yann Collect creating zstd. When I ran the numbers at Twitter, which is tiny compared to the huge tech companies, switching from HDFS to zstd was ~ mid 8 figs/yr. Across the world (not annualized), it seems like it must be >= 9 figs?

Cockcroft replied:

A lot was saved at AWS switching from gzip to zstd - about 30% reduction in compressed S3 storage, exabyte scale.

Zstandard, better known by its C implementation zstd, is a lossless data compression algorithm developed by Yann Collet at Facebook that provides a high compression ratio with very good performance across diverse datasets. Distributed as open source software under a BSD license, the reference library offers a wide range of speed / compression trade-off with an extremely fast decoder.

Cockcroft’s statement initially raised doubts in the community, with some developers questioning how AWS was compressing customer data on S3. An internal AWS employee clarified:

Adrian misspoke, or everyone is misunderstanding what he meant. What he meant wasn't that S3 changed how it stores zipped customer data. What he meant was that AWS changed how it stores its own service data (mostly logs) in S3 - by switching (as a client of S3 themselves) from gzipping logs to ztsd logs, we were able to reduce our S3 storage costs by 30%.

Liz Fong-Jones, principal developer advocate at Honeycomb, agrees on switching to zstd:

We don't use it for column files because it's too slow, but we do use it for Kafka (...) Honeycomb is seeing 25% bandwidth savings after switching snappy to zstd in prod. (...) It's not just the storage and compute. to us, it's the NETWORK. AWS inter-AZ data transfer is absurdly expensive.

In a popular Reddit thread, user noirknight is one of many sharing positive feedback:

My company did something similar a few years ago and saw similar benefits. We are throwing zstandard everywhere we can, not just storage, but other things like internal HTTP traffic.

User treffer on Hacker News comments:

Especially fast compression algorithms (zstd, lz4, snappy, lzo, ...) are worth the CPU cost with virtually no downsides. The problem is finding the right sweet spot that reduces the current bottleneck without creating a CPU bottleneck, but zstd offers the greatest flexibility there, too.

AWS exposes Zstandard and support for other compression algorithms in the API of some managed services. For example, after introducing Zstandard support for Amazon Redshift, the cloud provider developed its own algorithm AZ64 for the cloud data warehouse. According to the cloud provider, the proprietary compression consumes 5–10% less storage, and is 70% faster compared to zstd encoding.

Amazon did not issue any official comment related to the compression technology used for its own internal data or the S3 storage savings involved.

About the Author

Renato Losio

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Renato Losio

Rate this Article

This content is in the AWS topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter