Infinite Storage & Retention for Apache Kafka in Confluent Cloud

Confluent, Inc., the event streaming platform powered by Apache Kafka, recently announced the Infinite Storage option for its standard and dedicated clusters. This offering is a part of the Project Metamorphosis initiative, which is focused on enabling Kafka with modern cloud properties. With Infinite retention rolling out in July, organizations can have a centralized platform for all event data for real-time actioning and historical analysis with limitless storage and retention.

Jun Rao, the co-founder of Confluent, Inc., explained in his blog that Apache Kafka has become the standard for event streaming. As the data retention period requirement increases, hardware cost increases accordingly. To help Kafka operators deal with this trade-off, Confluent Cloud now offers "near infinite" scaling of storage. This approach will provide greater flexibility to the customers to retain data indefinitely. Speaking to Datanami, Dan Rosanova, a group product manager at Confluent, said, "This is the only fully managed event streaming service with no limits on the amount that is stored or time it is retained."

Getting into the need for historical data, Rao further elaborated that Kafka is the single source of truth for all types of systems to store digitized data. These source systems and applications fetch their data from Kafka in real-time, incrementally. For acquiring better insights and providing a better experience, context-rich applications demand historical data along with real-time (current) data. To address this, historical data is typically maintained in separate systems, needing developers to use two sets of APIs. This introduces a challenge with data balancing, as the default retention period of Apache Kafka is only seven days. With this latest offering, organizations like financial institutions can keep the data in Kafka for several years. Powered by infinite retention, machine learning models can be trained to make real-time predictions based on historical data.

It appears that Confluent is emphasizing that Kafka can be used as a database of record. Just like relational OLTP (online transactional processing) platforms, Kafka can now act as permanent database stores for transactional data. As an example, The New York Times uses Kafka as a source of truth, storing 160 years of journalism going back to the 1850s. Rao further contemplates:

Imagine if the data in Kafka could be stored for months, years, or infinitely. The above problem can be solved in a much simpler way. All applications just need to get data from one system—Kafka—for both recent and historical data.

Expanding on this idea, Confluent CEO Jay Kreps said:

Without the context of historical data, it’s difficult to take action on real-time events in an accurate, meaningful way. We’ve removed the limitations of storing and retaining events in Apache Kafka with infinite retention in Confluent Cloud. With event streaming as a business’s central nervous system, applications can pull from an unlimited source of past and present data to quickly become smarter, faster, and more precise.

Building upon existing work in Confluent Cloud, Tiered Storage was released earlier this year as a preview in the Confluent Platform. Tiered Storage offers features that are referred to as the "nuts and bolts of Infinite Storage", as this functionality is responsible under the hood for moving Kafka data between different storage implementations (each with varying performance and cost characteristics). Currently supporting only Amazon S3, Confluent claims that Tiered Storage "delivers on cost-effectiveness, ease of use, and performance isolation."

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter