DuckLake 1.0: Data Lake Format with SQL Catalog Metadata

DuckDB Labs recently released DuckLake 1.0, a data lake format that stores table metadata in a SQL database rather than across many files in object storage. The first implementation is available as a DuckDB extension and includes catalog-stored small updates, improved sorting and partitioning options, and compatibility with Iceberg-style data features.

According to the DuckDB team, file-based metadata in lake formats leads to complex coordination, slow metadata operations, and many small files in object storage. While Apache Iceberg, Delta Lake, and Apache Hudi store metadata mainly as files in object storage, sometimes adding catalog services on top, DuckLake stores it directly in a SQL database.

A year ago, the so-called "DuckLake manifesto" was published, arguing that lakehouse metadata should be stored in a database rather than spread across many files in object storage. The team writes:

We are happy to announce DuckLake v1.0, almost a year after we released our first sketch of the specification. This is a production-ready release with guaranteed backward-compatibility. DuckLake v1.0 ships a stable specification, a feature-rich and fast reference implementation (the DuckDB ducklake extension), as well as a roadmap for future development.

DuckLake 1.0 adds several features to improve lakehouse operations and performance. These include data inlining to handle small inserts, updates, and deletes without creating new files, sorted tables to speed up filtered queries, bucket partitioning for high-cardinality columns, improved support for geometry data types, and deletion vectors compatible with Iceberg. Discussing data inlining, the team notes:

Data inlining is one of the flagship features of DuckLake. It basically enables performing small insert, delete and update operations in the catalog database, avoiding the proliferation of "the small file problem". DuckLake v1.0 brings full inlining of updates and deletes. This feature is now on by default with a default threshold of 10 rows.

On a popular Reddit thread, user SutMinSnabel4 asks:

Could you add first-class support for the SMB protocol? I do not mean locally mounted filesystems, since those are OS-dependent. It should work well in traditional enterprise Windows environments too, with DFS, Kerberos, and the whole shebang, but it should also work on macOS and Linux machines (...) A lot of enterprises still rely on SMB on-premises.

On Hacker News, Alexander Dahl, data platform engineer, writes:

Very exciting! The numbers seem to crush Iceberg. Has anyone tried it out for "real" workloads?

DuckLake clients are available for Apache DataFusion, Apache Spark, Trino, and Pandas. MotherDuck also offers a hosted DuckLake service that manages the catalog database and storage.

DuckLake v1.1 is expected to add improvements such as variant inlining across catalogs and multi-deletion vector Puffin files. According to the roadmap, DuckLake v2.0 will introduce Git-like branching for datasets and built-in role-based permissions.

The awesome-ducklake repository provides DuckLake use cases and libraries. DuckLake 1.0 is available on GitHub under an MIT license.

About the Author

Renato Losio

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Renato Losio

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter