Hardwood Promises High-Speed JVM Apache Parquet Processing with Zero Mandatory Dependencies

Hardwood has been released as an open-source library designed to optimise the reading of Apache Parquet files within JVM environments. Kickstarted by Gunnar Morling, it aims to provide a faster, simpler alternative to the traditional Apache Parquet Java implementation, which often introduces significant dependency overhead and operates on a single-threaded core reader. Hardwood addresses these constraints by providing an almost zero-dependency alternative that utilises multi-threaded page decoding to maximise CPU utilisation. Five months since its inception in early 2026, it has reached version 1.0 and now provides read capabilities, with write support planned for future releases.

Hardwood’s design emphasises a modular approach to data access. It provides two distinct APIs to suit different engineering requirements: a structured row reader API for general-purpose record access and a batch-oriented column reader API intended for high-throughput analytical workloads. Unlike traditional implementations that process data sequentially, Hardwood spreads Parquet page decoding across all available CPU cores, reducing the latency typically associated with serialising page processing.

Row reader code:

try (ParquetFileReader fileReader = ParquetFileReader.open(
        InputFile.of(path));

    RowReader rowReader = fileReader.rowReader()) {

    while (rowReader.hasNext()) {
        rowReader.next();

        long id = rowReader.getLong("id");
        String name = rowReader.getString("name");
        LocalDate birthDate = rowReader.getDate("birth_date");
        Instant createdAt = rowReader.getTimestamp("created_at");
    }
}

The library is designed with a zero-mandatory-dependency profile to minimise the risks of supply chain attacks and classpath conflicts. To achieve this, it utilises Java's minimal logging abstraction available since version 9, effectively avoiding external logging dependencies. Additional functionality, such as support for specific compression algorithms like LZ4 and GZip or object storage services like S3, is provided through optional dependencies that users can pull in as needed.

It also implements optimised predicate evaluation. By employing branchless, batch-at-a-time evaluation during filtered scans, the system minimises CPU branch mispredictions, which is a critical factor for performance in modern analytical data processing.

Beyond the library itself, the project includes a command-line interface (CLI) tool designed for developers and data engineers. This CLI features an interactive text-based user interface (TUI) that allows users to inspect Parquet file schemas and metadata without writing boilerplate code or involving heavy data processing frameworks. This utility serves as a diagnostic tool for verifying file integrity and structure during the development lifecycle.

Benchmark results indicate that Hardwood achieves significant throughput improvements over standard implementations. In scans of flat datasets with 8 vCPUs, the reader achieved a throughput of 16.5 million rows per second. The performance advantage is largely attributed to the library's ability to scale with available hardware. In a single-threaded configuration, performance is constrained by sequential decoding; however, the multi-threaded approach allows the system to more effectively saturate the host machine's I/O and CPU bandwidth. Hardwood promises significant benefits for JVM environments through its modular design, high performance, multi-threaded decoding, and a zero-mandatory-dependency profile that simplifies dependency management.

Besides its initiator, Gunnar Morling, the project has already attracted 20 open-source contributors, including veteran contributors from the Java space, such as Andres Almiray and Bruno Borges. The general feedback from the broader community was mostly positive, with potential users also asking for Parquet writing capabilities. This enhancement is already part of the upcoming roadmap and is expected to be available soon.

Hardwood 1.0 marks a significant milestone in high-performance JVM data processing, having progressed from inception to its first stable release in just five months. The project utilised AI-assisted coding during development, though the design and code review processes remained under human ownership. By offering a zero-dependency architecture and an innovative multi-threaded decoding engine, the project provides a lightweight yet powerful alternative to traditional Parquet implementations. With its modular design and a clear roadmap for future write support, Hardwood is positioned to become a foundational tool for data engineers seeking to maximise resource efficiency in analytical workloads.

About the Author

Olimpiu Pop

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Olimpiu Pop

Rate this Article

This content is in the Java topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter