Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow

Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow

This item in japanese


Julien Le Dem, co-author of Apache Parquet and a PMC member of the Apache Arrow project, presented on Data Eng Conf NY on the future of column-oriented data processing.

Apache Arrow is an open-source standard for columnar in-memory execution that originated from Apache Drill’s own in-memory columnar data structures. It aims to become the de-facto way of efficiently holding data in-memory and exchange data between different execution engines, thereby avoiding serialisation. Apache Arrow is backed by key developers of 13 open source projects, mainly from Apache, including Calcite, Drill, Pandas, HBase, Spark, and Storm.

InfoQ interviewed Le Dem to find out the differences between Arrow and Parquet, a columnar on-disk storage format, and how both can enable a more efficient computation across execution engines.

InfoQ: Do you believe Apache Arrow, like Parquet, will be commoditized across execution engines like Apache Spark? Do you think it will narrow the performance difference between engines?

Le Dem: As pioneered by MonetDB, vectorized execution is the state of the art for efficient query processing. Many open source query engines are moving to this model, and we believe it makes sense to standardize the in-memory columnar representation to provide extremely efficient interoperability. What Parquet did for columnar storage, Arrow provides for columnar in-memory processing and interchange.

These standardization efforts greatly simplify integration between storage layers, query engines, DSLs and UDFs, and provide a much more efficient communication layer by removing the need for serialization. Standardization makes it easier, cheaper, and faster for all systems to interoperate by removing common bottlenecks. However, there is still plenty of room for each execution engine to innovate by providing specialized techniques that further improve performance, such as operating on compressed vectors or having a smarter query optimizer.

InfoQ: Apache Parquet supports predicate pushdown, avoiding reading data from disk whenever a page doesn’t contain data that matches the predicate. Do Apache Arrow data structures include similar functionality?

Le Dem: The tradeoffs between reading data from disk and from memory are different. Currently it is up to the engines to implement predicate push down. Eventually Apache Arrow will provide fast vectorized operations that can be reused across engines, although this effort has not yet started.

InfoQ: One of the Arrow's goals is to provide constant time access to data in-memory and enable vectorized operations with SIMD instructions being executed . Does Arrow also provide in-memory data compression like Parquet?

Le Dem: Arrow supports dictionary encoding that can provide significant compression and faster execution for operations such as aggregations and joins. There is also an ongoing discussion to provide generic buffer compression using general purpose algorithms such as snappy or gzip.

In this initial release, Arrow does not yet support other compression techniques such as bit packing. However, we intended for execution engines to be able to define custom vectors provided that standard vectors are used in interchange. This will allow more advanced techniques like operating on compressed vectors directly. An example that comes to mind is the BitWeaving project from University of Wisconsin. In the future the set of standard vectors will be expanded.

The first release of Arrow has native C++-backed integration between the Pandas library, Arrow, and Parquet, allowing Arrow's Record Batches to be manipulated as Pandas dataframes and exposed to SQL-on-Hadoop engines such as Apache Drill.

InfoQ: Apache Arrow supports interoperability allowing data to be transferred between processes without serialisation. Can you comment on the capabilities of the IPC layer of Arrow?

Le Dem: The IPC layer is still experimental and it is a true zero-copy layer. When an Arrow’s Record Batch is finalized it becomes immutable. In this state it can be shared with other processes in read-only mode using shared memory without worrying about concurrent access. The vector representation is independent from its memory address (no absolute pointer required) and is safe to use in shared memory where each process sees a different address for the buffers.

InfoQ: Like Parquet, Apache Arrow supports nested data types. What data types are currently supported and which are on the roadmap?

Le Dem: Arrow supports all the common data types, a pretty exhaustive list so far. Some of the more recently added types include SQL’s Timestamp and Interval types.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • RE: Yim Carfield: What is the difference between 2 projects?

    by Alexandre Rodrigues,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Apache Parquet is a on-disk columnar format while Apache Arrow is a in-memory data format. They serve different purposes: the first primary objective is to provide good compression and mechanisms that allow the software to open just the bits of the file that may be interesting. This is done with page header data that contains things like the min and max values of a certain column. Predicate pushdown operations leverage the information in the headers to decide if it should be scanned or skipped.

    Arrow is a standard for in-memory columnar processing, that defines CPU-cache friendly data structures, interoperable between multiple engines (which avoids CPU waste on de/serialization). Arrow as a standard enables easier and more efficient data exchange between multiple engines (such as Apache Spark, and others). Many techniques used by engines like Apache Spark, Apache Drill, borrow concepts pioneered by MonetDB such as vectorization of operations.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p