DataFu Enters Incubation Status at Apache

LinkedIn’s DataFu project, a collection of libraries for Hadoop, has now officially entered the incubation status at the Apache Software Foundation (ASF) since the first week of January.

The project was initially centered on being a collection of User-Defined Functions (UDFs) for Pig since its inception in January 2012. Compared to a more generic UDF collection like Piggybank, Datafu focuses on data-mining and statistics functions, such as quantiles computation or sampling methods. But since October 2013, a new library called DataFu Hourglass was added to the project. Hourglass is a library for MapReduce that allows jobs to process incremental data. This is typically done by keeping the state of the previous job in HDFS and using that to process the new input. Both projects are now part of the incubator.

Entering incubation at Apache is no small feat for DataFu, and each project has to go through a rigorous scrutiny and through a voting process before it is accepted in the incubator. DataFu had been around since early 2012 and only managed to get accepted in the incubator in early 2014. Graduation for a typical Apache project in incubation takes time, and once the infrastructure for the project is completed (wiki, mailing lists, tutorials, etc) DataFu will be able to graduate as its own top-level project in the ASF or as a sub-project of Hadoop.

With its recent introduction to the Apache incubator, DataFu has lots of plans for growth in the near future. One of the most critical functionality is to add the same set of UDFs for Hive and Crunch to have a more widespread adoption. Part of this includes migrating the project build system to Gradle, something the DataFu community is currently working on. Moving from Ant to Gradle will make it easier for DataFu to consolidate its community around a simpler process to add new functionality.

The DataFu community is still small but growing steadily. A recent contribution from Russell Jurney made the Open NLP project available as part of DataFu 1.3.0. The focus on the mailing list is on adding more UDFs and making DataFu “The WD-40 of Big Data”, as described by project contributors Matthew Hayes and Sam Shah.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Enterprise Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter