BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

DataFu Enters Incubation Status at Apache

by Charles Menguy on Feb 04, 2014 |

LinkedIn’s DataFu project, a collection of libraries for Hadoop, has now officially entered the incubation status at the Apache Software Foundation (ASF) since the first week of January.

The project was initially centered on being a collection of User-Defined Functions (UDFs) for Pig since its inception in January 2012. Compared to a more generic UDF collection like Piggybank, Datafu focuses on data-mining and statistics functions, such as quantiles computation or sampling methods. But since October 2013, a new library called DataFu Hourglass was added to the project. Hourglass is a library for MapReduce that allows jobs to process incremental data. This is typically done by keeping the state of the previous job in HDFS and using that to process the new input. Both projects are now part of the incubator.

Entering incubation at Apache is no small feat for DataFu, and each project has to go through a rigorous scrutiny and through a voting process before it is accepted in the incubator. DataFu had been around since early 2012 and only managed to get accepted in the incubator in early 2014. Graduation for a typical Apache project in incubation takes time, and once the infrastructure for the project is completed (wiki, mailing lists, tutorials, etc) DataFu will be able to graduate as its own top-level project in the ASF or as a sub-project of Hadoop.

With its recent introduction to the Apache incubator, DataFu has lots of plans for growth in the near future. One of the most critical functionality is to add the same set of UDFs for Hive and Crunch to have a more widespread adoption. Part of this includes migrating the project build system to Gradle, something the DataFu community is currently working on. Moving from Ant to Gradle will make it easier for DataFu to consolidate its community around a simpler process to add new functionality.

The DataFu community is still small but growing steadily. A recent contribution from Russell Jurney made the Open NLP project available as part of DataFu 1.3.0. The focus on the mailing list is on adding more UDFs and making DataFu “The WD-40 of Big Data”, as described by project contributors Matthew Hayes and Sam Shah.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT