Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Gobblin, LinkedIn's Unified Data Ingestion Platform

Gobblin, LinkedIn's Unified Data Ingestion Platform

Leia em Português

This item in japanese

At the 2014 QCon San Francisco conference, LinkedIn's Lin Qiao gave a talk on their Gobblin project (also summarized in a blog post) that is a unified data ingestion system for their internal and external data sources. Handling up to one hundred terabytes of new data per day, Gobblin has to deal with many of the technical and semantic integration challenges companies face when implementing huge Big Data projects.

Gobblin is built around Hadoop as a central storage (via Hadoop's distributed filesystem HDFS) and analysis platform, and serves business analytics, engineering, as well as member facing insights and data products.

According to Qiao, LinkedIn already had a rather complex system in place, which has grown over the years to stored all kinds of data in Hadoop. For example, streaming data like user activity events from the website were ingested via Apache Kafka and then stored in Hadoop using Camus, an adapter to persist Kafka streams in HDFS. Changes to profile user data were handled by Databus, another LinkedIn Open Source project.

External data came from platforms like Salesforce, Google, Facebook, and Twitter. While it made up a much smaller volume of the overall data, it posed different challenges because changes in data format were more frequent, availability and data quality were more fragile, and in particular not under LinkedIn's control.

In the existing system, there was a lot of duplication between the different solutions, because issues like record filtering, conversion, or quality control had been solved independently in all of these approaches. One of the main goals of Gobblin was to unify the workflow to make it much easier to integrate new data sources.

Besides providing adapters for all their existing data sources, the key part of Gobblin is the unified processing pipeline. Gobblin uses a worker framework where each records run through the four stages of extraction, conversion, quality checking, and writing. In conversion, a sequence of operations can be applied, including filtering, projection (extracting only part of the available information), type conversion, or other structural changes. Qiao pointed out that quality checking is very important to ensure that any data produced in Gobblin is reliable, in particular for external sources. This approach allows developers to focus just on the per-record processing logic, and relieves them of dealing with issues like job scheduling, dealing with streaming or batch data, scalability, or fault tolerance.

In terms of future development, Gobblin is planned to run on YARN, Hadoop's general resource manager and execution framework that allows many different applications to run on a Hadoop cluster in parallel. Support for streaming sources is going be improved. An ingestion DSL is planned for easier integration of new data sources, especially for business users. Finally, Gobblin is planned to be open-sourced in the near future.

In the past, LinkedIn has contributed a number of open source projects. Examples are the already mentioned Apache Kafka and Databus project, and White Elephant, a log aggregrator and visualization system. Apache DataFu is another project that contains a number of user-defined functions like statistics, estimation, and sampling to be used in Apache Pig. DataFu entered the incubation stage at the Apache Software Foundation in February 2014.

Rate this Article