BT

LinkedIn Open Sources Cubert With an Eye To Big Data Analytics

| by Alex Giamas Follow 8 Followers on Dec 17, 2014. Estimated reading time: 1 minute |

LinkedIn recently open sourced Cubert, its High Performance Computation Engine for Complex Big Data Analytics. Cubert is a framework written for analysts and data scientists in mind that delivers “all the efficiency advantages of a hand-crafted Java program yet provides the simplicity of a script-like user interface to solve a variety of statistical, analytical and graph problems”. The goal is to do all of the above without exposing them to low level details.

Cubert is designed around the need to implement better algorithms in the data processing side. When performance is a differentiating factor, Cubert can be of help as LinkedIn engineers claim it can outperform other engines by a factor of 5-60x even with data sizes in the tens of TBs swapping out from disk.

Developed completely in Java and expressed as a scripting language, Cubert is designed for complex joins and aggregations that frequently arise in the reporting world. Cubert uses the MeshJoin algorithm to process large datasets over large time windows with significant improvement in CPU and memory utilization. CUBE, a new operator defined by Cubert can compute additive and non-additive analytics dimensions. Non additive dimensions like count distinct users in a time window are computationally intensive but CUBE can speed up these calculations and also calculate exact percentile rank like median statistics, roll up inner dimensions on the fly and compute multiple measures within a single job.

Cubert is best suited for repetitive reporting workflows and exploits partial result caching and incremental processing techniques for speedup. Finally, a novel sparse matrix multiplication algorithm can help analytics computations with large-scale graphs.

Support from Pig UDFs is already implemented and the team plans on supporting UDFs and Storage layers from both Pig and Hive. Cubert currently runs on MR engines but support for Tez and Spark is under way. Cubert documentation and code are available in GitHub.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT