BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Spark Officially Graduates From Apache Incubator

| by Alex Giamas Follow 0 Followers on Feb 28, 2014. Estimated reading time: 2 minutes |

Recently, Spark graduated from the Apache incubator. Spark claims up to 100x speed improvements over Apache Hadoop over in-memory datasets and gracefully falling back to 10x speed improvement for on-disk performance. Since it was open sourced in 2010, Spark has been one of the most active projects in the community.

Its fast growth can be attributed to a number of reasons. It can combine its own DSL with SQL for leveraging of the well known SQL language. Spark’s primary API is a Scala DSL, built around a distributed collection of items called a Resilient Distributed Dataset (RDD). An RDD can support bulk and aggregate operations like filter, map and reduceByKey, leveraging distributed execution. Shark can provide the same speed native Scala API using Hive SQL. Reusing Hive’s frontend and backend means that it can be used alongside Hive, sharing data, queries and UDFs.

Machine Learning algorithms come out of the box with MLlib providing a range of algorithms in classification, regression, clustering and recommendation fields. MLlib is just a component of MLBase. MLBase is a distributed Machine Learning system aiming to make Machine Learning tasks more accessible to both end-users and ML researchers. It’s the first system freeing users from algorithm choices and automatically optimizing for distributing execution. Algorithm choices are made based on best ML practices and a cost-based model. Distributed execution is similar to Apache Mahout and is optimized for the data-access patterns of Machine Learning.

Graph algorithms can be implemented using GraphX that combines data-parallel and graph-parallel system semantics. GraphX offers comparable or better performance than Apache Giraph, the established Graph processing system used at Facebook.

SparkR exposes the Spark API to R, allowing statisticians to submit jobs directly from an R function into an Apache Spark cluster. R is the most popular tool for data scientists, other than RDBMS. Its main problem is that it is single threaded and inherently not designed for large data sets. SparkR overcomes these problems but comes with the caveat that it only works well for inherently parallelizable algorithms like gradient descent.

Spark can be deployed on Apache YARN, providing easy integration and co-existence with heterogeneous systems. It also comes as part of Cloudera Enterprise data hub edition, backed by Cloudera and Databricks, the company behind Spark’s commercialization. Finally, Streaming can help with quick prototyping and applying useful distributed systems semantics. Spark’s code is available on GitHub.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT