BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Running Spark on R with SparkR

| by Charles Menguy on Feb 11, 2014. Estimated reading time: 2 minutes |

R is still one of the most powerful languages for data scientists, and the bar was raised even further at the end of January 2014 when UC Berkeley’s AMPLab announced a developer preview of their new project SparkR to use Apache Spark natively from R.

A Big Data framework for in-memory data processing at scale, Apache Spark has been gaining a lot of traction lately as big companies likes Cloudera are throwing their weight behind the project. Cloudera recently announced that Spark is now officially supported in its Cloudera Distribution for Hadoop (CDH) from version 4.4.0 onwards. This includes the most recent release of Spark 0.9 which was released in February, and is a pre-requisite for SparkR. SparkR comes at the right time as CDH is one of the most popular Hadoop distributions, so this will help drive adoption towards the data science crowd which may be more familiar with R than Java or Scala, as shown by a recent survey of data scientists by O'Reilly.

SparkR should be seen as a lightweight frontend to use Spark from R, meaning it will not have an API as extensive as the Scala or Java bindings, but will be sufficient to run Spark jobs from R and manipulate data. One of its key features is the ability to serialize closures, which in turn transparently copies variables to a Spark cluster if they are needed in a computation. SparkR also integrates with other R modules via a built-in function that can tell the Spark cluster to load a particular module needed for a computation, but, unlike closures, this needs to be specified manually. More details around the technical capabilities of SparkR can be found in this summary. SparkR can also take advantage of Spark's EC2 scripts to be easily setup on EC2, and some instructions around that can be found on Github.

The data science crowd has been pretty vocal about SparkR, and Twitter in particular had many support messages for the project. Alex Pinto, lead at MLSecProject, for example tweeted the following:

This is very promising: SparkR by @amplab. Puts together my favorite things for data analysis.

The project is on Github and already has a pretty active community with close to 100 stars. Considering that the project is barely a month old, this is some significant growth. There are also several open issues, meaning the community is actively involved in this new open-source project.

The AMPLab team has expressed interest in the future to integrate SparkR with Spark's MLlib machine learning library so that algorithms can be parallelized seamlessly without having to specify manually which part of the algorithm can be run in parallel. MLlib is one of the components in a larger machine learning project called MLBase which also includes higher-level abstractions and an optimizer. MLlib is one of the fastest growing machine learning libraries with more than 137 contributors, so adding the ability to use it from R makes a lot of sense for AMPLab to ensure contributions to MLlib from R users.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

R Programming by Sonam Gupta

Thanks for your post! In the past, R has been criticized for delivering slow analyses when applied to large data sets, but more recent versions of the language are attempting to address this problem. Today, R is being adopted by enterprise users for big data analytics and is increasingly being seen as a challenger to more traditional statistical and advanced analytic platforms. Some vendors now support the use of R in their software or offer completely R-based packages. More at www.youtube.com/watch?v=1jMR4cHBwZE

R Programming by Sonam Gupta

Thanks for your post! In the past, R has been criticized for delivering slow analyses when applied to large data sets, but more recent versions of the language are attempting to address this problem. Today, R is being adopted by enterprise users for big data analytics and is increasingly being seen as a challenger to more traditional statistical and advanced analytic platforms. Some vendors now support the use of R in their software or offer completely R-based packages. More at www.youtube.com/watch?v=1jMR4cHBwZE

R Programming by Sonam Gupta

Thanks for your post! In the past, R has been criticized for delivering slow analyses when applied to large data sets, but more recent versions of the language are attempting to address this problem. Today, R is being adopted by enterprise users for big data analytics and is increasingly being seen as a challenger to more traditional statistical and advanced analytic platforms. Some vendors now support the use of R in their software or offer completely R-based packages. More at www.youtube.com/watch?v=1jMR4cHBwZE

R Programming by Sonam Gupta

Thanks for your post! In the past, R has been criticized for delivering slow analyses when applied to large data sets, but more recent versions of the language are attempting to address this problem. Today, R is being adopted by enterprise users for big data analytics and is increasingly being seen as a challenger to more traditional statistical and advanced analytic platforms. Some vendors now support the use of R in their software or offer completely R-based packages. More at www.youtube.com/watch?v=1jMR4cHBwZE

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

4 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and dont miss out on content that matters to you

BT