R is still one of the most powerful languages for data scientists, and the bar was raised even further at the end of January 2014 when UC Berkeley’s AMPLab announced a developer preview of their new project SparkR to use Apache Spark natively from R.
A Big Data framework for in-memory data processing at scale, Apache Spark has been gaining a lot of traction lately as big companies likes Cloudera are throwing their weight behind the project. Cloudera recently announced that Spark is now officially supported in its Cloudera Distribution for Hadoop (CDH) from version 4.4.0 onwards. This includes the most recent release of Spark 0.9 which was released in February, and is a pre-requisite for SparkR. SparkR comes at the right time as CDH is one of the most popular Hadoop distributions, so this will help drive adoption towards the data science crowd which may be more familiar with R than Java or Scala, as shown by a recent survey of data scientists by O'Reilly.
SparkR should be seen as a lightweight frontend to use Spark from R, meaning it will not have an API as extensive as the Scala or Java bindings, but will be sufficient to run Spark jobs from R and manipulate data. One of its key features is the ability to serialize closures, which in turn transparently copies variables to a Spark cluster if they are needed in a computation. SparkR also integrates with other R modules via a built-in function that can tell the Spark cluster to load a particular module needed for a computation, but, unlike closures, this needs to be specified manually. More details around the technical capabilities of SparkR can be found in this summary. SparkR can also take advantage of Spark's EC2 scripts to be easily setup on EC2, and some instructions around that can be found on Github.
The data science crowd has been pretty vocal about SparkR, and Twitter in particular had many support messages for the project. Alex Pinto, lead at MLSecProject, for example tweeted the following:
This is very promising: SparkR by @amplab. Puts together my favorite things for data analysis.
The project is on Github and already has a pretty active community with close to 100 stars. Considering that the project is barely a month old, this is some significant growth. There are also several open issues, meaning the community is actively involved in this new open-source project.
The AMPLab team has expressed interest in the future to integrate SparkR with Spark's MLlib machine learning library so that algorithms can be parallelized seamlessly without having to specify manually which part of the algorithm can be run in parallel. MLlib is one of the components in a larger machine learning project called MLBase which also includes higher-level abstractions and an optimizer. MLlib is one of the fastest growing machine learning libraries with more than 137 contributors, so adding the ability to use it from R makes a lot of sense for AMPLab to ensure contributions to MLlib from R users.
Community comments
R Programming
by Sonam Gupta,
R Programming
by Sonam Gupta,
R Programming
by Sonam Gupta,
R Programming
by Sonam Gupta,
R Programming
by Sonam Gupta,
Your message is awaiting moderation. Thank you for participating in the discussion.
Thanks for your post! In the past, R has been criticized for delivering slow analyses when applied to large data sets, but more recent versions of the language are attempting to address this problem. Today, R is being adopted by enterprise users for big data analytics and is increasingly being seen as a challenger to more traditional statistical and advanced analytic platforms. Some vendors now support the use of R in their software or offer completely R-based packages. More at www.youtube.com/watch?v=1jMR4cHBwZE
R Programming
by Sonam Gupta,
Your message is awaiting moderation. Thank you for participating in the discussion.
Thanks for your post! In the past, R has been criticized for delivering slow analyses when applied to large data sets, but more recent versions of the language are attempting to address this problem. Today, R is being adopted by enterprise users for big data analytics and is increasingly being seen as a challenger to more traditional statistical and advanced analytic platforms. Some vendors now support the use of R in their software or offer completely R-based packages. More at www.youtube.com/watch?v=1jMR4cHBwZE
R Programming
by Sonam Gupta,
Your message is awaiting moderation. Thank you for participating in the discussion.
Thanks for your post! In the past, R has been criticized for delivering slow analyses when applied to large data sets, but more recent versions of the language are attempting to address this problem. Today, R is being adopted by enterprise users for big data analytics and is increasingly being seen as a challenger to more traditional statistical and advanced analytic platforms. Some vendors now support the use of R in their software or offer completely R-based packages. More at www.youtube.com/watch?v=1jMR4cHBwZE
R Programming
by Sonam Gupta,
Your message is awaiting moderation. Thank you for participating in the discussion.
Thanks for your post! In the past, R has been criticized for delivering slow analyses when applied to large data sets, but more recent versions of the language are attempting to address this problem. Today, R is being adopted by enterprise users for big data analytics and is increasingly being seen as a challenger to more traditional statistical and advanced analytic platforms. Some vendors now support the use of R in their software or offer completely R-based packages. More at www.youtube.com/watch?v=1jMR4cHBwZE