Apache Spark 1.3 Released, Data Frames, Spark SQL, and MLlib Improvements

| by Mikio Braun Follow 0 Followers on Mar 16, 2015. Estimated reading time: 2 minutes |

Apache Spark has released version 1.3 of their project. The main improvements are the addition of the DataFrames API, better maturity of the Spark SQL, as well as a number of new methods added to the machine learning library MLlib, and better integration of Spark Streaming with Apache Kafka.

One of the main feature additions is the DataFrames API. Modeled after the data structures in R of the same name, the goal is to provide better support for working with tabular data. A DataFrame consists of a table with typed and named columns, and provides operations to filter, group, or compute aggregates, similar to queries in SQL.

DataFrames are tightly integrated with Spark SQL, a distributed SQL query engine. DataFrames can be constructed from SQL query result sets, or from RDDs, or loaded from files like the Parquet file format. So far, RDDs (resilient distributed datasets) were the main distributed data collection type in Spark, but DataFrames aim at providing better support for structured data.

Spark MLlib, the machine learning library that is part of Spark has seen a number of additions of learning algorithms, for example, Latent Dirichlet Allocation, a probabilistic method for identifying topics in documents and clustering documents accordingly, or multinomial logistic regression for multiclass prediction tasks. There is some beginning support for distributed linear algebra where blocks of a matrix are stored in a distributed fashion. Such functionality is necessary for many more complex data analysis tasks, including matrix factorization, which often involve matrices too large to fit into main memory.

On top of these algorithms, Spark is also adding higher level features for data analysis like import and export of learned prediction models, and the Pipeline API introduced in version 1.2 that allows users to specify a pipeline of data transformations in a high-level fashion. Such pipelines are often used in data science to extract relevant features from raw data.

Futhermore, Spark now has direct integration with Apache Kafka to ingest real-time event data.

Apache Spark was initially started in 2009 at the UC Berkeley AMPLab. It can run standalone or on top of an existing Hadoop installation and provides a larger set of operations compared the the MapReduce processing model of the original Hadoop. It holds data in memory if possible, resulting in further performance improvements compared to the largely disk based MapReduce. In addition, it allows to process event data in near real-time by collecting data in a buffer and then periodically processing these mini-batches. Similar projects are Apache Flink, which has a similar feature set but also includes query optimization and a continuous stream processing engine, or projects like Cascading and Scalding, which provide a similar set of high-level operations, but run on top of the MapReduce processing model.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread


Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you