Apache Spark 1.3 Released, Data Frames, Spark SQL, and MLlib Improvements

Apache Spark has released version 1.3 of their project. The main improvements are the addition of the DataFrames API, better maturity of the Spark SQL, as well as a number of new methods added to the machine learning library MLlib, and better integration of Spark Streaming with Apache Kafka.

One of the main feature additions is the DataFrames API. Modeled after the data structures in R of the same name, the goal is to provide better support for working with tabular data. A DataFrame consists of a table with typed and named columns, and provides operations to filter, group, or compute aggregates, similar to queries in SQL.

DataFrames are tightly integrated with Spark SQL, a distributed SQL query engine. DataFrames can be constructed from SQL query result sets, or from RDDs, or loaded from files like the Parquet file format. So far, RDDs (resilient distributed datasets) were the main distributed data collection type in Spark, but DataFrames aim at providing better support for structured data.

Spark MLlib, the machine learning library that is part of Spark has seen a number of additions of learning algorithms, for example, Latent Dirichlet Allocation, a probabilistic method for identifying topics in documents and clustering documents accordingly, or multinomial logistic regression for multiclass prediction tasks. There is some beginning support for distributed linear algebra where blocks of a matrix are stored in a distributed fashion. Such functionality is necessary for many more complex data analysis tasks, including matrix factorization, which often involve matrices too large to fit into main memory.

On top of these algorithms, Spark is also adding higher level features for data analysis like import and export of learned prediction models, and the Pipeline API introduced in version 1.2 that allows users to specify a pipeline of data transformations in a high-level fashion. Such pipelines are often used in data science to extract relevant features from raw data.

Futhermore, Spark now has direct integration with Apache Kafka to ingest real-time event data.

Apache Spark was initially started in 2009 at the UC Berkeley AMPLab. It can run standalone or on top of an existing Hadoop installation and provides a larger set of operations compared the the MapReduce processing model of the original Hadoop. It holds data in memory if possible, resulting in further performance improvements compared to the largely disk based MapReduce. In addition, it allows to process event data in near real-time by collecting data in a buffer and then periodically processing these mini-batches. Similar projects are Apache Flink, which has a similar feature set but also includes query optimization and a continuous stream processing engine, or projects like Cascading and Scalding, which provide a similar set of high-level operations, but run on top of the MapReduce processing model.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Infrastructure topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter