Apache Spark 1.2.0 Supports Netty-based Implementation, High Availability and Machine Learning APIs

Apache Spark 1.2.0 was released with major performance and usability improvements in the Spark core engine. It represents the work of 172 contributors from over 60 institutions and comprises more than 1000 patches.

Spark 1.2.0 is fully binary compatible with 1.1 and 1.0 and includes a Netty-based implementation, which significantly improves efficiency. Spark streaming adds support for Python and High Availability via Write Ahead Logs (WALs). In addition there is a set of machine learning APIs called spark.ml.

Spark SQL, which is a relatively new project, has improved support for external data sources.

InfoQ caught up with Patrick Wendell, Release manager for earlier Spark releases, a Spark committer and PMC member who works at Databricks.

InfoQ: First things first. As a developer who is comfortable with Map/Reduce, Spark seems to usher in a new paradigm? As a developer well versed in Map/Reduce should they even care about Apache Spark?

Spark was created initially to improve on the Map/Reduce model, so existing Map/Reduce developers should definitely give Spark a try! When compared to Map/Reduce, Spark offers a higher level, more expressive API in addition to a rich set of built-in and community libraries. To draw an analogy, if Map/Reduce is like an assembly language, i.e. low level and imperative, Spark in turn is more like a modern programming language with libraries and packages. Spark also provides significant performance improvements over Map/Reduce.

InfoQ: You can run Map/Reduce programs and other programs as well on YARN. What is the relationship between Apache Spark and YARN if any?

Spark can run in many different environments, ranging from co-existing with Hadoop deployments, to running in a Mesos cluster, and also in a managed service such as Databricks Cloud. In Hadoop environments, YARN is the cluster manager that helps launch and schedule the distributed components of a running Spark application. YARN can multiplex both Spark and MapReduce workloads on the same cluster hardware.

InfoQ:Do you have to be familiar with Scala to be a power Spark user?

Today there are as many more Java and Python users when compared to Scala users of Spark, hence no knowledge of Scala is necessary. Spark’s programmatic shell is provided in both Python and Scala (Java doesn’t have an interactive shell, so we don’t have that feature in Java). Spark’s SQL features are available from all languages. For those wanting to try something new, the Scala API is always available.

InfoQ:Spark SQL is a recent addition. Does being able to use the JDBC/ODBC APIs with Spark makes it more developer friendly?

Being able to expose Spark datasets over JDBC/ODBC is one of the most popular features we’ve provided in the last year. These interfaces allow querying Spark data with traditional BI and visualization tools as well as integrating with third party applications. With a single program, Spark allows you to ETL your data from whatever format it is currently in (JSON, Parquet, a Database), transform it, and expose it for ad-hoc querying. This is one of the most powerful concepts in Spark, a unification of what used to take many separate tools.

InfoQ: Disk storage is unlimited whereas memory is ultimately limited. Does Apache Spark have data size limitations unlike Apache Hadoop? What are the types of applications that can benefit most from Apache Spark?

While memory available in modern clusters is skyrocketing, there are always cases where data just won’t fit in memory. In all modern versions of Spark, most operations that exceed available memory will spill over to disk, meaning users need not worry about memory limits. As an example, Spark’s win of the Jim Gray sort benchmark occurred on a data set many times larger than could fit in cluster memory, and even with this Spark’s efficiency was several multiples higher than other widely used systems.

InfoQ: Let’s talk about performance. It’s impressive that Apache Spark was a joint winner in the Sort Benchmark originally instituted by Jim Gray. Can you talk about the relevance of these results keeping in mind that developers are generally skeptical about benchmark results?

We chose to pursue the Jim Gray benchmark because it is maintained by a third party committee. This ensures that it was independently validated and based on a set of well defined industry rules. Developer skepticism about benchmarks is warranted: self-reported, unverified benchmarks are often more marketing material than anything else. The beauty of open source is that users can try things out for themselves at little or no cost. I always encourage users to spin up Databricks Cloud or download Spark and evaluate it with their own data, rather than focusing too much on benchmarks.

It’s also important for users to think holistically about performance. If your data spends 6 hours in an ETL pipeline to get it into just the right format, or requires a 3-month effort to accommodate a schema change, is it really a win if the query time is marginally faster? If you need to transfer your data into another system to perform machine learning, is that worth a 10% performance improvement? Data is typically messy and complex, and end-to-end pipelines involve different computation models, such as querying, machine learning, and ETL. Spark’s goal is to make working with complex data in real-world pipelines just plain simple!

A more detailed explanation of the list of features in Apache Spark 1.2.0 is in the databricks company blog. You can download the latest version from the Apache Spark download page.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Infrastructure topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter