Spark in Action Book Review & Interview
In Spark in Action book, authors Petar Zecevic and Marko Bonaci discuss Apache Spark framework for data processing (batch as well as streaming data use cases). They introduce the architecture of Spark and core concepts such as Resilient Distributed Datasets (RDDs) and DataFrames.
The authors discuss, with code examples, how to process data using Spark libraries like Spark SQL and Spark Streaming as well as apply Machine Learning algorithms using Spark MLlib and ML. They also talk about graph data analytics using the Spark GraphX library.
InfoQ spoke with Petar and Marko about Apache Spark framework, developer tools, and the upcoming features and enhancements in the future releases.
InfoQ: Spark framework offers several advantages and capabilities in data processing compared to other technologies? Can you talk about some of these capabilities?
Petar Zecevic: Yes, Spark is a general-purpose data processing system that can be used for a large number of tasks. The biggest advantage of Spark, compared to Hadoop's MapReduce, is its speed. It is a well-known fact that Spark is much faster than MapReduce. That's what gave Spark its initial marketing boost. The other thing that sets it apart is its unifying nature. You can use Spark for real-time and batch processing, machine learning and performing SQL queries, using a single library. Moreover, you can use its API from four different languages (Python, Scala, Java and R). Besides its capabilities, what makes it attractive is its widespread use: it is a part of all major Hadoop distributions and many corporations are using it in their data processing pipelines. IBM's recent decision to commit 3000 developers solely to Spark is a sign that Spark has gone mainstream.
Marko Bonaci: Yes, IBM is always late to the party, but so called “enterprise users”, that still shy away from open source software basically wait for one of the giants to “christen” the product, which is why we can now expect an increase of enterprise users. If you’re thinking of making an investment in Spark, now’s the time to do it.
InfoQ: What are some limitations of Spark framework?
Zecevic: Spark is not suitable for OLTP applications where many reads and updates of small pieces of data by many users, is required. It is more suited for analytics and ETL tasks. A limitation of Spark Streaming is that it cannot help you if you need to process incoming data with a latency of less than half a second. Spark has yet to implement many types of machine learning algorithms, for example neural networks (it's a work in progress).
Bonaci: There are some recent efforts to integrate Spark with Tensor Flow, a deep learning (i.e. neural networks) framework that Google open sourced in December of 2015, where Spark helps by parallelizing some parts of deep learning algorithm, mainly parameter tuning. It will be interesting to see whether Spark can be employed in this field of Machine Learning. BTW just a month behind Google, Microsoft has also open sourced their deep learning framework, CNTK .
InfoQ: Can you discuss the considerations in setting up Spark cluster to leverage the distributed data processing capabilities of Spark?
Zecevic: The first decision is to choose the cluster type. Some recent statistics suggest that the most popular user choice is Spark standalone cluster, closely followed by YARN. Mesos is the third choice, but still, a significant number of users are using it. Spark standalone cluster is simple to set up and use and was built specifically for Spark. So it is a fine choice if you only need to run Spark and not other distributed frameworks. YARN and Mesos, on the other hand, are open execution environments and you can use them for running other applications, besides Spark ones. Mesos offers its fine-grained mode, not available on YARN, and it is currently the only cluster manager which allows you to run Spark in Docker images. But if you need to access Kerberos-secured HDFS, YARN is currently your only option.
The secret ingredient for all three cluster managers is to give Spark lots of memory :) Also, locality is very important for performance. That means you should have your Spark executors running as close to your data as possible (have Spark installed on the same nodes as HDFS, for example).
Bonaci: There are reasons to think that Mesos will be the winner here going forward. Since both YARN and, of course, Spark’s own cluster manager are not general enough to run arbitrary software. Therefore, Mesos, as a data center operating system, has a bright future, not just related to Spark and I’m sure that we’ll see a significant rise in Spark deployments on Mesos in the near future, once companies start to leverage Mesos for running all of their workloads. Look at AirBnB, Twitter, Apple and many others who already stopped enumerating their infrastructure by name. Cattle, not pets.
InfoQ: What are the development and testing tools that developers who are just starting to learn Spark can use in their applications?
Zecevic: Since Spark applications can be written in four languages, Scala, Java, Python and R, the answer depends on the choice of language. Eclipse and IDEA development platforms are both widely used development environments. We give step-by-step instructions in the book for setting up Eclipse. We chose Eclipse because there were already lots of resources on usage of IDEA for Spark development.
InfoQ: What should we consider in terms of performance tuning when using Spark for data processing and analytics?
Zecevic: There are lots of knobs that can be turned to tune Spark's behavior in a cluster and they differ among different cluster managers. First, you should get familiar with the different configuration options.
Out-of-memory exceptions are common when using Spark. Besides increasing memory, the remedy is also to increase parallelism. But it is hard to say in advance which value should be used for the number of partitions (parallelism) so it is often a matter of trial and error. The process is highly dependent on a particular data and cluster manager used so it can be tricky to set up properly right away.
Bonaci: The book contains a chapter dedicated to running Spark, that includes the most important Spark configuration parameter recommendations.
Zecevic: Locality is also important for performance, as I already said, you should have your executors as close to your data as possible.
Spark programs themselves should avoid unnecessary shuffling of data (we explain how to do that in our book) and can make use of other optimization tricks (such as broadcasting small data sets and using custom partitioners, for example).
Spark developers made a significant performance advancement with the Tungsten project. Project Tungsten is not yet reached a general availability phase, but we can be sure that future Spark releases will benefit from Tungsten optimizations even more.
Bonaci: Project Tungsten is one of these efforts under “get Spark as close to bare metal as possible” umbrella, where the goal is to remove any general-purpose software between Spark and the operating system (Tungsten allows Spark to bypass JVM and do memory management by itself). Tungsten makes a lot of sense, mainly because it makes a large class of JVM-related problems go away, garbage collection being the main one. Since end users are not managing memory manually, there’s no risk of getting segmentation fault errors, so the full potential is there to give Spark arbitrary large chunks of off-heap memory with significant performance improvements without any down sides that would be visible from the end user perspective.
InfoQ: NoSQL databases offer data storage solutions for managing a variety of data. How can we leverage NoSQL Databases and Big Data with Spark framework to get the best of both worlds?
Zecevic: Spark can talk to many NoSQL databases, such as MongoDB and Redis, through purpose-built connectors. I haven't tried it myself yet, but the recently developed Redis connector sounds very promising as a fast, in-memory alternative to HDFS. Cassandra and Spark complement each other nicely, particularly for storing and analyzing time-series data. Cassandra wasn't designed for certain types of queries and that is where Spark can help. Many other data stores, such as HBase, Couchbase and HyperTable are supported by community-built connectors and can be effectively leveraged alongside Spark.
Bonaci: Besides Cassandra, Basho’s Riak is another database that was inspired by the Amazon’s Dynamo paper whose community has built a Spark connector. Connectors also exist for Elasticsearch, Infinispan, SequoiaDB, Cloudant and others.
InfoQ: Real time data processing and analytics are getting lot of attention lately. Can you talk about how this data processing compares to batch processing? How is it different in terms of design patterns and considerations when developing real time data processing solutions?
Zecevic: Real time data processing operates on much smaller input data sets than batch processing and needs to process data quickly. Batch processing, on the other hand, has much more time at disposal. Batch jobs can take hours to complete and sometimes involve processing of all the available data in a department or an organization.
Because of these factors, real-time processing is capable of giving results almost instantaneously as the data is coming in, while batch jobs give results with large delays.
Lambda architecture is the widely adopted solution for overcoming this time gap between the moment data enters an organization, and the moment the data is processed and aggregated along with other data in the organization and becomes available for further consumption. Lambda architecture solves this by sending the same data to be processed by batch jobs and by real-time processing system. Output from real-time processing system is regarded as temporary and can be used while waiting for batch jobs to finish.
Spark has managed to reconcile both real-time and batch processing by dividing real-time streams of data into mini-batches. In that way the same API can be used for both approaches. With careful planning and design, large parts of Spark batch job codebase can be used by real-time processing programs.
Bonaci: If you think about this real-time/batch dichotomy, every batch was a stream when it was little :)
We are not processing all data in real-time simply because we never had that capability, mainly due to technical challenges and cost. Now days we are seeing a new trend, where forward-thinking organizations are starting to process all their incoming data in real-time, extracting valuable information as soon as it touches the organizational boundaries. You can think of this paradigm as having a constantly updated relational database materialized views. Why not use a relational database then? Well, these amounts of data do not fit inside a relational database. At least not in any affordable one :)
This is why open source is important. It democratized the big data space. Without Hadoop, even storing large amounts of data would be reserved only for a so-called web-scale companies, such as Google, Facebook, LinkedIn, … that may afford the engineering effort and the needed infrastructure. Without Kafka it would be almost impossible to deal with a company-wide data flow, let alone process the data in real-time with Spark hooked up the other end.
This real-time trend was spearheaded by Apache Kafka, which was built at LinkedIn and basically turned their data pipeline upside down, from heavily batch-oriented to real-time. This idea of “real-time views” also came from LinkedIn, in the form of the Apache Samza project. I can recommend a talk called “Turning the database inside out with Apache Samza”, by LinkedIn engineer Martin Kleppmann.
InfoQ: What should developers who are just getting started with Spark Machine Learning library should focus on in terms different ML algorithms and data pipelines?
Zecevic: Spark provides the means to train and use many machine learning algorithms for classification, regression, dimensionality reduction and so on, but using algorithms is just a part of the process of building a machine learning application. Data preparation, cleaning and analysis are important tasks when doing machine learning and fortunately, Spark can help in that department, too. So besides learning how to use Spark's machine learning algorithms, developers should also get acquainted with Spark Core and Spark SQL APIs. In the book we explain those preparatory steps and go through basics of regression, classification and clustering and show how to use those algorithms in Spark, so we hope the book will be of help to those developers who are just starting in the field.
InfoQ: Are there any new libraries you would like to see added to Spark framework?
Bonaci: There is a site that the team from Databricks (privately owned company, founded by the creators of Spark from Berkeley) maintains, called Spark Packages, where anyone can submit any useful type of Spark-related software. There are almost 200 packages already up there, from core add-ons, data source connectors, ML modules and algorithms, streaming and graph processing helper libraries … all the way to small fun projects, such as hooking up Spark to read data from a Google Spreadsheet.
InfoQ: Can you talk about other components in Spark eco-system like BlinkDB and Tachyon and how they complement Spark framework's capabilities?
Zecevic: There are lots of Spark packages available for download from spark-packages.org site and there are still others available independently. BlinkDB is an interesting project with much potential, but it seems neglected (the last version is 2 years old). But the idea of approximate results is very interesting.
Tachyon was used in Spark from the beginning. It provides a distributed in-memory data store and Spark can use it to offload data into memory, instead of on disk. That makes Spark fly much faster.
There are also other interesting projects, such as Zeppelin (a web-based notebook and data analysis tool that lets you execute Spark programs and inspect the results from a browser), Indexed RDD (a Spark RDD holding updatable and indexed data), Sparkling Water (provides a way to run H2O clusters on Spark; H2O is an easy to use machine learning system that has implementations of deep learning algorithms, among others), and others.
InfoQ: What's coming up in Spark in terms of new features that the users should look forward to?
Bonaci: There were some problems with the Kafka connector, mainly because it implements a high-level Kafka 0.8 consumer, which does not allow you to start consuming from an arbitrary position in a Kafka partition (only from either beginning or the end). That meant that Spark needed to rely on its own checkpoint mechanism to save consumed messages in order to minimize data loss in the event of node failure. The effort is currently under way to implement the new Kafka 0.9 consumer, which unites previous high and low level consumers, keeping best of both worlds.
Zecevic:Of larger features, I'd mention artificial neural networks, which are under development. And DataFrames, which are coming to GraphX component, which should make it faster and easier to use. Tungsten project will bring improvements to even larger part of Spark. IBM is adding resource monitoring capabilities to the Web UI. There are also lots of smaller features and improvements that are coming, too numerous to mention here. Spark is a very busy project and new things are getting added to it so fast that it's hard to keep up, so stay tuned.
InfoQ: Are there any specific features or enhancements you want to see in future Spark releases?
Zecevic: I'm hoping IndexedRDD will become a part of Spark Core soon. I already mentioned neural networks, but convolutional neural networks are also a needed (and planned) addition. Their performance would benefit greatly if they could be executed on GPUs. So another improvement would be to have a straightforward way of using GPUs from Spark.
Much can be done on usability front. For example, graphical wizards could be made for various tasks, such as building machine learning models and transforming and analyzing data with DataFrames.
Although this might sound like science fiction, it would be great if Spark could configure itself on the fly (memory, CPU, parallelism, streaming batch size etc.), based on data, user load, cluster architecture and so on. In other words, make it more plug-and-play.
Petar and Marko also talked about the book and how the readers can bootstrap a new Spark project using the instructions provided in Chapter 3.
Zecevic: Spark has gone mainstream and if you are working with Big Data it would be a mistake to ignore it. If you haven't already done so, get our book and we'll help you on that journey.
Bonaci: Your readers might find useful the information that, in the process of writing the book, we developed a Maven archetype (Scala project template), which allows you to bootstrap a new Spark project in Eclipse or Idea, in just a few clicks. The process is explained in the book’s publicly available GitHub repository.
The book contains step-by-step instructions, where we start from very fist steps, like Spark download and installation, the initial configuration, IDE installation and setup, exploring Spark Shell, creating a new project in your IDE, code examples. We even have a few stories to try to make examples “alive”. More complicated parts are sprinkled with images, so the book is really like a large Spark tutorial. We are very active on the book’s mailing list, so any time you stumble we promise to be there to help you get up :)
The book is currently in early access stage (MEAP) and will be available this summer.
About the Authors
Petar Zečević is a CTO at SV Group. During the last 14 years he has worked on various projects as a Java developer, team leader, consultant and software specialist. He is the founder and, with Marko, organizer of popular Spark@Zg meetup group.
Marko Bonaći has worked with Java for 13 years. He works at Sematext as a full-stack developer and consultant, mainly with Kafka on back end and React on front end. Before that, he was team lead for SV Group's IBM Enterprise Content Management team.