BT

Spark GraphX in Action Book Review and Interview

| Posted by Srini Penchikala Follow 41 Followers on Sep 12, 2016. Estimated reading time: 12 minutes | NOTICE: QCon.ai - Applied AI for Developers Apr 15 - 17, 2019, San Francisco. Join us!

Key takeaways

  • How graph data analytics is different from traditional data analytics?
  • How to use Apache Spark GraphX library and API like GraphFrames for graph data processing
  • Popular use cases for using graph data analytics
  • Performance and tuning techniques to develop efficient GraphX programs
  • Upcoming trends in graph data analytics space

 

"Spark GraphX in Action" book from Manning Publications, authored by Michael Malak and Robin East, provides a tutorial based coverage of Spark GraphX, the graph processing library from Apache Spark framework.

It teaches the readers how to install and configure Spark GraphX and then use it to process graph data.

Readers learn about how to use SQL with Spark graphs using GraphFrames API. They will also learn how to apply machine learning algorithms to graph data.

Chapter 5 covers the built-in algorithms like PageRank, Triangle Count, Shortest Paths, Connected Components. Authors also discuss other useful graph algorithms like Shortest Paths with Weights and Routing (using Minimum Spanning Trees).

InfoQ spoke with authors about their book and Spark GraphX library as well as overall Spark framework and what’s coming up in the area of graph data processing and analytics.

InfoQ: How do you define Graph data?

Michael Malak: Graphs in this context are the things with edges and vertices, not the things that look like stock charts. But that's a vague mathematical abstraction. More concretely, in the first chapter we break down real-world graphs into five categories: network, tree, RDBMS-like, sparse matrix, and kitchen sink.

Robin East: Traditional approaches to data analysis tend to focus on things - entities - such as bank transactions, asset registers and so on. With graph data we are not just interested in things themselves but the connections between them. For example if I have phone call records that tell me that person A called person B then I can link person A to B; that connection provides valuable information about both those people that we don’t know from simple data about each individual.

InfoQ: What is graph data analytics and how is it different from processing the traditional data?

Malak: As we describe in the first chapter, RDBMS's aren't efficient at handling graph path traversal due to the large number of self-joins required. Another place where graph analytics is performant is processing sparse matrices as described in the seventh chapter on machine learning.

East: Graph analytics is really a collection of practices that focus on drawing out the information content of these connections between data items. The ability to model connections between different data becomes very powerful when you can see the pattern of connections between different entities. Using the phone call records example again, when we can analyse the ‘web’ of different calls between many different people we can build up a picture of different types of interactions. In many cases we can isolate different behaviours (such as criminal v non-criminal) by the structures in the data.

InfoQ: How does graph data analytics help in big data and predictive analytics?

Malak: If you have Big Data, you first need to extract structured data out of it into, generally, a relational model or a graph model. Some problems such as route-finding on a map and social network analysis (especially finding the influencers of a social network graph) are naturally expressed as graph problems. All of machine learning is concerned about making predictions, and the chapter of the book on machine learning, the longest, shows several ways to use machine learning with graphs.

East: The efficacy of big data-based predictive analytics really depends on the ability to extract many different types of feature as input to predictive algorithms. One of my favourite examples in the book is a new take on the old spam detection problem. The problem is to detect web spam using logistic regression. However we take an interesting new idea of augmenting traditional input features with a new graph-based input feature, Truncated Page Rank, and show how the model can be implemented in GraphX.

InfoQ: Can you discuss how GraphFrames work and how do they compare with DataFrames?

Malak: GraphX is the official graph processing system that is part of the Apache Spark distribution, even for Spark 2.0. GraphX is based on RDDs: one RDD for the edges and one RDD for the vertices. GraphFrames, which is currently available as an add-on from spark-packages.org, is instead based on DataFrames. To compare GraphX to GraphFrames is largely a comparison of RDDs to DataFrames: with DataFrames (and thus with GraphFrames) there is a huge performance advantage potential with the Catalyst query planner, Tungsten sun.misc.unsafe raw memory layout, and on-the-fly bytecode generation which in Spark 2.0 has been enhanced to whole-pipeline rather than keyhole code generation for a 10x improvement compared to Spark 1.6 Dataframes. GraphX does have its internal routing table to facilitate the construction of triplets; GraphFrames lacks this but more than makes up for it in the performance improvements it gets for free from DataFrames. GraphFrames also provide the ability to query with a combination of a subset of Neo4j's Cypher language and the SQL that comes with DataFrames. Finally, Python-lovers and those most comfortable with Java can finally rejoice that GraphFrames provides bindings for those languages, whereas that was a sore spot with GraphX which officially supports only Scala (though we do show in the book how to jump through a hundred hoops to get it to work with Java). In the tenth chapter, we cover GraphFrames and present an interesting example on finding links missing in Wikipedia that probably should be there.

East: Resilient Distributed Datasets (RDD) are the core low-level data structure provided by Spark. RDDs are used in GraphX to represent the edges and vertices of a graph. DataFrames, on the other hand, are a higher-level data interface that provided a number of useful developer facing features like a SQL interface; they also have a number of performance optimisations. GraphFrames represent graphs using DataFrames instead of RDDs.

GraphFrames add a number of key features that are missing from GraphX such as a query interface and property python and java API. However it is possible to convert from one representation to the other and in fact this is how the standard algorithms like PageRank and Connected Components are implemented.

InfoQ: Can you talk about the following four graph data related concepts: Graph NoSQL databases, Graph data queries, Graph data analytics and Graph data visualization?

Malak: In my June, 2016 Spark Summit presentation, I provide a good illustration of a "spectrum" of graph technologies. On one end are the true OLTP-like graph NoSQL databases: Neo4j, Titan, OrientDB, etc. On the opposite end are the OLAP-like graph processing/data analytics systems: GraphX, GraphLab, etc. The territory of graph querying is in the middle of this spectrum. The graph NoSQL databases can also query, and so can GraphFrames, but GraphX is quite limited in this regard. The territory of Graph data visualization is orthogonal to this spectrum; graph visualization can be applied to either the OLTP-like graph databases or the OLAP-like graph processing/analytics. In the book, we talk about two specific technologies: Gephi, and the combination of Zeppelin with d3.js. It is important to note that the use case for graph visualization is much different than the use case for relational data visualization. In relational data visualization, the goal is to gain direct insight from the data, whereas in graph visualization the goal is to debug data or algorithms.

East: As Michael has mentioned there are a number of different Graph databases out there which cater to a range of different use cases. It should be stressed that GraphX provides in-memory graph processing rather than database capabilities. You can construct in-memory graphs in GraphX from a variety of sources which could include queries on a Graph NoSQL database. Actually that latter combination has great potential.

Graph visualisation is a topic that deserves an entire book so it's great that Manning already have one in the advanced stages of writing (Visualizing Graph Data by Corey Lanum).

InfoQ: What are some popular use cases where graph data processing is a better solution to process the data?

Malak: PageRank is kind of the poster child algorithm for GraphX, and that in and of itself has a number of use cases beyond just making your own competitor to Google. It can be used to find the influencers in any network-type graph, such as in a paper-citation network (an example given in the book) or a social network. In the book we also show how to turn PageRank into another metric called Truncated Page Rank to find link farms of spam web pages. But aside from PageRank, there are the classic graph algorithms from half a century ago that we implement in the book: shortest path (such as in geospatial mapping), traveling salesman, and minimum spanning tree. Minimum spanning tree sounds so academic, but in the book we show an interesting use of them to automatically, with a helping hand from the word2vec algorithm in MLlib, create a hierarchical taxonomy of concepts out of just text corpora. Then there are the machine learning algorithms in Spark that actually implemented in GraphX: SVD++ (for recommender systems, similar to ALS), and Power Iteration Clustering (for which in the book we show an example of image segmentation for computer vision).

East: Whenever the connections between individual data are as important as the data items themselves you should think about a graph approach to data processing. Whilst it's not impossible to do this kind of processing with traditional data tools it quickly becomes tedious and you find yourself fighting to implement simple constructs. By contrast graph analytics systems like GraphX provide a very natural way to represent and interact with connected data.

InfoQ: Spark brings to the table an unified big data processing framework by providing libraries to process batch, streaming, and graph data. It also provides machine learning library. Can you discuss some use cases of leveraging all these libraries together?

Malak: In the chapter on machine learning we describe ways to use GraphX with MLlib. These are all batch applications. Graphs in GraphX and GraphFrames are, like everything else in Spark, immutable. There is no way to incrementally add an edge or vertex. Spark Streaming, even though it too is based on immutable data, has feasible because its mini-batches of relational data are like tiny relational tables that are much more useful than tiny graphs. After publication of Spark GraphX in Action, Ankur Dave (creator of GraphX) did present at Spark Summit a research project called Tegra where GraphX was rewritten to accept incremental streaming updates (not related to Spark Streaming). But I don't believe the Tegra code has been made available yet.

East: One area would be online fraud detection. Because fraud attacks can evolve so quickly you need predictive analytics that is able to utilise features that might have been generated in the last 5 minutes or even more recent. In addition adding in graph models to the feature mix gives the potential for more effective prediction algorithms.

InfoQ: You write about performance and monitoring in the book. Can you discuss some of the techniques to make GraphX programs efficient?

Malak: Careful control of caching and careful control of lineage. These are especially important because GraphX programs tend to be iterative. Other typical Spark techniques also apply, such as choosing the right serializer.

East: Absolutely first up is to have some detail on how Spark actually works and how to monitor those processes to see what is happening. Spark provides web-based GUIs that can show you what is happening in real-time and so understanding how to use these tools to the maximum is a must. There are a number of different tuning areas such as caching, serialisation and checkpointing but ideally these are applied once you have an understanding of how your application is performing.

InfoQ: Are there any missing features in the current implementation of Spark GraphX library?

Malak: Chapter 8 is devoted to the "missing algorithms": Reading RDF files, merging graphs, filtering out isolated vertices, and computing the global clustering coefficient. Also, Ankur Dave (creator of GraphX) created a package called IndexedRDD to speed up GraphX, but it never made it into the Apache Spark distribution so in some sense it is "missing" from GraphX. We show how in chapter 8 to incorporate IndexedRDD into a GraphX program to improve performance.

East: Support for incremental updated of Graphs, possibly in conjunction with Spark Streaming, is an often requested feature. Right now GraphX data structures are immutable so incremental updates mean creating a whole new graph which could be a time-consuming process.

InfoQ: What's coming up in graph data analytics space?

Malak: Graph systems that can handle both the OLAP-like and OLTP-like applications. From my spectrum diagram of my June, 2016 Spark Summit presentation linked above, it's clear Graph NoSQL databases have the greatest lead in covering this entire spectrum. Which particular Graph NoSQL database will be the winner, however, is still up for grabs: Neo4j, Turi/Dato/GraphLab, OrientDB, Titan, Oracle PGX. One of the big draws of GraphX is that there is zero additional installation and administration cost once you already have a Spark cluster in place, which in this day and age many organizations already do. So Spark integration will be key to any future dominant graph technology.

East: I think you should keep an eye on 2 areas. First tighter integration between graph databases and graph processing frameworks. This could be via seamless interoperability between a database like Neo4j and a processing system such as Spark. Alternatively these capabilities may appear in a single offering.

The other area is graph algorithms becoming more tightly integrated with mainstream machine learning. Currently many libraries tend to focus on one or the other (even GraphX is only loosely integrated with Spark's machine learning library). In fact graphs can often be seen as an alternative representation of sparse data matrices.

Robin also spoke about the graph data processing approach.

East: If you're used to traditional data processing using relational databases it can take you a little while to get your head around a graph-based approach to data modelling. If you persevere you'll soon start seeing graph structures everywhere.

About the Interviewees

Michael Malak is the lead author of Spark GraphX In Action and has been developing Spark solutions at two Fortune 200 companies since early 2013. He has been programming computers since before they could be bought pre-assembled in stores.

 

Robin East has worked as a consultant to large organizations for over 15 years and is a data scientist at Worldpay.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss
BT