Facebook's Comparison of Apache Giraph and Spark GraphX for Graph Data Processing

A Facebook team has recently published a comparison of the performance of their existing Giraph-based graph processing system with the newer GraphX, which is part of the popular Spark framework. Their conclusion is that GraphX is currently neither sufficiently scalable or performant to support their graph processing workloads.

Large-scale graph processing is an important part of data infrastructure services at Facebook. Their social graph has 1.71 billion vertices with hundreds of billions of edges, and if they were to add in the pages people like, the graph would have over one trillion edges. They also have a wide range of use cases for graph data analytics including page and group recommendations, infrastructure optimization through intelligent data placement, and graph compression.

The team has built a graph analytics platform based on the Apache Giraph framework which is described in a VLDB '15 paper and a corresponding blog post. Describing their motivation for looking at alternatives the team writes that:

Since its inception, Giraph has continued to evolve, allowing us to handle Facebook-scale production workloads but also making it easier for users to program. In the meantime, a number of other graph processing engines emerged. For instance, the Spark framework, which has gained adoption as a platform for general data processing, offers a graph-oriented programming model and execution engine as well — GraphX.

As our goal is to serve internal workloads in the best possible way, we decided to do a quantitative and qualitative comparison between Giraph and GraphX.

Since Facebook also has a Spark cluster supporting production workloads they decided to perform a comparison of these graph data processing systems to see how these systems handle large graphs. The tests also looked at how the two systems performed under different resource allocation policies as well as what type of support they provide for aspects like fault tolerance and user interface. They also tested for other factors including usability and ease of development between the two systems.

The testing methodology included three popular algorithms in graph data analytics: PageRank, Connected Components, and the more message heavy Triangle Counting. To allow for comparison against the original GraphX paper, they started testing using the same two publicly available graph data sets- Twitter graph, which has 1.5 billion edges, and the UK web graph, which has 3.7 billion edges. The tests also included some synthetic graph data sets that were generated using the Darwini graph generator tool. The basic software configuration was Spark 1.6.1 and Giraph 1.2.0 with JDK 1.8 (8u60).

In general, they found that Giraph was better able to handle production-scale workloads while Spark GraphX offered several features that made the development of graph data processing solutions easier.

The key findings from the performance tests include the following:

Giraph performed better even on smaller graph data sets.
Giraph was also more memory-efficient.
GraphX allows graphs to be read from Hive using an SQL-like query, allowing arbitrary column transformations.
Using Scala from the shell environment is a convenient way for testing simple GraphX applications.

In the end the team concluded that GraphX was not scalable or performant enough to support their graph processing workloads.

Based on a projection of the current results, we would need orders of magnitude more machines to support our existing production workloads. Aside from this, even for the graph sizes that GraphX can handle, there is a large gap in efficiency in terms of machine hours required, and its performance was not stable. However, the GraphX programming interface offers a number of features that simplify application development, such as SQL integration. These are features that we would like to add to Giraph in the future.

The team has provided details on how to re-produce the study, and the code and data are also available.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter