Graph Processing Using Big Data Technologies
Processing extremely large graphs has been and remains a challenge, but recent advances in Big Data technologies have made this task more practical. Tapad, a startup based in NYC focused on cross-device content delivery, has made graph processing the heart of their business model using Big Data to scale to terabytes of data.
Social networks like Facebook or Twitter contain data that naturally lends itself to a graph representation. But graphs can be used to represent less obvious data, as in the case of Tapad’s device graph. Dag Liodden, Tapad’s co-founder and CTO, describes why using a graph representation for devices makes sense:
Tapad takes a graph-oriented approach to modeling relationships between devices. Anonymous identifiers (such as cookie IDs) are represented as nodes in our Device Graph and we track marketing information to these nodes. Edges between the nodes are scored / weighted using a combination of deterministic data and probabilistic statistical models / machine learning techniques. The concept of a "device" is defined as a starting device / node (let's say the cookie ID of a browser) and the collections of nodes (let's say the cookie IDs of a Tablet and a Connected TV) that are reachable from that starting point given a customizable set of edge constraints. Having an actual graph structure, as opposed to just aggregated information into a single node, gives us the flexibility to balance accuracy and scale dynamically as well as more easily augment the graph with new edge inference models.
Using the right tool for the right job is important, and the same goes for graph processing: there is no need to use Big Data technologies for graphs that can be handled by more traditional workloads, like Dag says:
"Big Data" to me is the threshold where you no longer can use a small set of general purpose, off-the-shelf tools to store and analyze your data, but instead have to tailor different technologies to address specific use cases. These thresholds keep moving every year as software and hardware solutions evolve and mature, but so does the size of the data sets we deal with and the level of sophistication of the analysis we need to perform.
For Facebook, this threshold is in the single digit petabytes, as detailed during their submission to the 2013 ACM SIGMOD conference in NYC. For Tapad, the amount of data in the graph is smaller but would still be impossible to process using traditional methods:
The US graph currently has about 1.1 billion nodes, representing mobile phones, tablets, laptops, gaming consoles and TVs. Some of these nodes are transient; for instance, due to a browser with non-persistent cookies, and thus have little data and no edges. The non-transient nodes have about five edges on average and around 500 discrete pieces of information, such as behavioral segments, associated with them. The live graph data weighs in at multiple TB and we read / write from / to it several hundred thousand times per second across multiple data centers. Updates to the graph are geographically cross-replicated and each data center is currently serving off of servers backed by 20 TB of Flash SSD storage and 2 TB of RAM.
The recent years have seen a surge in the number of technologies used to process graphs at scale, especially 2013 which saw several new additions to the ecosystem. There are two classes of systems to consider:
- Graph databases for OLTP workloads for quick low-latency access to small portions of graph data.
- Graph processing engines for OLAP workloads allowing batch processing of large portions of a graph.
The list of graph databases is already very long, but several projects have emerged and differentiated themselves recently. Neo4j is one of the oldest and most mature graph databases, but still suffers from scalability issues since it doesn’t support sharding yet. Another database that, albeit pretty young, has been gaining a lot of popularity in 2013 is Titan. As a backend-agnostic graph database, it can leverage both HBase and Cassandra’s scalable architecture and uses an optimized vertex and edge representation internally to allow it to scale to billions of edges as reported in a blog post in 2013.
But one does not need to use graph-specific databases, and more generic scalable NoSQL databases can also be an effective solution to the problem. Apache Accumulo, a technology based on Google’s BigTable and open-sourced in 2011, is an example of a generic database that can also be a good fit to store graphs at scale because records are flexible and can be used to store graphs with typed edges and weights, and is actually being used by the NSA according to a technical report published in 2013. Cassandra or Aerospike are other examples of databases that, with a proper data model, can effectively model a graph with edges, vertexes and weights. Facebook also built their own solution using MySQL and Memcache in a system called Tao, which is being used to serve the social graph to its users. And according to Dag, Tapad used the same philosophy in the design of their device graph:
The live graph lives in a key-value store to allow for fast traversals and updates. We regularly snapshot the graph into HDFS where it can be retrieved for more advanced graph processing and augmented with other data streams. The results are later fed back into the "live graph". There are advantages to using a graph specific database, but our current setup with extremely fast, simple traversals of the graph in our key-value store and slow, but very flexible traversal and analysis on Hadoop is serving us well, at least for now.
Even with a graph stored in a database, the operations that can be performed at scale will likely be limited to lookups and small traversals. For more complex analysis on a larger portion of a graph, there is a need for batch processing distributed frameworks. For the best performance, the GraphLab framework uses the Message Passing Interface (MPI) model to scale and run complex algorithms using data in HDFS. More recent frameworks like Apache Giraph and Apache Hama are based on the Bulk Synchronous Parallel (BSP) paradigm popularized by Google’s Pregel project. And the latest additions to the ecosystem are the GraphX project running on top of Spark which was unveiled in 2013, and Faunus, which is using Hadoop to run MapReduce jobs to process graphs in a Titan database. Tapad is using these new technologies to process their offline graph data. According to Dag:
Currently, our main graph processing framework is Apache Giraph, but we are experimenting with Spark GraphX and Graphlab as well. All of these frameworks are still pretty young, the learning curve is pretty steep and all come with their own pros, cons and caveats. For instance, Giraph and GraphX are convenient as they fit nicely into our Hadoop infrastructure, but Graphlab is very appealing due to the sheer performance of it.
Some projects are attempting to provide a unified framework to answer both OLTP and OLAP queries. Dendrite from Lab41 is such a project that leverages GraphLab on top of Titan for storage and processing, and AngularJS for visualization. This is still a very young project unveiled in early 2014 so the community reaction is limited, but the fact that it attempts to cover every use case should help drive adoption.
It's a hard choice...
Many thanks for sharing your graph database knowledge.
For a new project I'm working on we started to use Neo4j as it is the most easy to use db with a fast learning curve.
We use Cypher to retrieve paths from the model but the drawback is that Cypher is neo4j centric : I can't see the same capabilities for path retrieval with Gremlin. Am I right ?
I'm looking for Titan and Giraph too as we use a Hadoop cluster and ElasticSearch. But the choice is not very straightforward.
Thanks again for this post.
Re: It's a hard choice...
Actually Gremlin supports path retrieval, you can find some information here: github.com/tinkerpop/gremlin/wiki/Path-Pattern
If you are unsure whether Gremlin supports a particular construct, I would advise taking a look at the Gremlin docs which is very well structured and one of the references: gremlindocs.com/
Hope that helps !
Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin May 21, 2015