Graph Databases - Book Review and Interview
Graph Databases book authored by Ian Robinson, Jim Webber, and Emil Eifrém, covers the Graph based NoSQL database technology and different options available for storing “Connected Data” in real world applications. Authors discuss topics like data modeling with graph based domain models and predictive analysis using Graph Theory techniques.
InfoQ spoke with co-authors Ian Robinson and Jim Webber about the book, role of Graph Databases in the NoSQL database space, and what’s coming up in the Graph Databases.
InfoQ: Congratulations on the new book. What is the current status of the book?
Ian and Jim: Thanks Srini! The book was written over the course of the last year, with the free rough cut version made available about 8 months into the process. We’re now at the final editing phase with O’Reilly and the completed book will be out on June 10th. We’ll be giving away copies to all the attendees at the GraphConnect conferences this year, as well as at many of our Neo4j community events.
InfoQ: How does a Graph database compare with a relational database and with other NoSQL databases?
Ian and Jim: Alistair Jones coined the term “relational crossroads” to describe the differences. According to Alistair, we’re at a fork in the road. One path leads to the approach adopted by most NOSQL databases, whereby data is heavily denormalized, and we then rely on the application to join it together, typically at high latency, to gain insight. The other path leads to the approach adopted by graph databases, whereby we use the expressive power of the graph to build a flexible, connected model of the problem at hand, which we then query at low latency to gain insight.
The relational database sits in the middle. Much like a graph database, the relational database has a query-centric model. But this model isn’t as powerful as that of a graph database. In particular, it lacks the ability to create arbitrarily extensive, semantically rich and variably connected structures at runtime. To create any kind of extensive structure in a relational database, we have to plan our joins up front. To allow for variation, we end up creating lots of nullable columns. The result: sparse tables, fancy (expensive) joins, and object-relational impedance problems, even when used in otherwise straightforward applications.
InfoQ: What are the use cases where Graph database is a better fit?
Ian and Jim: Like relational databases, the majority of graph databases are general purpose OLTP databases, and can be applied in a wide range of solutions. That said, they really shine when the solution to a problem depends upon us first understanding how things are connected. This is more common than you might think. And in many cases, it's not just a matter of knowing that things are connected; often, we need to know something about the different relationships in our domain - their names, qualities, weights, and so on.
In short, connectedness is key. Graphs are the best abstraction we have for modeling and querying connectedness; graph databases, in turn, give application developers and data specialists the power to apply this abstraction to their own particular problems. To that end, we’ve seen them used for social networking, recommendations engines, authorization and access control, routing and logistics, product catalogues, datacenter management, careers management, fraud detection, policing, geospatial, and even emigration applications. The key theme that binds all of these solutions together is that they depend on being able to model and query connected data. Simply having keys and values is not enough; nor is having the data sort-of connected through semantically impoverished joins. We need both connectivity and contextual richness to make these solutions work.
InfoQ: Can you discuss the design considerations developers need to keep in mind when using Graph databases?
Ian and Jim: The most important design decisions concern the data model and queries for any particular application. The key here, as we describe in the book, is to drive out your model from the questions you need to ask of your data. By undercovering the questions at the heart of an application goal or end user need, you identify both the things you’re interested in and the ways in which these things are connected. It’s a short step then to turn these questions into an expressive graph model, and thereafter into the queries you execute against that model.
The resulting model and associated queries are really just projections of the questions you want to ask of your data. With Neo4j’s Cypher query language the complementary nature of these projections becomes obvious: the paths you use to create the graph structure are the same paths you use to query it.
As a good first test of your design, you should be able to read what you have written. Most importantly, you should be able to read your questions in what you have written, without having to make any out-of-band assumptions or inferences. A structure such as ‘(Emil)-[:WROTE]->(Graph Databases)<-[:PUBLISHED]-(O’Reilly)’ reads well, and clearly helps us answer the questions, “Which books has Emil written?”, “Who published Graph Databases?” and “For which publishers has Emil written?”.
InfoQ: What are some architecture and design patterns supported by Graph databases?
InfoQ: Do Graph databases, by their nature, have any limitations in terms of scalability?
Ian and Jim: When we talk about scaling we’re talking about three different things: scaling for large datasets, scaling for read performance, and scaling for write performance.
Regarding scaling for large datasets, there are really no inherent limitations. Neo4j currently has an arbitrary upper limit on the size of the graph (on the order of 10^10, which is large enough to support most graphs we see in the real world, including a Neo4j deployment that has more than half of Facebook's social graph in one Neo4j cluster), but this limit is going away later this year. This removes the majority of concerns people have with regard to scaling graphs for “big” data.
Scaling for reads similarly presents no problem. Today, in Neo4j, this is accomplished using master-slave replication. Read requests are load balanced across the cluster, where they execute against local data whose structure has been optimized for connected queries. Aside from transactional integrity, Neo4j has historically focused on read performance. To enable fast traversals we have a graph-native stack all the way down to disk. Some other graph stores offer a graph interface over a nonnative graph store, such as a column store. While this may help with other scaling characteristics, it tends to increase query latencies, because the joins are performed in library code, rather than at the level of the data model.
Scaling for writes can be accomplished by scaling vertically, but at some point, for very heavy write loads, it requires the ability to distribute the data across multiple machines. This is the real challenge. While distributing the data may help write performance, it can harm read performance. So far, nobody has implemented a graph database that optimizes and combines fast local traversals with slower (over the network) distributed traversals.
Scaling graph data by distributing it across multiple machines is much more difficult than scaling some of the simpler data models, but it is possible. Scaling a key-value, column-family or document store is relatively easy because in each case you’re dealing with discrete aggregates that can be placed in their entirety in one location or another. Scaling a graph is more difficult because by its very nature the data is connected. When distributing a graph, we want to avoid having relationships that span machines as much as possible; this is called the minimum point-cut problem. On top of the problem of balancing the graph so that there are as few machine-spanning relationships as possible, things are made even more difficult because the graph is always changing. What looks like a good distribution one moment may no longer be optimal a few seconds later. This is known to be an NP-hard problem in the general case.
While we can’t claim to have solved the NP-hard general graph partitioning problem, we’ve found some pretty interesting ways of side-stepping it. In fact we have an R&D team busily experimenting on those ideas right now, and at some point they’ll make it into future product versions. On top of that, we have some tricks up our sleeve to increase the write scaling limits with the current technology.
So to answer the question, we’d say that the connected nature of graph data imposes a theoretical challenge to scaling writes by distributing the graph. But there are practical solutions to the distributed graph problem, which we’re exploring.
InfoQ: Graph databases are about storing and managing connected data. Does this characteristic limit these databases in supporting capabilities like Sharding and Transactions?
Ian and Jim: As we mentioned in the previous answer, sharding (or graph-partitioning) is key to making graphs scale, particularly for writes. But it’s difficult - at least in the general case. Add the fact that we employ transactions to protect the integrity of your data, and you can see that we have an interesting scalability design challenge on our hands.
That said, when dealing with many concurrent transactions, the nature of the graph data structure helps spread the transactional overhead across the graph. As the graph grows, transactional contention typically falls away. In other words, the larger the graph, the more throughput, which is quite a nice outcome!
InfoQ: How do Graph databases support data replication and failover features?
Ian and Jim: Today Neo4j supports replication and failover with a master/slave cluster arrangement. In a Neo4j cluster one machine is designated the master and the others slaves. Writes directed towards any machine will ultimately pass through the master, which in turn ships updates to the slaves when polled.
If the master fails, the cluster automatically elects a new master via our Paxos implementation. Again, our R&D team have done some interesting work on making Neo4j into a masterless database cluster, which will make its way into future versions.
InfoQ: Can you talk about some limitations of Graph databases?
Ian and Jim: Compared to something like Apache Cassandra, today’s graph databases don’t have the same level of write throughput - a consequence of master/slave clustering and proper ACID transactions. Using a graph database for a shopping cart or voting doesn’t make that much sense when there are more appropriate databases to use. But while you might use a column or KV store for your shopping cart, actually analysing customers’ purchasing history is far better done in the graph. This is a pattern we’ve seen a few times now: use a simple database to absorb load, and then feed the data into a graph database for refinement and analysis as the system of record.
InfoQ: What are the emerging trends in Graph database space? What do you see as the future of Graph databases?
Ian and Jim: At a high level, we think graph databases are becoming mainstream. Five years ago graphs were an academic and niche concern, but now they’re everywhere. Soon, graph databases will be considered ‘just’ another regular tool at our disposal. Based on how we’re seeing them used today, we believe they’ll continue to displace relational databases. As data becomes more intricately connected and as the field matures, their popularity will soar.
They also said that this is an exciting time for graphs and graph databases. They're seeing many more graph tools - both specialized and general purpose - coming onto the market. Neo4j is making strides in improvements to its data model, its query languages, support for very large datasets, and overall performance. The literature is growing, and there are now tutorials and use cases online. There's a healthy, vital user community, with meetups in cities worldwide. And, there's the GraphConnect conferences - this year in San Francisco, Chicago, Boston, New York and London - each of which hosts a mix of technical, theoretical and practical sessions, with experts from across the industry and community.
About the Interviewees
Ian Robinson is an engineer at Neo Technology, currently working on research and development for future versions of the Neo4j graph database. He is a co-author of “REST in Practice” (O'Reilly) and a contributor to “REST: From Research to Practice” (Springer) and “Service Design Patterns” (Addison-Wesley).
Dr. Jim Webber is Chief Scientist with Neo Technology, the company behind the popular open source graph database Neo4j, where he researches and develops distributed graph databases and writes open source software. His is a co-author of the book REST in Practice. Jim's blog is located at http://jimwebber.org and he tweets often @jimwebber.