BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Graph Databases, NOSQL and Neo4j

Graph Databases, NOSQL and Neo4j

This item in japanese

Lire ce contenu en français

Bookmarks

Introduction

Of the  many different datamodels, the relational model has been dominating since the 80s, with implementations like Oracle, MySQL and MSSQL - also known as Relational Database Management System (RDBMS). Lately, however, in an increasing number of cases the use of relational databases leads to problems both because of Deficits and problems in the modeling of data and constraints of horizontal scalability over several servers and big amounts of data. There are two trends that bringing these problems to the attention of the international software community:

  1. The exponential growth of the volume of data generated by users, systems and sensors, further accelerated by the concentration of large part of this volume on big distributed systems like Amazon, Google and other cloud services.
  2. The increasing interdependency and complexity of data, accelerated by the Internet,  Web2.0, social networks and open and standardized access to data sources from a large number of different systems.

The relational databases have increasing problems to cope with these trends. This has led to a number of different technologies targeting special aspects of these problems, which can be used together or alternatively to the existing RDBMS - also know as Polyglot Persistence. Alternative databases are nothing new, they have been around for a long time in the form of e.g.  Object Databases (OODBMS), Hierarchical  Databases (e.g. LDAP) and many more. But during the last few years a large number of new projects have been started which together are known under the name NOSQL-databases.

This article aims to give an overview of the position of Graph Databases in the NOSQL-movement. The second part is an introduction to Neo4j, a Java-based Graph Database.

The NOSQL-Environment

NOSQL (Not Only SQL) really is a very wide category for a group of persistence solutions which don't follow the relational data model, and who don't use SQL as the query language.

In short, NOSQL databases can be categorized according to their data model into the following four categories:

  1. Key-Value-stores
  2. BigTable-implementations
  3. Document-stores
  4. Graph Databases

For Key/Value systems like Voldemort or Tokyo Cabinet the smallest modeling unit is the key-value-pair. With BigTable - clones it is tuples with variable numbers of attributes, with document databases like CouchDB and MongoDB the document. Graph Databases model the whole dataset as one big dense network structure.

Here, two interesting aspects of NOSQL databases will be examined deeper - Scalability and complexity.

1. Scalability

CAP: ACID vs. BASE

In order to guarantee the integrity of data, most of the classical database systems are based on transactions. This ensures consistency of data in all situations of data management. These transactional characteristics are also know as ACID (Atomicity, Consistency, Isolation, Durability). However, scaling out of ACID-compliant systems has shown to be a problem. Conflicts are arising between the different aspects of high availability in distributed systems that are not fully solvable - known as the CAP- theorem:

  • Strong Consistency: all clients see the same version of the data, even on updates to the dataset - e. g. by means of the two-phase commit protocol (XA transactions), and ACID,
  • High Availability: all clients can always find at least one copy of the requested data, even if some of the machines in a cluster is down,
  • Partition-tolerance: the total system keeps its characteristic even when being deployed on different servers, transparent to the client.

The CAP-Theorem postulates that only two of the three different aspects of scaling out are can be achieved fully at the same time.

In order to be able to work with big distributed systems, a closer look on the different CAP cahracteristics was taken.

Many of the NOSQL databases above all have loosened up the requirements on Consistency in order to achieve better Availability and Partitioning. This resulted in systems know as BASE (Basically Available, Soft-state, Eventually consistent). These have no transactions in the classical sense and introduce constraints on the data model to enable better partition schemes (like the Dynamo system etc). A more comprehensive discussion of CAP, ACID and BASE is available in this introduction.

2. Complexity

Protein Homology Network, Courtesy of Alex Adai: Institute for Cellular and Molecular Biology - University of Texas

The increasing interconnectivity of data as well as systems has led to denser data sets that cannot be scaled and autosharded in any obvious, simple or domain-independent way, even noted by Todd Hoff. More on visualizations of big and complex data sets over at Visual Complexity

The relational model

Before dismissing the relational data model as outdated, we should not forget that one of the reasons for the success of the relational databases systems is the ability of the relational Datenmodell according to E.F. Codd to model in principle any data structure without redundancy or information loss through - normalization. After the modeling phase, the data can be inserted, modified and queried in very powerful ways via SQL. There are even RDBMS that implement optimized schemas for e.g. insertion speeds or multidimensional queries (e.g. star schemas) for different use cases such as OLTP, OLAP, web-applications or reporting). 

This is the theory. In praxis however, RDBMS hit the constraints of the before-mentioned CAP-problems and have implementation-caused problems regarding high-performance querying of "deep" SQL queries that span many table joins. Other problems include scalability, schema evolution over time, modeling of tree structures, semi strucutred data, hierarchies and networks and many more.

Also, the relational model is poorly aligned with the current approaches to software development like object-orientation and dynamic languages, known as the object-relational impedance mismatch. Here, ORM layers like Hibernate for Java have been developed and applied with mixed results. They certainly ease the task of mapping the object model to the relational data model, but do not deliver optimal query performance. Especially semi-structured data is often modeled as big tables with many columns that are empty for most rows (sparse tables) which leads to poor performance. Even the alternative, to model these structures in lots of joined tables, is having problems, since joins are very performance-expensive set operations in RDBMS. 

The Graph as an alternative to relational normalization

Looking at the projection of domain models onto a data structure, there are two dominating schools - the relational way as used for RDBMS and graph- und network structures, used for e.g. the Semantic Web.

While graph structures in theory are normalizable even in RDBMS, this has serious query performance implications for recursive structures like for instance file trees and network structures like e.g. social graphs, due to the implementation characteristics of relational databases. Every operation over a relationship of a network results in a "join" operation in the RDBMS, implemented as a set-operation between the sets of primary keys for two tables - a slow operation and not scalable over growing numbers of tuples in these tables.

Basic terminology for labeled property graphs

There is no existing general consensus on terminology regarding the area of graphs. There exist many different types of graph models. However, there is some effort to create the Property Graph Model, unifying most of the different graph implementations. According to it, information in a Property Graph is modeled using three basic building blocks:

  • node (a.k.a. vertex)
  • relationship (a.k.a. edge) - with direction and Type (labeled and directed)
  • property(a.k.a attribute) on nodes and relationships

More specifically, the model is a labeled and directed attributed multigraph. A labeled graph has a label for each edge, that it is used as the type for that edge. A directed graph allows for edges with a fixed direction, from the tail or source node to the head or destination node. An attributed graph allows a variable list of attributes for each node and edge, where an attribute is a value associated to a name, simplifying the graph structure. A multigraph allows multiple edges between two nodes. This means that two nodes can be connected several times by different edges, even if two edges have the same tail, head and label.

The figure shows a small labeled property graph.

A small graph of the people surrounding TinkerPop

Graph theory has seen a great usefulness and relevance in many problems across various domains. The most applied graph theoretic algorithms include various types of shorest path calculations, geodesic paths, centrality measures like PageRank, eigenvector centrality, closeness, betweenness, HITS, and many others. However, the application of these algorithms has, in many cases, been confined to research since in practice there has not been any production ready high-performance graph database implementations available. Fortunately, in recent years, this has changed. There are several projects that have been developed with 24/7 production scenarios in mind:

The following diagram shows a positioning of the main NOSQL categories in the context of Complexity and Scalability aspects.

For more on "Scaling to size vs. Scaling to Complexity" see Emil Eifrem's blog post.

Neo4j - Java-based Graph Database

Neo4j is a full ACID- transaction compliant graph database written in Java. The data is stored on disk as an optimized data structure for graph networks. The Neo4j Kernel is an extremely fast graph engine with all the characteristics expected of a production database like recovery - 2-phase commit transactions, XA compliance etc. Neo4j has been in 24/7 production use since 2003. The project is just released as version 1.0 - a major milestone regarding stability and community testing. High Availability via online-backup and master-slave-replication are in beta and next on the release-roadmap. Neo4j can be used both as an embedded database with zero administration overhead, and as a standalone server with an extensive REST interface for easy integration into PHP, .NET and JavaScript based environments. This article will concentrate however on the direct use of Neo4j.

The developer is working directly against the graph model via a Java-API exposing the very flexible data structure. There are good community-contributed bindings to other languages like JRuby/Ruby, Scala, Python, Clojure and others. Typical data characteristics for Neo4j are:

Even "traditional" RDBMS applications tend to contain a number of challenging data sets that is best processed with graphs, like folder structures, product configurations, product assemblies and categories, media-meta data, semantic trading and fraud detection in the financial sector and others.

Neo4j has a number of optional components surrounding the kernel. There is support for structuring the graph via a meta model, a SAIL- and SparQL compliant RDF TripleStore implementiation or implementations for a number of common graph algorithms amongst others.

In case you want to run Neo4j as a separate server, there are is the REST- wrapper available. This is suitable in architectures that have been built using the LAMP - stack. REST even eases the scaling of bigger read-loads via memcached, e-tag and Apache-based caching and web layers.

High Performance?

It's hard to give definite numbers in performance benchmarks, since they are very dependent on the underlying hardware, the dataset used and other factors. Size-wise Neo4j handles graphs of sizes of several billion nodes, relationships and properties out of the box. It is normal to reach read-performance of 2000 relationship traversals per millisecond (about 1-2 million traversal steps per second) fully transactional, with warm caches, per thread. With shortest-path-calculations, Neo4j is even on small graphs of a couple of 1000 of nodes 1000 times faster than MySQL, the difference increasing as the size of the graph increases.

The reason for this is that in Neo4j, graph traversals are exectued with constant speed independent of total size of the graph. There are no set operations involved that decrease performance as seen with join operations in RDBMS. Neo4j is traversing the graph in a lazy fashion - nodes and relations are first traversed and returned when the result iterator is asking for them, increasing performance with big and deep traversals.

Write speed is much dependent on the seek time of the file system and hardware. The Ext3-filesystem and SSD disks are a good combination and result in transactional write speeds of ca. 100,000 operations per second.

Example - the MATRIX

The Graph

As mentioned before, Social Networks represent just a tiny fraction of the applications of graph databases, but they are easy to understand for this example. To demonstrate the basic functionality of Neo4j, below is a small graph from the Matrix movie, visualized with the Eclipse RCP based Neoclipse for Neo4j:

The graph is connected to a known reference node (id=0) for convenience in order to find the way into the network from a known starting point. This is not necessary, but has proven very usable in practice.

The Java implementation looks something like this:

Create a new graph database in folder "target/neo"

EmbeddedGraphDatabase graphdb = new EmbeddedGraphDatabase("target/neo");

Relationship types can be created on-the-fly:

RelationshipType KNOWS = DynamicRelationshipType.withName("KNOWS");

or via typesafe Java Enum:

enum Relationships implements RelationshipType { KNOWS, INLOVE, HAS_CODED, MATRIX }

Now, create two nodes and attach a "name" property to each of them. Then, connect these nodes with a  KNOWS relationship:

Node neo = graphdb.createNode();
neo.setProperty("name", "Neo");
Node morpheus = graphdb.createNode();
morpheus.setProperty("name", "Morpheus");
neo.createRelationshipTo(morpheus, KNOWS);

Any operation modifying the graph or needing isolation levels for data is wrapped in a transaction, so rollback and recovery work out of the box:

Transaction tx = graphdb.beginTx();
try {
	Node neo = graphdb.createNode();
	...
	tx.success();
} catch (Exception e) {
	tx.failure();
} finally {
	tx.finish();
}

The full code to create the Matrix graph the looks something like this:

graphdb = new EmbeddedGraphDatabase("target/neo4j");
index = new LuceneIndexService(graphdb);
Transaction tx = graphdb.beginTx();
try {
	Node root = graphdb.getReferenceNode();
	// we connect Neo with the root node, to gain an entry point to the graph
	// not neccessary but practical.
	neo = createAndConnectNode("Neo", root, MATRIX);
	Node morpheus = createAndConnectNode("Morpheus", neo, KNOWS);
	Node cypher = createAndConnectNode("Cypher", morpheus, KNOWS);
	Node trinity = createAndConnectNode("Trinity", morpheus, KNOWS);
	Node agentSmith = createAndConnectNode("Agent Smith", cypher, KNOWS);
	architect = createAndConnectNode("The Architect", agentSmith, HAS_CODED);
	// Trinity loves Neo. But he doesn't know.
	trinity.createRelationshipTo(neo, LOVES);
	tx.success();
} catch (Exception e) {
	tx.failure();
} finally {
	tx.finish();
}

With the member function

private Node createAndConnectNode(String name, Node otherNode,
RelationshipType relationshipType) {
    Node node = graphdb.createNode();
    node.setProperty("name", name);
    node.createRelationshipTo(otherNode, relationshipType);
    index.index(node, "name", name);
    return node;
}

to create the nodes and relationships.

Who are Neo's friends?

Neo4j's API has a number of Java-Collections-oriented methods to answer easy queries. Here, a look at the relationships of the "Neo" node is enough to find his friends:

for (Relationship rel : neo.getRelationships(KNOWS)) {
    Node friend = rel.getOtherNode(neo);
    System.out.println(friend.getProperty("name"));
}
returns "Morpheus" as the only friend.

The real power of Neo4j comes however from the use of the Traverser-API, which enables much more complex traversal descriptions and filters. It consists  of a Traverser, which is evaluating a StopEvaluator for the stop-condition and a ReturnableEvaluator for the inclusion of a node in the result. Also, the types and directions of the relationships to traverse can be specified. The traverser implements the Java Iterator interface, loading the nodes and stepping through the graph lazily first when they are requested in e.g. a for{...} loop. A couple of common Evaluators and Defaults are built in:

Traverser friends = neo.traverse(Order.BREADTH_FIRST,
StopEvaluator.DEPTH_ONE,
ReturnableEvaluator.ALL_BUT_START_NODE, KNOWS, Direction.BOTH);
for (Node friend : friends) {
    System.out.println(friend.getProperty("name"));
}

We first want to visit all nodes at the same depth from the start node before continuing to nodes at more distant levels (Order.BREADTH_FIRST), stop after one depth of traversal  (StopEvaluator.DEPTH_ONE), and return all nodes except the start node ("Neo") (ReturnableEvaluator.ALL_BUT_START_NODE). We only traverse relationships of type KNOWS in both directions (Direction.BOTH). This traverser again returns Morpheus as the only direct friend of Neo.

Friends of a friend?

In order to investigate who the friends of Neo's friends are, the KNOWS-Kanten need to be followed to a depth of 2 steps, starting at Neo, returning Trinity and Cypher. Programmatically this can be achieved by adjusting the StopEvaluator for our Traverser to limit the traversal depth to 2:

StopEvaluator twoSteps = new StopEvaluator() {
    @Override
    public boolean isStopNode(TraversalPosition position) {
        return position.depth() == 2;
    }
};

Also, the custom ReturnableEvaluator returns only nodes found at depth 2:

ReturnableEvaluator nodesAtDepthTwo = new ReturnableEvaluator() {
    @Override
    public boolean isReturnableNode(TraversalPosition position) {
        return position.depth() == 2;
    }
};

The friend-of-a-friend traverser now becomes:

Traverser friendsOfFriends = neo.traverse(Order.BREADTH_FIRST,

    twoSteps, nodesAtDepthTwo, KNOWS, Direction.BOTH);
for (Node friend : friendsOfFriends) {
    System.out.println(friend.getProperty("name"));
}

Which returns Cypher and Trinity as result.

Who is in love?

Another interesting question could be if there is anyone in this graph being in love, starting with e.g. The Architect.

This time the whole graph should be examined along any relationships starting at the architect (given his node-ID is known, but more on that later) and nodes be returned that have an outgoing LOVE-relationship. A custom ReturnableEvaluator will do this:

ReturnableEvaluator findLove = new ReturnableEvaluator() {
@Override
    public boolean isReturnableNode(TraversalPosition position) {
        return position.currentNode().hasRelationship(LOVES, Direction.OUTGOING);
    }
};

All relationship types in the whole graph are needed in order to traverse along all relatshionships:

List<Object> types = new ArrayList<Object>();
// we have to consider all relationship types of the whole graph
// (in both directions)
for(RelationshipType type : graphdb.getRelationshipTypes()) {
    types.add(type);
    types.add(Direction.BOTH);
}
//let's go!
Traverser inLove = architect.traverse(Order.BREADTH_FIRST,
StopEvaluator.END_OF_GRAPH, findLove, types.toArray());
for (Node lover : inLove) {
    System.out.println(lover.getProperty("name"));
}

This returns Trinity as the only node, since we are only returning nodes with outgoing LOVE-relationships.

Indexing the Graph

While traversing operations along the relationships are one of the sweet spots of Neo4j, often set-oriented functionality over the whole graph is needed. Fulltext-search of properties over all nodes is a typical example. Here, Neo4j is using external index systems in order not to reinvent the wheel. For the common case of text-based searches, Neo4j has tight integrations for Lucene  and Solr, adding the capability to index arbitrary node properties with transactional semantics in Lucene/Solr.

In the Matrix example, e.g. the "name" property could be indexed with

GraphDatabaseService graphDb = // a GraphDatabaseService instance

IndexService index = new LuceneIndexService( graphDb );

//create a new node and index the "name" property
Node neo = graphDb.createNode();
neo.setProperty( "name", "Neo" );
index.index( neo, "name", neo.getProperty( "name" ) );

//search for the first node with "name=Neo"
Node node = index.getSingleNode( "name", "Neo" );

Lucene is an example of an external index to the graph. However, having a fast graph engine, that are a vast number of strategies to build index structures in the graph itself, shortening traversal patterns for special data sets and domains. For instance there are timeline and B-Trees for one-dimensional data, RTrees and QuadTrees indexing two-dimensional data (very common in the Spatial and GIS communities) and many others. Often another useful pattern is to connect important subgraphs directly to the root node in order to shortcut important start-nodes.

Java sux. Anything shorter, please?

Neo4j has a number of good language-bindings that ease working with the graph structure even more. This example uses the excellent Neo4j-JRuby-bindings, which shrinks the total amount of code considerably:

After installing the neo4j gem with

>gem install neo4j

The JRuby codes for the whole Matrix graph and the before mentioned queries looks like:

require "rubygems"

require "neo4j"

class Person
  include Neo4j::NodeMixin

  #the properties on the nodes
  property :name


  #the relationship types
  has_n :friends

  # Lucene index for some properties
  index :name
end


#the players

neo       = Person.new :name => 'Neo'
morpheus  = Person.new :name => 'Morpheus'
trinity   = Person.new :name => 'Trinity'
cypher    = Person.new :name => 'Cypher'
smith     = Person.new :name => 'Agent Smith'
architect = Person.new :name => 'Architect'

#the connections

cypher.friends << morpheus
cypher.friends << smith
neo.friends << morpheus
morpheus.friends << trinity
trinity.rels.outgoing(:loves) << neo

architect.rels.outgoing(:has_coded) << smith


#Neos friends

neo.friends.each { |n| puts n }


#Neos friends-of-friends

neo.friends.depth(2).each { |n| puts n }


#Who is in love?
architect.traverse.both(:friends, :has_coded, :loves).depth(:all).filter do
  outgoing(:loves).to_a.size > 0
end.each do |n|
 puts 'In love: ' + n.name
end

A graph programming language - Gremlin

Until recently, there has not been any query language that covered the large domain of graphs and graph-related projects. In the Semantic Web/RDF domain, there is SPARQL, an SQL-inspired query language that focuses on the description of example graphs that are use to match triple sets. However, there is a large amount of graphs that are not RDF-compatible and take different or more pragmatic approaches to data modeling like for instance the Matrix-example of this article and other domain-specific data sets. Other query language are JSON-oriented, like MQL, the query language for Freebase. These languages only work on their own defined data model and provide none or very poor support for deep graph algorithms and heuristic analytics methods, that are necessary for todays big graphs.

For more complex and interesting queries for various graph data models (including RDF), Gremlin - an XPath-oriented, turing-complete graph programming language - is being developed by the TinkerPop team, primarily driven by Marko A. Rodriguez. Via the introduction of the Property Graph Model, it creates a superset and least common dominator for most of the existing models. Moreover, it allows for the connection of other graph frameworks (e.g. Gremlin using JUNG) and for the expression of graph traversals on different graph implementations. There are already a couple of implementations supported, from simple ones like the in-memory TinkerGraph, to others via an RDF-SAIL-adapter for AllegroGraph, Sesame and the ThinkerPop LinkedData SAIL (originally developed by Josh Shinavier for the Ripple programming language) all the way to Neo4j.

Gremlin's syntax is based on XPath in order to be able to express even deep path descriptions through the graph with easy expressions. Many easy cases look almost like normal XPath.

After installing Gremlin or trying it online, a Gremlin session on the Matrix-example graph could look like this:

peterneubauer$ ~/code/gremlin/gremlin.sh


         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> #open a new neo4j graph as the default graph ($_g)
gremlin> $_g := neo4j:open('tmp/matrix')
==>neo4jgraph[tmp/matrix]
gremlin> #the vertices
gremlin> $neo := g:add-v(g:map('name','Neo'))
==>v[1]
gremlin> $morpheus := g:add-v(g:map('name','Morpheus'))
==>v[2]
gremlin> $trinity := g:add-v(g:map('name','Trinity'))
==>v[3]
gremlin> $cypher := g:add-v(g:map('name','Cypher'))
==>v[4]
gremlin> $smith := g:add-v(g:map('name','Agent Smith'))
==>v[5]
gremlin> $architect := g:add-v(g:map('name','The Architect'))
==>v[6]
gremlin> #the edges
gremlin> g:list($cypher,$neo,$trinity)[g:add-e($morpheus,'KNOWS',.)]
==>v[4]
==>v[1]
==>v[3]
gremlin> g:add-e($cypher,'KNOWS',$smith)
==>e[3][4-KNOWS->5]
gremlin> g:add-e($trinity,'LOVES',$neo)
==>e[4][3-LOVES->1]
gremlin> g:add-e($architect,'HAS_CODED',$smith)
==>e[5][6-HAS_CODED->5]
gremlin> #go to Neo as the current root ($_) via a full-text index search
gremlin> $_ := g:key('name','Neo')
==>v[1]
gremlin> #is this Neo?
gremlin> ./@name
==>Neo
gremlin> #what links to here and from here?
gremlin> ./bothE
==>e[0][1-KNOWS->2]
==>e[4][3-LOVES->1]
gremlin> #only take the KNOWS-edges
gremlin> ./bothE[@label='KNOWS']
==>e[0][1-KNOWS->2]
gremlin> #Neo's friend's names
gremlin> ./bothE[@label='KNOWS']/inV/@name
==>Morpheus
gremlin>
gremlin> #Neo's Friends of friends, 2 steps
gremlin> repeat 2
               $_ := ./outE[@label='KNOWS']/inV
           end
==>v[4]
==>v[3]
gremlin> #What are their names?
gremlin> ./@name
==>Cypher
==>Trinity
gremlin> #every node in the whole graph with an outgoing LOVES edge
gremlin> $_g/V/outE[@label='LOVES']/../@name
==>Trinity

Deep graph algorithms - Value in Relationships

The Matrix example is a very naive use given the power of Gremlin. More interesting is the development and testing of algorithms over big graphs. Exhaustive algorithms like Eigenvector Centrality and Dijkstra do not scale to these graphs since they need to touch every vertex in the network. Heuristic approaches are more appropriate with concepts like Grammar Based Random Walkers and Spreading Activation (deeper explanation by Marko Rodriguez here) for these problems. The Google PageRank Algorithm is such a heuristic and can be modelled in Gremlin with the following code (an example over the graph of songs, concerts and albums by  Greatful Dead loaded from here, 2500 loops and an energy loss of 15% on each repetition):

$_g := tg:open()

g:load('data/graph-example-2.xml')

$m := g:map()

$_ := g:key('type', 'song')[g:rand-nat()]

repeat 2500

  $_ := ./outE[@label='followed_by'][g:rand-nat()]/inV

  if count($_) > 0

    g:op-value('+',$m,$_[1]/@name, 1.0)

  end

  if g:rand-real() > 0.85 or count($_) = 0

    $_ := g:key('type', 'song')[g:rand-nat()]

  end

end

g:sort($m,'value',true())

Which returns the following weighted list of songs:

==>DRUMS=44.0
==>PLAYING IN THE BAND=38.0
==>ME AND MY UNCLE=32.0
==>TRUCKING=31.0
==>CUMBERLAND BLUES=29.0
==>PROMISED LAND=29.0
==>THE MUSIC NEVER STOPPED=29.0
==>CASSIDY=26.0
==>DARK STAR=26.0
==>NOT FADE AWAY=26.0
==>CHINA CAT SUNFLOWER=25.0
==>JACK STRAW=25.0
==>TOUCH OF GREY=24.0
==>BEAT IT ON DOWN THE LINE=23.0
==>BERTHA=23.0

Another interesting example where the underlying graph is the LinkedData graph live from the Internet is a recommendation algorithm for music over LinkedData and DBPedia.

Conclusion

Graphs are not the silver bullet for all problems, much like RDBMS and all other persistence solutions. The most important aspect is the data, it's and the type of queries and operations structure that is to be dealt with, and what requirements exist regarding scalability and CAP. 

To take the High Scalability aspect of NOSQL databases as the only argument for the use of non-relational solutions is often neither required nor desirable. With Neo4j and todays hardware, it is in most cases with whole domain model can be held and queried within billions on domain objects in one single instance. 

If that is not sufficient, there is always the possibility to introduce domain-optimal sharding concepts without the need to introduce the hard data modeling limitations of documents or key/value systems. Whether this results in a dokumenten-model, a domain specific "object database" or something else depends on the domain context and the application scenario.

The code for this article is available here.

For the excellent feedback and the helpful advise I want to thank Michael Hunger and Marko Rodriguez.

About the Author

Peter Neubauer is COO of Neo Technology. Peter is co-founder of a number of Java-and graph based Open Source projects like Neo4j, Gremlin, LinkedProcess, OPS4J, Qi4j. Peter can be contacted at peter@neotechnology.com.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Congratulations for you article

    by Pere Urbón-Bayes,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hello,
    congratulations for your article, a very nice presentation of graph database approach to the NoSQL ecosystem. If any one of you are interested to review a good resume of existing graph databases you could check www.graph-database.org, here you will find another databases like:

    DEX a High Performance graph database. VertexDB a nice approach using a key-value as an storage or DirectEdge a powerfull graph based recomendation engine.

    BTW nice article.

    Pere

  • All very good but ?

    by Luis Trigueiros,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Does any off these have business friendly license ?

  • At last, a practical article on graph DBs

    by Jose Luis Rodriguez,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I was looking for something like this for ages. Most discussions were focused on the theoretical aspects or on the differences between graph vs. relational worlds.

    It is good to know too that not always a map/reduce implementation is the solution to the Giga data traversing problem.

    Are there any tests/reports that confirm the constant traversing speed?

    Thank you Pere for the link.

  • Graph, Graph, everywhere a Graph

    by Robert Greene,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Nicely done Peter!

    You’ve succinctly articulated the value of persistent storage technologies like Graph and Object Databases, which employ semi-static relationship binding. Semi-Static in that the relationships are bound into the storage model so that high speed traversal is enabled and yet those relations are re-assignable.

    I find it incredibly ironic that the RDBMS is called the “Relational database”, when it does not actually store relationships. Though perhaps the “R” really stood for Runtime Relationship (R-RDBMS) as they store discrete data values which are then used to runtime calculate relationship using set based logic. It’s not hard to see why that leads to performance and scalability issues when you look at the complexity and volume of data in today’s applications.

    Would be really interesting to see your graph meta models employed over application domain models in a large scale distributed object database like Versant ( this is where I come from ). It would certainly move your work further up the scalability curve shown in the article. It gets tricky when you start dealing with optimization of the network on the implementation of something like Dijkstra on a data set which doesn't fit on a single node. Then traversal algorithms need to be smart about how to load the adjacent set of vertices.

    Great Job!

  • Thank you!

    by Hermann Schmidt,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Peter,

    thank you for this article. It will be my personal reference node for exploring this subject.

  • Nice walkthrough

    by j0hn hag3r,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Nice look at RDBMS vs. graph. Also the Jerry Garcia photo makes it that much better :]

  • For people who use rails or ruby with neo4j

    by stark von,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    If you use neo4j with ruby and rails can visit : neo4j.tw

  • Re: Versioning and Export in useable format

    by Ritesh Shetty,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Peter,
    I liked the design and the way it works.
    My question is
    1>Can i version this graph.
    2>Can i define a schema like DDL
    3>Can i provide scripts for DML like in rdbms in prod setup
    4>Last but not the least can i export the data in somethign like a xml which is human readeable.


    Regards
    Ritesh

  • scalability

    by Svend Vanderveken,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thanks for this post, I’m definitely bookmarking it as a reference.

    CAP theorem appears however as a surprising introduction to Neo4j. As you pointed out, most NOSQL implementations give up on atomic transactions in order to improve scalability. That does not, however, correspond to my understanding of the approach of Neo4j: on the contrary Neo4j does provide ACID transactional support while causing headaches when it comes to scaling horizontally (sharding). The main difficulty comes from the fact that nodes are highly connected to each other, which renders partitioning of the data set much more complicated that in the case of key-values, columns or documents stores.

    Jim Webber provided us some interesting posts about that here:

    jim.webber.name/2011/02/16/3b8f4b3d-c884-4fba-a...

    jim.webber.name/2011/03/22/ef4748c3-6459-40b6-b...

    Best regards,

    Svend
    svendvanderveken.wordpress.com/

  • Really unbiased article

    by Prateek Jain,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thanks for such an unbiased overview of NoSQL DBs. Although article covers mainly graph based DBs. I really liked the way everything is presented here (instead of simply cursing RDBMS).

    A small delta, I expected more details about graph DBs.

    Thanks,
    Prateek

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT