Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Graph Databases, NOSQL and Neo4j

Graph Databases, NOSQL and Neo4j

This item in japanese

Lire ce contenu en français


Of the  many different datamodels, the relational model has been dominating since the 80s, with implementations like Oracle, MySQL and MSSQL - also known as Relational Database Management System (RDBMS). Lately, however, in an increasing number of cases the use of relational databases leads to problems both because of Deficits and problems in the modeling of data and constraints of horizontal scalability over several servers and big amounts of data. There are two trends that bringing these problems to the attention of the international software community:

  1. The exponential growth of the volume of data generated by users, systems and sensors, further accelerated by the concentration of large part of this volume on big distributed systems like Amazon, Google and other cloud services.
  2. The increasing interdependency and complexity of data, accelerated by the Internet,  Web2.0, social networks and open and standardized access to data sources from a large number of different systems.

The relational databases have increasing problems to cope with these trends. This has led to a number of different technologies targeting special aspects of these problems, which can be used together or alternatively to the existing RDBMS - also know as Polyglot Persistence. Alternative databases are nothing new, they have been around for a long time in the form of e.g.  Object Databases (OODBMS), Hierarchical  Databases (e.g. LDAP) and many more. But during the last few years a large number of new projects have been started which together are known under the name NOSQL-databases.

This article aims to give an overview of the position of Graph Databases in the NOSQL-movement. The second part is an introduction to Neo4j, a Java-based Graph Database.

The NOSQL-Environment

NOSQL (Not Only SQL) really is a very wide category for a group of persistence solutions which don't follow the relational data model, and who don't use SQL as the query language.

In short, NOSQL databases can be categorized according to their data model into the following four categories:

  1. Key-Value-stores
  2. BigTable-implementations
  3. Document-stores
  4. Graph Databases

For Key/Value systems like Voldemort or Tokyo Cabinet the smallest modeling unit is the key-value-pair. With BigTable - clones it is tuples with variable numbers of attributes, with document databases like CouchDB and MongoDB the document. Graph Databases model the whole dataset as one big dense network structure.

Here, two interesting aspects of NOSQL databases will be examined deeper - Scalability and complexity.

1. Scalability


In order to guarantee the integrity of data, most of the classical database systems are based on transactions. This ensures consistency of data in all situations of data management. These transactional characteristics are also know as ACID (Atomicity, Consistency, Isolation, Durability). However, scaling out of ACID-compliant systems has shown to be a problem. Conflicts are arising between the different aspects of high availability in distributed systems that are not fully solvable - known as the CAP- theorem:

  • Strong Consistency: all clients see the same version of the data, even on updates to the dataset - e. g. by means of the two-phase commit protocol (XA transactions), and ACID,
  • High Availability: all clients can always find at least one copy of the requested data, even if some of the machines in a cluster is down,
  • Partition-tolerance: the total system keeps its characteristic even when being deployed on different servers, transparent to the client.

The CAP-Theorem postulates that only two of the three different aspects of scaling out are can be achieved fully at the same time.

In order to be able to work with big distributed systems, a closer look on the different CAP cahracteristics was taken.

Many of the NOSQL databases above all have loosened up the requirements on Consistency in order to achieve better Availability and Partitioning. This resulted in systems know as BASE (Basically Available, Soft-state, Eventually consistent). These have no transactions in the classical sense and introduce constraints on the data model to enable better partition schemes (like the Dynamo system etc). A more comprehensive discussion of CAP, ACID and BASE is available in this introduction.

2. Complexity

Protein Homology Network, Courtesy of Alex Adai: Institute for Cellular and Molecular Biology - University of Texas

The increasing interconnectivity of data as well as systems has led to denser data sets that cannot be scaled and autosharded in any obvious, simple or domain-independent way, even noted by Todd Hoff. More on visualizations of big and complex data sets over at Visual Complexity

The relational model

Before dismissing the relational data model as outdated, we should not forget that one of the reasons for the success of the relational databases systems is the ability of the relational Datenmodell according to E.F. Codd to model in principle any data structure without redundancy or information loss through - normalization. After the modeling phase, the data can be inserted, modified and queried in very powerful ways via SQL. There are even RDBMS that implement optimized schemas for e.g. insertion speeds or multidimensional queries (e.g. star schemas) for different use cases such as OLTP, OLAP, web-applications or reporting). 

This is the theory. In praxis however, RDBMS hit the constraints of the before-mentioned CAP-problems and have implementation-caused problems regarding high-performance querying of "deep" SQL queries that span many table joins. Other problems include scalability, schema evolution over time, modeling of tree structures, semi strucutred data, hierarchies and networks and many more.

Also, the relational model is poorly aligned with the current approaches to software development like object-orientation and dynamic languages, known as the object-relational impedance mismatch. Here, ORM layers like Hibernate for Java have been developed and applied with mixed results. They certainly ease the task of mapping the object model to the relational data model, but do not deliver optimal query performance. Especially semi-structured data is often modeled as big tables with many columns that are empty for most rows (sparse tables) which leads to poor performance. Even the alternative, to model these structures in lots of joined tables, is having problems, since joins are very performance-expensive set operations in RDBMS. 

The Graph as an alternative to relational normalization

Looking at the projection of domain models onto a data structure, there are two dominating schools - the relational way as used for RDBMS and graph- und network structures, used for e.g. the Semantic Web.

While graph structures in theory are normalizable even in RDBMS, this has serious query performance implications for recursive structures like for instance file trees and network structures like e.g. social graphs, due to the implementation characteristics of relational databases. Every operation over a relationship of a network results in a "join" operation in the RDBMS, implemented as a set-operation between the sets of primary keys for two tables - a slow operation and not scalable over growing numbers of tuples in these tables.

Basic terminology for labeled property graphs

There is no existing general consensus on terminology regarding the area of graphs. There exist many different types of graph models. However, there is some effort to create the Property Graph Model, unifying most of the different graph implementations. According to it, information in a Property Graph is modeled using three basic building blocks:

  • node (a.k.a. vertex)
  • relationship (a.k.a. edge) - with direction and Type (labeled and directed)
  • property(a.k.a attribute) on nodes and relationships

More specifically, the model is a labeled and directed attributed multigraph. A labeled graph has a label for each edge, that it is used as the type for that edge. A directed graph allows for edges with a fixed direction, from the tail or source node to the head or destination node. An attributed graph allows a variable list of attributes for each node and edge, where an attribute is a value associated to a name, simplifying the graph structure. A multigraph allows multiple edges between two nodes. This means that two nodes can be connected several times by different edges, even if two edges have the same tail, head and label.

The figure shows a small labeled property graph.

A small graph of the people surrounding TinkerPop

Graph theory has seen a great usefulness and relevance in many problems across various domains. The most applied graph theoretic algorithms include various types of shorest path calculations, geodesic paths, centrality measures like PageRank, eigenvector centrality, closeness, betweenness, HITS, and many others. However, the application of these algorithms has, in many cases, been confined to research since in practice there has not been any production ready high-performance graph database implementations available. Fortunately, in recent years, this has changed. There are several projects that have been developed with 24/7 production scenarios in mind:

The following diagram shows a positioning of the main NOSQL categories in the context of Complexity and Scalability aspects.

For more on "Scaling to size vs. Scaling to Complexity" see Emil Eifrem's blog post.

Neo4j - Java-based Graph Database

Neo4j is a full ACID- transaction compliant graph database written in Java. The data is stored on disk as an optimized data structure for graph networks. The Neo4j Kernel is an extremely fast graph engine with all the characteristics expected of a production database like recovery - 2-phase commit transactions, XA compliance etc. Neo4j has been in 24/7 production use since 2003. The project is just released as version 1.0 - a major milestone regarding stability and community testing. High Availability via online-backup and master-slave-replication are in beta and next on the release-roadmap. Neo4j can be used both as an embedded database with zero administration overhead, and as a standalone server with an extensive REST interface for easy integration into PHP, .NET and JavaScript based environments. This article will concentrate however on the direct use of Neo4j.

The developer is working directly against the graph model via a Java-API exposing the very flexible data structure. There are good community-contributed bindings to other languages like JRuby/Ruby, Scala, Python, Clojure and others. Typical data characteristics for Neo4j are:

Even "traditional" RDBMS applications tend to contain a number of challenging data sets that is best processed with graphs, like folder structures, product configurations, product assemblies and categories, media-meta data, semantic trading and fraud detection in the financial sector and others.

Neo4j has a number of optional components surrounding the kernel. There is support for structuring the graph via a meta model, a SAIL- and SparQL compliant RDF TripleStore implementiation or implementations for a number of common graph algorithms amongst others.

In case you want to run Neo4j as a separate server, there are is the REST- wrapper available. This is suitable in architectures that have been built using the LAMP - stack. REST even eases the scaling of bigger read-loads via memcached, e-tag and Apache-based caching and web layers.

High Performance?

It's hard to give definite numbers in performance benchmarks, since they are very dependent on the underlying hardware, the dataset used and other factors. Size-wise Neo4j handles graphs of sizes of several billion nodes, relationships and properties out of the box. It is normal to reach read-performance of 2000 relationship traversals per millisecond (about 1-2 million traversal steps per second) fully transactional, with warm caches, per thread. With shortest-path-calculations, Neo4j is even on small graphs of a couple of 1000 of nodes 1000 times faster than MySQL, the difference increasing as the size of the graph increases.

The reason for this is that in Neo4j, graph traversals are exectued with constant speed independent of total size of the graph. There are no set operations involved that decrease performance as seen with join operations in RDBMS. Neo4j is traversing the graph in a lazy fashion - nodes and relations are first traversed and returned when the result iterator is asking for them, increasing performance with big and deep traversals.

Write speed is much dependent on the seek time of the file system and hardware. The Ext3-filesystem and SSD disks are a good combination and result in transactional write speeds of ca. 100,000 operations per second.

Example - the MATRIX

The Graph

As mentioned before, Social Networks represent just a tiny fraction of the applications of graph databases, but they are easy to understand for this example. To demonstrate the basic functionality of Neo4j, below is a small graph from the Matrix movie, visualized with the Eclipse RCP based Neoclipse for Neo4j:

The graph is connected to a known reference node (id=0) for convenience in order to find the way into the network from a known starting point. This is not necessary, but has proven very usable in practice.

The Java implementation looks something like this:

Create a new graph database in folder "target/neo"

EmbeddedGraphDatabase graphdb = new EmbeddedGraphDatabase("target/neo");

Relationship types can be created on-the-fly:

RelationshipType KNOWS = DynamicRelationshipType.withName("KNOWS");

or via typesafe Java Enum:

enum Relationships implements RelationshipType { KNOWS, INLOVE, HAS_CODED, MATRIX }

Now, create two nodes and attach a "name" property to each of them. Then, connect these nodes with a  KNOWS relationship:

Node neo = graphdb.createNode();
neo.setProperty("name", "Neo");
Node morpheus = graphdb.createNode();
morpheus.setProperty("name", "Morpheus");
neo.createRelationshipTo(morpheus, KNOWS);

Any operation modifying the graph or needing isolation levels for data is wrapped in a transaction, so rollback and recovery work out of the box:

Transaction tx = graphdb.beginTx();
try {
	Node neo = graphdb.createNode();
} catch (Exception e) {
} finally {

The full code to create the Matrix graph the looks something like this:

graphdb = new EmbeddedGraphDatabase("target/neo4j");
index = new LuceneIndexService(graphdb);
Transaction tx = graphdb.beginTx();
try {
	Node root = graphdb.getReferenceNode();
	// we connect Neo with the root node, to gain an entry point to the graph
	// not neccessary but practical.
	neo = createAndConnectNode("Neo", root, MATRIX);
	Node morpheus = createAndConnectNode("Morpheus", neo, KNOWS);
	Node cypher = createAndConnectNode("Cypher", morpheus, KNOWS);
	Node trinity = createAndConnectNode("Trinity", morpheus, KNOWS);
	Node agentSmith = createAndConnectNode("Agent Smith", cypher, KNOWS);
	architect = createAndConnectNode("The Architect", agentSmith, HAS_CODED);
	// Trinity loves Neo. But he doesn't know.
	trinity.createRelationshipTo(neo, LOVES);
} catch (Exception e) {
} finally {

With the member function

private Node createAndConnectNode(String name, Node otherNode,
RelationshipType relationshipType) {
    Node node = graphdb.createNode();
    node.setProperty("name", name);
    node.createRelationshipTo(otherNode, relationshipType);
    index.index(node, "name", name);
    return node;

to create the nodes and relationships.

Who are Neo's friends?

Neo4j's API has a number of Java-Collections-oriented methods to answer easy queries. Here, a look at the relationships of the "Neo" node is enough to find his friends:

for (Relationship rel : neo.getRelationships(KNOWS)) {
    Node friend = rel.getOtherNode(neo);
returns "Morpheus" as the only friend.

The real power of Neo4j comes however from the use of the Traverser-API, which enables much more complex traversal descriptions and filters. It consists  of a Traverser, which is evaluating a StopEvaluator for the stop-condition and a ReturnableEvaluator for the inclusion of a node in the result. Also, the types and directions of the relationships to traverse can be specified. The traverser implements the Java Iterator interface, loading the nodes and stepping through the graph lazily first when they are requested in e.g. a for{...} loop. A couple of common Evaluators and Defaults are built in:

Traverser friends = neo.traverse(Order.BREADTH_FIRST,
ReturnableEvaluator.ALL_BUT_START_NODE, KNOWS, Direction.BOTH);
for (Node friend : friends) {

We first want to visit all nodes at the same depth from the start node before continuing to nodes at more distant levels (Order.BREADTH_FIRST), stop after one depth of traversal  (StopEvaluator.DEPTH_ONE), and return all nodes except the start node ("Neo") (ReturnableEvaluator.ALL_BUT_START_NODE). We only traverse relationships of type KNOWS in both directions (Direction.BOTH). This traverser again returns Morpheus as the only direct friend of Neo.

Friends of a friend?

In order to investigate who the friends of Neo's friends are, the KNOWS-Kanten need to be followed to a depth of 2 steps, starting at Neo, returning Trinity and Cypher. Programmatically this can be achieved by adjusting the StopEvaluator for our Traverser to limit the traversal depth to 2:

StopEvaluator twoSteps = new StopEvaluator() {
    public boolean isStopNode(TraversalPosition position) {
        return position.depth() == 2;

Also, the custom ReturnableEvaluator returns only nodes found at depth 2:

ReturnableEvaluator nodesAtDepthTwo = new ReturnableEvaluator() {
    public boolean isReturnableNode(TraversalPosition position) {
        return position.depth() == 2;

The friend-of-a-friend traverser now becomes:

Traverser friendsOfFriends = neo.traverse(Order.BREADTH_FIRST,

    twoSteps, nodesAtDepthTwo, KNOWS, Direction.BOTH);
for (Node friend : friendsOfFriends) {

Which returns Cypher and Trinity as result.

Who is in love?

Another interesting question could be if there is anyone in this graph being in love, starting with e.g. The Architect.

This time the whole graph should be examined along any relationships starting at the architect (given his node-ID is known, but more on that later) and nodes be returned that have an outgoing LOVE-relationship. A custom ReturnableEvaluator will do this:

ReturnableEvaluator findLove = new ReturnableEvaluator() {
    public boolean isReturnableNode(TraversalPosition position) {
        return position.currentNode().hasRelationship(LOVES, Direction.OUTGOING);

All relationship types in the whole graph are needed in order to traverse along all relatshionships:

List<Object> types = new ArrayList<Object>();
// we have to consider all relationship types of the whole graph
// (in both directions)
for(RelationshipType type : graphdb.getRelationshipTypes()) {
//let's go!
Traverser inLove = architect.traverse(Order.BREADTH_FIRST,
StopEvaluator.END_OF_GRAPH, findLove, types.toArray());
for (Node lover : inLove) {

This returns Trinity as the only node, since we are only returning nodes with outgoing LOVE-relationships.

Indexing the Graph

While traversing operations along the relationships are one of the sweet spots of Neo4j, often set-oriented functionality over the whole graph is needed. Fulltext-search of properties over all nodes is a typical example. Here, Neo4j is using external index systems in order not to reinvent the wheel. For the common case of text-based searches, Neo4j has tight integrations for Lucene  and Solr, adding the capability to index arbitrary node properties with transactional semantics in Lucene/Solr.

In the Matrix example, e.g. the "name" property could be indexed with

GraphDatabaseService graphDb = // a GraphDatabaseService instance

IndexService index = new LuceneIndexService( graphDb );

//create a new node and index the "name" property
Node neo = graphDb.createNode();
neo.setProperty( "name", "Neo" );
index.index( neo, "name", neo.getProperty( "name" ) );

//search for the first node with "name=Neo"
Node node = index.getSingleNode( "name", "Neo" );

Lucene is an example of an external index to the graph. However, having a fast graph engine, that are a vast number of strategies to build index structures in the graph itself, shortening traversal patterns for special data sets and domains. For instance there are timeline and B-Trees for one-dimensional data, RTrees and QuadTrees indexing two-dimensional data (very common in the Spatial and GIS communities) and many others. Often another useful pattern is to connect important subgraphs directly to the root node in order to shortcut important start-nodes.

Java sux. Anything shorter, please?

Neo4j has a number of good language-bindings that ease working with the graph structure even more. This example uses the excellent Neo4j-JRuby-bindings, which shrinks the total amount of code considerably:

After installing the neo4j gem with

>gem install neo4j

The JRuby codes for the whole Matrix graph and the before mentioned queries looks like:

require "rubygems"

require "neo4j"

class Person
  include Neo4j::NodeMixin

  #the properties on the nodes
  property :name

  #the relationship types
  has_n :friends

  # Lucene index for some properties
  index :name

#the players

neo       = :name => 'Neo'
morpheus  = :name => 'Morpheus'
trinity   = :name => 'Trinity'
cypher    = :name => 'Cypher'
smith     = :name => 'Agent Smith'
architect = :name => 'Architect'

#the connections

cypher.friends << morpheus
cypher.friends << smith
neo.friends << morpheus
morpheus.friends << trinity
trinity.rels.outgoing(:loves) << neo

architect.rels.outgoing(:has_coded) << smith

#Neos friends

neo.friends.each { |n| puts n }

#Neos friends-of-friends

neo.friends.depth(2).each { |n| puts n }

#Who is in love?
architect.traverse.both(:friends, :has_coded, :loves).depth(:all).filter do
  outgoing(:loves).to_a.size > 0
end.each do |n|
 puts 'In love: ' +

A graph programming language - Gremlin

Until recently, there has not been any query language that covered the large domain of graphs and graph-related projects. In the Semantic Web/RDF domain, there is SPARQL, an SQL-inspired query language that focuses on the description of example graphs that are use to match triple sets. However, there is a large amount of graphs that are not RDF-compatible and take different or more pragmatic approaches to data modeling like for instance the Matrix-example of this article and other domain-specific data sets. Other query language are JSON-oriented, like MQL, the query language for Freebase. These languages only work on their own defined data model and provide none or very poor support for deep graph algorithms and heuristic analytics methods, that are necessary for todays big graphs.

For more complex and interesting queries for various graph data models (including RDF), Gremlin - an XPath-oriented, turing-complete graph programming language - is being developed by the TinkerPop team, primarily driven by Marko A. Rodriguez. Via the introduction of the Property Graph Model, it creates a superset and least common dominator for most of the existing models. Moreover, it allows for the connection of other graph frameworks (e.g. Gremlin using JUNG) and for the expression of graph traversals on different graph implementations. There are already a couple of implementations supported, from simple ones like the in-memory TinkerGraph, to others via an RDF-SAIL-adapter for AllegroGraph, Sesame and the ThinkerPop LinkedData SAIL (originally developed by Josh Shinavier for the Ripple programming language) all the way to Neo4j.

Gremlin's syntax is based on XPath in order to be able to express even deep path descriptions through the graph with easy expressions. Many easy cases look almost like normal XPath.

After installing Gremlin or trying it online, a Gremlin session on the Matrix-example graph could look like this:

peterneubauer$ ~/code/gremlin/

         (o o)
gremlin> #open a new neo4j graph as the default graph ($_g)
gremlin> $_g := neo4j:open('tmp/matrix')
gremlin> #the vertices
gremlin> $neo := g:add-v(g:map('name','Neo'))
gremlin> $morpheus := g:add-v(g:map('name','Morpheus'))
gremlin> $trinity := g:add-v(g:map('name','Trinity'))
gremlin> $cypher := g:add-v(g:map('name','Cypher'))
gremlin> $smith := g:add-v(g:map('name','Agent Smith'))
gremlin> $architect := g:add-v(g:map('name','The Architect'))
gremlin> #the edges
gremlin> g:list($cypher,$neo,$trinity)[g:add-e($morpheus,'KNOWS',.)]
gremlin> g:add-e($cypher,'KNOWS',$smith)
gremlin> g:add-e($trinity,'LOVES',$neo)
gremlin> g:add-e($architect,'HAS_CODED',$smith)
gremlin> #go to Neo as the current root ($_) via a full-text index search
gremlin> $_ := g:key('name','Neo')
gremlin> #is this Neo?
gremlin> ./@name
gremlin> #what links to here and from here?
gremlin> ./bothE
gremlin> #only take the KNOWS-edges
gremlin> ./bothE[@label='KNOWS']
gremlin> #Neo's friend's names
gremlin> ./bothE[@label='KNOWS']/inV/@name
gremlin> #Neo's Friends of friends, 2 steps
gremlin> repeat 2
               $_ := ./outE[@label='KNOWS']/inV
gremlin> #What are their names?
gremlin> ./@name
gremlin> #every node in the whole graph with an outgoing LOVES edge
gremlin> $_g/V/outE[@label='LOVES']/../@name

Deep graph algorithms - Value in Relationships

The Matrix example is a very naive use given the power of Gremlin. More interesting is the development and testing of algorithms over big graphs. Exhaustive algorithms like Eigenvector Centrality and Dijkstra do not scale to these graphs since they need to touch every vertex in the network. Heuristic approaches are more appropriate with concepts like Grammar Based Random Walkers and Spreading Activation (deeper explanation by Marko Rodriguez here) for these problems. The Google PageRank Algorithm is such a heuristic and can be modelled in Gremlin with the following code (an example over the graph of songs, concerts and albums by  Greatful Dead loaded from here, 2500 loops and an energy loss of 15% on each repetition):

$_g := tg:open()


$m := g:map()

$_ := g:key('type', 'song')[g:rand-nat()]

repeat 2500

  $_ := ./outE[@label='followed_by'][g:rand-nat()]/inV

  if count($_) > 0

    g:op-value('+',$m,$_[1]/@name, 1.0)


  if g:rand-real() > 0.85 or count($_) = 0

    $_ := g:key('type', 'song')[g:rand-nat()]




Which returns the following weighted list of songs:

==>DARK STAR=26.0
==>JACK STRAW=25.0

Another interesting example where the underlying graph is the LinkedData graph live from the Internet is a recommendation algorithm for music over LinkedData and DBPedia.


Graphs are not the silver bullet for all problems, much like RDBMS and all other persistence solutions. The most important aspect is the data, it's and the type of queries and operations structure that is to be dealt with, and what requirements exist regarding scalability and CAP. 

To take the High Scalability aspect of NOSQL databases as the only argument for the use of non-relational solutions is often neither required nor desirable. With Neo4j and todays hardware, it is in most cases with whole domain model can be held and queried within billions on domain objects in one single instance. 

If that is not sufficient, there is always the possibility to introduce domain-optimal sharding concepts without the need to introduce the hard data modeling limitations of documents or key/value systems. Whether this results in a dokumenten-model, a domain specific "object database" or something else depends on the domain context and the application scenario.

The code for this article is available here.

For the excellent feedback and the helpful advise I want to thank Michael Hunger and Marko Rodriguez.

About the Author

Peter Neubauer is COO of Neo Technology. Peter is co-founder of a number of Java-and graph based Open Source projects like Neo4j, Gremlin, LinkedProcess, OPS4J, Qi4j. Peter can be contacted at

Rate this Article