Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Let Me Graph That For You

Let Me Graph That For You

This article first appeared in Objective View magazine and is brought to you by InfoQ & Objective View magazine.



In this article we’ll look at some of the challenges in the contemporary data landscape that Neo4j is designed to solve, and the ways in which it addresses them. After taking a tour of Neo4j’s underlying graph data model, we'll look at how we can apply its data model primitives when developing our own graph database-backed applications. We’ll finish by reviewing some modelling tips and strategies.

Tackling Complex Data

Why might we consider using a graph database? In short, to tackle complexity, and generate insight and end user value from complex data. More specifically, to wrest insight from the kind of complexity that arises wherever three contemporary forces meet—where an increase in the amount of data being generated and stored is accompanied by a need both to accommodate a high degree of structural variation and to understand the multiply-faceted connectedness inherent in the domain to which the data belongs.

Increased data size—big data—is perhaps themost well understood of these three forces. The volume of net new data being created each year is growing exponentially—a trend that looks set to continue for the foreseeable futureBut as the volume of data increases, and we learn more about the instances in our domain, so each instance begins to look subtly differentfrom every other instance. In other words, as data volumes grow, we trade insight for uniformity. The more data we gather about a group of entities, the more that data is likely to be variably structured.

Variably structured data is the kind of messy, real-world data that doesn't fit comfortably into a uniform, one-size-fits-all, rigid relational schema; the kind that gives rise to lots of sparse tables and null checking logic. It’s the increasing prevalence of variably structured data in today’s applications that has led many organisations to adopt schema-free alternatives to the relational model, suand document stores.

But the challenges that face us today aren’t just around having to manage increasingly large volumes of data, nor do they extend simply to us having to accommodate ever increasing degrees of structural variation in that data. The real challenge to generating significant insight is understanding connectedness. That is, to answer many of the most important questions we want to ask of our domains, we must first know which things are connected, and then, having identified these connected entities, understand in what ways, and with what strength, weight or quality, they are connected. If you've ever had to answer questions such as:

  • Which friends and colleagues do we have in common?
  • Which applications and services in my network will be affected if a particular network element—a router or switch, for example—fails? Do we have redundancy throughout the network for our most important customers?
  • What's the quickest route between two stations on the underground?
  • What do you recommend this customer should buy, view, or listen to next?
  • Which products, services and subscriptions does a user have permission to access and modify?
  • What's the cheapest or fastest means of delivering this parcel from A to B?
  • Which parties are likely working together to defraud their bank or insurer?
  • Which institutions are most at risk of poisoning the financial markets?

—then you've already encountered the need to manage and make sense of large volumes of variably-structured, densely-connected data. These are the kinds of problems for which graph databases are ideally suited. Understanding what depends on what, and how things flow; identifying and assessing risk, and analysing the impact of events on deep dependency chains: these are all connected data problems. Today, Neo4j is being used in business-critical applications in domains as diverse as social networking, recommendations, datacenter management, logistics, entitlements and authorization, route finding, telecommunications network monitoring, fraud analysis, and many others. Its widespread adoption challenges the notion that the relational databases is the best tool for working with connected data. At the same time, it proposes an alternative to the simplified, aggregate-oriented data models adopted by NOSQL.

The rise of NOSQL was largely driven by a need to remedy the perceived performance and operational limitations of relational technology. But in addressing performance and scalability, NOSQL has tended to surrender the expressive and flexible modelling capabilities of its relational predecessor, particularly with regard to connected data. Graph databases, in contrast, revitalise the world of connected data, shunning the simplifications of the NOSQL models, yet outperforming relational databases by several orders of magnitude.

To understand how graphs and graph databases help tackle complexity, we need first to understand Neo4j’s graph data model.

The Labelled Property Graph Model

Neo4j uses a particular graph data model, called the labelled property graph model, to represent network structures. A labelled property graph consists of nodes, relationships, properties and labels. Here’s an example of a property graph:


Nodes represent entity instances. To capture an entity's attributes, we attach key-value pairs—properties—to a node, thereby creating a record-like structure for each individual thing in our domain. Because Neo4j is a schema-free database, no two nodes need share the same set of properties: no two nodes representing persons, for example, need have the exact same attributes.


Relationships represent the connections between entities. By connecting pairs of nodes with relationships, we introduce structure into the model. Every relationship must have a start node and an end node. Just as importantly, every relationship must have a name and a direction. A relationship's name and direction lend semantic clarity and context to the nodes attached to the relationship. This allows us—in, for example, a Twitter-like graph—to say that “Bill” (a node) “FOLLOWS” (a named and directed relationship) “Sally” (another node). Just like nodes, relationships can also contain properties. We typically use relationship properties to represent some distinguishing feature of each connection. This is particularly important when, in answering the questions we want to ask of our domain, we must not only trace the connections between things, but also take account of the strength, weight or quality of each of those connections.

Node Labels

Nodes, relationships and properties provide for tremendous flexibility. In effect, no two parts of the graph need have anything in common. Labels, in contrast, allow us to introduce an element of commonality that groups nodes together and indicates the roles they play within our domain. We do this by attaching one or more labels to each of the nodes we want to group: we can, for example, label a node as representing both a User and, more specifically, an Administrator. (Labels are optional: therefore, each node can have zero or more labels.) Node labels are similar to relationship names insofar as they lend additional semantic context to elements in the graph, but whereas a relationship instance must perform exactly one role, because it connects precisely two nodes, a node, by virtue of the fact it can be connected to zero or more other nodes, has the potential to fulfil several different roles: hence the ability to attach zero or more labels to each node. On top of this simple grouping capability, labels also allow us to associate indexes and constraints with nodes bearing specific labels. We can, for example, require that all nodes labelled Book are indexed by their ISBN property, and then further require that each ISBN property value is unique within the context of the graph.

Representing Complexity

This graph model is probably the best abstraction we have for modelling both variable structure and connectedness. Variable structure is provided for by virtue of connections being specified at the instance level rather than the class level. Relationships join individual nodes, not classes of nodes: in consequence, no two nodes need be connected in exactly the same way to their neighbours; no two subgraphs need be structured exactly alike. Each relationship in the graph represents a specific connection between two particular things. It's this instance-level focus on things and the connections between things that makes graphs ideal for representing and navigating a variably structured domain. Relationships not only specify that two things are connected, they also describe the nature and quality of that connection. To the extent that complexity is a function of the ways in which the semantic, structural and qualitative aspects of the connections in a domain can vary, our data models require a means of expressing and exploiting this connectedness. Neo4j's labelled property graph model, wherein every relationship can not only be specified independently of every other, but also annotated with properties that describe how and in what degree, and with what weight, strength or quality, entities are connected, provides one of the most powerful means for managing complexity today.

And Doing It Fast

Join-intensive queries in a relational database are notoriously expensive, in large part because joins must be resolved at query time by way of an indirect index lookup. As an application’s dataset size grows, these joininspired lookups slow down, causing performance to deteriorate. In Neo4j, in contrast, every relationship acts as a precomputed join, every node as an index of its associated nodes. By having each element maintain direct references to its adjacent entities in this way, a graph database avoids the performance penalty imposed by index lookups—a feature sometimes know as indexfree adjacency. As a result, for complexly connected queries, Neo4j can be many thousands of times faster than a join-intensive operation in a relational database.

Index-free adjacency provides for queries whose performance characteristics are a function of the amount of the graph they choose to explore, rather than the overall size of the dataset. In other words, query performance tends to remain reasonably constant even as the dataset grows. Consider, for example, a social network in which every person has, on average, fifty friends. Given this invariant, friend-of-a-friend queries will remain reasonably constant, irrespective of whether the network has a thousand, a million, or a billion nodes.

Graph Data Modelling

In this section we’ll look at how we go about designing and implementing an application’s graph data model and associated queries.

From User Story to Domain Questions

Imagine we’re building a cross-organizational skills finder: an application that allows us to find people with particular skills in a network of professional relationships.

To see how we might design a data model and associated queries for this application, we’ll follow the progress of one of our agile user stories, from analysis through to implementation in the database. Here’s the story:

As an employee I want to know which of my colleagues have similar skills to me

So that I can exchange knowledge with them or ask them for help

Given this description of an end-user goal, our first task is to identify the questions we would have to ask of our domain in order to satisfy it. Here’s the story rephrased as a question:

Which people, who work for the same company as me, have similar skills to me?

Whereas the user story describes what it is we’re trying to achieve, the questions we pose to our domain provide a clue as to how we might satisfy our users’ goals. A good application graph data model makes it easy to to ask and answer such questions. Fortunately, the questions themselves contain the germ of the structure we’re looking for.

Language itself is a structuring of logical relationships. At its simplest, a sentence describes a person or thing, some action performed by this person or thing, and the target or recipient of that action, together with circumstantial detail such as when, where or how this action was accomplished. By attending closely to the language we use to describe our domain and the questions we want to ask of our domain, we can readily identify a graph structure that represents this logical structuring in terms of nodes, relationships, properties and labels. From Domain Questions to Cypher Path Expressions The particular question we outlined earlier names some of the significant entities in our domain: people, companies and skills. Moreover, the question tells us something about how these entities are connected to one another:

  • A person works for a company
  • A person has several skills

These simple natural-language representations of our domain can now be transformed into Neo4j’s query language, Cypher. Cypher is a declarative, SQL-like graph pattern matching language built around the concept of path expressions: declarative structures that allow us to describe to the database the kinds of graph patterns we wish either to find or to create inside our graph.

When translating our ordinary language descriptions of the domain into Cypher path expressions, the nouns become candidate node labels, the verbs relationship names:

 (:Person)-[:WORKS_FOR]->(:Company), (:Person)-[:HAS_SKILL]->(:Skill) 

Cypher uses parentheses to represent nodes, and dashes and less-than and greater-than signs (<-- and -->) to represent relationships and their directions. Node labels and relationship names are prefixed with a colon; relationship names are placed inside square brackets in the middle of the relationship.

In creating our Cypher expressions, we’ve tweaked some of the language. The labels we’ve chosen refer to entities in the singular. More importantly, we’ve used HAS_SKILL rather than HAS to denote the relationship that connects a person to a skill. The reason for this is that HAS is far too general a term. Rightsizing a graph’s relationship names is key to developing a good application graph model. If the same relationship name is used with different semantics in several different contexts, queries that traverse those relationships will tend to explore far more of the graph than is strictly necessary—something we are mindful to avoid.

The expressions we’ve derived from the questions we want to ask of our domain form a prototypical path for our data model. In fact, we can refactor the expressions to form a single path expression:

(:Company)<-[:WORKS_FOR]-(:Person)- [:HAS_SKILL]->(:Skill) 

While there are likely many other requirements for our application, and many other data elements to be discovered as a result of analysing those requirements, for the story at hand, this path structure captures all that is needed to meet our end-users’ immediate goals. There is still some work to do to design an application that can create instances of this path structure at runtime as users add and amend their details, but insofar as this article is focussed on the design and implementation of the data model and associated queries, our next task is to implement the queries that target this structure.

A Sample Graph

To illustrate the query examples, we’ll use Cypher’s CREATE statement to build a small sample graph comprising two companies, their employees, and the skills and levels of proficiency possessed by each employee:

 // Create skills-finder network 

CREATE (p1:Person{username:'ben'}),

This statement uses Cypher path expressions to declare or describe the kind of graph structure we wish to introduce into the graph. In the first half we create all the nodes we’re Page 18 of 52 interested in—in this instance, nodes representing companies, people and skills— and then in the second half we connect these nodes using appropriately named and directed relationships. The entire statement, however, executes as a single transaction.

Let’s take a look at the first node definition:


This expression describes a node labelled Person. The node has a username property whose value is “ben”. The node definition is contained within parentheses. Inside the parentheses we specify a colon-prefixed list of the labels attached to the node (there’s just one here, Person), together with the node’s properties. Cypher uses a JSON-like syntax to define the properties belonging to a node.

Having created the node, we then assign it to an identifier, p1. This identifier allows us to refer to the newly created node elsewhere in the query. Identifiers are arbitrarily named, ephemeral, in-memory phenomena; they exist only within the scope of the query (or subquery) where they are declared. They are not considered part of the graph, and are, therefore, discarded when the data is persisted to disk.

Having created all the nodes representing people, companies and skills, we then connect them as per our prototypical path expression: each person WORKS_FOR a company; each person HAS_SKILL one or more skills. Here’s the first of the HAS_SKILL relationships:


This relationship connects the node identified by p1 to the node identified by s1. Besides specifying the relationship name, we’ve also attached a level property to this relationship using the same JSON-like syntax we used for node properties.

(We’ve used a single CREATE statement here to create an entire sample graph. This is not how we would populate a graph in a running application, where individual end-user activities trigger the creation or modification of data. For such applications, we’d use a mixture of CREATE, SET, MERGE and DELETE to create and modify portions of the graph. You can read more about these operations in the online Cypher documentation.)

The following diagram shows a portion of the sample data. Within this structure you can clearly see multiple instances of our prototypical path:

Find Colleagues With Similar Skills

Now that we’ve a sample dataset that exemplifies the path expressions we derived from our user story, we can return to the question we want to ask of our domain, and express it more formally as a Cypher query. Here’s the question again:

Which people, who work for the same company as me, have similar skills to me?

To answer this question, we’re going to have to find a particular graph pattern in our sample data. Let’s assume that somewhere in the existing data is a node labelled Person that represents me (I have the username “ian”). That node will be connected to a node labelled Company by way of an outgoing WORKS_FOR relationship. It will also be connected to one or more nodes labelled Skill by way of several outgoing HAS_SKILL relationships. To find colleagues who share my skillset, we’re going to have to find all the other nodes labelled Person that are connected to the same company node as me, and which are also connected to at least one of the skill nodes to which I’m connected. In diagrammatic form, this is the pattern we’re loo king for:

Our query will look for multiple instances of this pattern inside the existing graph data. For each colleague who shares one skill with me, we’ll match the pattern once. If a person has two skills in common with me, we’ll match the pattern twice, and so on. Each match will be anchored on the node that represents me. Using Cypher path expressions, we can describe this pattern to Neo4j. Here’s the full query:

// Find colleagues with similar skills
MATCH (me:Person{username:'ian'})
RETURN colleague.username AS username,
      count(skill) AS score,
      collect( AS skills
      ORDER BY score DESC 

This query comprises two clauses: a MATCH clause and a RETURN clause. The MATCH clause describes the graph pattern we want to find in the existing data; the RETURN clause generates a projection of the results on behalf of the client.

The first line of the MATCH clause, (me:Person{username:'ian'}), locates the node in the existing data that represents me—a node labelled Person with a username property whose value is “ian”—and assigns it to the identifier me. If there are multiple nodes matching these criteria (unlikely, because username ought to be unique), me will be bound to a list of nodes.

The rest of the MATCH clause then describes the diamond-shaped pattern we want to find in the graph. In describing this pattern, we specify the labels that must be attached to a node for it to match (Company for companies, Skill for skills, Person for colleagues), and the names and the directions of the relationships that must be present between nodes for them to match (a Person must be connected to a Company with an outgoing WORKS_FOR relationship, and to a Skill with an outgoing HAS_SKILL relationship). Where we want to refer to a matched node later in the query, we assign it to an identifier (we’ve chosen colleague, company and skill). By being as explicit as we can about the pattern, we help ensure Cypher explores no more of the graph than is strictly necessary to answer the query.

The RETURN clause generates a tabular projection of the results. As I mentioned earlier, we’re matching multiple instances of the pattern. Colleagues with more than one skill in common with me will match multiple times. In the results, however, we only want to see one line per colleague. Using the count and collect functions, we aggregate the results on a per colleague basis. The count function counts the number of skills we’ve matched per colleague, and aliases this as their score. The collect function creates a comma-separated list of the skills that each colleague has in common with me, and aliases this as skills. Finally, we order the results, highest score first.

Executing this query against the sample dataset generates the following results:






['Neo4j', 'REST'




The important point about this query, and the process that led to its formulation, is that the paths we use to search the data are very similar to the paths we use to create the data inthe first place. The diamond-shaped pattern at the heart of our query has two legs, each comprising a path that joins a person to a company and a skill:


This is the very same path structure we came up with for our data model. The similarity shouldn’t surprise us: after all, both the underlying modeland the query we execute against that model are derived from the question we wanted to askof our domain.

Filter By Skill Level

In our sample graph we qualified each HAS_SKILL relationship with a level property that indicates an individual’s proficiency with regard to the skill to which the relationship points: 1 for beginner, 2 for intermediate, 3 for expert. We can use this property in our query to restrict matches to only those people who are, for example, level 2 or above in the skills they share with us:

// Find colleagues with shared skills,
level 2 or above
MATCH (me:Person{username:'ian'})
WHERE  r.level >= 2
RETURN colleague.username AS username,
      count(skill) AS score,
      collect( AS skills
      ORDER BY score DESC 

I’ve highlighted the changes to the original query. In the MATCH clause we now assign a colleague’s HAS_SKILL relationships to an identifier r (meaning that r will be bound to a list of such relationships). We then introduce a WHERE clause that limits the match to cases where the value of the level property on the relationships bound to r is 2 or greater.

Running this query against the sample data returns the following results:










Search Across Companies

As a final illustration of the flexibility of our simple data model, we’ll tweak the query again so that we no longer limit it to the company where I work, but instead search across all companies for people with skills in common with me:

// Find people with shared skills, level 2
or above
MATCH (me:Person{username:'ian'})
WHERE  r.level >= 2
RETURN other.username AS username, AS company,
      count(skill) AS score, 
      collect( AS skills
      ORDER BY score DESC

To facilitate this search, we’ve removed the requirement that the other person must be connected to the same company node as me. We do, however, still identify the company for whom this other person works. This then allows us to add the company name to the results. The pattern described by the MATCH clause now looks like this:

Running this query against the sample data returns the following results:






Startup, Ltd


['Java', 'REST']


Acme, Inc




Startup, Ltd




Acme, Inc




We’ve looked at how we derive an application’s graph data model and associated queries from end-user requirements. In summary:

  • Describe the client or end-user goals that motivate our model;
  • Rewrite those goals as questions we would have to ask of our domain;
  • Identify the entities and the relationships between them that appear in these questions;
  • Translate these entities and relationships into Cypher path expressions;
  • Express the questions we want to ask of our domain as graph patterns using path expressions similar to the ones we used to model the domain. In these last sections we’ll discuss a few strategies and tips to bear in mind as we undertake this design process.

Use Cypher to Describe Your Model

Use Cypher path expressions, rather than an intermediate modelling language such as UML, to describe your domain and its model. As we've seen, many of the noun and verb phrases in the questions we w ant to ask of our domain can be straightforwardly transformed into Cypher path expressions, which then become the basis of both the model itself, and the queries we want to execute against that model. In such circumstances, the use of an intermediate modeling language adds very little. This is not to say that Cypher path expressions comprehensively address all of our modelling needs. Besides capturing the structure of the graph, we also need to describe in what ways both the graph structure and the values of individual node and relationship properties ought to be constrained. Cypher does provide for some constraints today, and the number of constraints it supports will rise with each release, but there are occasions today where domain invariants must be expressed as annotations to the expressions we use to capture the core of the model.

Name Relationships Based on Use Cases

Derive your relationship names from your use cases. Doing so creates paths in your model that align easily with the patterns you want to find in your data. This ensures that queries that take advantage of these paths will ignore all other nodes and relationships.

Relationships both compose and partition the graph. In connecting nodes, they structure the whole, creating a complex composite from what would otherwise be simple islands of data. At the same time, because they can be differentiated from one another based on their name, direction and property values, relationships also serve to partition the graph, allowing us to identity specific subgraphs within a larger, more generally connected structure. By focussing our queries on certain relationship names and directions, and the paths they form, we exclude other relationships and other paths, effectively materializing a particular view of the graph dedicated to addressing a particular need.

You might think this smacks somewhat of an overly specializing approach, and indeed, in many ways it is. But it’s rarely an issue. Graphs don't exhibit the same degree of specialization tax as relational models. The relational world has an uneasy relationship with specialization, both abhoring it and yet requiring it, and then suffering when it does so.

Consider: we apply the normal forms in order to derive a logical structure capable of supporting ad hoc queries—that is, queries we haven't yet thought of. All well and good—until we go into production. At that point, for the sake of performance, we denormalize the data, effectively specializing it on behalf of an application's specific access patterns. This denormalization helps in the near term, but poses a risk for the future, for in specializing for one access pattern, we effectively close the door on many others. Relational modellers are frequently faced with these kinds of either/or dilemmas: either stick with the normal forms and have performance suffer, or denormalize, and limit the scope for evolving the application further down the line.

Not so with graph modelling. Because the graph allows us to introduce new relationships at the level of individual node instances, we can specialize it over and over again, use case by use case, in an additive fashion—that is, by adding new routes to an existing structure. We don't need to destroy the old to accommodate the new; rather, we simply introduce the new configuration by connecting old nodes with new relationships. These new relationships effectively materialize previously unthought of graph structures to new queries. Their being introduced into the graph, however, need not upset the view enjoyed by existing queries.

Pay Attention to Language

In our modelling example, we derived a couple of path expressions from the noun and verb phrases we used to describe common. There are a few rules of thumb when analyzing a natural language representation of a domain. Common nouns become candidates for labels: “person”, “company” and “skill” become Person, Company and Skill respectively. Verbs that take an object—”owns”, “wrote” and “bought”, for example—become candidate relationship names. Proper nouns—a person or company's name, for example—refer to an instance of a thing, which we then typically model as a node.

Things aren’t always so straightforward. Subject-verb-object constructs are easily transformed into graph structures, but a lot of the sentences we use to describe our domain are not as simple as this. Adverbial phrases, for example—those additional parts of a sentence that describe how, when or where an action was performed—result in what entity-relational modelling calls n-ary relationships; that is, complex, multi-dimensional relationships that bind together several things and concepts.

N-ary relationships would appear to require something more sophisticated than the property graph for their representation; a model that allows relationships to connect more than two nodes, or that permits one relationship to connect to, and thereby, qualify another. Such data model constructs are, however, almost always unnecessary. To express a complex interrelation of several different things, we need only introduce an intermediate node—a hublike node that connects all the parties to an nary relationship.

Intermediate nodes are a common occurrence in many application graph data models. Does their widespread use imply that there is a deficiency in the property graph model? I think not. More often than not, an intermediate node makes visible one more element of the domain—a hidden or implicit concept with informational content and a meaningful domain semantic all of its own .

Intermediate nodes are usually self-evident wherever an adverbial phrase qualifies a clause. “Bill worked at Acme, from 2005-2007, as a Software Engineer” leads us to introduce an intermediate node that connects Bill, Acme and the role of Software Engineer. It quickly becomes apparent that this node represents a job, or an instance of employment, to which we can attach the date properties from and to.

It's not always so straightforward. Some intermediate nodes lie hidden in far more obscure locales. Verbing—the language habit whereby a noun is transformed into a verb— can often occlude the presence of an intermediate node. Technical and business jargon is particularly rife with such neologisms: we “email” one another, rather than send an email, “google” for results, rather than search Google.

The verb “email” provides a ready example of the kinds of difficulties we can encounter if we miss out on the noun origins of some verbs. The following path shows the result of us treating “email” as a relationship name:

 (:Person{name:'Alice'}) -[:EMAILED]->(:Person{name:'Lucy'})

This looks straightforward enough. In fact, it's a little too straightforward, for with this construct it becomes extremely difficult to indicate that Alice also copied in Alex. But if we unpack the noun origins of “email”, we discover both an important domain concept—the electronic communication itself—and an intermediate node that connects senders and receivers:


If you're struggling to come up with a graph structure that captures the complex interdependencies between several things in your domain, look for the nouns, and hence the domain concepts, hidden on the far side of some of the verb phrases you use to describe the structuring of your domain. Conclusion Once a niche academic topic, graphs are now a commodity technology. In this article we’ve looked at the kinds of problems graph databases are intended to solve. We’ve seen how Neo4j makes it easy to model, store and query large amounts of variably structured, densely connected data, and how we can design and implement an application graph data model by transforming user stories into graph structures and declarative graph pattern matching queries. If, having got this far, you’re now beginning to think in graphs, head over to and grab a copy of Neo4j.

About the Author

Ian Robinson works on research and development for future versions of the Neo4j graph database. Harbouring a long-held interest in connected data, he was for many years one of the foremost proponents of REST architectures, before turning his focus from the Web's global graph to the realm of graph databases. Follow him: @iansrobinson


This article first appeared in Objective View magazine and is brought to you by InfoQ & Objective View magazine.

Rate this Article