BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Emil Eifrem on Neo4j and Graph Databases

Emil Eifrem on Neo4j and Graph Databases

Bookmarks
   

1. Emil, can you tell us a little bit about Neo4j and what it is?

Neo4j is a graph database which is one of the new alternative database categories emerging right now. A graph database differs itself from a relational database in that it uses nodes and typed relationships between nodes and then key/value properties that you can attach to both nodes and relationships as abstractions for you to model data. So it's great for any type of data which is complex, which is very connected bioinformatics, master data management, financial systems, social networks is the obvious one and things like that.

   

2. Neo4j is part of a fairly large category of things called "NoSQL technologies". Why do you think that NoSQL technologies have been growing in the last few years in comparison to traditional relational databases?

I'd say there are four main reasons why NoSQL is arriving right now. And first off, everyone hates the name NoSQL. Everyone thinks it means "Not SQL" or "No to SQL" or "Never SQL" or something like that, whereas I try to evangelize "Not only SQL" as in the backends of the future will use not only SQL databases, but also key/value stores and document databases and graph databases. That's my prefix. There are four things right now that are happening that makes NoSQL databases appealing: one is that we have an exponential growth of information. So we have just a shitload of data being created right now.

Exponential is kind of funny in the way that it means that all the data that is going to be created next year is going to more than all the data that has been created in the world up until this point so far, combined. That's a lot of information! That's one thing; the second thing is that data is becoming more and more connected. Way back in days we had text documents, completely isolated. Then, mid 90s came the web or at least the web was popularized and we got hypertext, so one level of connectivity between documents. Now more and more things are being connected and related to one another. Main driver here is the web then the internet, but also the fact that we model the real world such as bioinformatics where data is inherently connected, like protein interactions or gene sequences.

The third one is semi-structured information. Semi-structured information is data which has a few mandatory attributes, but many optional ones. For example, if you had modelled salary lists like a payroll system in the 70s, you might have one column called "Title" whereas today, in 2010 it's very frequent that people have multiple roles and belong to multiple organizations or multiple parts of the organization table. Today you might have this one guy that has three titles or four titles and then if you model that into a table, at least in the obvious way, you would add four columns - Title 1, 2, 3, 4 - which would be great for that one guy with four titles, but it's going to end up punishing all the other rows, so that's called "sparse tables".

There is an explosion of semi-structured information now which is driven or fueled by the whole user-generated content phenomena, where you have a lot of people creating content in the decentralized manner. Whether you want to call that web 2.0 or whatever, it leads to a situation where everyone doesn't adhere to the same schema. Once you take that data and try to do something with it, you have very semi-structured datasets, very irregularly shaped. Those are three information trends. The fourth one, which is kind of overlooked, I think, is that we also have this trend in architecture where, back in the 90s, the typical system had the one database.

There was one database instance, Larry sold it to you, it had one schema and then you had a bunch of systems that were all connected to this one database. This database was fenced off by a couple of DBAs. As a software developer one of these systems you wanted to add a new feature, you would walk up to the DBA and you'd tell them "Hey, I need this new thing" and they would say "No" or the chief architect would say "No" because if you changed something in the one database, then that would cascade throughout the entire system. There was just a very broken architecture in that. You had basically no separation of concerns there. The underlying persistence mechanism leaked out to the rest of your system.

I don't want to be too optimistic about this, but I think at least the ambitions today are to build more services based architectures where you instead of exposing the database on the wire for all these hundreds of systems inside your corporation, you would build one system exposed as a service on the wire with something domain-oriented like "add account", not "insert into this account table" or something like that. That means that you can independently swap out the persistence mechanism for that one service without that breaking the rest of your architecture.

Then you have the choice to say "Hey, this one service actually deals with data that is extremely graph based. Then why don't we swap in a graph database?" and you get all the advantages of that, hopefully. You can actually experiment like that and make those kinds of "the right tool for the right situation" choices.

   

3. With the variety of non-relational data stores that are available, there are many categories or classes of these data stores. What are they and where does Neo4j fit in?

The problem in NoSQL is that it's not clearly defined. That's not the problem, that's one of the challenges with NoSQL. People have varying views of this. Another challenge is that NoSQL is extremely hyped right now, so pretty much anyone wants to attach themselves to that term and it's also that is defined by what it's not - it's not SQL. You could say that "Hey, is this room NoSQL? - It doesn't support SQL" and it's a challenging thing like that. The way I look at NoSQL is that it's these new alternative emerging databases that are operational. I don't put all the analytical stuff under the NoSQL umbrella and if you look at that, you can squint a little bit and look at these databases and see that there are four main categories.

The first category is key/value stores like Tokyo Cabinet, Project Voldemort, Riak in some senses. The second category is the columns family, like HBase, Hypertable, Cassandra. The third family is document databases like CouchDB, MongoDB. And the fourth one is graph databases and that's obviously where Neo4j lives. Other graph databases are AllegroGraph from Franz Inc., Sones DB from the company called Sones and InfiniteGraph.

   

4. What are the characteristics of a graph database and which classes of problems or applications are they best suited to?

I tend to take a data model view of NoSQL. The defining aspect of a graph database is the data model of nodes, typed relationships and then key/value pairs on both nodes and relationships. That is what constitutes a graph database. When it comes to applicability, a graph database excels at complex data. If you have data that is very complex then a graph database is awesome and by complex data I mean data that is very semi-structured or that is very connected or both. An example here is social networks - that's the obvious one. I think Mark Zuckerberg at Facebook popularized the notion of the social graph.

For the longest time even the front page of Facebook.com said "We digitalized the social graph" or something like that. That's an obvious example - nodes are persons and relationships are whether you know that person or not. But even in geographical systems you have something like the node means the city and the relationship means a road and you start building up a geographical system. One of the cool things with the graph databases is that the model is so powerful, so generic that you can actually squeeze a lot of different kinds of domains in there. For example, I just combined the two that I just said.

If you have a social graph and a geo graph and you take both of them and merge them into the same graph database, what you get is actually a social network which also connects to the geo graph with stuff like recommendations. So one node may be a restaurant, which is in the city, which is connected to another city - so that's a graph and then we have my social graph and when you say "Emil likes this restaurant" then all of a sudden you have basically what's now called location-based services with the Foursquares, the Gowallas, the Facebook Places of the world. I think that's a really underappreciated aspect about graph databases, the fact that they're so powerful at connecting all kinds of domains.

   

5. When I'm developing an application does my entire problem domain have to be representable inside of a graph or can I split different parts of the data model across different databases or different data stores depending upon what the best representation for that particular data is?

We're big believers in whole concept of polyglot persistence and polyglot persistence is the observation that in the future and even today datasets are just so complex that in order to get both convenience in programming, but also runtime benefits such as performance and scalability, you need to take parts of your dataset and squeeze them into different types of databases. So you might have one part of your dataset that is very graph oriented - awesome! Let's put that in the graph database, but other parts of this same dataset is very key/value oriented like username and password. Cool! Let's put that in the key/value store. I think you can learn a lot by looking at the big websites in the world and you can learn a lot about where the rest of the industry is going because I think they're hit early by information.

If you look at all the big websites, like the Amazons, the Yahoo!s, the Googles, the eBays and what not, they've long since moved on from a world where everything is stored in a relational database. They use specialized systems for parts of their data. I think that is clearly where the rest of the world is going to end up anywhere between now and 5 years out. That also ties back to what I said before about not only SQL and which is again the observation that NoSQL stores aren't out to replace or displace SQL databases, but rather complement them, not only SQL. That also places a burden I think on middleware vendors.

We're here at SpringOne and SpringSource just announced a project called Spring Data where we've been extremely active and that all sort of sprang out of the observation, that we made when we were out talking to customers, we saw that no middleware out there supports multiple types of databases in a good way. I really mean that, specifically in a good way because there is some work being done at having multiple databases concurrently from middleware frameworks, but not a single framework today supports the notion of taking one class and then taking slices of that class and say that "This part of this class belongs to a graph database, this part however belongs to a relational database." That's exactly what we're doing right now with the Spring Data project.

   

6. From an operations perspective, historically there have been DBAs who have worked with RDBM systems and understand them very well. What kinds of challenges does this pose for operations when you now have polyglot persistence where you have a variety of data stores?

It's definitely a challenge and we should acknowledge that fact. We will bring in a lot of new technologies in a very critical area of the datacenters which is where data is stored. There are maybe some aspects where you can mess up in the datacenter and it's not that big a deal, but databases is not that part. Having said that I think there is a very big focus in the NoSQL world on ops. I think that a lot the stuff actually came out of the observation that you can shard MySQL, but that just ends up being a complete operations nightmare, because you have to do all the sharding yourself. Therefore we have auto-sharding types of stores like many of the key/value stores which they do the lionshare of that automatically.

That's what's going to have to happen, we're going to have to have basically an even higher focus on making these guys, these different new data stores better and easier to manage than the relational database.

   

7. What kinds of management and monitoring tools are available for Neo4j install?

The core Neo4j kernel is completely JMX-ified. The "j" in Neo4j is for "JVM" so we fully support JMX in that respect. Then we have a couple of tools that do management through JMX. For example, we have a web admin tool where you can see pretty charts of how many nodes you insert, how many operations are executed against the cache and stuff like that. We also hook into things like jConsole and any SNMP type tools through the JMX support.

   

8. With Neo4j installations, what's the average number of nodes for a common Neo4j install and for a large Neo4j install?

Very difficult is to say average numbers. It's open source software so we don't even know most of the installations out there, but I'd say once you get to hundreds of millions, that's probably a pretty large one and billions - that's definitely a large one. We have several installations where they deal with multiple billions of nodes and relationships. But I'd say that's the range we're looking at.

   

9. From a data complexity perspective, how does Neo4j help remove some of the implementation complexity in storing your data?

Basically we like to look at the whole scalability aspect of NoSQL. Actually it has two axes: one is the axis of scaling to size, how do you deal with a high volume of data which is uniform. The other one is scaling to complexity. How do you deal with data that is very semi-structured and very connected. A bunch of our brothers and sisters in the NoSQL world have chosen to focus on scaling to size and I think that's an admirable goal and it's really cool that you can take these key value stores for example and scale to thousands of machines and petabytes and hundreds of billions of key value sets. Having said that, I think that for the majority of the applications out there that's not the main struggle - getting to thousands and tenths of thousands of machines, but actually you're fine with 3-5 machines.

The majority of your problems actually stems from dealing with complexity of data. The challenge with some of the key/value stores and the column family is that the data model is not very rich. In the key value space basically you have a Hashmap, or a Python dictionary, key/values - that's it. You can always take your dataset and squeeze it into key values all these models are isomorphic: you can transform data in one model to the other. You can always capture data but the problem is that if your data is very complex, then you as a programmer in your upper layers you're going to have to do a lot of work in order to take your complex dataset and serialize it into key/value pairs. Whereas a graph database has nodes which can contain key/values and relationships which can also contain key/values to other nodes.

Using those primitives you can model a lot more complex data, which shifts the burden of dealing with that complexity down to your data model where I think it belongs, which means that you as a software developer will have to do less work to compensate for the simple data model of the key/value store or whatever. But then you have a lot of that in the graph database instead. You are going to end up having a lot leaner and more maintainable code. I think that's an aspect that is very underlooked in the NoSQL discussions today.

   

10. If I'm looking to test out a graph database for some aspect of my application, what are some of the changes in perspective that you have?

I think you are very right. The industry has worked for 30 or 40 years with relational databases and worked out many different patterns and methodologies for taking your domain model and serializing it to something that makes sense in terms of relations and sets. Having said that, I don't necessarily think that the most intuitive model for mapping most domains is actually that of tables. We have this notion that we call whiteboard friendliness, which is the observation that when you sit down and have a brainstorming session with a customer, let's say you are a consultant and you get a gig and the first thing at least I do is I go into a room which is painted with whiteboards all over it and you have this initial brainstorming session and you work with someone who is an expert at retail or pensions or something like that, definitely not a computer scientist.

Then you have this brainstorming session on the whiteboard and very seldom do you end up drawing tables. Maybe if you are building a payroll like a 70s style payroll system, then I think even the domain expert thinks of the data model as a table. There are certainly domains like that, but I think most domains today what you end up drawing on the whiteboard is you have this guy over here and he's related to this one thing over here and this shopping card includes items over here, which belong to a product category and stuff like that. What we call whiteboard friendliness is the observation that when we've taken that and built those systems using Neo4j and we've put them in production 6 months later, then we'll look back at the snapshots from those brainstorming sessions and more often than not can we see a 1-1 mapping between the whiteboard and the graph put in production.

I think that's super-powerful because I think the cognitive model of the programmer is that of the whiteboard. We may have been trained to do first normal form and third normal form and ER modeling and all those things in order to take that domain model and serialize it into tables. But I don't think that is the mental model of most people. Having that 1-1 mapping into the graph is really powerful. Now, to get back to your question specifically, what we always encourage people to do then is to just let go of your relational shackles and go into a room and destroy your domain on a whiteboard. At first approximation, whenever you have an entity in there, that's a node; whenever it's related to something, whenever there is an arrow in there, that's a relationship. That's a very good first stab at a data model for a graph database.

   

11. Once my data model is represented as a graph rather than as a relational database, which kinds of things are more performant? When I think about it from the design perspective, I look at large table joins and I immediately step back and say "That's really nasty!" How does that look in the graph database? What's the performance like?

Whenever you have joins problems in a relational database, it's probably a good bet to look at graph databases because the "join" operation is a way for a graph database to really walk from one entity to another entity. But it does that through set merges. You can look at a set merge or a join in a relational database is like a double for loop. It really has to loop through all these elements and all those elements find the common thing. That's a simplification, but at the core of it that's what it is and that's a very expensive operation. In a graph database, going from one entity to another that's a linked list pointer type look-up; it really is O(n).

That's why when we do deep traversals, they can even be 3-4 hops, we have some amazing performance improvements to a relational database on the order of 1000 times faster or a million times faster. The rule of thumb that we use is that in Neo4j you can traverse around one million hops per second in a graph, which is obviously many orders of magnitude faster than a relational database. Having said that, if you have extremely well structured data and you want to do ad hoc queries, like "Give me everyone with a name that starts with ‘R'" like 'R*' that's a crappy query for a graph database and you shouldn't use a graph database for that.

We integrate with stuff like Lucene for queries like that, by the way, but that's not something that we natively support very well.

   

12. Neo4j is open source. Can you tell us a little bit more about why you adopted that approach?

I think that adoption of software in particular enterprise style software today is really best done bottoms up. We're big believers in selling to developers and by selling I don't mean trying to get dollars back kind of sense, but talking to developers and convincing developers that this is a good system, this is a good tool, this solves everyday practical problems for you. And then have that float up into the organization. I think that is the most efficient way and I also think that's the most honest way to produce real good software because you don't end up going through the CTO or the CIO who then pushes it down because you have fancy brochures and what not.

That's the core why we've chosen to go open source. We use the AGPL as our license and it's available at Neo4j.org, there is a community site obviously and there are wiki pages, mailing lists, anything you'd expect from an open source project.

   

13. Thank you very much.

Thank you.

Dec 23, 2010

BT