BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Ian Robinson discusses Service Evolution and Neo4J Feature Design
Recorded at:

Interview with Ian Robinson by Jeevak Kasarkod on Jan 29, 2013 | NOTICE: The next QCon is in San Francisco Nov 3-7, Join us!
26:25

Bio Ian Robinson (@iansrobinson) is Director of Customer Success for Neo Technology, the company behind Neo4j, the world’s leading open source graph database. He is a co-author of ‘REST in Practice’ (O’Reilly) and a contributor to the forthcoming ‘REST: From Research to Practice’ (Springer) and ‘Service Design Patterns’ (Addison-Wesley). He blogs at http://iansrobinson.com.

Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.

   

1. Hi, Ian, welcome to InfoQ and can you introduce yourself?

Hello, my name is Ian Robinson, I am a software engineer with Neo Technology, Neo the commercial sponsors of Neo4j, which is a JVM based graph database.

   

2. The last time we spoke to you was in 2009 and we gathered some ideas around REST. I just wanted to catch up around RESTful services, especially around service evolution that you have written about frequently in the recent past.

Yes. So when we spoke last time, I think prior to REST in Practice coming out, so REST in Practice has been out for a couple of years now and that, really, at the time encapsulated a lot of the stuff that we’ve learnt about building these things. Since then, overall, my ideas haven’t changed too much, on the one hand they’ve probably simplified a little, nowadays I just think of building RESTful apps as you’re effectively shuffling documents around the network and that has the side effect of triggering some interesting work lower down the stack. So you end up interacting with something in the end of the line domain and then you end up generating another document that you successfully return to the client. On the other hand, when we wrote REST in Practice we were very focused on creating hypermedia APIs that were very link driven and we were inspired a lot by the atom and atom pub at the time. I think since then I’ve learned to differentiate between hypermedia controls that probably end up in a way the clients interact with the resource in a way that doesn’t change anything on the server side.

That’s stuff I would advertise using a link. And then interactions where that interaction probably has some significant side effects on the server and nowadays I tend to advertise those using a form like dialect. And I think there’s a large part, not all, but a large part of the REST community that is also beginning to adopt this nice distinction and I think it’s a useful way of playing now. With regards to the evolution of services, I’m glad you chose the word evolution rather than versioning, our concern is always to being able to evolve the applications that we’re developing and there are lots of different techniques that we can employ in order to be able to do that, and I’ve always felt that one of the things that we need to understand, that would help us evolve an application is to understand how it’s actually being used by the clients.

So this is now I guess, is an old idea of mine, but I thought a long time ago about consumer driven contracts that effectively, our clients, if possible, and this isn’t always possible when you’re dealing on the big bad anonymous web, but if possible our clients can give us their contract, some indication of how they expect to use our service or our application. And if they can do that in some programmatic fashion, they can do that using unit tests, then we can incorporate that into our own application build pipeline. And that gives us very early feedback as we are choosing or we’re needing to evolve that application we can learn where we’re beginning to can break some of those clients’ expectations. That’s not always going to help, I mean finally there are often drivers that cause you to make a breaking change to an application and that’s not necessarily going to be helped by having adopted a RESTful architecture, finally we need to make a break in change and we need to help clients migrate and at that point I think we need to introduce versioning strategies.

And internally, in terms of the representation formats and the sponsors that we’re generating, we can apply a lot of well understood versioning strategies, there are very well understood versioning strategies for things like XML and stuff like that. And all of that continues to apply today. Some of the benefits of the RESTful architecture are that, if we have sufficient resources and opportunity, we can continue to support old clients, but direct new clients of our new functionality off to other resources, we always got the opportunity to create other resources that encapsulate additional or new functionality.

   

3. That’s great. Let’s move to your favorite topic around graph databases. So can you give a brief introduction to graph databases and especially focus on use cases where it’s better to use a graph database than a regular relational database.

So a graph database, what’s a graph database? Well, it’s a database that allows you to store and manage and query your data in the form of a graph, so for computer scientists that’s in the form of edges and vertices but nowadays and certainly at Neo, we talk in terms of nodes and relationships. So graph database, its data model is that of nodes and relationships, circles and lines. When you start a new project very often you end up doing is sitting around a white board and try to describe your domain using lots of circles and lines, you’re describing a graph. So a graph database will allow you to model your domain or model your data interest in the form of a graph and persist it as a graph and query it as a graph. So typically, you use nodes to represent things that are of interest to you, things in your domain and then you structure the relationships between those things using named and directed relationships, so you join them up using those relationships.

Every relationship has a name and a direction and that creates some semantic context for all of the different things in your domain and then you can attach properties both to the nodes and to the relationships, so the nodes become containers for properties, probably for key value pairs and then in addition you can attach properties to relationships as well. So in that way you are beginning to capture all of the relevant aspects in your domain, you are putting some of that data in the nodes, which are representing the things in your domain and then you’re structuring it using relationships and perhaps qualifying or weighting those relationships using specific properties. So this is all really useful where your domain or the questions that you are asking of your domain depend upon a deep understanding of the way in which things are connected, the ways in which they’re related and very often the quality of those relationships. That’s of particular interest to you and if the kind of questions you are asking depend upon that understanding then the graph is going to give you some significant power in answering those questions.

It also helps cater to highly variable domains where no two things necessarily look alike, things that we’ve always struggled to model in the relational world and often end up with large sparse tables in order to capture all of that variation. In the graph, no two portions of the graph need look alike, no two nodes need contain exactly the same set of properties, so it’s very good for capturing or modeling what we call semi structured domains and densely connected semi structured domains. In terms of why on earth would I choose to use it, well perhaps my domain is most naturally expressed as a graph, but particularly the questions I want to ask depend upon those relationships, in the relational world I would probably be depending upon queries that use a lot of joints or recursive joints, and those kind of queries are going to suffer particularly as the data set size grows, enormous resource consumption and the queries can be relatively slow, you’re trying to resolve or rarefy those joints at query time.

In the graph all those relationships effectively represent pre computed joints so for particular kind of query that in the relational world would deteriorate as the data set grows, those kind of joint intensive queries, in the graph database world they can be a couple of thousand times faster so you can get some blistering performance for some very densely connected semi structured domains and the kind of questions you want to ask of them.

   

4. Talking of graph domains, can you exemplify it with some recent projects or places where Neo4j has been adopted, because that is always a question from adopters, about the maturity level and the adoption level in the market?

Previously, we just talked about graph databases in general, as you said, I am working at Neo and working on Neo4j which is our graph database, it’s a JVM database graph database, and I’ve been there for nearly two years now and I spent a large part of the first year and a bit working with many of our customers. And now, this week we’ve just had a very large conference here in San Francisco, just prior to QCon, we have a number of large customers who use Neo4j and some very large business critical deployments. So we had Accenture here talking about graph connect, about how they are using Neo4j with one of the largest logistics companies in Europe to read parcels through their parcel network. So here’s a company that has 2,000 - 3,000 parcels per second coming into their network and they need to be able to calculate routes for each of those parcels from the point of entry to their ultimate destination, two or three thousand per second. And today, Neo4j is in production satisfying all of those requests. We also have Telenor here, from Norway, who are using Neo4j to model some very complex access control structures and organizational structures.

They have a self-service application which they give their larger customers, they’re in telco, they give their larger customers a self-service application that allows those customers to manage products and subscriptions and services. Here we’re replacing an old version of the application, four, five, six years old. In the old version, which was built on relational technology, queries for some of the largest customers were taking up to 20 minutes, just to resolve some of those access controls, some of those permissions and resolve the entire object graph. In Neo4j it’s taking two seconds. So, again, that’s another production deployment. Then we’ve got several telcos who are using Neo4j to model their telecommunications infrastructure, so all of the elements in their telco network they’re modeling Neo4j, and then this allows them to do impact analysis, given this particular important client or customer, which parts of the network do they depend upon? And if they report a problem, which parts of the network will likely at fault? And this is reducing their problem resolution time from days down to minutes.

They can also do this kind of bottom up analysis, given this route or this switch, which parts of the network am I going to impact or which customers and clients am I going to impact if I have to replace or repair? So there are three of the things that I worked on over the course of the last year with some of the customers here in the US and in Europe.

   

5. Talking of performance, can you share some of the underlying implementation details of how the nodes and relationships are stored on disk and or on heap?

So, today all of the data is finally made durable on disk, we have several stores on disk, we have a store for nodes, another for relationships, another store for properties and then additional stores for large string of values and large arrays and they all consist of fixed length records. So that’s how we actually layer all the data out on disk and then effectively a query is just chasing pointers, just chasing offsets into each of those files. But on top of that we have a couple of layers of caching, we’ve got file based caching so, we’ve got NIO memory mapped files and then on top of that we have a heap based object cache where we actually have the rarefied Java objects which represent the nodes and relationships.

So you can get your entire graph into main memory onto the heap, that’s when you get by far the best performance, but many of our clients and customers today, even if they’ve got lots and lots of ram they’re not going to be able to get their entire graph onto the heap, but they can still, in most cases, they can still memory map the entirety of the files on the file system. And so they very rarely have to go all the way to the file system to in order to resolve most of their queries.

   

6. One of the concerns most NoSQL adopters have is lack of ACID transactions. Neo4j decided to go the path of adopting ACID transactions. Can you explain some of the tradeoffs that occurred as a result of it and the reason why you decided to stay ACID?

Sure. I think our choice of ACID is really a function of our data model, our graph data model is very different from the data model employed by lots of other NoSQL databases. When you are thinking about key value stores or column oriented databases, document oriented databases, these all effectively have very similar data models, the things that Martin Fowler called aggregate oriented data models. With those kinds of databases they’ve made some very wise decisions, to trade off some of those ACID characteristics in order to gain the benefits of high availability and horizontal scalability. And they can do that because each little bit of data that you want to submit is very self-contained, is an aggregate, hence Martin’s term. So that makes it very easy to partition or shard those things, but at the same time you have to surrender some of the ACID characteristics. Whereas when you consider that the graph data model, effectively when you’re creating data, you can be creating an arbitrary complex sub graph.

Now, Neo4j is an OLTP database and it’s directed as much to a kind of enterprise cloud as it is to a startup and stuff like that. And we want to be confident that even if you are creating an arbitrarily complex structure, that when you complete that operation that structure is made durable on disk in its entirety or we roll back. We don’t want half of that structure, half of that sub graph to have been confidently made durable on disk and the other half to have been lost. So therefore we made the decision right on the onset that, this is eight or nine years ago, to retain all of those ACID characteristics. You can be confident as a client of Neo4j when you’re creating an arbitrary complex sub graph in the context of a transaction, when that transaction commits that entire sub graph has been made durable on disk. We also allow you to register other transaction managers so Neo4j can participate in distributed transactions. And that’s often quite important where we have clients and customers who still have a lot of data in other systems, often in relational systems, but then they use Neo4j to impose a graph like fabric over the top of that.

So they are updating data in two different repositories at the same time, and they want to be confident they can do that in the context of a transaction, therefore we support distributed transactions as well.

   

7. Let’s get to some of the details around ACID transactions. Can you also share if they are any performance hits, especially with CRUD operations, at the node and edge level, how does it compare with CRUD operations in the RDBMS world?

Right. So, you might say that a write operation in Neo4j is potentially a bit more expensive than the equivalent write operation in a relational database. That’s partly because those relationships effectively represent pre computed joints, which gives us a tremendous benefit at read time, whereas in a relational database you’re reifying those joints at query time, often you’re taking advantage of indexes in order to compute those joins, hence there’s always that layer of indirection in a relational query and that’s the kind of thing that can be impacted as your data set size grows. On the other hand, we’ve taken a very different approach where we’re effectively pre computing those joins and laying them out on disk at insert time. On the other hand, whilst it might be a slightly more expensive operation, in many cases when you’re building a complex structure the number of writes that you have to do, to accomplish that in Neo4j, is often far less ,an order of magnitude less than a relational database, so often those kind of things balance out.

Then when it comes to reads or queries, again the kind of query that in a relational world would depend upon a lot of joins or often recursive joins, which would deteriorate as the data set size grows, in a graph database this can be literally thousands of times faster, and we have good anecdotal evidence. I mentioned the Telenor example, 20 minutes down to two seconds, so again we’re getting the benefits of pre computing the joints so that at read time we can traverse an enormous sway of the graph, often millions of relationships per second per thread per call on some very modest hardware.

   

8. And how does it scale for distributed transactions, does that involve three phase locking, can you explain that the same read and write operations in terms of distributed transactions?

Today, we don’t distribute the data over different instances, so when you’re committing to the graph, you’re always committing to a specific instance, but if you’re registering another transaction manager then there is all of that additional protocol overhead in order to coordinate the distributed transactions. So, yes, that is going to have a significant impact, but again the benefits there are you are storing data in another system in addition to having some in Neo4j, you can be confident that it’s all being made durable and consistent once that commits.

   

9. And then there are of course performance efficiencies from indexing, so can you throw some light on what are the indexing options within Neo4j?

Well, we do have indexing, we do have indexes, but we don’t use them in the same way they are used in the relational world, in relational world they are often used at query time in order to reify those joins so you often go by way of index in order to compute that join. Graphs in of themselves often have their own index, most indexes are a graph like structure, so our preference is as soon as possible just start traversing the graph, that’s when you get by far the best performance. The way we do use indexes today is effectively as a naming service, so in order to run a query against Neo4j you need a starting points or you need one or more starting points, one or more starting nodes in the graph, and that’s where we use indexes. We use indexes to look up a starting point, one or more starting points in the graph and then we encourage you to spider or traverse the graph just using those pre computed joins. So the indexes play a smaller role, a very important role, but a smaller role, they are not typically used over the majority of the query execution they are only used in order to find those starting points.

   

10. Talking about starting points and traversals, are there some interesting out of the box graph algorithms that were included with Neo4j?

Yes, there are. I mean we expose a number of different Java APIs that the developer can use and there is also a query language, so if you just want to interact with the database and query language the kind of stuff we are familiar in SQL world, then we have that. But we have these really rich Java APIs, at the lowest level we have the core API, which deals purely in terms of those graph primitives, nodes, relationships, properties and then we build a lot of stuff on top of that. We have a traversal framework that you can use to describe how you might want to traverse the graph, but we also have a graph algorithm package and that includes some implementations of Dijkstra, A-*, shortest path and so on. That’s all built on top of our core API and if you want to introduce additional implementations of your own algorithms, you can take advantage of the very same APIs that we’ve used to build our graph algorithm package.

So we expose that entire stack to you and you can start at the top and perhaps use a graph algorithm from the algo package or the traversal framework or if you need to and if you want to really define at the very fine level your own algorithm, you can drop down at the core API and build that into your own class.

   

11. The bindings to the API are through REST and the data format is JSon at this point in time. Is there any reason why you decide to go with JSON and not with the binary protocol format?

Yes. It’s kind of accident in history, I mean the first server interface that we created was really intended just to be a discovery mechanism and a management mechanism and that was the REST API, so for a long time Neo4j went around as an embedded database, that’s its heritage, you host it and you run it in your own Java process. But that’s not people’s familiar experience with databases; they want something a machine over there and speak to it over the network. So a few years ago we introduced this server mode and our first step there was just to introduce a management and discovery API which was exchanging JSon documents over http. That then became the basis of our default API today, I think in the future we would likely see a binary protocol that would allow you to create Cypher queries, Cypher is our query language, we’ll have client side drivers, you create those queries and submit them using this binary protocol. The REST API is not going to disappear but I think it will go back to what it was always intended to be, which was discovery and management interface, but today it is still the primary way of accessing the database remotely.

   

12. Talking about deployment options, let’s go back to performance. What are the various high availability deployment options for Neo4j?

So, the enterprise edition of Neo4j has an HA capability. So, we scale horizontally for high availability and for high read through put and that is implemented as traditional master- slave replication, so we have a master and then one or more slaves and the master is responsible for coordinating all of the writes and then the slaves poll the master at frequent intervals, you should configure that interval, but they poll the master in order to catch up. So the writes that are directed at the master are immediately consistent when that write completes and the controls return to the client, you can be confident that’s been made durable and consistent on the file system on the master.

But the overall system is eventually consistent, so there’s this eventual consistency window, it may only be a matter of milliseconds, but the overall system is eventually consistent. With 1.8, Neo4j1.8 which we released just a few weeks ago, now when you write through the master you can also specify a replication factor and we will actually push that transaction out to one or more of the replicas, so we’re kind of increasing the durability guarantees and just insuring that that eventual consistency window is being reduced.

   

13. In production can you share what is the maximum replication factor that most of your clients have used?

Varying on the size of the cluster, we have many people who have three instance clusters of Neo4j, one of the largest deployments that I am aware of is Adobe’s deployment for declarative cloud, which is an AWS, and that is kind of nine instances globally distributed in AWS. There the replication factor, there they actually restrict master re-election to one of their regions, so the replication factor will effectively be three, I think it’s three there.

   

14. And finally, just curious, what is the overall direction that Neo4j is working on, any interesting things that are coming up?

There’s a lot of interesting exciting things coming up, many of which I can’t talk about, but I’ll give you a hint as to some of the things that are happening. We have a very frequent release schedule, we release a new GA every quarter, but every two or three weeks we make milestone releases, so we are often previewing very early all of the new functionality that’s coming up. We’re focusing a lot of attention of Cypher, this release and the next, ideally by the end of the very next release so in the end of the first quarter next year Cypher will be pretty much feature complete. And that then I would imagine will become the basis of the binary protocol, so looking forward even further we can see that becoming the basis of the binary protocol.

We’ve also got very interesting ideas about our data model and about an enriched version of that data model, so we always prided ourselves in being schema free and that gives you a lot of flexibility in modeling and evolving your domain. But we also have a lot of customers and clients who would like to be able to layer on top some additional constraints and particularly be able to effectively label nodes so that you can say this node represents a user whereas this node represents a product. Today you do that by way of convention, in the future we will give you the capability to label nodes and then even further down the line to associate different kind of constraints with those labels, and they may be schema like constraints or security constraints and so on.

Jeevak: Thanks, Ian, thanks for joining us today, really good speaking to you.

Ian: A pleasure, thank you very much.

Jeevak: Thank you.

Ian: Bye bye!

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT