Bio Dr. Jim Webber is Chief Scientist with Neo Technology, the company behind the popular open source graph database Neo4j, where he researches and develops distributed graph databases and writes open source software. His is a co-author of the book REST in Practice. Jim's blog is located at http://jimwebber.org and he tweets often @jimwebber.
Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.
I’m doing very well Srini, thank you for taking time out with me.
Sure thing. So my name is Jim Webber, I’m Chief Scientist at Neo Technology and as you mentioned we are the commercial backers of the open source Neo4J database and my job is to build graph databases, great open source graph databases.
3. Graph databases have been getting lot of attention lately. So tell us what is the current state of graph based NoSQL databases in general and also in Neo4J in particular, in terms of enterprise adoption, standards, vendor support?
So I think graphs have always been the kind of weird, idiosyncratic corner of NoSQL so in the standard four quadrant pattern where we have key value stores, columnar stores and document stores, that’s pretty well understood, the data models are very developer friendly and graphs have always been a bit of an oddball, because compared to the relational stores, the columnar, the documents and the key value stores have deliberately decided to have a simpler data model, a less expressive data model to gain things like operational performance, operational stability, scale and so on. So the aggregated databases are a simpler data model for good operational reasons.
The graph databases are odd, because they’ve actually decided to have a much less expressive data model compared to relational databases. So I think they are an oddity compared to the other three types of NoSQL stores, which means that when a developer first comes across them there is an awful lot of head scratching, you can see this haircut was completely caused by Neo4J. So I think compared to the other NoSQL stores, the graph database community is a little bit further behind in terms of adoption and penetration because they are a bit of an odd beast when you look at them first, “What would I use graphs for, they are those things I forgot from university, with that boring old guy doing math on the whiteboard”, on the blackboard even, I’m so old we had chalk, would you believe?
So, I think you will see people that have had a great deal of success with the kind of scale, availability and operational simplicity of some of the popular NoSQL stores but now they are looking to start to do more sophisticated data modeling and I think that’s really where graphs are becoming more popular to the extent where a few years ago there was no way we would have talked about graphs at something like QCon or InfoQ. But this year for some reason, enormous subject right, so we see enormous subject in terms of conferences, or QCon housed graph talks, there are several graphy things out in the conference foyer over there, there is a dedicated graph conference happening in combination with QCon San Francisco later in the year, called GraphConnect, it just seems to be growing. I don’t know quite what’s happened, there has been some inflexion points and the whole graphs ecosystem is growing.
[Srini's full question: It makes sense, and also graph databases are more multidimensional as well as more object oriented programming friendly, because you’ve got the relation between objects and the entities. So you mentioned briefly about the scalability aspect, can you talk about the performance and scalability of Neo4J, how is it being used in large systems, how can it scale? ]
Definitely. So Neo4J is a conventional clusterable database, where it defies convention a little is that Neo4J can be embedded inside your application or it can run standalone over the network database server. Now, in terms of performance, an individual instance of Neo4J on the commodity hardware that you have there, will allow you to do approximately one to two million traversals, that is traversals from one node across a relationship to another node per second per core. So for one instance you get a lot of ability to explore deeply through graphs very quickly. So of course, the original intent and design of Neo4J is to be an OLTP database, which is sufficiently low latency to power your web apps, normal web apps where you expect a fast response node by request.
Now, to scale that out, Neo4J clusters, in a very similar way to MySQL, that is a master slave cluster today, with full automatic failover in the case of machine crashes and so on, which means Neo4J scales very well for reads and scales asymptotically for writes. So for example, compared to Cassandra, Neo4J has less write throughputs than something like Cassandra has, but it has equivalent read throughputs. And that’s a tradeoff I think you make when you’re looking at Neo4J, is that you get so much benefits from the insights the graphs provide and does Neo4J provide enough write throughput for you to get that benefit where sometimes you might build a system which is so write throughput heavy, even though you might want graphy things you have to pick a non-graph database. For what it’s worth, in the general case, we see large organizations, the likes of Cisco, Adobe, the telecom companies picking Neo4J and subjecting it to some pretty brutal global deployments.
Adobe are of course running their Creative Cloud on top of Neo4J, which is an enormous global scale piece of software, Neo4J is sustaining that throughput, which was a real proud moment for us because we worked hard, partnered with Adobe to do this and when it went live we were like “This is going to work”, and when it worked, “Yeah”. For the general case, Neo4J is scaling well, for the future of course we want to build a version of Neo4J that scales, not just for reads but also for writes so that’s what my team in London are working on right now. We are working on a horizontally scalable database, some people call that sharding, we think sharding is an implementation detail, we just want to make a database where you add another machine you get more horse power.
[Srini's full question: Like most scalables are. Speaking of graph based use cases, there is more happening once you select the data from the database, there is more happening in the application layer because you are not writing too much back in the database, you do a lot of processing and then you write to the database. So the use case itself is probably more reads and not so much writes so that kind of falls into that whole how it compares with documents database versus a graph database. ]
For sure. When you do a query in Neo you are read heavy because you are typically crawling the graph looking for some information goals so you are typically doing more reads than writes on balance because the modeling Neo is search rather than compute you are never going to do anything like dredge everything out of the graph, map reduce and then pour it all back in. You keep data in the database where it’s safe, and of course in Neo safe means wrapped in these lovely things called ACID transactions. So it’s like a predictable safe data model that you’re used to from the mature relational stores.
Indeed. That’s our hope that we can take some of the great pedigree that the relational stores have in terms of transaction processing and provide a more expressive data model to go with it.
Right. I think that’s a good point that you raised, the mapping is easier once you get over the mental barrier of “What on earth is a graph?” And when you think “Well actually I’m quite used to my objects interacting, and that’s kind of a graph” and then once you’ve unwired from tables and you’ve figured “Oh, it’s really natural to say “Srini participated in an interview, Jim participated in an interview”” and then be able to infer that you interviewed me at this time, in this location and so on, actually once you’ve wired graphs that way I now find it hard to go back to tables, I find tables too constraining and they don’t fit my domain modeling kind of practices.
Yes, absolutely and in a sense I was talking to some guys on the conference floor earlier and I was expressing that I am disappointed that relational databases are called relational databases, because what does a relationship mean in a relational database, you only have one of them, you can’t name it, you can’t put properties on it, whereas in graph database like Neo4J relationships really mean something, you name them, they have directions, they can have properties, so really we are relational database because we care about relations.
9. Speaking of examples for graph use cases, if you think about the real world there are so many examples, our communities, person to person, employers, companies, there are just so many examples that are kind of well fit into the graph use cases.
Absolutely, so the social use cases that people come through immediately, so I can’t thank Facebook enough for popularizing the term social graph because that has led so many people to say “I have a similar problem and I need an engine to be able to process it” and of course social graph is something Neo4J eats for breakfast.
10. Jim, going back to the architecture wise, what do you recommend as far as design considerations or best practices, the database and application developers should take into account when using the graph database, you want to use it in the right place, do you have any guidelines for that?
Sure. Specifically around Neo4J, I think Neo4J shines where you have connected data. If you’re in a domain where the data is disconnected, then Neo4J is not going to give you so much benefit. So first your domain has to fit, you have to have data to which relationships are important. That’s probably the majority of use cases, because there are very few domains where isolated documents or isolated key values make much sense. So the majority of domains are applicable, but I think then the kind of domains where you have connected data you would then need to make sure you’re in a kind of balanced read write kind of load pattern. So if you have again as I said something with ridiculous write loads, please use Cassandra, Cassandra is awesome for that. If you have something which is more sensible, more balanced rather, then Neo is a good fit.
And in terms of the actual architecture, Neo is normal; it’s a normal clusterable database, so all of the architectural patterns you know still apply. You can still create systems with an app tier and a data tier, but one thing that Neo does allow you to do if you like, is because Neo is embeddable, you can actually take the Neo database, put it into your JVM process and then your deployment architecture simply becomes one kind of thing, one image that you deploy everywhere, so it’s a single tier architecture. I think perhaps the more interesting challenge isn’t so much around the kind of deployment architecture, I think Neo it’s so normal it’s boring in that sense but around data modeling, for many of us modeling in graphs is quite a new skill, we model for relational stores, we kind of know how to do that.
So we go to the whiteboard, we sketch out domain with our business stakeholders hopefully, we understand how the domain looks and how it all joins together and then we go away and we turn that into tables, into a normalized data model, because we are all good students from the university and we normalize the living day lights out of it. And then someone who really knows databases looks at it and says ‘No’, and then they go away and de-normalize it into something that is like “Wow, ok”. And that’s fine, we know how to do that, the downside of that of course, is that the thing we sketched on the whiteboard and the thing that we stored in the database is now very different. There is a lot of affinity, whereas when you are modeling in a graph, often the thing you draw on a whiteboard is actually the thing you store in the graph. But we are learning that, for example I was working with some folks a while ago who were doing email forensics. So they were looking for kind of corporate corruption.
They wanted to know if you and I were exchanging dubious emails about insider trading or something; it was fascinating stuff. We drew a graph that said something like “Jim emails Srini”. And that seemed well, because we have this rule of thumb that says if your graph is readable from left to right, if you can create a sentence than it’s probably correct. So we built this graph and it looked great and then when we came to query the graph, it wasn’t working. What’s going on? It looks right, Jim emails Srini, and we made the mistake there that we collapsed the domain entity into a relationship, actually Jim sent an email and he sent it to Srini. And then once we’ve realized we’ve collapsed that wrongly and then we reified this object, this node called an email and it was sent to Srini, it was cced to Alice and bcced to Bob, and then we found that Srini was an alias of Bob, then we started to find some interesting and devious patterns.
And I think that was a hard learned lesson for me that just because the graphic reads correctly you have to be careful that the entities are still surfaced explicitly in the graph. So even though I’m used to graphs I made that mistake. I won’t make it again, but it did catch me out that time.
Absolutely. So for those people who have a semantic web background they look at it and it’s like “Yes, so”, because that what you do on the semantic web, you say “I have this resource and it relates to those other resources in this way” and that’s really what Neo4J does, you have a resource, a noun, and then you declare through relationships how it relates to the rest of its world.
Sure, I mean if you think about what some web people do in the big fishing when they move from “www” to “ggg”, the giant global graph, really what they want is a data structure that agents can run over to discover information goals, and they want to do that at an enormous geographic scale. Neo4J is that same model but simply shrunk into a local database. So it’s the same kind of data model where you have relationships between entities, relationships are very semantically rich, but instead of running by agents across the web, instead I have a little robot, I like to think of it, which runs my graph inside Neo4J to find my information goals. So it’s a micro cousin if you like of the semantic web which actually means that if you are building services for the semantic web, Neo tends to make a very good database for hosting that data. Because projecting the small graph onto the big graph, which is the web is quite a straightforward thing to do. A little impedance mismatch.
I can’t speak in general because I’m not really aware of the other graph database tool support, I imagine they have something similar to Neo. We have tools like Neoclipse which is like a graph database workbench, similar to the kind of the SQL studio kind of products that you get in the relational world so a data professional can use that as their portal on to the underlying graph data. Neo itself comes out of the box with a tool called web admin which allows you to explore the graph in a nice visual zingy way and send queries into the graph and so on. There are third party graphic visualization engines, and I’ve seen one based on a company in Cambridge, in the UK where they have wonderful quake engine style 3D graph visualization for being able to swim through big graphs, which was phenomenal.
What I particularly think it’s interesting about this area, particularly because most of the graph databases have a friendly open source aspect to them, so what the Neo4j does is that the community finds places where tools don’t exist, where they are needed and they fill the gaps. So if there isn’t a tool you need, at some point you either are going to build it yourself and share it or someone’s already done that for you. So I think a lovely thing about the graph community as a whole, and this is going to sound quite hackneyed, but they are quite connected, it’s a friendly connected community. And kind of holistically we are moving this platform along together.
Indeed, absolutely. In fact my colleagues who work in the community for Neo4J talk about it as the community graph and in fact they now have systems in place where they can traverse the graph for people looking to do interesting things and reward them with T-shirts and iPads and stuff so it’s actually working both ways.
Srini: I’m sure they would do anything for the iPads.
Jim: Indeed, they’re hard currency, aren’t they?
[Srini's full question: As you mentioned about visualization in a graph, data is more valuable than when you do that in a database and you’re looking at the relationships and value between that, so I want to bring HTML5 into that discussion here. HTML5 has a lot of visualization features coming up. So do you have any road map, feature that in the future might connect Neo4J database with HTML5, to provide visualization? ]
Srini: Yes, 3D is probably the best way to go for graph. 1116
Jim: It’s so sexy as well when you see it as a geek it’s like “Oh, that’s cute”.
There are two aspects to this. Firstly, if you choose to embed Neo inside your application, then Neo benefits from whatever application security you have, so that’s kind of a dumb thing. The other one is when you run Neo as a standalone server, then we support privacy and integrity in terms of the REST interface, you can run https to Neo and then we allow you to actually plug in authorization rules that suit your domain. On each web request you can validate whether you like the user or not and whether the user is allowed to access this part of the graph or any of that kind of stuff. So there is a really flexible authorization rule system there. It doesn’t plug in directly to the whole eco-system yet but in terms of being a configurable secure database it’s there. The only other attack vector that you need to secure if you’re being particularly security conscious is to encrypt at the file system level. That of course it can have some performance downsides, you have to encrypt and decrypt as you flush data to and from the disk, but if you are being very cautious you will also want to encrypt the file system where Neo stores its data.
Srini: Performance, usability and security- you need to find the best balance.
Jim: It’s about the balance, right.
[Srini's full question: So let me switch to the topic of the analytics or BI, which is a big thing that happens right now, the term “relationship analytics” is getting a lot of audition lately, data mining is a big thing in enterprise apps, especially to store and process, analyze the vast amounts of data we have, both structured and unstructured. So where does the Neo4J fit into this as far as storing and retrieving graph data is one thing, but processing and making sense out of graph data results? ]
Absolutely. I’ll have to split that into a few responses if that’s ok. There are two things at play, one is the traditional, highly latent kind of OLAP style BI, and the other one is actually being able to do BI in the clickstream. So for the traditional data warehousing, Neo is absolutely fine for that, what you tend to do though is instead of having an ETL process which takes your transactional data and then dumps it into a data warehouse, with Neo you tend to effectively have reads slaves off your transactional database so again you have close affinity between your real data and your data warehouse, and then you’ll run business intelligence jobs on that data. And you can run large jobs because you’re not doing it in the clickstream at this point.
You can run very large queries that may take several minutes to terminate. Of course in Neo4J that several minutes of querying is billions potentially of hops around the database. You use really sophisticated stuff there. So that’s one thing, I think that if I was a BI vendor I would definitely be looking “How do I get Neo4J to replace the expensive proprietary databases I’m using”. There are folks like JasperSoft who are already wrapping their reporting tools around the likes of Neo4J. The other one I think it’s more interesting and I think it’s doing BI into the clickstream. So I’ve been associated with a few proofs of concepts of this in my previous life and the one I was really interested in is retail analytics. As you go through the check out at the supermarket, particularly in the UK, we are fond of loyalty cards where we sell our data for a few vouchers for groceries. So the supermarkets love this data because they can influence our buying behavior.
What they do today is that they do all this in batches; they gather all of our data and then on some big data warehouse where they run queries that last several hours and eventually some vouchers come in the mail to my home. That’s ok.
I may not need them, that’s ok, that’s an ok thing, except what they’ve noticed by studying this is most of the time I’m going to pick it up, I’m going to say “Oh, that’s junk mail”, and immediately goes in the recycle bin. There’s no personal touch. So what they’ve really want to do is as you go through the check out and pay for your goods they want to be able to do “in the clickstream BI” of you. Because if I hand you a voucher, human to human, I look you in the eye you feel obliged to read it otherwise that’s rude, right, there’s a certain social protocol. So what you want you to do is say “Oh, my goodness, that is actually very accurately targeted voucher for me and it’s being given to me by a human, I am going to act on it”. And that accurate targeting is really well facilitated by graph.
So the example we always use is the case of young fathers. We think young fathers have money, which in our world is a node which has some data, first name last name and an age, and it has some connections, bought beer and bought nappies, and many of those young fathers have also bought PS3 or bought xbox360. It’s a really classic pattern for young fathers, or people who love gaming and beer so much they can’t get up from the couch. We’re hoping it’s young fathers. What’s really interesting about young fathers is that not all of them have bought xbox360 or bought PS3, which means there is a sales opportunity there. So Srini as you come through the checkout counter and we see you buying diapers, sorry, I’ll use the American, diapers and beer, but you haven’t bought an xbox, we can find that out about you in a very short space of time using Neo and then as you pass through the checkout we print the voucher that says 20% off xbox360 for the next 48 hours. And then I give it to you, you read it, that’s a really compelling way to change someone’s buying behavior.
So taking that as a prototype, you can see how you’ll be able to do really highly targeted recommendations for customers, because not only do you take into account their buying history but No4J is sufficiently quick that you can take the household buying history, their friends buying history, their postcodes buying history, people who own the same cars buying history, people who shop at the same stores, people who bought the same products so you can actually start to do incredibly highly targeted recommendations in the clickstream. I think that is game changing and particularly with Neo4J being embeddable, it’s small enough you can actually put it inside your point of sale terminal, which is like an amazing thing to do.
Srini: More prediction.
Jim: That’s right.
Srini: I’m sure young fathers could use Xboxes as soon as they can get one.
Sure. There are a couple of things that I am really thrilled about Neo4J. One is the query language Cypher. So Cypher is an ASCII-art based query language where using ASCII-art sticks and arrows you can declare to Neo the kind of shapes your data would like to find in your database. For example like the young fathers, you create a sticks and arrows picture and you say “Neo, find me stuff like this”. Andres Taylor and Michael Hunger, the guys who wrote this, is a really nice piece of work and now Cypher is actually becoming sufficiently mature that not only can we query with it, but we can also use Cypher to manipulate and mutate the graph, which is a relatively recent feature but I am really thrilled that that stuff is maturing because I think over time Cypher will become the single API or the most dominant API for Neo, I mean perhaps we can deprecate some of the Java API’s and stop being so programmatically focused, start being much more data focused at our API.
And I love that kind of stuff, I think it’s amazing. And the other thing, obviously I have to be keen about this it’s the stuff I do for my day job, so we have a team at Neo Tech, predominantly in London, who are working on horizontal wide scale for graphs. Now of course, the discrete math says that’s an NP-hard problem in the general case, so it should be impossible to do, we understand that, we are not trying to solve the general problem, but we are working on ways being able to get around the general problem and be able to statistically provide very high write throughput even for mutating graph. So we have the loveliest work days when we’ll read some CS papers, will do some coding, test some theories, and we’ll push out some products and they pay me for it. Oh, I hope my boss isn’t watching this now.
Absolutely, the team is a very innovative space generally to be in and I think that you talk to some of the guys out here, not just the graph guys but I love the Basho guys because they are so in the computer science stuff, and I think around the NOSQL space there is just so much innovative computer science going on and it’s just such a joy to be part of that.