00:22:13 video length
Bio Emil Eifrem is CEO of Neo Technology and co-founder of the Neo4j project. Committed to sustainable open source, he guides Neo along a balanced path between free availability and commercial reliability. Emil is a frequent conference speaker and author on NoSQL databases.
Neo4j is a robust (fully ACID) transactional property graph database. Due to its graph data model, Neo4j is highly agile and blazing fast. For connected data operations, Neo4j runs a thousand times faster than relational databases.
Sure, so I’m the co founder of Neo4j, which is an open-source project and the CEO of Neo Technology, which is the commercial sponsor of Neo4j. As a CEO I set strategic direction - most I take out the trash and do all of the things that the people don’t want to do.
I get some time to code, but very little. My joke is that I do write code these days but it’s typically in PowerPoint or in Keynote and I never contribute any code to the product anymore, I do use the product- from a user perspective - but unfortunately not as a producer.
I think it is. One of the things that we observed early on is this notion that we now call ”whiteboard friendliness”, that graph databases are whiteboard friendly and those are the observations we made when we first started building external projects for Neo4j. Quite a while ago now, the first thing we did as consultants really - that’s basically what we were back then - we started engaging a client, we got him into a room with lots of whiteboards, and we started brainstorming around the domain.
What we saw there was that very seldom did these people - weren’t plagued with a computer science degree - know anything about first normal form or ER modeling or anything like this, right? But they were awesome at pensions or retail or whatever the domain was and when we brainstormed very seldom did they end up growing tables, maybe if you would be building like a payroll system, maybe that’s very tabular, but a retail system is like “Hey, we have a custumer here and yes, a shopping cart, actually two shopping carts potentially and those include orders which include order items which refer back to a product and actually a product belongs to a category.”
And so they end up brainstorming like this and the amazing thing we saw was that six months down the line when we’ve put that in production we saw a one-to-one mapping between what was written on the whiteboard and what was actually reflected in the graph. I believe that if you talk to some of the developers who are actually writing code between the white boarding sessions and it’s up running in production, the cognitivemodel of the programmer would be that of the white boarding. They thought of the domain in that way and having that one-to-one mapping represented in the graph I think is extraordinarily powerful.
3. But aren’t there performance trade-offs with graph databases? Graphs can quickly become very complex - as they are with social graphs for instance. Surely there are some performnace trade-offs that occur when you traverse these large, complex graphs.
Yes, I mean, so there is always a tradeoff, right? However I think when it comes to performance and complex and connected data the graph database shines. So Neo4j, as an example, with warm caches can do between one to two million hops in a graph per second, I mean that’s pretty damn fast and you can cover a lot of grownd, you know, a couple of million hops in a graph.
However, if you have very tabular, structured data and you want the average age of all the records that have an age property, that’s not a good operation for a graph database, that’s basically “visit every single node and take an aggregate and do an average of that.” The relational database is perfect for that.
So it all comes down to choosing the right tool for the job which fundamentally is what NoSQLNoSQL is supposed to give us, right? We are now at a point where the right answer isn’t always the relational database, it’s not always the wrong answer eihter, but at least now as engineers who take our craft seriously, we should look at the data layer and choose different tools, that are appropriate for the job.
So I didn’t look at it from an industry perspective, I had a very urgent need myself and that’s what I tried to solve. We were a small team building an enterprise content management system; at the time we called it a media asset management system. It was essentially a big file system on the web, which is what a content management system really is.
So we were twenty engineers, half of our engineering team spent the majority of their time just modeling this big connected dataset into the relational database, so we evolved that and actually what we really did is, we said that “Hey, let’s implement an abstraction layer which works just with nodes and then relationships all in the business layer of the application.” So it lived on the application server, not on the database side at all, but then we started expressing everything in our domain using this language: nodes, relationships and properties. And all of a sudden we realized “Wow, this is really powerfull”
We still didn’t solve any of the runtime problems because we still had to get all this connected data into the relational database, but at least we saw that this abstraction was really powerful. And then after a while we started iterating on that and working with that and after a while we said that “Why aren’t any databases out there that natively expose this abstraction, because it seems it works so well for at least these things that we’re doing.”
We started looking around - this is the late 1990’s and early 2000’s - and we didn’t find anything and at this point in my career I was young enough and naive enough to think that if I had a technical idea, I’d say “Screw it, let’s go for it” and so that’s what we did We basically set out and said that “Hey, it would be fantastic to have a database which exposes these building blocks, nodes, type relationships between nodes and then key values pairs and both of these.” But in all other respects it’s like a relational database, it has all the good things that a database has, which is full ACID transactionality which is something that we fundamentally embrace and full robustness, persistance, all the good things of a relational database, but with a model that is more, I think, relevant for more people today versus when the relational database was invented.
I think we’ve gone through multiple faces in terms of how we as an industry talk about graph databases. Now I’m not delusional, at least not too delusional, but most of the people in the industry aren’t talking about graph databases; however,the talk, about two or three years ago, was more like “What is a graph database?”.
Today a lot of people that I meet, at least at confecences like this one, QCon, where you have people who are highly interested in our trade, they know of the graph database but we’re now in a phase where a lot of people peg graph databases as only useful for social: graph database equals social graph, right? Or social graph equals graph database, which is great, that’s fantastic for a start.
The funny thing is that when you talk to actual customers or users of Neo4j, what they will tell you is one of two things, either “Wow, this is a completely horizontal hammer that I can use across the board; you know, I started using this for this particular thing, but I can use it across my organization.” Either that, or they say that graph databases are realy useful for this particular problem I’m solving, but not for anything else, except that when you talk to other custumers, they say the same thing but about their particular problem and their particular problem and their particular problem.And if you sit where I sit, where I see a lot of deployment of graph databases and you start mapping these, you see that a very horizontal map is appearing which suggests that we’re in this evolution now, where the market is becoming increasingly aware of both the existence of graph databases and the wide applicability. We see early adopters in anything software, in terms of industries, in finance and insurance, telecom, and datacom.
That’s a good question. So when you comapre a product, it sort of depends on how much you zoom out, to say it who do I compare it with, right? And if we zoom out to include all of NoSQLNoSQL I would say that by far the high order bit, when you compare Neo4j to say a random NoSQLNoSQL database, it’s that it’s a graph database, so it’s the data model, right?
The fact that we work with nodes, type relationships between nodes and key value properties and that’s compared to something like a key value store - which works with key values pairs - the data model is by far the high order bit when you compare Neo4j to other NoSQL products.
Other things where we’ve chosen sort of a contrarian view or maybe a little bit of a different path than most of the NoSQL offerings is transactionality. We actually like really strong consistency, Neo4j is fully ACID compliant, XA protocol, two-phase commit all that good stuff, so we have customers running Oracle and Neo4j side by side and participating in two-phase commit transactions, so if you write something to Oracle and at the same record to Neo4j or an independent record to Neo4j, you’re going to have real strong consistency semantics. So if it fails in Oracle, it’s not going to commit in Neo4j and vice versa. However if you have a strongly consistent core you can always relax that consistency. When we do big scaleouts, not two machines or three machines but ten machines or fifty machines or whatever, then we relax that consistency or the costumer, the user can choose to relax that consistency and have an eventually consistent setup.
So in that sense we can have, I think the best of both worlds: we can choose strong consistency when it makes sense - which honestly is most of the cases - and eventual consistency when you have high enough scale requirements that you’re forced to take the development pain of eventual consistency. I don’t think that’s unique to NoSQL but it’s definitely unusual.
So that was the first part of your question about how we compare to NoSQL. The second part of your question is how it compares to other graph databases. So one of the wonderful things about graph databases is that when we started talking about it, five years ago, there was not a single other garph database on the planet, now there’s twenty, thirty, fifty, I don’t know a lot of graph databases. Everything from FlockDB, a very specialised graph database built by Twitter to InfiniteGraph a very commercial, closed-source, general purpose graph database built by a company called Objectivity.
There are too many of them for me to be able to answer feature by feature, but the one thing that makes Neo4j stand out is that it is, I think by any kind of metric that we’ve been able to figure out, by far the most popular one. We have more production deployments and more users than all other graph databases combined. And while that doen’t mean that we’re the right tool for any kind of job, it does mean that there is by far the biggest ecosystem around Neo4j, there are adapters to any kind of framework out there, there are cloud providers, Heroku addons, adapters to Spring and Rails, etc.
So I think the use cases for that are exactly the same as the use cases on premise, it’s just a matter of time. I mentioned earlier that Cisco deployed a master data management system based on Neo4j; we placed Oracle rack, which is the biggest and baddest of Oracle’s databases, a multimillion dollar kind of thing, right? Cisco threw that out and we placed it with Neo4j; that would not happen on the cloud. Master data management is sort of hardcore IT, very central, integral. A big, Global 2000 corporation is not going to deploy that on Heroku, it’s jut not going to happen, TODAY.
“Tomorrow it will”, and I don’t know when tomorrow is, by the end of the decade certainly. I’m one of the people who believe that all software for some definition of all, will be delivered over the cloud eventually. So that’s why it’s important for us to be avalable in Heroku and other cloud providers.
Sure, so first of all, both of them are open source but they have a different license. So Neo4j community is available under the GPL, which is the same license that MySQL is using, so the rule of thumb is that any time you could use MySQL for free, you could use Neo4j for free, which means that it is in all situations except OEM. If you take it, take Neo4j and you embed it with your product and then ship that to a customer, you need to have a commercial license. That’s Neo4j community.
Then there’s Neo4j enterprise which is available under the AGPL and the AGPL is like the GPL but it’s also viral across the firewall, so across the web and the background of the AGPL is basically the Google use case for MySQL, where we found out after a while that Google was able to use MySQL behind the firewall to, let’s say serve Gmail using MySQL. Even though MySQL uses the GPL, they were supposed to have open sourced their software; however, since there was no shipping involved and the GPL was written in ’92, I believe, the AGPL was invented to solve that loophole, which means that even if you’re a web service - if you use something under the AGPL - you can use it for free but you need to open source your software.
And that’s the difference between, in terms of open source, between Neo4j community and Neo4j enterprise, in terms of functionality. Neo4j community is a fully featured graph database, has all the bells and whistles that you want, a single server; however Neo4j enterprise is the cluster edition, so you can get a highly available setup, where everything you write to one machine is going to be replicated across two, three, five, ten, twenty, however many machines you want to have which is going to give you full tolerance and is going to give you read scale out, so it’s basically a piece of mind. This means that most people who choose to go into production with Neo4j, where it’s mission critical, they choose to go with Neo4j enterprise if they’re willing to open source their software they can do it for free, not pay us a dime. Most of the companies obviously don’t want to open source their software so then they buy a commercial license.
We have a couple of important upcoming themes for 2013: one of them is around scale out. So currently, scaling out horizontally is possible today with Neo4j using a cluster approach, which means that everything you write to one machine will be propagated and replicated to all the other machines participating in the cluster.
However ultimately what you want to be able to do is to “shard” or partition the graph.Let’s say that we have a big, you know, a thousand billion node graph or something like this, right? We want to take part of that graph and put it on this machine over here or this set of machines and other part of the machine and put it over here, which is called sharding or specifically autosharding, right? You can shard with any database, which just means that you have to do a lot of work on a client to partition the dataset. We think that it is possible to do that with a graph database and that’s one of the Holy Grails of scaling because currently there is a bunch of databases out there that can do autoscaling but they all have trivial data models. They have data models that force you as a developer to make all this contrived decisions and not be able to represent your domain in all its glory and fundamentally I have to believe that as developers we kind of should focus on the domains a lot more and not get too caught up in the latest technology fads.
Eric Evans yesterday visited GraphConnect (2012) and had a great talk about domain modeling with graphs and I think he’s been spot on with these thoughts. However all these existing databases, the key-value stores and the document databases of the world, they give you great scale-out scalability, but they force you to sacrifice the expressiveness of the domain.
Now a graph is the most versatile data structure that at least I’ve been able to figure out, it’s the super set of linked lists, the super set of all of these kind of data structures. So it can very efficiently represent the world in any domain. If we can also get that to scale out horizontally, that’s going to be world changing amazing, and that’s one of the core things we are working right now, we’ve worked on it for a long time, we’ve had several breakthroughs and I think it’s going to be an amazing year when it comes to this kind of scalability.
But that’s only one part of the things we are working on, another thing is that I think at this point, we’ve been, I think, great at providing a very robust and mature, highly performant graph database, which can solve real world problems for many customers out there in mission critical deployments. We have not been as good as we could be, I think, at providing this amazing power, this graph model in as as easy-to-consume a product or a package as I would like.
So we have a very active and thriving community. Go to Neo4j.org, there’s a bunch of links to community resources like the Google group obviously, also of course StackOverFlow if you want to have sort of end user questions about it, it’s also a good starting point. And then just generally show up on the mailing lists, there are meetups; We just started working in a little bit more structured manner with the meetups we have had There were I believe, one or a handful of meetups at the end of last year, at the end of 2011, and today I think we have twenty-five or forty meetups in the world so it’s just exploding.There are many ways to get involved and Neo4j.org is your starting point.