Transcript
Meyer: Peter Boncz is one of the preeminent database researchers on the planet. He basically gave the modern shared nothing column store its form. This is him being somewhat critical of graph database systems. I'm a graph database implementer, so a little bit nervous making. This is a great talk if you're a database person. However, it's quite technical, so I thought I would unpack it a little bit for you guys. How many people out here have teenage children? A graph database is like teenage sex. They're all talking about it. Some of them are doing it. The ones who are doing it are not doing a very good job. I'm here as your own graph database teenager.
I'm going to talk about LIquid. LIquid is a relational graph database that we built at LinkedIn, runs about 2 million QPS. It's a pretty standard, large-scale Silicon Valley database system. As is usual with these things, I didn't build this by myself. I have a wonderful and talented team of probably 40 people now who make this thing a reality. I'm shamelessly going to claim credit for their work.
Browsing a Large Graph
First question is, why did LinkedIn build a graph database? The recruiter who brought me to LinkedIn, explained LinkedIn to me. He said, the magic of LinkedIn is in the second degree. Your first degree, all your connections, you already know who they are, like you have your Rolodex and your email. It's not very valuable. Your second degree is all of the connections that you could have, but maybe don't yet. Maybe your next job is in your first degree, maybe you'll go work with a friend, but probably your next job is in your second degree. What we're doing fundamentally, when we're looking at LinkedIn is we're browsing a large graph, and we're figuring out what about that graph is interesting to us. Here's some of the stuff that comes out of the graph database. You notice, these are all second degree connections. What about a graph is interesting to me? It's some sort of second degree connection, like maybe we worked at the same employer, maybe we went to the same school, maybe we know some of the same people.
Graph Workloads
This is just one of three graph workloads. What we're doing is graph serving. Database people would call this as complex OLTP. What we're doing is we're browsing a graph, this is the predominant application workload. Even things like PowerPoint or Word, what they're doing is they're browsing a graph of stuff almost all the time. You do a little bit of editing. The important thing about serving is we need to work in human real time, like about a quarter of a second. Let it be half a second, but it's got to be like that fast. You can also do graph analytics. Database people would just say complex OLAP, or maybe just OLAP. Where you have an analyst typing in a query, and they're happy to wait for a few minutes, or maybe they're running a dashboard, and it runs once an hour and comes up with a graph for somebody to look at. There, we're looking at something that scans a whole bunch of the graph and summarizes it. It's going to take minutes probably. Then there's a third workload, which is graph computation, things like Page Rank, Bellman-Ford, where I want to run an algorithm over the whole graph. These things typically take minutes or even hours, depending on the size of the graph and the algorithm. What we're doing is just graph serving.
Background
Handy rule of thumb in software development, never invent something that's in a textbook. All of the background for this talk and this product is basically textbook computer science, literally. Specifically, the relational model and datalog. This is actually the real textbook that we use as a reference on the team.
Relational Graph Data
Relational graph data, I'm going to throw some SQL at you. I'll speak it slowly. Basically, relational graph data is two tables. I have a table of vertices. This is just mapping strings to integers, that's all it's doing. A vertex is represented by a string. Inside the database we'd like to represent it with a fixed length integer. That's what this does. Now the exciting part, I have a table of edges. An edge is just three vertices. This is a compound primary key. The edge, you specify it by value, and it either exists or it doesn't exist. Very simple. Pause right here, this is like 90% of the content of this talk right here. It just so happens, like a lot of things in software, that the remaining 10% of the talk will take the other 90% of the time. Let's look at some graph data. What we're talking about is basically just three columns of integers like this. I've shown these guys, if this was a normal SQL database, they'd be sorted like this because of the primary key constraint. Let's think a bit about what it's going to be like to work with this. Suppose you start with a bunch of things, and you want to transform them, you want to navigate through an edge, one edge for each thing, or multiple edges for each thing. You notice the things are all sorted, and the subjects are all sorted, so this is great. We do a merge join. What we get in the output is no longer sorted, it's not even a set. If we were doing a complex query, where we do multiple hops through a path, we would have to resort this thing in order to keep on going, or do random access. There's another aspect to this. This is really small, but any real database, the edge does not contain a lot of information, so you're going to have a lot of edges, think a trillion edges. I said our workload is browsing, so I have to produce a screen full of data for someone to look at. I have to do this in about a quarter of a second. What I can tell you then, is, if I only have a quarter of a second, and I'm looking at a trillion things, I can't actually scan very many things. Maybe I can scan a few hundred thousand, maybe a million things, a million edges. We're very limited in the amount of data that we can have as an intermediate result, just by the nature of the domain. Now think about it like, so I have an intermediate result, it's just going to be a bunch of edges. Everything is edges. I have some intermediate result. It's got like 100,000 things in it. I'm doing a lookup into a table of a trillion things. It's just incredibly improbable that a page that I read for one thing would contain the answer for the next thing. Just by the law of large numbers, what you're doing when you work with graph data is random access. The key insight here is we're doing random access at scale. There's absolutely no way around this. The index structures that we're using are hash tables. The performance domain that we're working in, we're going to be counting L3 cache misses, because we're doing random access into memory. The entire graph is held in hash tables in memory.
Why only 3 things? Why not 2, why not 4? Seems like an arbitrary choice. The first reason, it's tractable to index everything, like 3 factorial a 6, 6x is a reasonable write amplification if you're careful. We can just index everything and say, edges are fast, there's no create index. They're just uniformly fast, they deliver random access. That's a guarantee that we can make. That's super handy. Second of all, as subject, predicate, object would lead you to believe, this is how people encode knowledge. We are building database machinery that deals with knowledge the way people like to encode it, which seems like it ought to be a good idea. There's a subset of people, computer programmers, and they like to use structs and pointers when they're building a model of the world in an application. Because the whole point of having a database is to build applications with it. Let's go look at that world a little bit. You have a triple, like Herman Melville wrote Moby Dick, just the fact. If you're an application programmer, you're probably going to model this as like, I have a struct author, and here's the one for Herman Melville. He wrote a book. He could have written many books so there's a vector right there. Then, one of the books that he wrote is Moby Dick. Similarly, the Moby Dick structure, it's a book. Books were written by an author, and there's the author. When you're programming with structs, and pointers, what you're really doing is you're working with a hand-built inverted index of the underlying relation.
Datalog
Some good news, this is just a relational database, we can do this with SQL. We have a much better syntax. Interestingly, it's been sitting there in a textbook for 30 years, it's datalog. This datalog that you see on the screen is the same query that the SQL was doing. You see, it has a natural decomposition into rules. It's a basic social graph type of a thing. I'm looking for employers and skills of friends of friends, employers and skills in my second degree. These are datalog rules. In datalog, anything with a period is a statement of fact. In our datalog, the only thing you can assert is an edge at the bottom. Everything is assertions with edges. Here I'm saying, user 1, their name is Scott. Pretty straightforward. Anything with a question mark is a query. Underbar is a variable, which just means, I'm different from all other variables. Every underbar is different. All you're doing here is you're asking, what is the name of user 1? Pretty simple. I can define a rule, like, here's a rule called named. The rule just evaluates the right-hand side, same sort of thing. I can ask, who has this name? Or, what is the name of this thing? Just by supplying different variable bindings. A cool thing that you can do with datalog and edges is you can say, user 1, what's up with that? Like all the inbound edges on user 1. This is a very typical first step, when you're exploring a new database with a schema that you're not familiar with. You find something that you know, typically yourself, and then you go explore just by looking at the inbound edges. It's something that's very difficult to do in SQL.
Predicates
Our two-table representation actually lost some valuable information. We need to add this information back. We're going to do this by formally defining predicates. Here's a rule called DefPred. You have to use DefPred to introduce the predicate. This is a predicate for title, like the title of a book, and what the definition says. Says, the subject side, a book only has one title, so it's cardinality 1. The type of a book, it's just a node. On the object side, there's a string, a UTF-8 string, so our type is a string. Of course, many books can have the same title, so 0 meaning unlimited cardinality. All you have to do to start using a new predicate is put this DefPred in, and then you can immediately start using it. For example, book 1 title is Moby Dick. The predicate itself is just a node in the graph like any other node, so you can add your own metadata to predicates. For example, here I'm saying there's a kind of a predicate that's like a name like thing. A title is like a name. Name is like a name. I'm going to identify these as being of type Known As, just writing edges in. Now I can ask queries like this. What we're saying is there's a thing, u:1, I don't know much else about it. It's referred to by an edge with some kind of a predicate. The kind of predicate is anything referred to by an edge with the type Known As. I'm able to define what a name like thing is, and I'm able to ask for it in a query. If you look at what an application does, very often this is exactly what they need. I have a little card, it has a name, it has a picture, and so forth. I use this to display anything. I don't care. As long as it has a name, and a picture, and maybe some descriptive text. That's predicates.
N-Ary Relationships
We're not quite done yet, because the predicate is a simple binary relationship. Most relationships in the real world are not simple binary relationships. Let me give you some examples. Grammar is more than subject, predicate, and object. Here are some real relationships that are not subject, predicate, and object. You work for a company, that's employer, employee, and a start date. If I don't have the start date, then I can't represent a common situation in the world where you work for a company, and then you quit and you go work for another company. Then you quit and you go back and work for the first company. That's actually a super important piece of data. LinkedIn calls this a boomerang. Similarly, a marriage, spouse, spouse, start date. You have people like Elizabeth Taylor, who are married a lot, sometimes to the same people. Again, if you want to model the data, you need more than a binary relationship. Another one, actor, role, and film, endorser, endorsee, skill. N-Ary relationships happen all over the place. We need support for these guys in a graph database. One more thing, a lot of relationships have attributes that describe them that are not part of the identity, like a super common one in our world is a score. This relationship exists, but how important is it? Herman Melville wrote a book, but what's his most important book? Which one should you show first? Other things might be like a stop date, or maybe a title, something like that.
The basic intuition here is like, if I have an edge, and my edge traversal is fast, then it's not really objectionable just to add an extra hop. If going across one edge is going to cost you two L3 cache misses, going across two is going to cost you four. That's not outrageous. Honestly represents the structure here. Your big question is, what do I use for the identity of this central node? If I just make one up, now this N-Ary relationship doesn't behave like an edge, you can get duplicates. If you have duplicates, then you might add interesting data to the wrong one. It's not really an effective way to work. We can actually solve this problem. We're just going to have a function which generates us a string, which is the identity of the central node. You can read the string and get how this works. Like, I'm going to take all pairs of object and predicate, and I'm going to sort them. That's the identity of this hub node. Once we figure out that, the way edges work, now N-Ary relationships work just like edges, so we get this for free. You can annotate. You can do anything you want with a central node. It's just a node, it's not going to bite anybody. You can put whatever data you want on there. The data is just edges, we don't need to come up with some fancy schema to talk about this stuff. LIquid does what we call compound predicates. This is just the syntax of that working.
What is this thing that we built? This has been known to relational database types for a long time. I think in the '70s, Peter Chen came up with this notion of Entity-Relationship modeling. This thing that we've come up with is just a relationship table. It says, there's a compound primary key of actor, role, and film. Then you have some attribute columns, whatever you want. This thing that we invented is actually completely normal, well understood 40-year-old SQL. There's a nice bonus that you get out of this. Suppose we wanted to do a symmetric relationship, like a mutual friend. You'd start off in SQL. It's like, I have two friends, like friend1 and friend2. Now I have Bob and Jane, they're mutual friends. Is it Bob and Jane, or is it Jane and Bob? There are two different ways to represent that. Pretty quickly, you're going to reach for your PL/SQL hammer, and you're going to say, friend1 has got to be less than friend2. That's the way we know the identity of the mutual friendship. If we're in this N-Ary relationship world, what I'd observe is like, if you just had two friend edges, it's in fact structurally perfectly symmetrical. The structure expresses exactly what we want to express. Furthermore, our little algorithm for generating the primary key works exactly right. We sorted the predicates of the same Bob sorts ahead of Jane, so now we have a primary key. That's pretty cool. This is actually something SQL can't do. Here's a little bit of datalog. I'm showing you, I'm asserting the edge twice in both different orders. At the end, I ask for the inbound edge as Bob. You can see there's only one inbound edge. There's just a unique mutual friendship.
OODB/Ontologies
A question you might ask here is like, what about nodes? We've only talked about edges. That's on purpose. Nodes are just immutable strings, like they're primary keys. They represent entities. That's it. Sometimes they parse. You could have a node that's like 7, 7 has a meaning as like an integer. When we give that node a type like int, it just means I want something that parses as an integer. A question is like, why not more? Why don't we put a bunch more stuff inside our node? If you have something that has node properties, a question to ask is like, why even have edges? I can have a node be like an author and another node be like a book, and the book would have a little vector of author identities in there. Why not just do stuff like that? We tried this back in the '90s. I was I was involved with this, OODBs. They don't work very well. In this world, say, we have a person. That's a base class, people have a name or something like that. Then we have a person who's an author, that's a subclass. We're just working on structure extension here, subclassing. No one could object to that. What if I have a person who's an editor? I can solve it the same way. What if I have a person who's an editor and an author? Now I have a problem, because either I figure out some multiple inheritance thing, which never worked very well. Or, I have two different identities, two different structs, one for person as an author, and one for person as an editor. Equality doesn't work the way you think it ought to work. It gets worse. Like, you're making an ontological assumption that all authors are people. That's not actually the case. Back in my history, starting with graph databases, worked at Metaweb, kept discovering that the Beatles, a rock group, were getting typed as a person. We were trying to do an inheritance-based type system, we said, an author that's a subclass of a person. It turns out the Beatles authored a book, like there are books in the world. They have a title and a name, and the author's name says the Beatles. It's a fact. I'm not making this up. This is Amazon right now. The notion that you're going to have a single ontology that is going to support everything is really badly broken. There's another interesting problem with it, which is, the root of your hierarchy becomes a gigantic single point of failure. You do some clever editing up at the root, and suddenly two-thirds of your database just disappears. The nice thing about the graph structure is it doesn't care, like we're just recording facts. The triples don't need to have a particular order of structure fields, they just exist. They don't do anybody any harm. Authors, editors, astronauts, all can cluster around the same identity without any conflict at all. That's the data model.
Query Evaluation
I want to talk a little bit about query evaluation, just a couple details. The first problem is cross-products. The second problem is static planning. These are both endemic to SQL. Worst case, optimal joins. Super interesting, big line topic. You can go search for that on Google Scholar or something, and read about it. It's very relevant to a social graph, kind of a workload. Let's look at a cross-product. I'm going to use a predicate like knows, A knows B, B knows A, so there's an edge. Now I'm going to define a rule that's bidirectional. I just ask, knows x, y is either x knows y, or y knows x. I'm going to say, show me who knows who in the whole database? I get two results, A knows B, and B knows A. The simplest cross-product that I could come up with. If I look at the edges, there's this one edge. Why is this a big deal? We're doing complex queries. We're doing complex queries of many to many relationships. Every time you introduce a cross-product, you're multiplying the size of your result by some constant factor, 2, 3, 5, 10, something like that. It's really easy to get results that are 100 times the size of the actual edges, if you're doing a complex query. Why do we care? Inside the database, we do lots of clever stuff to avoid materializing cross-products, because it's really hard. If you have n times m things, you wind up doing n times m work, and that turns out to be slow. A key point here is like, the application has exactly the same problem. They don't want to do n times m work, either.
How does this play out? What we need is a subgraph return. Typically, in SQL, SQL only returns you a single table. A subgraph return, I just want to return multiple tables. You can think of this as, I want to return one row in every relationship table that matched. I can stitch this stuff together. It's a relational algebra tree, like what we have inside a SQL engine. Really, I only need two things in the tree. I need cross-products and I need outer joins. If you're processing this data as an application developer, you're building up a graph of people and skills and employers. You're going to have some sort of a loop where you say, person 23, do I have one for that? No. I'm going to make one. Skill 7, do I have one for those? I'm going to create one. The struct for 23 is going to point at the struct for 7 in the correct offset. That thing that you're doing with the hash table lookup to see if you have one, that's an outer join. That's all it is. It's nothing mysterious.
Dynamic planning. Let's look at a really simple query. This is graph distance is 3. It's very relevant to our workload. It's just three hops. A is connected to B is connected to C. That's all it is. I want to know if Alice is 3 hops from Bob. There are only four possible plans in a sequential world. I can go left, left, left from Alice. I can go left, left, right. I can go left, right, right, or I can go right, right, right. It's the only ways you can do this query. None of these plans are optimal for all data. We're doing self-joins, all of our statistics to our statistics, each edge constraint in this query looks identical. They're all the same. What makes them different is like what data is in the graph. An observation here is, unlike in a sorted storage world, in a hash table storage world, our set sizes are available in constant time. I can do a hash lookup, and I can know how big the set that I'm going to read is. If I'm storing stuff in a B-tree, like I B search to find the beginning of it, and then I have to B search to find the end, or I can just scan it. It's far more expensive to find out how big a set is.
What does this look like? We just hand coded the four plans. You can see them on the chart there. You can see there's this scallopy thing with four scallops. That's where each plan is optimal. The dynamic plan, you can see it's not quite as good as perfectly optimal, because it costs you a little bit to figure out how to use it. We're going to say, how big is the fanout from Bob? How big is the fanout from Alice? Pick the smaller one, expand that. Did that get super huge? We have to do some work to figure things out. We can do that work. It's tractable to do that work in this indexing domain. You get a query evaluation performance that is pretty close to optimal in all four domains.
What is a Graph Database?
What I want to leave you with is a formal definition for what a graph database is. Graph database is an implementation of the relational model with four properties. All relationships are equal, everything is an edge. If you think about like a typical SQL database, if you're a relationship that is in one table, you're first class. It's super nice, like I do a B search, I get the row, and I have this whole relationship right there basically for free. If you're a relationship that exists across tables, like I have to do a join, you're really second class. Joining is super slow, so second class that way. Semantically, SQL doesn't really keep track of what joins were intended by the user. SQL will happily say, yes, you're joining depth and fathoms to degree Celsius. Probably ok. In the graph database, everything is an edge, and the edge is exactly what the user intended. Those are the joins that the user intended. If everything is going to be an edge, you're going to have a lot of edges, it better be fast. If it's not constant time, and fast, and by that I mean three L3 cache misses, as your database gets bigger, your performance gets slower. Ultimately, it's not going to be worthwhile having a big graph. You want a smaller graph that performs better. That turns into harder to manage. Query results are a subgraph. If I'm going to have a big graph with a lot of edges, I want to ask complex queries of it. I need to return structured results that the application actually wants to consume. Lastly, schema change is constant time. If you think of it like a general model for a serving system, it's a materialized view. Like I did some fancy query. Then, when you ask something, I just look up a row, and give it back to you. What we're saying with this is, no, you can't heat. I need to be able to show up with a new predicate and just start using it at full speed. This can't turn into any kind of like ALTER TABLE fire drill underneath the surface in the implementation. That's my proposed definition for a graph database. Clearly, I disagree with a lot of the world. We'll see how that turns out.
Graph Database vs. SQL, RDF, and Property Graphs
I want to do a brief comparison walkthrough. First of all, versus SQL. There's no denormalization. A graph is perfectly normalized. It is in fifth normal form. You can't express denormalization. There are no nulls or trinary logic. Edges either exist or they don't. If you introduce an unbound variable, we're going to force you to give it a binding. If you have a default quantity, you get to decide, is it 0? Is it min int? Is it max int? Is it 1? Depends which is appropriate. Depends on what you're doing with it. We force you to make that decision explicitly, so now there's no trinary logic, which is a big pain. We can do constrain through the predicate. In SQL, a predicate is typically a column, roughly. You're welcome to use the catalog table, but it's hard to use the catalog table meaningfully in a query. Typically, to do something like this in SQL, you need to query the catalog, figure it out, and then you generate another SQL query. Constraining through the predicate is like, this predicate is a name like thing. You can explore all the incident edges on any entity. You have a ticker symbol, (MSFT) Microsoft, it's in the graph somewhere. You can go start exploring. You can do symmetric relationships. You have subgraph results.
RDF, the relational model, that's pretty cool. It's been 40 years, we've had 20 years of NoSQL. There really isn't much of a contender. We did not invent a query language. I highly recommend that. We have a much simpler edge. We have first class schema that is much simpler than OWL. You've seen all the moving pieces in the schema in this talk. We can do N-Ary relationships. We can do symmetric relationships. We have composable rules. Lastly, versus property graphs. Relational model. Didn't invent a query language. We have a first-class schema for everything. You can constrain through the predicate. There's no separate property schema. Everything is just edges. There's no OO node problems, so we just make the decision for you. You can't put any stuff in a node, just use edges. N-Ary relationships, again, super common. Composable rules.
Single-Query Applications
What's the future here? A lot of people think that the future of data is big graphs. If the future of data is big graphs, then I think the future of applications is big queries. You ought to be able to say, here's all the data that I want to put on a page, here's a description of it, go get it for me. Even if I need to do this, I'm doing thousands of joins in human real time.
Questions and Answers
Participant 1: There is a feature in [inaudible 00:45:25] where you can have a representation like, I am 80% good in Java, and we'll be doing JavaScript or something like that. How do you have that built in these edges?
Meyer: You would have a compound relationship with a skill, so you have that hop in the middle. Then you just add a score edge, whatever you want, a floating point, or an integer score.
Anand: Do you put any limits on fanout, either on the query edge part or in the representation?
Meyer: There are no limits on the fanout. Naturally in a social graph, you're going to experience skew. I don't have very many followers, a few hundred. Bill Gates has 20 million followers. That's not a normal distribution, it'd be a ZIP distribution. That's a benefit of the graph representation is you can represent stuff like that. The implication is, your query eval better be able to deal with skew. If you have a bunch of people, and you're looking at the fanout from followers, it might be 200, 700, 900, 200, 20 million. When you get to 20 million, probably time for a different query plan.
Participant 2: I want to understand, in what use cases will you manage to use graph database. We maybe have different options or approaches there, like GraphQL or big data, which also help us to search faster with [inaudible 00:47:33] of the data joins. Each may have its pros and cons, like for graph database basically we're looking for failing use cases that these could be the optimal approach.
Meyer: What are the use cases? What are the pros and cons? For the data model itself, for how to model data, I don't think there are pros and cons. I think the graph is just better. It's isomorphic to entity relationships. If you wanted to do normalized entity relationship data modeling in SQL, that's great. It turns right into a graph. For querying, obviously, it really just depends on the workload. The observation I'd make is serving is a new workload. People typically handle this with a cache in front of a conventional database. Really, the difference between a cache and an index is just the index is complete. It has some consistency properties that you can state, whereas the cache is just whatever the application put in there. If you have an Object Relational Mapper, it's running a cache of objects. I think that's pretty undesirable. I think indexing is much better. There's no warmup time. You know how much things are going to cost. It's immediately usable at full speed. The previous system at LinkedIn was a cache system. The current system, LIquid, is uncached. The SREs love that.
See more presentations with transcripts