BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Papers in Production Lightning Talks

Papers in Production Lightning Talks

Bookmarks
40:11

Summary

These lighting talks cover several papers: Towards a Solution to the Red Wedding Problem, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, and A Machine Learning Approach to Databases Indexes.

Bio

Randy Shoup is a 25-year veteran of Silicon Valley. Roland Meertens is Machine Learning Engineer at Autonomous Intelligent Driving. Gwen Shapira is a principal data architect at Confluent helping customers achieve success with their Apache Kafka implementation.

About the conference

QCon.ai is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.

Transcript

Shoup: I'm going to share very little of my personal knowledge, in fact, none of it, but I'm going to talk about a cool paper that I really like. Then Gwen [Shapira] is going to talk about another cool paper and Roland [Meertens] is going to talk about yet another cool paper. The one I want to talk about is a paper that's around using machine learning to do database indexing better. This is a picture of my bookshelf at home. A while ago, I bought myself a box set of "The Art of Computer Programming", which has basically all of computer science algorithms written by or assembled by Don Knuth. There's 4a, so he's still working on completing the thing, hopefully, that will happen. Anyway, there's a lot of algorithms out there in the world and a lot of data structures.

When we're choosing a data structure, typically we're choosing it in this way, we are trying to look for time complexity, how fast is it going to run, and space complexity, how big is it going to be? We typically evaluate those things asymptotically, we're not looking as much at real-world workloads, but looking at what are the complexity characteristics of this thing at the limit when things get very large? We're also, and this is critical, looking at those things without having seen the data and without having seen typically the usage pattern.

We're doing is we're saying what is the least worst time and space complexity, given an arbitrary data distribution and an arbitrary usage pattern? It seems like we could do a little better than that, that's what this paper is about. What we'd like to be able to ask or to be able to answer is how could we achieve the best time/space complexity given a specific real-world data distribution and a specific real-world usage pattern. That's one of the things we're going to talk about in this 10-minute section.

The Case for Learned Index Structures

The fantastic paper is from a bunch of people at Google and who are visiting Google, it's called The Case for Learned Index Structures. There are three things that I would like you to take away from this paper, the first is the idea that we're going to be able to replace indexes with models. That's going to be pretty interesting; that there's a duality between an index or a data structure and a machine-learned model. Next, we're going to talk about how we might be able to use machine learning to synthesize a custom index that actually takes the specific characteristics of a specific data distribution and a specific usage pattern in mind, and how we might be able to combine them using machine learning. Then finally, the thing that's most exciting to me is that this really opens up a whole new class of approaches to fundamental computer science algorithms.

Let's first talk about replacing indexes with models. The first key idea of the paper which is actually in the first sentence of the abstract, is an index is a model. What do they mean? Let's use a couple of examples of really common data structures, a B-tree which we use to search and store things in database indexes, you can think of that as a predictive model that given a key, finds a position in a sorted collection. We can think of that as those things are dual. You can think of a hash-map as, again, a predictive model that takes a key and finds a position of something in an unsorted collection. Then you could look at a bloom filter as simply a binary classifier, a bloom filter is a way of deciding cheaply, is something in our set or not in our set, and you can have false positives but not false negatives, etc.

The wonderful idea though, is that there's this duality between indexes and models. How they show it in the paper is this, is that you might think of a B-tree index as given a key, you go into the B-tree and you find the position of the record that you're interested in, but a learned index could replace that triangular B-tree symbol with a square model, which could be a neural net or a regression or some regressive thing or something else. That's a wonderful thing. You go,” All right. That's kind of cool but what am I going to do with that?”

Let’s go to the next step which is automatically synthesizing custom indexes using machine learning. One of the things we could use machine learning for, which is pretty straightforward, and a bunch of other people have talked about it here, is we might be able to use it to select which data structure we're using, that's pretty cool. Then we might be able to think about how to use it to tune the data structure, hyperparameter tuning is very much the same idea as applied to machine-learned models. We can use machine learning to see how best we can tune our other machine-learned models. Given a traditional data structure, we might be able to use machine learning to help us figure out something better. Even cooler than that is we could use machine learning to assemble a custom structure that uses a specific, it takes advantage of a particular data distribution.

We might be able to choose different data structures for different parts of the distribution. One part of the distribution might be very sparse, another part of the distribution might be very dense, one part of the distribution might be integers, another part of the distribution might be something else. We could assemble a custom structure that put all those things together. Then we could actually use a hierarchy of models, we could think of a kind of ensemble or a mixture of experts approach, where we use a hierarchy of first choose the next model, and then the next model, and then finally, at the end, the actual structure.

Cleverly, we could default to the traditional structure if we weren't able to improve upon it with machine learning. This is what it looks like in the paper, imagine in this example, a three-tiered situation, where a key comes into an initial model, it chooses the next model, and then those choose the next model, and then finally, we find the position. Well, this seems like a lot of work to replace some really fast, highly optimized data structure that we use all the time. Is there any way this could work in a performant way? It turns out, yes. The paper uses a specific example of a highly cache-optimized production quality B-tree implementation, versus a two-stage learned index where there was a neural net on the top and linear models at the bottom and what they got was actually a net increase in 70% in speed and 10 to 100x in size. That's pretty cool, that's not a bad day's work.

The thing that I think is even more exciting is just the idea of replacing the guts of aspects of important parts of a database with machine-learned approaches, so an entirely new approach to how we might improve on performance characteristics of stuff inside a database. One of the things they point out is that what we're actually doing with that tree of models is we're learning a model for the cumulative distribution function, the CDF of the input. We can use that CDF model to build an indexing strategy like I talked about, we could also use it to build an insertion strategy, a storage layout, we could do cardinality estimation and join optimization. This is all stuff that database researchers have been doing for a long time, we could also actually compute approximate queries.

The same authors did a subsequent paper later on this year, which they called SageDB, where they are using this idea of building a model in the center and using it for query optimization, for data access improvements, for sorting, for advanced analytics, all this great stuff. What's very cool about using machine-learned approaches here, is not only will this stuff improve as we improve the machine learning techniques- that's one dimension of improvement- it will also be able to take advantage of improvements in algorithms and data structures in the traditional computer science world, because we'll just have better base things to assemble. Even more importantly, this stuff scales with GPU and TPU performance, which actually is continuing to improve versus Moore's law and CPUs, which are flat. I leave you with this totally awesome paper, a Case for Learned Index Structures.

Toward a Solution to the Red Wedding Problem

Shapira: I'm going to talk about a paper called “Toward a Solution to the Red Wedding Problem”. The paper is almost as good as the name, but you really cannot do that, the content of the paper is around solving a very unique problem around compute at the edge, so I'm going to start by giving some background about what we're doing at the edge at all, talk about a specific problem, talk about the proposed solution, and then talk about some things we can do in the future, all of that in 10 minutes.

What do we mean when we talk about the edge? If you think about most of the web architecture these days, there are the main data centers that have databases, and we call them origins. If everyone connected to that main database, we would have a lot of problems, it will be slow, because some people live very far away from database. It will also become a big bottleneck, our entire global capacity will be limited to what one database can provide, that's a problem. People take different locations around the world, put a big cache there, and now people can talk to the cache. That's fantastic from reading, because you read data that is nearby.

If you're writing, on the other hand, the writes still have to end up on the original data center for consistency. Usually, you write to the cache there and the request is kind of rerouted to data center. All the problems that we have with one data center still exist. These days, we have a new solution end problem, we added compute to the edge. Most of CDN providers and AWS CloudFront allow you to run small serverless functions on the edge, so you can do cool stuff like resize images or reroute requests for AB testing or add some headers, etc., but it's all limited to small and most importantly, stateless stuff, because there is no state at the edge. You have some cache and you have no control over what is actually in there.

What is the Red Wedding Problem? The Red Wedding Problem is what happens after there is a particularly popular episode of "Game of Thrones" airing on TV, as happened this Sunday, I think it was Sunday, I don't actually watch it myself. Immediately after the episode airs, everyone apparently runs to Wiki and tries to update their episode page, their favorite character page, start editing pages. Now there is also a read spike, but we already know how to handle a read spike, but what about the write spike? How do we handle a huge spike in writes? This is the Red Wedding problem, the Red Wedding episode just aired; what do I do with all those writes?

In a nutshell, the solution is very simple, we need to allow concurrent updates on the edge. That's the way to solve it, don't take all the writes to data center because we said it's going to have problems, the devil is clearly in the details. Our theoretical solution is absolutely awesome because if we can do that, it means we have low latency writes, everyone writes locally, and we get rid of a big honking bottleneck in our database. It sounds amazing if only we can solve it.

There are obviously some challenges, one of them is that, as we said, the edge only runs serverless, stateless stuff. How can we maintain state there? The other part is that if everyone's updating their own data center, and everyone is working on the same character page, how are we going to merge all those things together? It turns out that the second problem is actually way easier, which was incredibly unexpected when I read the paper. It turns out that there is something called CRDTs, which are basically data structures made for merging. As long as you use CRDTs for editing, you're guaranteed to be able to merge them almost conflict free at the end. I'll talk about which CRDTs to use when I get to the end of this talk. Surprisingly, the hardest part was how do we actually maintain state on the edge?

Here's what the solution kind of looks like - I'm sorry for the mess - we know start reads work as usual. Writes are now handled at the data center, where we're running all those Lambda functions, each of them maintains its own state but they also communicate, so you talk to Azure Lambda functions in the same point of presence, in your same location, and you do it through a queue, so you still propagate your state. If we're the same data center, I have my state, you have yours, we'll communicate with each other, it means that when they start a new serverless function, it can very rapidly recover state from someone else running in the same location. Every once in a while, we take all this data to accumulate it on the edge and merge it back into our data center, and that's how we maintain state in the data center. Those batches are what makes it efficient.

In a bit more detail, we use containers to deploy our application and database. In this case, and I'll show you in a second, the application database is actually the same thing, it's embedded. Then we recover state from other replicas running in the same location, and we use some CRDTs for merging and we also use Anti-Entropy to make sure that mistakes don't accumulate and propagate. Implementation - that's all I'm going to say about implementation - they've taken and distributed peer-to-peer key-value store written in Erlang, embedded it in Node.JS, and use AWS Lambda to run it on the edge. This thing should be legal and probably is, in many states, whatever you do, do not take this paper to production. This is not the right paper to take to production, it needs some production in it.

The main problems they ran while implementing, is the fact that they're running it on AWS Lambda on the edge, half the paper is one limitation after another and how that impacts the solution? You basically have stateless functions, that's all you have, you cannot bind to a port which means that original plan of how we talk to each other on the same location went completely out the window and they talk to each other using a queue. As you can see in the last point, there is also not really a queue on the edge, so they talk to each other via queue in the nearest data center. Then you also don't control concurrency, ideally, you will have a lot of Lambdas running at the same time to serve the workload, they basically had zero control over that, they could theoretically lose their entire state accidentally.

Invocations don't last long, they found weird ways to keep the Lambda functions running for longer so they could actually perform their anti-entropy cleanup work. If you want to hear a hack after hack, on top of a really good idea, really amazing idea, that's the paper for you. How is it actually useful? Well, for one thing, a lot of the IoT and AI is moving to the edge because that's where all the data is, so the idea that you can use CRDTs to merge state on the edge and avoid this global bottleneck seems incredibly useful, something that I'm totally going to play with.

If you need CRDTs, there are JSON CRDTs, and since JSON is kind of the global data structure we all sadly converged upon, this allows us to basically use CRDTs for whatever it is we need. The JSON CRDT paper with Martin Kleppmann is dense, but amazing in its own right. He also has a good talk about it, EdgeLambda is clearly terrible and awful, and I hope either AWS fixes it or someone else will show up and provide a service that has good stateless computing on the edge, but actually allows you to maintain state, bang to ports, and do normal stuff.

As I read the paper, I was just thinking, “Hey, I have KafkaStreams, it has local store with RocksDB, RocksDB support merges. We have Kafka as a broker, I can actually recreate it in a better way than what they did with EdgeLambda.” If anyone wants to beat me to it, it sounds like a fun weekend project. CRDTs on the edge for the win every one. Next up is Ronald [Meertens], who builds autonomous vehicles, self-driving cars for a living, that has to be interesting.

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

Meertens: I am indeed Ronald Meertens and I work on self-driving vehicles, I think it's the most interesting AI problem of this century. If you agree, come talk to me, if you don't agree, come also talk to me. I'm going to present a paper called a Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Let me first show you what's there, what kind of databases this can work on.

Imagine that you have this kind of data, then as a human, it's pretty clear to see that probably these things belong together, but it's not like a normal distribution you expect. Same here for this kind of pictures, where you noted these data points probably belong together, or that these data points belong together. You can't really fit it to a normal distribution, this is a case where you can see what belongs together by looking at the points around it. That's why the DBScan algorithm is so nice and elegant, because it's an unsupervised algorithm designed to find clusters of points that are close together and thus likely belong together. Because it's unsupervised, there's no need for any training step, you just have to select the right parameters.

In case you guys now think, “Oh man, it must be really difficult to find these parameters”. Ha-ha, no. The other thing which is really good is it is great at finding outliers or anomalies, because it is not really influenced by them, unlike, for example, k-means where if you have an anomaly, your mean for a cluster is dragged away. The only two parameters you have here are the distance in which you want to search for neighboring points and the amount of points that need to be close together to be part of the same cluster.

Let's go into that, it's clustering by simply counting, this is basically the whole DBScan algorithm. Points are either core points or border points or noise or anomalies, if you have a single point like point A in that picture, that have N or more points around themselves, then those are core points. Points which can be reached from a core points, but don't have enough points around themselves to reach enough other points, they are border points but they belong to the same cluster. Points that are not close to any core points and don't have N or more points around themselves are classified as noise or anomaly. This DBScan, the density-based scan algorithm recurses through the connected components to assign the same label to connected points. If you notice something as the core point has N or more points around it, you can basically recurse to each of these endpoints around it, then try to see if they are belong to the same cluster or not. That's basically the whole algorithm.

Because I have some time left now, I also wanted to go into some of the applications and how you can use it, because the cool thing about this algorithm is not only that it's really simple, it's also that it's super versatile and can be used for many applications. For example, you could use this in an outlier detector for a group of servers where you try to group them together based on how much RAM they use or how much CPU to use, you can find some outliers there.

You can use it for clustering cells in biology, for example. You could cluster movies for a recommendation engine; if people like this movie and there are many movies close to it, probably you like these other movies as well. You can maybe cluster crops in satellite or drone images, you can find separate objects on the road in LiDAR data. Overall, it's a very powerful tool, you can use it as a very powerful tool in your software engineering toolbox.

I wanted to go into what this example of finding objects on the roads, one thing you have in self-driving cars, you have this really cool sensor set up on the top. Those things are called LIDAR’s, and it consists of multiple lasers which basically spin around very fast. These gives you a very large amount of accurate measurements, but they are unstructured points in space, or you could treat them that way. The main challenge here for self-driving cars is, of course, what objects are on the roads and where are they, or where are any objects in the roads? The first challenge is, find what separated objects are on the roads. As you probably guessed, you could use DBScan for that, then there might be a bonus challenge that determines what these objects are. If you do that, if you just run DBScan, on the left, you see the outputs, I gave every connected cluster a different color. You can see the DBScan algorithm is able to find- I'm sorry, you have to have a lesson in how to read these things - every group of points is actually one car or a tree in this case. That's the nice thing about the DBScan algorithm, it's pretty versatile, so you can also use these points as input for a new algorithm.

I asked a student, Christian, to do that, he came up with this, where in this case, he tried to draw or paint every car green, every tree orange, and everything which is not related to any relevant objects blue. He also made a video, so that's cool, let me see if I can actually play that. Here you see a car driving on the road, it's in the center actually, and so you can clearly see the trees in this image, which are all colored orange, you can see that all the cars become green. You can even use DBScan as basically input for your cooler deep learning algorithm if you want.

To summarize it, I think it's a really powerful tool in a machine learning toolbox, I use it very often for a lot of things, a lot of machine learning problems I encounter. It needs no training, it's easy to reason about the arguments you want to pass to it. For example, if you have this LIDAR case, you could say, “Well, I want to connect or find all the objects which have no more than 20 centimeters space around them and I want to remove all the noise, so if there are only three LIDAR points hitting an object and slightly noise.” It's really easy to reason about that, it gives you end clusters and the outliers, making the algorithm even more versatile, and can be used as a first step in a longer chain of processing.

Questions and Answers

Participant 1: My question is for Randy [Shoup]. You talked about time complexity that you showed.

Shoup: That's the table of results, that was an empirical result, yes.

Participant 1: Did they think about how to think about time complexity for that design in a classic way? Come up with a asymptotic all for those so that we have tight bounds in the more theoretical way, as opposed to just, “Oh, we'll use this real data and think this is how much faster it is”?

Shoup: I think I understand the question, it was real world, you should read the paper. They used real-world data, they used actually three separate datasets, one of them was a Google, I forget, but it's three real-world data sets and a real-world production, implementation of a B-tree, and then obviously a real-world example of their two-layer model. I think they're not asserting a different complexity class, if that's what you're saying. They're just asserting that they were able to beat the performance of something that was good.

Participant 2: My question was more about that, because the anxiety that I have about using that approach is how can I guarantee that my running time is going to be [inaudible 00:26:05]?

Shoup: Oh, yes, we should talk about it. You're worried about worst case, I see what you're saying. You ran it 10 times, what if you ran at the 11th time and it was longer? Yes, they do address that a little bit in the paper. We should talk about that offline, it’s a great question.

Participant 3: I'm wondering if you use a model as an index, which sounds like a fascinating idea, what about the problem of drift? As data accumulates in your database, the structure of it may change over time.

Shoup: I think there are two aspects to that, about the drift, that actually can be an advantage. If you think about again, in 10 minutes, I didn't summarize all the cool ideas of the paper. One of the other cool ideas is you can imagine retraining the algorithm, retraining your mixed model or hybrid model. You could imagine, “Ok, at some point in time, I've got something that feels optimal”, and then do it the next day.

The cool thing that I implied in the abstract but didn't say explicitly is, usually the only way to get a custom data structure is to hand write it, and that's why people don't do it, but if you can automate this process and use machine learning to do it, then you can run it however often you like. That's one part of the drift, so you can actually take advantage of that, the other thing what you might be implying is machine learning models are inexact, versus traditional data structures. They do talk about ways of using auxiliary structures to deal with that. If I didn't answer your question, then we can talk more of it.

Participant 3: Well, I was just going to say, no, I was thinking more of the first one, which is the idea of in the beginning, you have this database and you put things in it and just because you have maybe beta customers, they're all in a certain geographical region or there's some characteristic where they all have indexes work for them. Then all of a sudden you have all these new users or there's a Red Wedding and suddenly that something happens, so the now the indexes that, I mean, virtual indexes, the learned way of finding things quickly, is now failing because whatever you're using as your features in that model are not as relevant anymore.

Shoup: Yes, you're basically saying it's overfit, so I think all the techniques that are around being resilient to overfitting would work. There's another thing that they talk about in a subsequent paper, where the example that they give are read-only structures. What they're showing is for static data, so we'll start with that. What you're really saying is, which they talk about in the subsequent paper, what if you inserted stuff into it and changed it over time? The idea there, which I'm not sure I'm going to be able to articulate well is, if you could model the ideal CDF, that is what allows, which is the real domain from which the input is coming from. That's what you're looking for in the insert update case. You can model the actual CDF in the read-only case and then you can finish. It doesn't matter that it's overfit because the data is not changing. I don't know if that made any sense but we can talk more if it doesn't.

Participant 4: I had a question. Are they updatable models? Like the B-tree ...?

Shoup: That's what I'm saying.

Participant 4: Like for each update.

Shoup: In the first paper, they only talk about read-only structures. Then the obvious, the orange site, people are like, “This is useless because I update my data structures”. I don't think I did a good job of explaining it, but because I don't fully understand, I'll be open. In the read-only case, you can model the very specific exact data distribution and get a perfect near optimal thing. That's clear, what you're saying is, “Hey, man, that thing's going to change over time”. What you're looking for is a harder to train model, but ultimately the model is about predicting the theoretical or the general CDF as opposed to the actual. Does that make sense?

Participant 4: Basically, if it's read-only, the training accuracy is 100, perfect. Perfect training accuracy, right?

Shoup: You overfit like, yes, dude, I overfit. That's the goal.

Participant 5: It's a question for Gwen [Shapira]. Can you describe the mod, [inaudible 00:30:55] and talk about, I didn't catch exactly what it is?

Shapira: The CRDTs?

Participant 5: Yes.

Shapira: Yes, I love to talk about CRDTs. Do we have an hour? The general idea is that, if you think about it, if you're an engineer, we're all here pretty much engineers, how I make changes to my code, and you make changes to your code, and we both Git commit. And 90% of the time when we run Git merge, it just seamlessly magically merges. If we edited two different files, not a problem, we edited two parts of the same file, not a problem. Unless we really edited the same line, still not a problem. You can think about the generalized, like if you have an array and someone inserts something and someone else enters something else, you can figure out that hey, it's two inserts and so on.

Basically, CRDTs are data structures that in addition to the data structure, support the merge operation. The JSON CRDTs support that on generalized JSONs, which is pretty cool, you can imagine that if I was an owner of a Wiki and really, really worried about concurrent writes, I would deal with that with if someone edited two different parts of the page, not a problem at all. If two people actually edit the exact same sentence and I tried to merge it, I will probably end up with a mess. No matter how I tried to solve it, there is no good solution for two people editing the same sentence.

It's a Wiki, and store the mess, somebody will notice it, they'll fix it an hour later. It's not a bank, not a big deal, basically. Some stuff can also be smoothed over by just shrugging and saying “Not a big deal”. I want to again refer people to a talk that Martin Kleppmann did, I think both at QCon and GoTo, on the subject of JSON CRDTs and how he's basically building a text editor that is distributed and supports the same idea. That's a good example of how I imagined the whole Wiki CRDTs works.

Participant 6: I have a couple of questions. Regarding self-driving, is there any hope that there's unsolved self-driving without LIDARs or is that crazy?

Meertens: That's a difficult question because they are trying, so they obviously believe that there is. The big thing is that if you work with camera only, everything you see is an estimation. If you use LIDAR, you have a measurement, the big problem with these LIDARs is that they are super expensive. That's why Tesla wants to try to make something without them because obviously, they want to sell consumer cars. The big difference is that we, at our company, want to build robo taxis, where you can offset the costs of a single vehicle over a way higher usage throughout the day than Tesla can, where you only use your Tesla to drive to your work and back. Obviously, we as humans can't do it without any exact measurements, but we use a lot of intrinsic knowledge about how it looks like, how it behaves, etc. I think in the short term that using LIDARs is a very good bet. In the long term, we don't know.

Participant 7: I also have a question for Gwen [Shapira], which is maybe more of a funny question. Do you think serverless functions are popular because they are the enterprise version of Ethereum?

Shapira: I think they're popular because they're kind of easy to write, they're kind of easy to run, and it's 99.99% of the scale, it's actually quite cheap. It's not cheap when you reach the tail end, especially if you do what they did in the paper, where dysfunctions continuously communicate with it from the edge to a data center. The natural costs totally pile up easily but, when you begin it's cheap, it's easy, and everyone loves cheap and easy.

Participant 8: I have a question for Randy [Shoup]. I think this goes back to the question because he was asking. If you're building a model, in order to optimize your data structure, by very definition, your model is statistical in nature. It seems like there is a fundamental problem here because you cannot provide guarantees, because it's the statistical. If you do try to provide guarantee, you will give away to overfit your model, which means a new data point that comes in for access will not be able to satisfy or will not be able to provide the same kind of performance, so it's not just a problem where you can get away with it. You can only have either of the two.

Shoup: Yes, I do suggest you read the paper, I'll separate these two ideas. If you're in a read-only situation, overfitting is your goal. We'll start with that, I have a static set of data and all I want to do is serve read-only stuff. Actually, overfitting is my goal there. Do we agree on that?

Participant 8: Again, we don't exactly know what is the read pattern. The key here is that the distribution of usage should not change from the distribution over which the model was trained, because if the distribution of the user changes, then...

Shoup: Then you might get a slightly less optimal, or there might be something that's better. I think nobody is asserting that it's- maybe I use the word optimal- nobody's asserting that it can't be improved upon. Obviously, to your point, if anything changes, all bets are off, or a lot of the bets are off, but the point that you make around, and the reason why I separated this because this other part is dealt with in the paper. You said statistical but I'm going to say back approximate, you're saying and it's true that the vast majority of machine-learned models give me approximate answers, not exact answers.

This is not my paper, just the beautifulness of the paper is that they describe that. In a B-tree, the prediction that we're making is to find a page, we're not finding the record; we're finding the page, then we look in the page. Ditto for a hash table. What we do with hash collisions, lots of places hash to the same hash, but then we have the link list or whatever of the million cuckoos, etc., techniques to deal with those collisions. In both cases, the existing data structures are already approximate, does it make sense what I'm saying? Yet another, for me, totally mind-blowing, like “Oh, my God, you're right.” Yes. I had the same thing, there's no way you could do that with the model. “Oh my God, you can't do that”.

I can't encourage you enough to read the paper because that's dealt with in there. In the Bloom filter section, Bloom filter does need to be exact or else it's totally useless. Exact in the sense that I can get false positives but I cannot get false negatives, or else I shouldn't do it at all. What they suggest there is because the only ways I could think about were approximate, yes, there would be some side Bloom filter for the outliers. Anyway, you should read the paper. But all your points are well taken, I encourage you to read the paper because they're dealt with in one way or another in there. Awesome.

Participant 8: Also, I think you showed that it's like the cache optimized B-tree, but there are clustered indexes, probably passing the model. I'm assuming cluster B-tree [inaudible 00:39:45]

Shoup: Yes, the orange site people are like, “They are compressed structures.” Yes, and the machine-learned model can take care of the compressed structure. “There are succinct structures.” Yes, we can take advantage of that too. Anyway, read the paper, it's a great write, let's take away, “Wow, what a cool paper. Let's read it.”

 

See more presentations with transcripts

 

Recorded at:

Jun 27, 2019

BT