Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Algorithms behind Modern Storage Systems

Algorithms behind Modern Storage Systems



Alex Petrov talks about modern storage system approaches, discussing storage internals, and evaluation techniques to choose a database with the optimal read, write or memory overhead, best suitable for a certain data.


Alex Petrov is an infrastructure engineer and Apache Cassandra committer. He is interested in storage, distributed systems, and algorithms.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


It's my first time here and I know that this is an awesome conference and I always wanted to either come here and, well, have the pleasure to meet everyone or be able to present. And here I am. I'm very pleased. So, as it was already mentioned, my name is Alex [Petrov]. I'm an Apache Cassandra Committer, and today we're going to talk about the on disk storage. And if you're new to the subject, it is quite hard to find the information and kind of put the pieces together. Some systems have existed for decades, for many years, and this somewhat skews the information that is easier to find, that is more accessible, and other things are somewhat difficult to find and you stop knowing what is good and what is bad, and do not really know where to start.

And I couldn't find a good summary about what's going on in the storage world, so I thought that why not just do a talk. Actually it started out with the series of the blog posts that you can also read online, and now it's kind of turning into the book about everything databases related. And it should help you to understand and give you an introduction to this world of algorithms behind this storage.

Let's talk about the reasons to give this talk. Over the last decade or so, the amount of data processed by even an average application has grown tremendously, I would say. If you were to shop for a database in 2006, you would have just a few choices. So, it was rather obvious. I won't name names, but we all know those. And now you have all sorts of databases even here, there were several presented. And if you were to shop for a database today, what would you pick? What do they have underneath? What are they using? Is this good, from theoretical perspective? How do we compare the databases? How do we say what is good for my application?

So, because now you have all sorts of good databases, it's kind of hard to answer these questions because it's also sometimes hard to know what's inside. Of course, this doesn't make jobs of neither database developers nor application developers easier, because database folks are trying to squeeze out the last bits of performance out of their systems, and application developers require everyday new tools, potentially having more features. And at some point when the obvious optimizations are already done and we still need to increase density of the node or decrease the latency, we have to start making trade-offs, because everything else has already optimized away. And knowing these trade-offs and understanding where they're coming from can be instrumental for creating better tools or building the applications on top of them.

In this talk, I will summarize the modern practices, build vocabulary and, again, it will give you or help us all to gain some more intuition to get started with this subject and systematize it a little bit for the folks who are already familiar with this all. By the way, who has already some familiarity with the concepts such as B-Trees and LSM Trees? Okay. So, I have at least something to talk about. This is good. This is very good. So, B-Trees, obviously, most of the people know already, but at least I can hopefully shed some light on and compare them to the other ones. So, but let's not run in front of ourselves.

Given the talk format, it's quite hard to cover all the latest developments. But I'm also writing a book, as it was mentioned with Riley, and hope it will give you even more details and help to gain even better understanding. If you're curious, hit me up, let's chat about it a little bit, and stay tuned for that. This is not a simple subject and in 45 minutes of the talk, I'll only be able to scratch the surface, of course. And to learn more, you will have to dig into the books, papers, and code. I personally had to read approximately 120 papers in order to cover well the subject in the book, and approximately 6 books which will cover different aspects of each one of the things that we're going to be talking about. So, it's a vast topic, but this hopefully is going to get you started.

I'll try my best to make this talk free of empty promises and all sorts of silver bullets, and present only face facts and make it unbiased, even though I work for Apache Cassandra, I mean, I work on Apache Cassandra, not for. And, I mean, I've seen the code and I know the trade-offs that Apache Cassandra is making. However, it doesn't mean that I'm going to be saying that this is the only way to go. This would be unfair of me. And after all, there are no clear winners and we have yet to see absolutely optimal system. And actually I doubt that this is going to happen.

Access Patterns

Before we dig into the storage systems themselves, let's just start with some terminology. One of the most important things that sets different storage systems apart is Access Patterns. If we were to store data on the, let's say, vinyl records, Access Pattern would describe the needle movement. Usually we distinguish between random, sequential, and mixed Access Pattern and, well, mixed is kind of clear. So this is a mix of sequential and random. Let's take a look at what sequential and random actually means. By sequential accesses, we usually mean reading contiguous memory segments without seeks in between. This is as if you were playing your favorite album where tracks are perfectly laid down by artists and exactly in the way that you would enjoy listening to them. This happens not very often, but it does. And random access is difference in that it doesn't read contiguous memory segments. So, it has to perform seeks in order to find the locations for the next read. These seeks are hard to predict. We say it's random. It's kind of like you're listening to an album where you only like a couple of songs. So you listen to the first one and then you skip a couple, then you listen to another one, etc. So, you kind of have to seek for the songs and maybe even perform them not in the order that the artist has written them on the vinyl record.

So, what we have discussed so far is related only to the reads. Of course, the terminology doesn't change for writes at all. So, sequential and random writes have pretty much the same meaning, but semantics are slightly different. For example, sequential writes do not always result into the sequential reads. Data that is written closely is not necessarily going to be read together. So, if you have data stored in the B-Tree on disk and in order to maintain key order and allow sequential access for range scans, which is often done- like whenever you have in a SQL statement with where X larger than 1, and then you scan a large range of data- you have to perform random seeks in order to locate the data. But then you can start reading at one point and continue reading until you reach either the end of the data or the end of the range that you were searching.

In order to achieve sequential writes, however, we use either in-memory buffering or append-only storage. Of course, append-only storage due to its nature results in random reads because the order in which we are randomly writing the data into the database, is definitely, almost definitely, not going to be the order in which we would like to have it read, unless it's some sort of a log storage, like for instance in Kafka, so whenever you actually would like to do that. So to put the data that will be read together, we have to kind of prepare it for it. It's done by collecting and buffering the records in memory, then sorting them, and then writing them down on disk. You see that we can achieve sequential reads by either random writes or sequential writes. But will we always have a trade-off, locating write point or buffering when sorting the records. So it's one of the two overheads. This dichotomy is going to hunt us throughout the talk and we'll see that its implications are going to be extremely important for database and storage design.

And while SSDs have eliminated the cost of random IO reads compared to sequential, we have to keep in mind that it applies only to block reads and writes, since block is kind of the smallest atomic entry that can be transferred between the block device and your application, well, with many layers in between. Reading just a few bytes out of the block means that you have to transfer the whole block from disk anyways. So, for example, things like range scans can be only effectively implemented in the storage where data is laid out and prepared for sequential reads.

The situation with writes is that they're both bad for SSDs and hard disk drives, but for slightly different reasons. On hard disk drives random IO is bad because the arm has to seek to the correct track, in order to navigate from one position to another to perform a write. And with SSDs the reason is, well pretty much how they're built. Most of the modern SSDs are built using NAND gates and AND gates. So, due to optimizations and the cost efficiency, the smallest erasable entity is a series of blocks. And the data can be written only after the blocks are erased fully. So, this means that the write operation can only set the single bits, but erase operations can unset but not single bits, but entire units like several blocks at a time.

And this all on a very low level. Then one level above that, there is a Flash Translation Layer which is doing a job of keeping record of these abandoned pages, like the pages that were once at one place, then were read and has to be re-written now they have to be relocated, and the previous is abandoned. And now, if there are memory segments which are close to it, we have to relocate them as well in order to clear out the entire block. So, the Flash Translation Layer will have to do all this process of garbage collection.

So, eventually we run out of empty pages after doing lots of writes and have to garbage collect and nullify, basically unset those blocks, possibly relocating their neighbors. This operation is performed during writes and can increase the cost of writes. Since every write implies block relocation and subsequent garbage collection, we have to keep the write amplification as low as we can in order to reduce the amount of IO in general, which will subsequently also reduce the amount of GC we're going to have.

In order to summarize everything that we were just saying about the sequential and random IO, we say that sequential IO is generally good, since it's more predictable and it gives you just more material to work with, to optimize, to make it faster and better. But because we cannot avoid random IO altogether in life, we have to balance between what makes sense and what's best for the database.

On Disk Data Structures

And another important concept and distinction that database storage designers take into the consideration is the mutability or immutability of the disk format or of the file database structure. This also has significant implications to the display out, construction, on the maintenance process, and many other things. Mutable data structures usually pre-allocate the memory and do in-place updates. So basically, find the thing on the disk and update it in place wherever it was, in order to amortize the cost of writes for the data that is stored together. This usually results into random IO; in order to perform a write, a bunch of reads have to be done in order to locate the destination. And then we can do the actual write, and the writes which are close to each other in time, will not most likely be written together. Since updates are made in place, all the data is read from a single source and reads will not have to merge and reconcile data from different sources, because we basically have a single source of truth. So, single file we just read from it.

Because it would be unsafe to concurrently modify the data that can change anytime underneath of us, accesses and modifications have to be guarded with concurrency primitives that help with mutual exclusions. These are locks which guard B-Tree data integrity and latches which guard B-Tree structure integrity. These are two different concepts, slightly overloaded terms, and I would really recommend listening to, for instance, CMU talks by Andy Pablo; he really well explains the concepts of locks and latches in many details.

The immutable data structures on other hand, require no memory overhead for the subsequent updates, since files are pretty much written on disk just once and will never be changed anymore. However, in order to perform a read, since all the files are immutable, multiple versions of the same record of data for the same key located in the several different files. We will have to read all of the files, reconcile, merge the data together and only then we'll be able to return it to the user. Writes here are sequential. So data is batched up in memory and sorted and written out sequentially. And since files are not modified on disk and updates would effectively mean pretty much rewriting the whole file, immutable data structure require merge from different sources before returning the data to different clients.

Log-Structured Merge-Trees

Now, let's move on to the first data structure that I thought would be great to discuss today, and it would be log-structured merge-trees. We covered the basic vocabulary and we can start talking about it. LSM Trees is an immutable disk resident, write-optimized data structure. LSM Trees have been getting more attention because the insert, update, and delete operations, and even for their sorted variants, do not require random IO, which is good. So you don't have to locate the place where you're going to write on disk at all in order to use LSM Trees.

In order to allow sequential writes, LSM Trees batch up writes and updates in memory resident table. So it's somewhere in RAM. It's often implemented using some sort of sorted data structures such as a binary search tree or a skip list. So anything that would allow you quickly searching; logarithmic look-up time would be pretty good. And when the size of this table, memory-based table is reaching a certain threshold, its contents are written on disk, and this separation is called flash. After a few flashes are performed and data ends up being split between multiple on-disk tables, now, in order to retrieve the data, we have to search old disk resident parts of the tree, check in memory table and their contents before returning the result itself. So writes are only addressing the memory resident table, while reads have to reconcile the data from everywhere, from memory and disk.

Disk resident and table files are immutable. They are written only once and never modified and can only be deleted on later stage. And we're going to be talking about it later in the next few slides. On the last slide we've just seen that when the data is read, it has to be merged from the several multiple sources. Now, let's consider a specific example. Here we have two tables and they hold the data for the same key. This means that a record was first inserted and then updated. And now the former version of this data has to be discarded.

Now, let's check out how this is happening from the keys and values perspective. Record with a key Alex was written with one timestamp and then on the later stage updated with higher timestamp. So the resulting table, the merge table, will only have the updated value for this key. And record with the key John gets deleted in the later table, and will be discarded during the merge process and not show up in the merge table at all. And the other two records are just copied verbatim, so to say, because they're not shadowed, so they appear only in one of the tables and not the other.

To summarize the merge arithmetic, we have a quite simple operations going on here. From the data standpoint, merge is reconciling the records from multiple sources, so records have a key and timestamp associated with it. And if two tables hold the same value for the same key, meaning that an update has occurred, only the record with the latter or later timestamp will be picked and the previous one will be discarded. Since the read operation cannot physically remove the data from immutable data structure, we have to use something called dormant certificates, sometimes also called tombstones. They indicate that a record for a certain key has to be removed starting with a certain timestamp, and during the merge process records shadowed by these tombstones are going to be discarded.

And many modern LSM implementations such as RocksDB, Apache Cassandra, choose sorted string tables as their file format because of its simplicity. SS tables are persistent maps from keys to values ordered by the key. Structurally, SS tables are split in two parts. It's an index and the data block. The index block contains keys mapped to the block offsets, pointing to where the actual record is stored or located; and index is implemented using a format also with good look-up guarantees such as B-Tree, which is good for range scans, or if you just need point queries, you can use something like a hash table.

And SS table has several very interesting and nice properties. First of all, for the point queries, in order to find the value by the key, it can be done quickly by simply looking up the index and locating the data than following offset and reading the data from the data file. And the range scans can pretty much do the same thing, so they can first jump to the beginning of the range, and then keep reading the data from the sorted data file without performing any extra seeks or anything else. So, logically, the first SS table after a flash, represents a snapshot of all the database operations over the period in time. So, because SS table was created in this flash process from the memory base data structure as we have discussed on previous slides.

So far we've covered that SS tables are created by the flash process and from memory resident tables, but because SS tables are immutable and are written sequentially and hold no reserved space for in-place modifications, if we had a single SS table, any insert, update, or delete operation would probably mean that we have to re-write the entire file. So over time, the amount of such tables created by flash will only grow and reads will be getting more and more and more expensive because we will have to merge more and more data files. And data for the same key will be pretty much located everywhere, and records will be shadowed by the deletes, etc. So we will not only waste time, but we will also waste a bunch of space because of all these duplicates everywhere.

In order to reduce the cost of reads, reconcile disk space caused by these shadowed records and reduce the amount of these tables that have to merged, the LSM Trees propose a process that reads complete SS tables from disk, merges them, and then writes them back. This process is called compaction. As we have discussed, SS tables are sorted by the key and compaction works sort of like merge sort, which makes it very efficient. So records are read from several sources sequentially, then merged and appended to the file also in a sequential manner. And one of the advantages of this merge iteration is that it can work very efficiently also for the data sets that do not fit in memory. And resulting table preserves the order of the original SS table, so you will be able to repeat the process of compaction over and over your data set. During the process, merged SS tables are discarded and they're replaced by their compacted version. After compaction is done, the amount of SS tables is reduced and the queries are made more efficient.

To summarize what we have just learned about LSM Trees, first and foremost, they are immutable, and their contents are never changed on disk. In order to reconcile the disk space, we use the process which is called compaction, which merges things together and discards the old data and everything that we do not need anymore. And LSM Trees are write-optimized. So any non-durable insert, update, or delete does not require any disk seeks whatsoever. And, let's call it a day. So that's it.


So, now let's talk about the B-Trees, which most of you seem to know, but I will only scratch the surface. I will not go into the details that you probably already know, and try to concentrate on the things which are lesser known. So, one of the prominent papers on the B-Trees is called Ubiquitous B-Trees, which describes in great detail several variants of B-Trees and their applications. The original paper was published by Bayer and McCreight back in 1972. But this paper or this data structure remains important and useful up until today.

Researchers still come up with new ways to optimize B-Trees and come up with their different variants, and some of them are extremely useful. Before we dig into the B-Trees, let's remember their predecessor, the Binary Search Tree, which logically preceded them. The Binary Search Trees are useful as an in-memory sorted data structure, but they are not very good to be used on the disk because of the balancing. So you have to perform balanced nodes of the BSTs in order to keep them balanced or keep the height to the minimum. And because of the low fan-out, meaning that they only have two child nodes per node, which would not be very efficient given the block size of even four kilobytes. So this all doesn't work very well in disk, therefore the community came up with B-Trees and B-Trees allow storage of more than two pointers per node, and work very well with block devices, by matching node size to the page size. And today of course there are many implementations that are used, larger page sizes, but I think it's kind of irrelevant to the talk.

B-Trees have the following properties. First of all, they're sorted, which allows the sequential scans and simplifies look-ups pretty much the same as with SS tables that we just talked about. B-Trees are self-balancing, so there is no need to balance the tree during insertions and deletions. So the height is going to be uniform across the tree. And when the B-Tree node is full, it has been split in two parts. When the occupancy of the neighboring nodes is low enough, the merged are going to be merged together. This also means that leafs, the bottom layer of the tree, are going to be always equidistant from the root of the tree.

And probably the most important thing is that B-Trees are mutable. So inserts, updates, and deletes are performed on disk in place. In order to make this updates on in-place possible, B-Trees first have to locate the data, unlike LSM Trees that we just talked about. Additionally, in order to avoid resizing nodes during each update, B-Trees reserve some space for the future inserts and update operations. So these two things really set them apart from the LSM Trees.

And let's take a look at an anatomy of the B-Tree, how they're built. B-Trees consists of disk segments, usually called nodes or sometimes pages, and which is distinguished between several types of them. And we say there is a root, the one that has no parents. So, meaning that it is not child to any other node, and internal nodes which connect the top layer to the bottom layer and the bottom layer which stores, holds the data and it's usually called leaf. And B-Trees are characterized by their, first of all, branching factor. And a branching factor is an amount of child pointers, like a fan-out, how many child pointers there will be per every node in the tree.

This also means that there will be pretty much as many keys as pointers. And the B-Trees are also characterized by their occupancy, meaning that how much free space pretty much, the B-Tree node has or each node has. They're also characterized by their height, and height is an amount of B-Tree levels; how many pointers you have to follow between the levels in order to finish your look-up process. And height is a logarithm of the amount of keys over the nodes. And this also explains you how many disk transfers you will have to perform in order to finalize your read. And every non-leaf nodes in the tree holds keys and pointers to the sub trees. Leaf nodes may also hold the pointer to the previous and the next node in order to have more efficient navigation on the leaf level.

As you've just seen on the previous slide the node consists of sorted keys and pointers. The keys are sometimes called separator keys, because they separate the tree into the sub trees. And the amount of pointers is always greater than the amount of keys by one. So, separator keys also can be thought of as they were splitting the range to before the first key, between middle keys, like any pair of middle keys, and after the last key. When performing look-ups, you start with the top node, then follow recursively the pointers down to the leaf level. When the point queries perform, the search is complete after you have located the leaf node. In case of the range scan, you have to traverse the keys and values until your range scan is complete.

In terms of complexity, B-Trees guarantee logarithm of the base two look-up, because finding a key is performed using the binary search. So, in order to find a key within each node, we still have to binary search. So it would be incorrect to say that we have the logarithm of height of tree complexity here.

When performing insertions; we have to locate the target leaf for that we can use the look-up algorithm that we just discussed on the previous slide. And after the target leaf is located and keys and value appended to it, in case there is not enough space in the node we have to- the situation is called overflow and the leaf has to be split in two parts. This is done by allocating the new leaf, moving half of the elements to the new one, and pulling the first key of the newly allocated node to its parent. And if the parent doesn't have the space either, the split is performed on the parent level, and so on and so forth, until we reach the root level. This also implies that the tree always grows from the root level. So there is no other way to change the root of the B-Tree except for by splitting all the nodes recursively up to the root level and then splitting the root. Deletions are similar to insertions, but they cause node mergers, not splits. And that's pretty much it from the algorithmic perspective.

In order to summarize, the B-Trees are first and foremost mutable. So they allow in-place updates, but they have to introduce some space overhead. And in order to amortize the subsequent writes for the data that sorts closely, writes have to locate a point on disk where the data has to be written, unlike LSM Trees where the writes are just going to memory and the disk seek doesn't have to be performed at all. B-Trees can also be used in LSM storage as an index, but then you have the B-Trees with 100% occupancy, and B-Trees are optimized for reads, so they do not require reading from, and subsequently merging the data from multiple sources in order to satisfy the query. Writes and deletes might trigger a cascade of splits and mergers as we just talked on the last slide, making some of the operations, but not all of them, more expensive.

There are more techniques and implementations that people are using in order to optimize it, so there is much more to splits balancing and merging in the B-Tree subject, but it's outside of the scope of this talk. They're optimized for paged environments and help to manage the sorted data which is located in pages on disk. Over the time, after a bunch of subsequent updates and deletes, the B-Tree nodes will get fragmented. This means that they also require some maintenance and block rewrites, but this cost is usually amortized or absorbed by the fact that B-Tree implementations are using something called buffer pool, which kind of also helps to buffer the update operations before the data is actually reaching the disk. And lastly, the concurrent access to B-Trees require reader, writer, isolation, and involves the chains of locks and latches.

Looks like we are on time, we still have 10 minutes. When developing a storage system, we're always confronted with the same challenges and have to consider the same factors. Making a decision about what to optimize for is extremely difficult. One can spend more time during the write in order to lay out the data structure in a more efficient way, reserve extra space for in-place updates to facilitate faster writes, buffer the data in-memory in order to ensure sequential access. You can do anything, but you cannot do all of these things at the same time.

An ideal storage system would have the lowest read cost and lowest write cost, and have no overhead. In practice, the data structure is compromised between multiple factors and it's important to understand these compromises. And the researchers from our DBLab summarized the three key parameters the database systems are optimized for. They say it's optimized for either reads, updates or memory overhead. Understanding which one of these three parameters is most important for your use case can influence your choice of data structures, access methods, and even suitability of certain tool for certain workloads, as some algorithms are tailored specifically having one use case in mind.

Read, Update and Memory Overhead

The RAM conjecture that our DBLab folks came up with states that setting up an upper bound for the two of dimension overheads also sets a lower bound for the third one. In other words, the three parameters form a competing triangle, an improvement on one side which means compromises on the other two.

For example, the B-Trees are read-optimized, but they have to reserve empty space for updates. So, one against the other, therefore resulting in the memory overhead, and have higher write costs because they have to locate the target place where to write the node on disk during the write time. The B-Tree is optimized for read performance, so index is laid out in the way that minimizes disk accesses required to traverse the single index in order to locate the data. This is achieved by keeping the index files mutable, but B-Trees also imply write amplification resulting from node splits and mergers, and also subsequent updates of the data which is written closely together in the same page, but written during the different periods in time.

In order to amortize the update costs, B-Trees reserve extra space on nodes at all the levels. And that helps to avoid to change the node size during each write, otherwise you would have to resize it constantly, and maintain a tree in fixed size pages which simplifies the code and understanding the layout. In short, B-Trees trade update and memory overhead for better write performance. Excuse me, for better read performance, of course. On the other hand, LSM Trees have less space overhead at a cost of read overhead coming from having to access several files, several tables on the disk. LSM Trees optimize for write performance. Both updates and deletes do not require locating node on the disk or where you have to write on the disk, and guarantee sequential writes by buffering everything in-memory and only then writing it down. This comes at a higher cost of maintenance, which is done by doing compaction that we just discussed in several slides before.

Reads are getting more expensive as the data has to be read from multiple sources and merged, and compaction is helping us to mitigate this problem and reduce the amount of access tables. At the same time LSM Trees eliminate memory overhead by not reserving the empty space and allowing - well, they can also allow the block compression but I think that's also not super relevant. So, in short, LSM Trees trade read performance and maintenance for better write performance and lower memory overhead. So, once again, the same triangle all over again.

So, there are data structures optimized for each desired properties, so you can use LSM Trees or B-Trees, but you cannot really put their properties together and have a data structure which will be equally good for every single direction and every single overhead out there. There are pretty much three tunables that we have discussed today: read, update, and memory overheads. And they can help you to evaluate the database and deeper understand the workloads that are best suitable for them; or the other way around, to pick the database which is best for the workload that you are having at the moment.

And, of course, there are many other factors to consider in the storage system, or especially in the database. Like there is a maintenance overhead, operational simplicities, system requirements, and so on and so forth. But this RAM conjecture and understanding, well, at least the basic algorithms could give you a good rule of thumb in order to develop an intuition and maybe know how to even test the database for the workloads that you would like to test it for.

And many things vary from implementation to implementation. So, even the two databases with a similar storage engine, which seemingly should implement even the same paper and have same design principles, may end up performing extremely differently. Databases are complex systems with many moving parts and, unfortunately they are, or maybe fortunately, an integral part of many applications. And I hope that this information will help you to peek under the hood of the database and knowing the difference between the underlying data structures, and this simple rule of thumb will help you at your work to decide what's best for you and your application.

I would really advise to take a look at least at these four books, because they really help to get you some understanding of basic data structures and what was going on in the database world over the last couple of decades. For the, so to say, "newer stuff" you can just follow me on Twitter and I'm trying to post the papers, and soon- hopefully maybe sometime next year, let's see about that- there will be also a book coming out that will help you to develop even further knowledge. But generally even if you hunt for papers, you can always hit me up on Twitter.

Questions and Answers

Participant 1: Hi, from the top of your head, do you know any metrics, like how faster B-Trees than Binary Trees in the writing …?

Petrov: Well, from theoretical perspective, B-Trees are not faster than Binary Trees, and I would say because they are mostly used in absolutely different workloads, so B-Trees are mostly for in-memory or, excuse me, for on-disk data structures and Binary Trees are mostly for the in-memory stuff, they are so different that I wouldn't even be able to compare them. But comparing the LSM Trees and B-Trees, there are plenty of comparisons, and probably the most prominent of them that I can name is done by WiredTiger folks. This is a storage engine currently underneath MongoDB, and WiredTiger actually has an implementation of both LSM Trees and B-Trees, and it is solid material, really recommended to check out. And they also have a big Wiki page which compares B-Trees versus LSM Trees for a bunch of different use cases. It's really awesome with very deep analysis. So definitely they're recommended. But sorry, I cannot really respond to B-Trees versus Binary Trees.

Participant 2: So, both those data structures are created in the age where we have only had a hard drive, like with the disks, and today when we've got SSD disk. So how are they applicable right now when random access is not much faster on SSD. I'm wondering, there is maybe a third way?

Petrov: There are. There is a third way. Unfortunately, I tried my best to jam it in. So, there is also something called unordered LSM Trees, which is basically LSM Tree which allows you random access. One of the most prominent implementations of such thing is called Bitcask. It used to be in used in Riak, by a bunch of folks, open source by bet365 just a year ago or so, or maybe it was always open source, it's just like they continued to keep it alive. But yes, Bitcask is definitely one of the implementations. What they do is basically they have this write ahead log, and sequentially write all the data in the log, and the keys, they keep them in-memory, and they have a hash map that points to pretty much the last point in the log where the data is located.

There is a paper. So I haven't seen the implementation. The paper is called WiscKey, so I think it's Wisconsin University or something. So WiscKey database, it's a very nice paper from, I think it was Sigma, very recommended to read as well. So they are talking about pretty much the same concept, meaning that you write the data into log only right away, so without any thing in between. But you still maintain the LSM Tree but only of the keys.

So, effectively your compaction costs drops significantly at the … Okay. Now I kind of start seeing why everything I was saying doesn't make sense. Now I try to put it back, so, I apologize. So why this makes sense or didn't make sense previously and now it does, is because you can search for keys very quickly and retrieve them in blocks and basically read them out, and WiscKey for instance even can utilize the SSD parallelism. So they can submit several operations and then get the data quicker, and they claim that they pretty much can be absolutely optimized like data structure in terms of reads and writes and space overhead.

But a slight pitfall is that when you are trying to do the range scan, you are still going to incur the cost of retrieving the data. For instance, you need to have one record from the block over here, one record from the block over there. And it doesn't matter how quickly you can read out those blocks because it's not byte addressable storage. So, if we take all of that and wait three years until something like NVMe, really gets implemented and we have something like byte addressable storage without - I mean, there is NVMe right now, but there is no good way to really not transfer blocks. I think that the minimum addressable should be something like a cache line or something, but not like entire block. But until we get there, I would say that we will have to stick probably to one of the two which are proven to work. And the third one is a good and great area of research. Definitely read up on that. So, WiscKey and Bitcask, and Bitcask is pretty solid but only for hash map. So it's like no range scans. WiscKey adds range scans but at a cost. So, hope that answers.


See more presentations with transcripts


Recorded at:

Jan 15, 2019

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Can't say this was a good presentation

    by Ranjix ranjix,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Interesting subject, somewhat basic approach to subject (not bad), unfortunately not that focused and kind of hard to follow (for how basic it was).

  • Great effort.

    by abdo abdo,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    It was really informative. Thanks

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p