Transcript
Greef: Why design a new database? There was a time when you could have any database as long as it was MySQL or Postgres. These systems took 30 years to develop. They were tried and tested, and people thought twice before replacing them. Then something happened. A wave of discoveries in the research around durability, efficiency, and testing grew more difficult for existing database designs to retrofit. At least, this was our experience of the impact of this research on our design decisions for TigerBeetle, and why we decided to start fresh. TigerBeetle is a new, open source distributed database, where some databases are designed for analytics, others for streaming, and still others for time series, TigerBeetle is designed from the ground up for tracking balances, to track the movement of value from one personal place to another. For example, to track financial transactions, in-app purchases, economies, or to switch payments, record trades, or arbitrage commodities like energy, and to do those with mission critical safety and performance. You can use balance tracking to model any business event. That's because the way you track balances is really double entry accounting, which is the schema TigerBeetle provides as a first-class primitive out of the box. You can spin up the replicas of a TigerBeetle cluster with a single binary, and then use a TigerBeetle client to connect to the cluster to create accounts and execute double entry transactions between accounts with strict serializability.
TigerBeetle is designed for high availability with automated failover if the leader of the cluster fails, so that everything just works. We wanted to make it easy for others to build and operate the next generation of financial services and applications without having to cobble together a ledger database from scratch, or to execute manual database failover at 2 a.m. With a tightly scoped domain, we have gone deep on the technology to do new things with the whole design of TigerBeetle. Our global consensus protocol, local storage engine, the way we work with a network disk and memory, the testing techniques we use, and the guarantees that TigerBeetle gives the operator, first and foremost of which is durability. What is durability? Durability means that once a database transaction has been acknowledged as committed to the user, it will remain committed even in the event of a crash. It's fine to lose a transaction if the transaction has not yet been acknowledged to the user, but once it's been committed, it can't be lost.
To achieve durability, a database must first write the transaction to stable storage on disk before acknowledging the transaction, so that after a crash, the transaction is still there. However, writing data to disk is an art or a science. Blog person papers have been written about how hard it is to get right. It's tempting to use a file system and not a database, but writing to a file from an application when the plug can be pulled at any time is simply too hard, at least if you want consistency. We might think to ourselves, surely we know by now how to use rename to do atomic file updates. Then you hear this fresh report from three weeks ago that renameat is not linearizable on Windows Subsystem for Linux's ext4. This is why our applications trust the database with durability to get this right at least in the face of sudden power loss, not to mention gradual disk corruption.
Mmap
There are at least three ways that a database can be designed to write data to disk. The first of these is mmap. Where you map the contents of a file into your program's address space, to give the illusion that you're working with pure memory instead of disk. However, Andy Pavlo and his students at Carnegie Mellon wrote an excellent warning on the pitfalls of mmap to motivate why mmap is not acceptable for database durability. Some databases do still use mmap. Since we're going to cover many of the same pitfalls, as we go along, as we look at other designs, we won't go into mmap further, except to shine a spotlight on the paper.
O_DIRECT and Direct I/O
Next after mmap. There's O_DIRECT or direct I/O. This is where the database takes responsibility for working with a disk directly, so that writes and reads go directly from user memory to disk and back again, and they just bypass the kernels page cache. It's much more work, but the database has full control over durability as well as caching. However, this is what Linus has to say. The right way is to just not use O_DIRECT. There is no valid reason for ever using O_DIRECT. You need a buffer whatever I/O you do, and it might as well be the page cache, so don't use O_DIRECT. If we go with Andy Pavlo, and we steer clear of mmap, and if we don't rebel against the PDF file on O_DIRECT, what else is left? Here we come to what many databases do such as Postgres. This is to outsource durability to the kernel with buffered I/O. You write from the database's memory to the kernel's page cache, and then you allow the kernel to flush or sync the page cache to disk. Whenever the database wants to write to disk, it issues a write system call to the kernel, passing the buffer to be written, where on disk it should be written to.
The interesting thing is that this write system call does nothing, at least nothing durable at this point. That's because when the kernel receives the write system call from the database, the kernel simply copies the buffer across to some pages in the page cache. Marks the pages as dirty. In other words, they're en route to disk, but they're not there yet. Finally, because there's no real disk hardware involved, these write system calls don't typically fail. The kernel usually returns success almost immediately back to the database. Then at some point, the kernel is going to start writing the data out to disk. As each page is written, that's marked clean. If the database does nothing more than issue writes to the kernel in this way, then it can't know when these writes are safely on disk. That's not a problem so far because if the machine were to crash right now, then the data would be lost. It wouldn't matter, because the database would not yet have acknowledged the transaction to the user.
Fsync
When the database does want to commit the transaction, when it wants all the dirty pages relating to the transaction to be flushed to disk, again, for durability, then it issues another system call. This time fsync. Fsync tells the kernel to finish all writes to disk before returning. This ensures that all the data can be retrieved, even if the system crashes or restarts. In other words, where a database does make the big design decision to outsource durability to the kernel, then fsync is crucial. This is because most database update protocols such as write ahead logging or copy on write designs, they rely on forcing data to disk in the right order for correctness. However, fsync is not easy to get right because under the buffered I/O design, fsync is where the rubber hits the road, and the data actually hits the disk. Because disks are physical, they can fail in all kinds of ways, either permanently or temporarily. Some sectors might fail, others might not. If you're lucky, the disk will tell you when an I/O fails. If you're unlucky, it won't. We'll come back to this.
The takeaway for now is that since the kernel may know that some of the buffered writes may have hit I/O errors on their way to the physical disk, when the database does eventually call fsync, then it might receive an error back from the kernel indicating that some of the writes in the batch didn't make it. If fsync does return an I/O error, there are three choices that the database can make. Option 1 is just ignore any error from fsync, pretend that the writes didn't fail, which is what some databases used to do in the past. Option 2, retry the fsync in the hope that the kernel will retry all the buffered writes to disk and keep retrying until you're durable, and until you don't get an error back from fsync. Option 3 is just crash the database, and then you restart and you recover from a checkpoint.
The Cracks of Buffered I/O
If you were in MySQL, or Postgres issues, what would you choose? How confident would you be that your choice guarantees durability? If you've heard the story before, and you know the answer, then I promise, there's a twist in the tale. There's something new which I think you will find surprising. Before we get to the answer, I want to warm up with a look at some of the cracks in buffered I/O. None of these cracks on their own are enough to end an era of database design or begin another. I think they point to something shaky in the foundations. First, writing to cache instead of disk means that you lose congestion control over the disk with no mechanism for backpressure. This can result in significant latency spikes, when the system has to write out gigabytes of dirty pages. Second, it's difficult to prioritize foreground and background I/O. You can't schedule your fsyncs apart from other application data in the page cache. Everything is just sharing this one page cache. There are ways around this like sync_file_range, but the man page for that has this friendly warning, "This system call is extremely dangerous and should not be used in portable programs." What does that mean?
Third, buffered I/O is all or nothing. You can't handle errors related to specific writes. If something goes wrong, you don't know what it was or where. Finally, disks have now become so fast on the order of 3 gigabytes a second, they're starting to approach per core memory bandwidth on the order of 6 to 20 gigabytes per second, maybe if you've got an M1. This means that if you're still using buffered I/O, and you're doing memcopies to the kernel page cache for every write, then you're not only thrashing your L1 through L3 CPU caches, you're using up CPU to do the copies, but you're also potentially halving memory bandwidth. This is assuming that the copy to the page cache is the only copy in your data plane. If you need a second copy, maybe for networking with deserialization, then that can be almost all your memory bandwidth gone. When it comes to buffered I/O, there's a crack in everything. That's how the light gets in.
Fsync Returns EIO
Let's return to our foundational fsync question. Again, the question is, if fsync fails with an I/O error, what should the database do? How do you handle fsync failure? Do you ignore the error and pretend that your buffered writes are durable? Do you keep retrying the fsync in the hope that the kernel will retry all the buffered writes that failed? Do you crash and restart? Do you recover from a checkpoint? Of course, ignoring the error is not correct. It's not an option. What about retrying? Indeed, for many years, most databases would retry, for example, Postgres would keep retrying a failed fsync until it succeeded under the assumption that the kernel would retry any failed buffered writes, at least this was the assumption for 20 years. Then something happened. Five years ago, Craig Ringer, posted this post on the Postgres mailing list to report that he had run into real data loss with Postgres. The critical database guarantee of durability, the D in ACID had been violated. What was more stunning, I think, and perhaps the reason that the incident became known as Fsyncgate is that this wasn't due to a bug in Postgres per se. Postgres had followed Linus's advice, and relied on the design of the page cache. Even though Postgres was careful to retry after an I/O error, it was still losing data. This was because in the kernel, when a buffered write failed due to disk error, the kernel was in fact simply marking the relevant pages as clean, even though the dirty pages had not been written properly to disk. This means that Postgres might get the first fsync error the first time around, assuming that another process didn't consume it first. Then the second time around when Postgres retries the fsync, the fsync would succeed, and Postgres would proceed as if the data was committed to disk, despite the relevant pages still not having been made durable. Again, the dirty pages would simply be marked clean in the kernel, and the kernel developers maintained that this mark clean behavior was necessary, for example, to avoid out of memory situations, if a USB stick was pulled out, so if the page cache wouldn't fill up with dirty pages that could never be flushed. It rocked the database world.
After the discovery of Fsyncgate, Postgres, MySQL, other affected databases, they decided to fix the issue by just changing their answer to the foundational fsync question. Instead of attempting to retry fsync after failures before, they would crash, and then recover from the data file on disk. Jonathan Corbet wrote about Postgres's fsync surprise. Then a year later, in 2019, with a fix in place, Tomas Vondra gave a fascinating talk about this, called, "How is it that Postgres used fsync incorrectly for 20 years, and what we'll do about it." Up to now, if I've told the story properly, then I think you'll agree that this was the correct fix for Fsyncgate. This is where the story ends. Perhaps you're still wondering, where are all the seismic shifts in database design that you promised at the beginning? Because it looks like the fix for Fsyncgate was not a major design change, not even a minor design change, but just a simple panic, to restart and recover. It's a clever fix, for sure, but a simple fix.
Can Applications Recover from Fsync Failures?
Here we come to the part of the story that I find is less often told. The story picks up two years later, in 2020, when University of Wisconsin-Madison asked the question, can applications recover from fsync failures? You can probably guess what the answer is when some of the leading storage researchers, Remzi, and Andrea Arpaci-Dusseau, and his students, when they decide to ask the question in this way. If what the first fsyncgate had found was stunning, then I think that what this paper found was even more so. Because while databases such as SQLite, and Postgres would now crash after an fsync EIO error, after restarting their recovery, their recovery would still read from the page cache, not the actual on-disk state. They're potentially making recovery decisions based on non-durable pages that were marked cleaned through the fsync failure. The paper also raised other ways that fsync failure was not being handled correctly by these databases. For example, where EXT4 in data mode would suppress an EIO error, and only return it to the next fsync call. You want to get an fsync error, you have to call fsync twice. Users were still at risk of data loss and corruption.
The most important finding, at least for me, was just this little sentence at the end of the abstract. Sometimes in a paper, you read it carefully, and there's these little things that you almost gloss over and you read again, and there's the sentence that just changes your understanding of things. For me, this sentence at the end of the abstract was one of those. Our findings have strong implications for the design of file systems and applications, that is, databases that intend to provide strong durability guarantees, strong implementation implications for the design of databases. UW-Madison had shown that the classic buffered I/O design for databases of the past 30 years was now fundamentally broken. There was in fact no correct answer to the question of how to handle fsync failure under the buffered I/O design. The answers were all wrong, and not actually sufficient for a database to guarantee durability. Instead, database design would need to move from buffered I/O to direct I/O. Databases would need to take direct responsibility for durability. They would need to be able to read and write to the disk directly, to be sure that they always made decisions and acknowledgments of durable data, instead of basing decisions only on the contents of the kernel page cache.
I think the implications of this design change are enormous. It's one thing to design a database from scratch, like we did with TigerBeetle to take advantage of direct I/O. It's something else entirely to try and retrofit an existing design for direct I/O. The reason is, you need to align all your memory allocations and I/O operations to advanced format: 4 kilobyte sector size, you need a new buffer pool, a new user space page cache. Because the latencies of your writes to disk are now realistic, that is, disk speed rather than memory speed, you can't afford to block asynchronous I/O anymore. You could have taken those shortcuts in your design. Now you actually need to implement proper asynchronous I/O in your write path. If you can do this, there's some incredible performance gains. It's a complete overhaul of your whole database design. Or as Andres Freund on the Postgres team said back in 2018, when this happened, efficient DIO usage is a metric ton of work.
Andres has since done a phenomenal amount of work around direct I/O for Postgres. Andres shared this with me that Thomas Munro is planning to merge some actual DIO support soon. For all these reasons, for all this history, I believe that this event in 28th, March 2018, drew the first line in the sand. This is what marked the end of an era for database design. This was the first line I think, and it almost passed us by. The design of new databases that intend to provide strong durability guarantees would have to change. Not only because of Fsyncgate, so something else was about to happen in 2018. Researchers at UW-Madison were again about to discover something just as momentous, this time not in the buffered I/O design, but in the write ahead log design of almost every database you know.
Crash Consistency Through Power Loss
We started out by saying that, for a database to guarantee durability, it must protect committed transactions through power loss. One aspect of this is the write path, which we've looked at, to compare buffered I/O with direct I/O in how a database writes transactions to disk and then reads them back in at recovery. How does a database actually recover these transactions after a crash? The idea is pretty simple and common to most databases. It's called the write ahead log. I first learned the technique when I was diving into Redis, a decade ago. Redis has what is called the AOF, which stands for append only file. This is a log on disk. When Redis wants to commit a transaction, it first appends the transaction to the end of this log, and then it calls fsync. For example, if you want to execute a transaction in Redis to set the QCon key to London, then Redis would append this to the log.
I've simplified it a little here, but this is an elegant format. First, you've got the number of arguments, in this case, 3. Then you've got the number of bytes in each argument, so 3 bytes for the command, which is SET, 4 bytes for QCon. Then finally, you've got 6 bytes for London. If we then want to update QCon to New York, we would append another transaction to the log like this. The trick is that Redis never updates anything in place in the log, it always appends so that data is never overwritten. This means that if the power goes, then existing data is not destroyed. There might be some garbage that at the end of the log if a partial transaction was being appended, and then the power went, for example, if we tried to set QCon to San Francisco, then the power goes, might end up with this like partial transaction. At startup, Redis can figure this out and discard the partial transaction. This is safe to do because Redis would not yet have acknowledged the San Francisco transaction till it was durably on the log.
This write ahead log design is the foundation for most databases. It's the crucial building block for ensuring atomic changes to database state. It's also the foundation for distributed databases. This is where each node in the cluster will keep its own write ahead log, which is then appended to by the global consensus protocols such as viewstamp replication, Raft or Paxos. However, for distributed databases, the write ahead log is doubly critical. Because if anything goes wrong, if you have a bug that undermines durability, so that you lose a transaction that you've acknowledged to the rest of the cluster, then you've not only lost a copy of user data, which you have. You've also undermined the quorum voting that's going on in the consensus protocol. You mess with the quorum votes. This can easily lead to like split brain, which can in turn cause global cluster data loss. The write ahead log is foundational to the design of a database with a single node or distributed. There are many variants on this design. For example, I've shown you a simplified version of Redis as text format, to give you the big idea, but most databases use a binary format. Then they prefix each transaction in the log with a transaction header. There's also a checksum. If at startup, the database sees that the checksum doesn't match, then the database truncates the log from that point to discard the partial transaction. However, it's critical that only the last partial transaction, a transaction that was being appended as the power went out, is truncated. The database must never truncate any other transactions in the write ahead log, because obviously, these would have been acknowledged to the user as committed. To truncate them would violate durability. At a high level, this is how most write ahead log designs all work today.
If we apply what we've learned from Fsyncgate, how does this design interact with real storage hardware? Can you spot the problem? We've seen that QCon London is already committed as a transaction through a write ahead log. What if while the machine is off, the disk sector containing our QCon London transaction in the write ahead log, what if that disk sector is corrupted, and experiences just a single flip bit, so if the length prefix for London changes from 6 to 4. When the database starts up and reads the log, the QCon key now is going to look like it was set to Lond. Then the New York transaction after that, which was committed, now that's going to look like it was being appended when the power went out, because the log is now no longer going to have the proper format. It's going to have garbage data at the end from O and N onwards. As far as I'm aware, in this situation, most databases would truncate the log before New York, and the committed New York transaction as well as every transaction after it would be lost, violating durability. For databases that checksum the transaction in the write ahead log designs, the London key would also be truncated because of the checksum mismatch, even though it was in fact committed. In other words, a single disk sector failure is enough to break the write ahead log design of most databases. They incorrectly truncate the committed log, potentially truncating tens to hundreds of committed transactions. You can see the big problem, they're all trying to solve the problem of power loss and torn writes after a crash. They're being fooled because now you get a little bit of bit rot in the middle of the committed log, and they conflate that with a system crash, and just rewind everything, which is not correct.
As far as I know, that's every database out there with a write ahead log. If yours handles this, please let me know. With these designs, users will experience silent data loss. Whereas for a single node database, the correct behavior would be to fail fast, then you notify the user of the corruption, because at least then they've got a chance to restore from backup. For distributed databases, it's worse. This single disk sector fault would also have the potential to do more damage. Again, like we said, it can mess with the quorum votes, cause split brain, and that can cascade into global cluster data loss. There, you're not just losing one piece of user data in the transactions, you're actually corrupting everything. The whole data file of every replica just gets messed up. Split brain cascades into cluster data loss. At least this was the finding of the research team that discovered this, as they analyzed the write ahead log designs of distributed systems using consensus protocols such as Raft. They wanted to see, if they can just put in one single disk sector fault on one node on a single machine, what could that do to the cluster? Basically, it could do the worst.
The team was again, Remzi, Andrea Arpaci-Dusseau, students at UW-Madison as well. This time, the students were Ramnatthan Alagappan, Aishwarya Ganesan, who won best paper at FAST for their research on this, which was called Protocol-Aware Recovery for Consensus-Based Storage. It was in 2018. This was in fact, a month before Fsyncgate. It totally passed us by. We heard about Fsyncgate. I don't think many of us have heard about this. Like Fsyncgate, again, it's something we still haven't dealt with fully as an industry, I think. Some cloud providers I know, they've patched their proprietary databases for these findings. I'm not aware of other open source databases that handle this correctly. In the case of Fsyncgate, it took users to experience data loss, then they had to be dedicated enough to report the data loss, and to figure out what had happened. This was not just disk corruption, but a design flaw. The database could have handled Fsyncgate safely, but it didn't, instead it accelerated the storage fault into unnecessary data loss. In the case of the write ahead log design flaw protocol-aware recovery, here, the research community, they've done the favor. They've discovered this proactively, but it's yet to trickle down to inform new design. How long until the write ahead log powering a major Kubernetes deployment is truncated in a way that leads to split brain, loss of the control plane and public outage? I think this is the paper you need to read three times to let the impact sink in.
How Do you Distinguish between Power Loss and Bit Rot in the WAL?
How do you fix a design flaw like this? How do you distinguish between power loss and bit rot, to know whether to truncate it to unwrite the power loss, or to raise an error, or do distributed recovery in the case of bit rot. The strategy given by the paper, what we do in TigerBeetle is to supplement the write ahead log with the second write ahead log. We've actually got two write ahead logs in TigerBeetle. We love them so much. The second write ahead log, it's very small. It's lightweight. It just contains a copy of the headers from the first. First write ahead log has got your messages and your headers, or your transactions and transaction headers. The second write ahead log just has the transaction headers. This turns the write ahead log append process into a two-step process. First, you append to the log as you normally would. Then after you append the transaction, you also append the small transaction header to a second log, so you got another copy of it. This enables you to distinguish between corruption in the middle of the committed log caused by bit rot from corruption at the end of the log caused by power loss. Again, like the fix for Fsyncgate, it's a major design change for a database to have two write ahead logs, not just one. It's not always easy to retrofit.
There were also other findings in the protocol-aware recovery paper from UW-Madison, they impact how we design distributed databases, showing that for protocols like Raft, if the output is lucky, and a single disk sector fault doesn't lead to global cluster data loss, then the fault can still cause the cluster to become prematurely unavailable, because of how storage faults are handled in the write ahead log. You're paying for this durability, but you're not getting the availability. Fixing this is also going to require design changes. For example, in the past, global consensus protocol, local storage engine would be separate modules or components. We used to think of these as completely decoupled or not integrated. If you want your distributed database to maximize availability, how your local storage engine recovers from storage faults in the write ahead log needs to be properly integrated with the global consensus protocol. If you want to optimize for high availability, you want to tolerate up to the theoretical limit of faults that your consensus protocol should be able to tolerate, then it's no longer enough just to cobble together off-the-shelf Raft with off-the-shelf LSM-tree like Rocks or LevelDB. Instead, you need to take both of these, according to the paper, and they show you how to make them storage fault aware. How to talk to each other, so they can both recover. It sounds complicated, but it's really simple. It's an obvious idea. Use your replicated redundancy to heal your local write ahead log. This is like a design change. It's fundamental enough, I think, to signal a new era for how we design distributed databases. You have to use a consensus protocol that implements this, and the storage engine that implements it. Both of them must talk to each other with these new interfaces. It's a big design change.
Summary: Fsyncgate and Protocol-Aware Recovery
To summarize what both Fsyncgate and protocol-aware recovery from UW-Madison have in common, is that I think storage faults force us to reconsider how we design our databases to write to disk, or even just to append to write ahead log and how they recover at startup. If we can extract a principle from this, then the principle is that if we want to guarantee durability, we need to move beyond a crash safety model. This is the model that we've had up until now, let's just survive durability through power loss. We need to move beyond that. This idea that the plug can be pulled any second. Sure, we need to support that, but we need more. We need to actually adopt an explicit storage fault model, so that we can test our designs against the full range of storage faults that we expect, not only crash safety, but storage fault safety. In the security world, this would be the equivalent of a threat model. Just as it can make for better security design to be upfront about your threat model, so I think it can make for better durability or just durability, if we expect the disk to be not perfectly pristine. This is formal proofs for Paxos or Raft. This is what they assume, perfect storage. Rather, what I think we need to do is think of disks as near-Byzantine. That's what we call it at TigerBeeetle. It's our own made-up term. It's somewhere between non-Byzantine fault tolerance and Byzantine fault tolerance. Disks are somewhere in between, they're near-Byzantine. You do well if you expect the disk to be almost an active adversary.
For TigerBeetle, we saw both these events of 2018 as an opportunity. We could take the growing body of storage fault research. There's so much out there. All the other papers coming also from UW-Madison, the Vanguard. We could take that, design TigerBeetle to not only be crash tolerant, but also storage fault tolerant, to start to see disks as distributed systems. One disk, there's so much stuff going on there in terms of failures. It's like the network fault model, like network partitions, you can get cut off from disk sectors. You start to see just a single disk as a distributed system. It's a whole faulty microcosm where you can have bugs in the disk firmware, in the device drivers, even in the file systems. You can have latent sector errors, at least where you get an explicit EIO from the kernel as we've seen. Then you can also get silent corruption where you don't. This could be caused by bit rot, even lost or misdirected reads or writes. This is just where the disk sends the I/O to the wrong sector for some reason.
For example, here's how we designed TigerBeetle to read our write ahead log at startup, we've got these two write ahead logs, even in the presence of storage faults. What we really liked about this was we enumerated all the possible faults we expected according to the fault model, while we're recovering from both write ahead logs. You've got the transaction log, the header log. Then we enumerated all these combinations in a matrix. My co-founder and I, DJ and myself, we worked these out together, the countless cores, worked them out by hand. Then we generated all the failure handling branches in the code dynamically. This way, we can ensure that we wouldn't miss a fault combination, we just enumerated them all, worked out what the proper response should be. Then we generated it in the code so that all the branches are there. Of course, these storage faults are rare compared to all machine failures. In large scale deployments, even rare failures become prevalent.
Taking Advantage of Direct I/O From a Performance Perspective
At the same time, while we had the opportunity to focus on safety, to design for an explicit storage fault model, and for direct I/O from scratch, we also took the opportunity to ask, how can we take advantage of direct I/O from a performance perspective? In the past, when working with direct I/O, you would use asynchronous I/O, like we've said, so that your system calls don't block your main process. However, historically on Linux, because the I/O API for async I/O wasn't great, you'd have to implement the solution of asynchronous I/O by means of a user space thread pool. Think of libuv. Your database control plane would submit I/O to a user space thread pool, which would then use a blocking syscall to the kernel to do the I/O. Problem with this is that your control plane context switched to your thread pool, then your thread pool must pay the cost of the blocking syscall to the kernel. Then you must context switch back to your control plane. In July 2020, as we're designing TigerBeetle, we're asking the performance questions, what if we could have first class asynchronous I/O but without the complexity of the user space thread pool? With the cost of a context switch in the range of a microsecond, context switches approaching the same order as just doing the I/O itself. What if you also didn't need a context switch? We love to ask these questions. Let's just get rid of the thread pool, just get rid of the context switch. Even better with the mitigations after Spectre and Meltdown, this was making syscalls slower. We asked, what if you just didn't need to syscall into the kernel?
We wanted to embrace direct I/O almost to the point of bypassing the kernel completely. We're already bypassing the page cache and now we're bypassing the context switches and the user space thread pool and the syscalls. Except, obviously, we didn't want the complexity of bypassing the kernel completely. For example, we didn't want to use kernel bypass techniques like SPDK, or DPDK, because, firstly, we're just not that good to do that. That's what Redpanda do, which is amazing. One of my favorite databases. That's high performance but also more complex, I think. Finally, we wanted a unified asynchronous I/O interface with a simple API for networking and storage. Here, again, we're lucky with the timing, because just as we were thinking about these things, Jens Axboe comes along with his new io_uring API for Linux. This is arriving on the scene and starting to make waves. The reason is that, like Jens just singlehandedly fixed Linux's asynchronous I/O problem. He landed a series of magnificent patches to the kernel. He actually almost started the end of 2018, just touching AIO a little bit. Then, Jens, there comes the first io_uring patch, early 2019. Here he gives user space a ring buffer, to submit I/O to the kernel without blocking. Then he gives the kernel another ring buffer to send I/O completion callbacks back to user space without blocking. No syscalls could be amortized. No need for user space thread pool. You can now have a single threaded event loop control plane. Then you can use the kernel's own thread pool as your data plane. No more thread pool. You get to use the kernel's own thread pool as your data plane. It's much more efficient. Also significantly simpler, which I really love.
If you've enjoyed Martin Thompson's brilliant talks at QCon, over the years, then you'll recognize the design of io_uring has what Martin calls mechanical sympathy. If you don't know this word, I know you didn't watch Martin's talks, but please go watch them. They're fantastic. They've made a huge impact on TigerBeetle's design. I think again, io_uring is a beautiful design. What I love most of all about io_uring is how it not only gives the perfect interface for file and disk I/O, but also for network I/O. Historically, storage and network were different APIs. Now, with io_uring, you've got a unified interface for both. We really couldn't ask for a better I/O API. In fact, it's so good, I think io_uring alone is a perfect excuse to redesign the way that our databases do I/O in 2023.
New Databases for the Future
Given these major design changes already, at what point do we start to consider the next 30 years of databases? If we're going to write new databases for the future, we've made that decision, we're going to do this, if we're going to write them with new designs, what languages are we going to write them in? Are we going to use the systems languages of the last 30 years, C, C++? Are we going to use the systems languages of the next 30 years? These questions were on our mind when we were deciding whether to write TigerBeetle in C, or else in a new systems language called Zig that we've been following for two years by this point. Barbara Liskov said that if you want to teach programmers new ideas, you need to give them new languages to think those ideas in. What we came to appreciate with Zig, is that it fixed all the issues we had with C, also made it easier to write correct code. It also resonated with our thinking on how you work with memory efficiently as you do systems programming. Memory is probably the most important aspect of systems programming. How good are you working with memory? How sharp are your tools to do that?
For example, it was refreshing how easily we could implement direct I/O in Zig, where all the memory that you pass to the disk is sector aligned, where you can enforce this in a type system. Even the allocators, you don't need special calls anymore just to do an aligned allocation. With Zig there's also no second-class macro language. Instead, you simply program in Zig at compile time as your binary is being compiled. Your meta language is also Zig. We considered Rust, but io_uring and the ability to use the kernel thread pool without context switches meant we had less need or desire for fearless multi-threading in user space. We didn't actually want to do that. We were trying to get away from that. The borrow checker would still have been useful for a single threaded event loop, for logical concurrency bugs, but we didn't want to do multi-threading in the first place. We also wanted TigerBeetle to follow NASA's power of 10 rules for safety critical code. You have to handle memory allocation failure for your database. We also wanted to enforce explicit limits on all resources with general designs. Even for loops, there's no while True, every loop must terminate. There's an expected balance on how long that loop can possibly run for. Another big example of this is just TigerBeetle's use of static memory allocation. This means that all the memory that TigerBeetle needs is calculated and allocated at startup. After that, there's no dynamic allocation at runtime. This is almost a lost art, but we wanted to bring it back to enable TigerBeetle to decouple database performance from memory allocation for extreme memory efficiency, and predictable operating experience. This means that after you start TigerBeetle again, there's no more malloc or free, no risk of fragmentation. It's just pure predictable performance. We've made resource usage explicit like this throughout the design, with memory, with other resources as well.
It's like cgroups in your database. This is striking, I think, because compared to designs where the limits have not been thought through, or made explicit. We really enjoy this model of building the database. Because TigerBeetle makes everything explicit from storage fault model to resource usage, this means that we can test everything and then actually, like test it to the limit. We've got all the limits, so we can now test them. For example, to simulate disk corruption on the read or write path, in the double-digit percentages on every node in the cluster, even the primary, we inject faults up to the theoretical limit, and then we check that TigerBeetle doesn't truncate committed transactions. Is our write ahead log design working. Can the write ahead log use replicated redundancy to heal itself? Can we preserve durability? Can we remain available?
What's powerful about this testing also is that everyone on our team can run this simulator on their local laptop, and every bug that the simulator finds can be reproduced deterministically, replayed again and again for incredible developer velocity. As you're building a new feature, you just run the simulator. This is the final major design decision in TigerBeetle because the whole database has been designed from the ground up as a deterministic distributed database. This means that all the abstractions are deterministic. For example, again, the control plane is single threaded. There's no nondeterminism from the operating system's thread scheduler that could be a source of randomness. Even the source of time, the clock in each node in the database is deterministic. It's an abstraction that you can tick. It's got a tick method, just like you would tick the second hand of a clock. Then we can shim the source of time, or we can shim the disk, or the message bus, we can give a network to the message bus that's actually backed by a packet simulator. Then we can run a whole cluster of TigerBeetle replicas in a single process. This is the second order effect that if we want, because we control the whole simulation, we can literally speed up time. Instead of ticking time every 10 milliseconds as we would normally do, if we want 10 millisecond granularity in our timeouts, instead of doing that, we can tick time in a tight while True loop so that every iteration of the while True loop is just a hot loop. We've simulated 10 milliseconds of real-world time in a fraction of the time.
We actually worked out the time dilation numbers for this last week, and an average run of the TigerBeetle simulator takes just 3.3 seconds to run a pretty interesting simulation, with 64 committed operations, all kinds of stuff, just 3.3 seconds. That executes 235,000 clock ticks, each of those represent 10 milliseconds of time in the real world. In other words, 39 minutes in total. You can run the simulator for 3.3 seconds on your laptop, and you've achieved the equivalent of 39 minutes of simulated test time, full of all kinds of network latencies, packet drops, partitions, crashes, disk corruptions, disk slowdowns, every possible fault. If you want to debug something that would normally take 39 minutes to manifest, you can now do that in just 3.3 seconds. It's a speedup factor of 712 times. With existing test harnesses, like Jepsen, they're fantastic, but they're not deterministic. If you find a bug, you might not find it again. Also, they run in real time, so if you want 39 minutes of test time, you have to give up 39 minutes of real-world time on your local laptop. You don't get the same developer velocity. Being able to speed up time like this feels like magic, like a silver bullet. What I'm most excited about is that because everything in TigerBeetle is abstracted, even the state machine logic, you can just take this accounting state machine out, put another state machine in, and you've got a whole new distributed database, but you benefit from all the fault tolerance testing of TigerBeetle.
For example, we've done this internally to create our own control plane database in a week, before might have taken years. That's why I believe we're really in a new era. The rate at which new databases can be created is going to just accelerate and they're going to operate and be tested at much tighter tolerances than anything we've seen before. Each of these advances, direct I/O, protocol-aware recovery for consensus-based storage, explicit storage fault model, io_uring, an efficient replacement for C in Zig, deterministic simulation testing. Each of these on their own, I think makes for a whole new dimension in database design, hard to retrofit. Taking it together, these advances in database design are going to unlock an abundance of new, correct, and high-performance open source database management systems, tailored to the domain. It's a new era for database design, and I think it's a good day to do this.
See more presentations with transcripts