Tokutek Announces New Consensus Algorithm for MongoDB
Tokutek has announced work on a new consensus algorithm with the goal of replacing the existing leader election algorithm in MongoDB. Ark, as the algorithm is named, is being developed for use in TokuMX, Tokutek’s fork of MongoDB, and addresses a number of issues with the existing MongoDB algorithm.
The design is heavily influenced by the Raft and Paxos algorithms and aims to provide the same provably strong consistency guarantees. It differs from Raft to enable it to support the MongoDB architecture and programming model, implementing an asynchronous, pull-based replication model. This, the developers claim,
…supports a wider range of client semantics that allows application developers to choose points along a trade- off between safety and latency. In addition, Ark supports different replication topologies like chained replication and multi-data center replication with more flexibility than Raft does with its synchronous push model.
Tokutek explained the need for this new algorithm by pointing out two issues with the existing MongoDB leader election algorithm. The primary issue is one of correctness. In the blog post announcing Ark, Zardosht Kasheff points out that it is possible for updates that succeed with the
majority write concern to roll back.
Our main goal is to modify the election protocol to make TokuMX a true CP system. That is, in the face of network partitions, TokuMX will remain consistent. To do so means ensuring that any write that is successfully acknowledged with majority write concern is never lost in the face of a network partition. This is not currently the case for TokuMX and MongoDB.
The secondary issue that Tokutek draws attention to is one of availability. In the accompanying tech report Zardosht and coauthor Leif Walsh explain that it is possible for a MongoDB replica set to be unavailable for 30 seconds or more during failover.
MongoDB’s election protocol requires that a member may not vote “yes” in more than one election in any 30-second period. … [T]his 30 second threshold can be problematic in practice, especially if an election fails: this necessarily makes the set unavailable for at least 30 seconds, maybe more if successive elections fail.
Ark addresses these flaws by exploiting the structure of the TokutekDB global transaction identifier (GTID). The GTID consists of a pair of 64-bit integers, (term, opid), where opid is incremented each time an operation commits on the primary, and the term is incremented each time a new primary is elected, and at this point the opid is reset to 0. The term in the GTID serves the same purpose as the term concept in the Raft protocol and that similarity allows Ark to employ many of the same solutions that Raft uses to provide its strong consistency guarantees.
While Ark is an implementation of a consensus protocol that works in a real database system, it is also evidence of the flexibility in the Raft consensus algorithm. It was relatively straightforward to tweak Raft in safe ways to make it fit the MongoDB architecture and programming model, and we think this is an important feature of Raft.
There is an Ark development branch available and Tokutek is actively soliciting feedback on both the design and the implementation.