Tokutek Announces New Consensus Algorithm for MongoDB

Tokutek has announced work on a new consensus algorithm with the goal of replacing the existing leader election algorithm in MongoDB. Ark, as the algorithm is named, is being developed for use in TokuMX, Tokutek’s fork of MongoDB, and addresses a number of issues with the existing MongoDB algorithm.

The design is heavily influenced by the Raft and Paxos algorithms and aims to provide the same provably strong consistency guarantees. It differs from Raft to enable it to support the MongoDB architecture and programming model, implementing an asynchronous, pull-based replication model. This, the developers claim,

…supports a wider range of client semantics that allows application developers to choose points along a trade- off between safety and latency. In addition, Ark supports different replication topologies like chained replication and multi-data center replication with more flexibility than Raft does with its synchronous push model.

Tokutek explained the need for this new algorithm by pointing out two issues with the existing MongoDB leader election algorithm. The primary issue is one of correctness. In the blog post announcing Ark, Zardosht Kasheff points out that it is possible for updates that succeed with the majority write concern to roll back.

Our main goal is to modify the election protocol to make TokuMX a true CP system. That is, in the face of network partitions, TokuMX will remain consistent. To do so means ensuring that any write that is successfully acknowledged with majority write concern is never lost in the face of a network partition. This is not currently the case for TokuMX and MongoDB.

The secondary issue that Tokutek draws attention to is one of availability. In the accompanying tech report Zardosht and coauthor Leif Walsh explain that it is possible for a MongoDB replica set to be unavailable for 30 seconds or more during failover.

MongoDB’s election protocol requires that a member may not vote “yes” in more than one election in any 30-second period. … [T]his 30 second threshold can be problematic in practice, especially if an election fails: this necessarily makes the set unavailable for at least 30 seconds, maybe more if successive elections fail.

Ark addresses these flaws by exploiting the structure of the TokutekDB global transaction identifier (GTID). The GTID consists of a pair of 64-bit integers, (term, opid), where opid is incremented each time an operation commits on the primary, and the term is incremented each time a new primary is elected, and at this point the opid is reset to 0. The term in the GTID serves the same purpose as the term concept in the Raft protocol and that similarity allows Ark to employ many of the same solutions that Raft uses to provide its strong consistency guarantees.

While Ark is an implementation of a consensus protocol that works in a real database system, it is also evidence of the flexibility in the Raft consensus algorithm. It was relatively straightforward to tweak Raft in safe ways to make it fit the MongoDB architecture and programming model, and we think this is an important feature of Raft.

There is an Ark development branch available and Tokutek is actively soliciting feedback on both the design and the implementation.

Topics

Pitfalls of Unified Memory Models in GPUs

Evolving Trainline Architecture for Scale, Reliability and Productivity

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the Distributed Document Oriented Database topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Microsoft Introduces Drasi: Open-Source System for Real-Time Event Processing and Automation

How Cell-Based Architecture Enhances Modern Distributed Systems

Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems

Orchestrating a Path to Success - a Conversation with Bernd Ruecker

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Challenges and Lessons Porting Code from C to Rust

Copilot Now Available in OneDrive: AI-Powered Features for Streamlined Document Management

Ephemeral IDs: Cloudflare's Latest Tool for Fraud Detection

Evolving Trainline Architecture for Scale, Reliability and Productivity

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

No EC2 or Kubernetes Allowed: Insights from Building Serverless-Only Architecture at PostNL

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

How a Sustainable Mindset in Software Engineering Can Increase Team Performance and Prevent Burnout

The Ongoing Challenges of DevSecOps Transformation and Improving Developer Experience

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Google Cloud Adds Scalable Vector Search to Memorystore for Valkey & Redis Cluster

Podman Desktop 1.13 Launches with Hyper-V Support and Additional Enhancements

Uber Completes Major MySQL Fleet Upgrade, Boosting Performance and Security

QCon San Francisco

QCon London

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?