Azure Cosmos DB: Low Latency and High Availability at Planet Scale

Mei-Chin Tsai and Vinod Sridharan spoke at QCon San Francisco on Azure Cosmos DB: Low Latency and High Availability at Planet Scale. The talk was part of the "Architectures You've Always Wondered About" track.

After introductions, Tsai started the talk by explaining what Azure CosmosDB is, beginning with the history. A founding father of the service is Dharma Shukla, who began in 2010 with the idea to design a highly scalable, globally-distributed database. In the years that followed, he and his team worked on Document DB's service. And in 2018, the product was renamed to Azure CosmosDB to reflect the scalability of the planet. Initially, it supported the document- and mongo DB, followed by Gremlin (Graph API) and Cassandra (Wide-Column API) later.

Azure Cosmos DB supports various features at its core for scalability and performance and the enterprise.

After providing the Azure Cosmos DB feature set overview, Tsai discussed using the database service with the customers, such as Geico, LinkedIn, and Walmart. In particular, without mentioning the customer, she gave an example of a customer having a single instance scalability in production with 100 million requests per second over petabytes of storage and globally distributed over 41 regions.

Before handing it over to Sridharan, Tsai explained the overall architecture of Azure Cosmos DB. From a high-level perspective, a request (like CRUD) goes through one of the supported APIs and reaches the API Gateway before it gets routed to the storage. Then, the operation result is routed back through the API Gateway to the client.

Next, Sridharan continued with the storage engine. Before he dived deeper into the backend, he explained the resource hierarchy of Cosmos DB. The database service has "Databases," which has "Containers." The latter's size is determined by so-called "Request Units" (RUs). And containers can have many partitioning schemas like single partition keys, hierarchical partition keys, and composite partition keys (Cassandra APO).

The Cosmos DB backend is a multi-tenant, shared-nothing, distributed service. Each region is divided into multiple logical fault-tolerant groups of servers called clusters. Every cluster contains multiple fault-domain boundaries, and servers within these domains are spread across these domains. The cluster as a whole can survive any failure in a subset of these domains. Within each cluster, customers' physical partitions for different containers are spread across the other servers in different fault domains. And a single partition has four replicas placed across fault domains for availability and durability. Furthermore, data is encrypted at rest or, optionally, a customer's provided key.

Sridharan subsequently explained the logical request flow with any given partition and how it relates to availability and reliability, followed by a deep-dive explanation of the multi-region writes capability, read and tunable consistency, high availability, and elasticity (offered through auto-scale, serverless, or elastic scale out).

And finally, Sridharan stated that Cosmos DB offers multiple APIs with a shared infrastructure, providing the same scalability, availability, consistency, and elasticity.

Next, Tsai took over to discuss the API gateway in Cosmos DB. The API gateway is organized as a fleet of microservices, each specialized in one API (each API having multiple microservices). The gateway was designed for various APIs (Cassandra, NoSQL, Mongo, Gremlin, and the latest addition, PostgreSQL – not mentioned in the talk), multi-tenancy, throughput, and availability. The choices behind the design were:

Flexibility in scaling
Shared platform layer
- Optimize once for all APIs
- Faster standing up for new scenarios
Efficient use of nodes to reduce COGS
Resource governance to reduce blast radius and contain noisy neighbor
The platform is deeply integrated with the underlying hardware
- Numa node boundaries
- Fixed-size processes

She continued with how Cosmos DB leverages a VM. All processes are small instead of one giant process. There is a process per API and a fixed size for a better working set. Furthermore, she explained:

Understanding your hardware that it depends on, we never span our process across Non-uniform memory access (Numa) nodes. When a process is across the Numa node, memory access latency can increase if the cache misses. Not crossing the Numa node gives our process a more predictable performance.

Another aspect of the API Gateway is load balancing at all levels (VM, Clusters).

Tsai handed over to Sridharan to talk about performance. Cosmos DB applies multiple strategies to ensure performance. To summarize his explanation of performance, it involves:

Rigorous Validation (continuous monitoring latency and throughput, changes need to meet performance bar)
Storage Engine
- Lock Free B+ tree for indexing with log-structured storage
- Local disk with SSDs (not Remote Attached)
- Batching to reduce network/disk IO
- Custom allocators to optimize for request patterns
- Custom async scheduler with coroutines
API Gateway
- Low allocation APIs (Span/memory)
- Minimal parsing/transforms/deserialization
- NUMA affinity & Fixed-size process
- Proximity Placement Groups
- Optimize with high-performance .NET APIs

Tsai ended the talk with a couple of takeaways from building Azure Cosmos DB:

Design in reliability and performance features (core value propositions)
Test, monitor, and improve reliability and performance
Leverage customer insight
Stay ahead of business growth and trends

About the Author

Steef-Jan Wiggers

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Steef-Jan Wiggers

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter