Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Azure Cosmos DB: Low Latency and High Availability at Planet Scale

Azure Cosmos DB: Low Latency and High Availability at Planet Scale



Mei-Chin Tsai and Vinod Sridharan discuss the internal architecture of Azure Cosmos DB and how it achieves high availability, low latency, and scalability.


Mei-Chin Tsai is Partner Director of Software Eng Manager @Microsoft. Vinod Sridharan is Principal Software Engineering Architect @Microsoft.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Tsai: My name is Mei-Chin. I'm a Partner Director at Microsoft.

Sridharan: I'm Vinod. I'm a Principal Architect at Azure Cosmos DB.

Tsai: Today's topic is Azure Cosmos DB: Low Latency and High Availability at Planet Scale. First of all, we will start with introduction and overview. Then Vinod will take us deep dive into the storage engine. I will be covering API gateway. We will conclude with learning and take away.

History of Azure Cosmos DB

First of all, let's start with a little bit about Azure Cosmos DB's history. Dharma Shukla is a Microsoft technical fellow. In 2010, he'd observed many internal Microsoft applications trying to build highly scalable and globally distributed systems. He started a project, he hoped to build a service that is cloud native, multi-tenant, and shared nothing from the ground up. In 2010, while Dharma Shukla is having a vacation in Florence, his proof of concept started to work. What is Azure Cosmos DB's codename? Project Florence.

The product name was named as DocumentDB as we started with document API, by 2014 there are plenty sets of critical Azure services and Microsoft developers using Cosmos DB. With continuous evolution, DocumentDB no longer describe the capability of the product. Upon public offer in 2017, the product was renamed as Azure Cosmos DB to reflect the capability more than just document API. Also, to better describe aspiration for a scalability beyond a planet. In the initial offering, we only have document API such as document SQL and Mongo API. By 2018, both Gremlin API and Cassandra API were introduced.

What Does Azure Cosmos DB Offer?

Managing a single tenant fleet is hard, it is usually not cost efficient. To do sharding of any database is also a complex task. What does Azure Cosmos DB offer? These sets of features and characteristics are built into Azure Cosmos DB to support a journey from the box product to the cloud native. On the left-hand side, it is our core feature sets. Azure Cosmos DB is Microsoft's schema free, NoSQL offering. Document SQL API is Azure Cosmos DB's native API. Our query engine supports rich query semantics, such as sub-query, aggregation, join, and more, that make it unique and also complex. We also support multi-APIs. There is a huge base of existing Mongo, Cassandra developers and solutions. Providing these OSes APIs, smooths out the migration to Azure Cosmos DB. We will see multi-API choice impacts throughout our system, including storage engine, query, gateway, and more. We have tunable consistency, and conflict resolution. For scalability and performance features, we have geo distribution, active-active, and elasticity. On enterprise front, we are a Ring Zero Azure services. That means wherever there's a new Azure region build-up happening, Azure Cosmos DB will be there. Encryption, network isolation, role-based access, are all needed to support enterprise software.

Azure Cosmos DB Usage

How is Azure Cosmos DB doing? Azure Cosmos DB is one of the fastest growing Azure services. We have 100% year-over-year growth in transaction and storage. How big can a customer become? I'm just citing one of the single instance customers, one of this instance customers actually supports over 100 million requests per second over petabytes of storage. This customer is globally distributed in 41 regions. Who uses Azure Cosmos DB? We power many first-party partners from Teams Chat, Xbox games, LinkedIn, Azure Active Directory, and more. We also power many critical external workloads, such as Adobe Creative Cloud, Walmart, GEICO, ASOS. All of our APIs have strong customers. We also observed many of our customers use several APIs, not just one, simply because depends on your data modeling, dataset, and your query. Customers tend to pick the right set of APIs to better support that scenario.

Azure Cosmos DB High Level Logical Flow

This diagram illustrates a high-level Azure Cosmos DB architecture. Customers creates a database account. The data is stored in the storage unit. Customer sends various requests from simple CRUD and query operations. What differentiates Azure Cosmos DB the most is actually our customer. On the left-hand side, you can see many of our customer accounts, but they can speak different languages. As you request a write to the write API gateway, request will be translated and planned. Sub-request will be routed to relevant backend storage. Result will be processed in the gateway if needed before sending it back. Azure Cosmos DB is designed with high performing multi-tenancy, that means a VM can contain many storage units, an API gateway process can support multiple customers' requests. In this talk, you will see multi-API and high performing multi-tenancy plays a big part in our system.

Storage Engine

Sridharan: As we dive deep into the storage engine, I wanted to talk about a couple of themes you'll see. In Cosmos DB, we believe highly in redundancy. If there's one of something you should assume that it fails. If there's more than one, assume that you have to prepare to have them swap out at a moment's notice, and we will need redundancy at every layer.

Resource Hierarchy

I want to quickly cover some of the simple concepts in the Cosmos DB service. Cosmos DB has a tiered resource hierarchy. At a top level you have the database account, which is how customers interact with the database. Within an account, you can have several databases. A database is just a logical grouping of containers. Containers are like SQL tables. They hold user data and documents that are stored in physical partitions that are then horizontally scaled. Entities such as containers are provisioned with request units, which determine the throughput that you can derive from them. It's like the currency that you use to interact with Cosmos DB. Each operation in the container ends up consuming these request units, and is restricted to be within the provision value for the container within any one second. Every container also comes with a logical partition key. The logical partition key is basically just a routing key that is used to route user data to a given physical partition where the data resides. Each logical partition key lives in one physical partition, but each physical partition stores many such logical partition keys. That's pretty much how we ended up getting the horizontal scale we talked about. For instance, consider a container that has the logical partition key, city. Documents with the partition key value Seattle or London may end up in partition one, while Istanbul ends up in the physical partition three. We have several types of partitioning schemes in Cosmos DB. They can be a single field path, or hierarchical, or composite, where you could have multiple keys contributing to the partitioning strategy.

Each partitioning scheme is optimized for its own use scenarios. For instance, when you think about CRUD operations, a single partition key may be good enough. When you consider complex systems where you have data skew, for instance, where you have data skew but you still want to optimize for query usage patterns, hierarchical partition keys can still be useful. For instance, with Azure Active Directory where you have login tenants that are vastly different such as Microsoft or Adobe, but you still want horizontal scalability because a single tenant may be larger than what a physical partition can handle. You want data locality for these queries. You can have hierarchical partition keys where you first partition by the tenant, but also downstream by an organization or team, allowing you to benefit from the horizontal scalability but also data locality for fairly complex queries such as ORDER BYs, or CROSS JOINs across the different partitions.

The Cosmos DB Backend

The Cosmos DB backend is a multi-tenant shared nothing distributed service. Each region is divided into multiple logical fault-tolerant groups of servers that we call clusters. Every cluster contains multiple fault domain boundaries, and servers within the cluster are spread across these fault domains to ensure that the cluster as a whole can survive any failures in a subset of these fault domains. Within each cluster, customers' physical partitions are spread across the different servers across all the different fault domains. A single partition for a given user comprises of four replicas that each hold copies of the entire data for that partition. We have four copies of your data for every partition. Different partitions even within a container are spread across various machines in a cluster. Across partitions, we divide them across multiple clusters, allowing us to get elastic and horizontal scale-up. Since we're multi-tenant, every server hosts data from multiple accounts. By doing this, we ensure that we minimize the cost and get better density and packing of customer data per server. To do this, we need strong resource governance.

The Backend Replica

Let's take a look at some of these replicas and partitions in detail. Each replica of a partition has a document store, which is written to a B-tree, an inverted index, which is an optimized B+ tree holding index terms used for queries, and a replication queue which logs writes as they happen, and is used to build and catch up slow replicas. Simple reads and writes are just served from the document store. We can always just look it up by ID and basically replace or read the value. However, queries that involve more than simple reads such as ORDER BYs, or GROUP BYs, or aggregates, or even sub-queries on any fields are processed in conjunction with the B-tree and the inverted index. Given that you can index pads, or wildcards, or just say, index everything, we end up building these indexes at runtime as you insert these documents. There is a brilliant paper some of our colleagues wrote on schema agnostic indexing and how we make it work that is linked here,

Write a Single Document

Looking at a specific single write, if a user issues the write request to a given partition, the write is first directed to the primary replica of that partition. The primary replica first prepares a consistent snapshot of the message and dispatches that request to all of the secondaries concurrently. Each of the secondaries then commits that prepared message and updates its local state, which involves updating the B-tree, updating the inverted index, and the replication queue. The primary then waits for a quorum of secondaries to respond confirming that they have acknowledged and committed the write. This means that they've written it fully to disk. Once a quorum of secondaries act back, the primary then commits the write locally. It updates the replication queue, its local B-tree, and the inverted index before acknowledging the write to the client. If there's any replicas that didn't respond to the quorum write, then the replication queue is used to catch them up, in this case, the middle replica that didn't act back.

Multi-Region Writes

That works great if you're in a single region. Once again, if you have a single region, you only have one of something, and we like redundancy at Cosmos DB. We do offer having multiple regions with multi-homing. In the scenario where you have multiple regions, a client in region A will first write to the primary replica in region A. The primary then follows the same sequence as before and commits to its own secondaries. One of the secondaries is then elected to be a cross-region replication primary and replicates the write to region B's primary replica. This then follows the same channel as before, and region B's primary commits it to its own secondaries, and writes are transparently propagated across various regions. Conversely, if a user in region B writes to the primary in region B, then the inverse flow happens. First, the write is first committed to its primary and secondaries in region B. One of the secondaries there is elected as the cross-region replication primary, and writes to region A, which then follows the same sequence as before, ensuring that writes are made available transparently across all the different regions.

As both people are writing to both regions, one of the things that can happen is conflicts. Cosmos DB has several tunable conflict resolution policies to help resolve these conflicts. The user can configure, for instance, last writer wins on the timestamp of the document, or they can provide a custom field that defines how to resolve conflicts as some path within the document. The users can also configure a custom stored procedure that allows them to customize exactly how conflict resolution happens. Or they can have a manual feed and say I want to resolve these offline. Additionally, there's also API specific policies such as Cassandra, which applies last writer wins on a timestamp but at a property level. All of these are configurable at an account or a collection basis.

Reads and Tunable Consistency

When it comes to reads, Cosmos DB has a set of five well-defined tunable consistencies that have been derived based off of industry research and user studies. These consistency levels are overridable on individual reads and allow for trading off between consistency, latency, availability, and throughput. On one side, you have strong, which gives you a globally consistent snapshot of your data. Next to that we have a regionally consistent consistency called bounded staleness, which about half of our requests today use. In this mode, users can also configure an upper bound on the staleness, ensuring that reads don't lag behind writes by more than N seconds, or K writes across various regions. The next consistency we have is session, which provides read your own write guarantees within a client session. This gives users the best compromise between consistency and throughput. Finally, we have consistent prefix and eventual which gives you the best latency and throughput but the least guarantees on consistency. Of course, within a single region, the write behavior does not change with the consistency model. Writes are always durably committed to a quorum of replicas.

Let's look at how some of these are achieved in detail. When a client issues a read that is eventual or session, these reads are randomly routed to any secondary replica. Across different reads, because of this random selection, reads are typically distributed across all the secondaries. In the case of a bounded staleness, or a strong read, the reads are pushed to two random secondaries. We also ensure that quorum is met across the two different secondaries. Given different request types, be it writes that go to the primary or reads that go to the secondaries, we basically load balance our workloads across all the replicas in our partition, making sure that we get better use of the replicas and provisioned throughput.

Ensuring High Availability

Coming to availability, across various components, we continually strive to ensure that we have high availability in all of our backend partitions. One of the things we do is load balancing. We constantly monitor and rebalance our replicas and partitions both within and across our clusters to ensure that any one server isn't overloaded as customer workload patterns change over time. This load balancing is done in various metrics such as CPU or disk usage, or so on. Additionally, things always happen, like our machines go down, there's upgrades, the power may go out on part of our data center, and we need to react to this. If a secondary replica goes down, because of any of these things, due to maintenance or upgrades or so on, new secondaries are automatically selected from the remaining servers and rebuilt from the current primary using the existing B-tree and replication queue. If a primary goes down, a new primary is automatically elected from the remaining replicas using Paxos. This new secondary is guaranteed to be chosen so that the latest acknowledged write is available. By doing this, we ensure that we have high availability within a partition, as long as there's enough machines around even as the system evolves and changes.

Finally, we have strong admission control and resource governance. Requests are admission controlled on entry before any processing occurs based on their throughput requirements to ensure that any one customer doesn't overload the machines, and there's fair usage based on throughput guarantees. With simpler CRUD requests, this is relatively easy to do. Basically, when you have a write, I know the size of your write, and I can figure out the throughput requirements. For reads, similarly, once I look up the document, I know how much throughput you're using. However, when you get to a complex query that has like a JOIN, or a GROUP, or you're deep in the middle of an index or an ORDER BY, resource governance gets harder. Here we have checkpoints in the query VM runtime, where we report incremental progress of throughput, and we yield execution if you exhaust your budget. By doing this, we ensure that customers get the ability to make forward progress as they work through a query, but also respect resource governance and the bounds of a given replica or partition. If, however, the requests do exceed the budget allocated, say you have a large query and there's not enough budget remaining within that replica to handle that query, they do get throttled on admission, with a delay in the future that allows the partition to operate under safe limits.

Elasticity - Provisioned Throughput

Beyond availability, customer's data needs change over time. Traffic can be variable with follow the sun, or batch workloads, or spiky workloads, or analytical, and so on. Their stored data can also change over time. Cosmos DB handles this by offering various strategies to help with the elasticity of a service when data needs change. Customers can leverage various strategies like auto-scale, throughput sharing, serverless, or elastic scale out for their containers. The model on the left shows a simple provisioned throughput container. The container has a fixed guaranteed throughput of 30,000 RUs that is spread evenly across all the partitions. The container will always guarantee that 30,000 request units worth of work is available at any one second, and this is divided evenly across all of the different physical partitions. This is useful for scenarios where the workload is predictable, or constant, or well known a priori. On the right, we have an auto-scale container. Customers provide a range of throughput for the container, and each partition gets an equivalent proportion of it. In this case, the workloads can scale elastically and instantaneously between the minimum and maximum throughput specified, but the customer is only billed for the maximum actual throughput consumed, which is great for spiky or bursty workloads. To do this, we rely heavily on our load balancing mechanisms we discussed before, because we need to constantly monitor these servers and ensure that any replicas and partitions that get too hot get moved out if any one server gets oversubscribed.

Elastic Scaling

An additional case is if you need to grow your throughput or storage needs beyond what one physical server can handle. Say a user wants to scale their throughput from 30,000 to 60,000 request units, this may require a scale out operation. If it does, under the cover, the first thing we do is allocate new partitions that can handle the desired throughput. We then detect the optimal split point based on consumption or storage on the existing partitions, and then migrate the data from the old to the new based on the split point with no downtime to the user. New writes arriving to the partition are also copied over so the user doesn't see any impact from this operation. Finally, when the data is caught up, there is an atomic swap between the old and new partitions, and the newly scaled partitions now take the user traffic. Note that these are all independent operations. Each partition once it's caught up will automatically transition atomically to the new partitions. Similarly, let's say that we detect that your storage is growing beyond what a single partition can handle, and it may be better to split. We do partition this the same way as a throughput split, where we create two child partitions, detect the split point, and migrate the data to the new child partitions, which can elastically grow horizontally. We did recently add the ability to merge partitions back as you reduce the throughput needs. While we didn't need to do this earlier, we found out that this became important especially for scenarios around query where data locality is super important and fanning out to fewer partitions yielded better latencies and performance for the end customer. In this case, we once again follow the same flow where we provision a new partition, copy the data inbound, and then do the swap.

Multi-API Support

Finally, Cosmos DB is a service that supports multiple APIs, from SQL, Mongo, Gremlin, Cassandra, and Tables. For all of these APIs, we provide the same availability, reliability, and elasticity guarantees, while ensuring that API semantics are maintained. Consequently, we leverage a common shared infrastructure in the storage engine across all the APIs. The replication stack, the B-tree and the storage layer, the index and elasticity are all unified. This allows us to optimize these scenarios across all the APIs and ensure that features in this space benefit all APIs equally. We can see this because our active-active solution is available in Cassandra and SQL, and even Mongo, where we have an active-active Mongo API. However, each API can have its own requirements in terms of functionality. In these cases, we have extensibility points in the storage engine to support them. For instance, with patch semantics, or index term and query runtime semantics, where type coercions and type comparisons and such can vary vastly across APIs, or even conflict resolution behavior where resolving conflicts can have API specific behavior like we saw in Cassandra. By carefully orchestrating where and when these extensibility points happen, we can ensure that we give users the flexibility of the API surface while still guaranteeing availability, consistency, and elasticity.

API Gateway Problem Space

Tsai: That's a very cool database engine. I always joke about it, Vinod got an easy problem. He got to charge our customers and API gateway does not. You had seen this diagram earlier, we are zooming a little bit into that middle box. As you can see, our API gateways have formed a fleet of microservices. Each microservice is specialized in one API. For each API, there are many microservices. Why is our API gateway problem space unique? Our API gateway is designed for multi-API. Today, we support Documents SQL, Mongo API, Cassandra API, and Gremlin API. Each API has its own semantics and protocols. Our gateway needs to be able to understand various protocol and also implement the correct semantics. Just picture key-value store, you probably can picture a simple query and request go over the wire. Think about graph. Graphs usually tend to deal with traverse, and super node as a very computationally heavy request. Our gateway is going up for multi-tenancy. Know that we do not charge our customer for these gateway services, everything here is COGs for us. Unlike the backend storage and CPU, we delegate to the customer as an RU. This gateway needs to be performing and fair, reducing the impact of noisy neighbor, and also contain the blast radius. This gateway needs to be highly available because that's one of the Azure Cosmos DB [inaudible 00:28:24].

API Gateway Design Choices

This is our API gateway design choices. First, our gateway federation is separate from our backend federation. That gives us the flexibility in scaling independently from the backend for growing or shrinking. Second, to make multi-API implementation effective, we abstract our platform host from API specific interop. This enables us to quickly stand up another protocol or scenario if we desire to, and also allow us to optimize a platform layer once, and can benefit all scenarios. Platform layer including how we talk to the backend, knowing the partitions, and also talking about memory management. We need to be efficient, use the nodes to reduce COGs. Again, these are free services. However, we also need to balance between maximize use of a node and yet maintain high availability at peak traffic. A lot of heuristic and tuning has been put in. We implemented resource governance to reduce blast radius and to contain noisy neighbor. Our platform is deeply integrated with underlying hardware, such as NUMA node boundary conservation and core affinitization.

How We Leverage a VM

This diagram illustrates how we leverage a VM. Our processes are all small and fixed sized, as you can tell from this diagram. With a process per API, we can have better working set and localized locality, such as caching. We have multiple processes per API on a VM. Understanding the hardware that you're deploying on, this is very important. We never span our process across NUMA node. NUMA stands for Non-Uniform Memory Access. When a process is across NUMA node, memory access latency can increase if you have cache misses. By not crossing the NUMA node makes our performance a lot more predictable. We also affinitize each gateway process to a small set of unique CPU cores. This will allow us be free from OSes from switching us between process, by sharing the cores, and also, between the API gateway process they won't compete. Second, understand the language and framework that it depends on. Our API gateway is implemented in .NET Core. We manage the process for performance. We put in consideration for potential latency such as GC configuration, throughput heuristic. We also have been leveraging low allocation API and buffer pooling technique to reduce the GC overhead. We are very fortunate that we work closely with .NET team. We are a .NET early adopter. We provide performance and quality feedback to .NET team.

Load Balancing at all Levels

Load balancing is a difficult problem, you want to do it well. Vinod mentioned about load balancing in our backend. The same thing is happening in the frontend. With our API gateway design, load balancing happens in all different levels. Within a VM, there are connection rerouting to balance between process of the same API. Between VMs, account load balancing can happen by monitoring system if a VM is getting too hot in CPU or memory. Account load balancing can happen within a cluster or between clusters. These are passive load balancing, because you load balance when a VM may be out of a comfort zone. What we are working on is active load balancing, where we load balance upon request arrival. We are building a layer 7 customized load balancer. This is a work in progress. It will leverage VM health not just VM's liveness. It will have capability to leverage historical insight in ML for future prediction. All this load balancing is done in a non-observable way to a customer, and maintain the high availability.

Tuning the System for Performance

Sridharan: Performance is another critical component to manage as the code base evolves. It's one of the first things to go if you're not paying close attention to it. Cosmos DB applies multiple approaches to ensure that we retain and improve performance over time. Firstly, for every component, we have a rigorous validation pipeline that monitors the current throughput and latency continuously. We make sure that we never regress this performance, and every change that goes into the product has to meet or exceed this performance bar. I learned this the hard way when I first joined the team. Even a 1% regression in latency or throughput is considered unacceptable. That sometimes means that as changes happen, developers must reclaim performance in other components so that the overall performance bar is met. Within the storage engine, we have a number of things we do to ensure that we have the performance guarantees needed for the service. The first is our inverted index, which uses a lock-free B+ tree optimized for indexing, which was written in collaboration with Microsoft Research. This gives us significant benefits over a regular B-tree when doing multiple batched updates across different pages, which is pretty common in indexes. We also optimize our content serialization format in our storage with a binary format that is length prefixed, so that projections as you're looking at a deeply nested document is highly efficient for queries, and we can read and write properties super efficiently. We also ensure that we use local disks with SSDs instead of remote disks to avoid the network latency when committing changes. Instead, we use the fact that we have four such replicas to achieve high availability and durability. We also have custom allocators to optimize for various request patterns, for instance, a linear allocator for a query, or we have a custom async scheduler with coroutines that ensures that we can optimize for concurrency and resource governance all through the stack.

Within the API gateway, the optimizations tend to be about managing stateless scale out scenarios that focus on proxying data. We usually use low allocation .NET APIs such as span and memory, and minimize the transforms parsing or any deserialization needed all to reduce variance in latency and garbage collections. Additionally, to ensure predictability in memory access patterns, we have a hard NUMA affinity within our processes. We also use fixed size processes across all of our APIs, so that we can optimize for one process size when it comes to core performance characteristics across the entire fleet that we have. Within Azure, our VMs in our cluster are placed within a single proximity placement group, which allows us to provide guarantees around internode latencies, especially when we're doing things like load balancing. Finally, we partner closely with the .NET team to ensure that we continue to benefit from the latest high-performance APIs that we can leverage in our software stack.

Takeaways from Building Azure Cosmos DB

Tsai: These are our takeaways from building Azure Cosmos DB. Reliability and performance are features, you must design it in on day one, if not day zero. The architecture had to respect that and everything that you put in had to respect that. Reliability and performance is a team culture. As Vinod mentioned, it could be test by 1000 paper cuts, 1% of a performance regression 20 of those will be 20% of regression. It is very important that continuous monitoring and continuously holding that bar, you must allocate the time to continuously improve your system, even when you think that you are done with the features. Third, leverage customer insight. This is also super important. With all the data in your hand, how do you decide what feature to build, or that a customer can benefit, or what optimizations you should optimize, to prioritize so that most customers can benefit from it. One other good example that we have is actually the recent optimization of query engine. Even though we've implemented Mongo API, we know Mongo supports nested arrays, we have so many of our customers who don't even have nested arrays. Our query engine implements optimization that a single code pass is actually much more performing, and a lot of customers benefit from it. Last, stay ahead of the business growth and trend. Nobody builds services on day one, and power billions of requests, and petabytes of data. You continuously monitor to see if you're hitting a bottleneck, you need to work on removing that bottleneck. Sometimes you might need to reimplement if your initial design does not support that scalability. Training is also really important to know, what is upcoming, and how do you prioritize those?

Questions and Answers

Anand: Could you expand on how you verify that you don't have performance regressions?

Sridharan: There's a number of things that we look at. The primary metrics we're looking for are throughput, latency, and CPU. We want to make sure that for our critical workloads, at the very least, throughput, latency, and CPU are constant and do not degrade over time. We basically take a single node, we load it up to 90% CPU, for, say, a single document read, a single document write, a fairly well-known query, and so on, and make sure that the throughput that we derive from a node, the latency that we see at 90% CPU, the memory that we see are all meeting the benchmark that we want. Then we basically add specifically well-known workloads over time as we see scenarios that derive from production. We also have observability in our production system where we monitor specific queries, signatures, and so on, and make sure that we don't regress that as we do deployments and such.

Tsai: I think, in summary, essentially is some layers of guardrails. There's a first-line guardrail, more like a benchmark unit testing, and then the workload, like simulated workload, and then real workload. Then you will hope that you catch it earlier, because those are the places that you can catch it with more of a precise measurement. Then when you catch low, and further out, then it's just more like, I'm seeing something that is not what I expected.

Anand: I don't know if this person was asking in just a standalone question. I don't think it's related to the replication thing. Just query performance, I think. What I understood is, you run your own benchmarks based on your query log, a representative query log, is that correct?

Sridharan: Are you talking about for the performance regression stuff?

Anand: Yes.

Sridharan: For variable workloads, yes, there's a degree of variability, which is why our performance runs similar to what you do with like, YCSB or any of the major benchmarks, so-to-speak. We have a well-known set of queries that we execute to measure the performance to ensure that we meet the bar for latency, CPU, and throughput. A 1% regression is literally just any regression in either latency or throughput core CPU for any of these well-known workloads.

Anand: You have a graph workload, you could have an analytic workload, you can have an OLTP workload, you can have a document retrieval workload, you have a document database, you can have a key-value workload. It's a little complicated.

Tsai: Now you know how painful it is when people say why don't you adopt a new release of the operating system, new hardware, all the things that matter that will go in. I think the question there about, like 1%, 1% is actually a very interesting question. Because sometimes your performance environment, 1%, the 1% could be a noise, and we don't know. We observe on per change, and we also observe the trend. Sometimes we can catch a larger regression immediately. That test by 1000 paper cuts, you hope to catch it after 20 paper cuts, not 1000. That's how we operate.

Sridharan: We do use known benchmarks like YCSB, but we also have our own benchmarks that we have built on top of that, especially like he said, like when you're dealing with a graph workload, figuring out the sets of relationships we want to test is also a crucial part of it. For instance, friends of friends is a common query that most people want to ask for graphs, or third hop, or whatever. We tend to build our performance benchmarks around that.

Anand: YCSB is mostly for NoSQL, correct?

Sridharan: Yes. A lot of our APIs are NoSQL, like between Mongo, or Cassandra, or documents API. We've extended some of that for our graph database as well.

Anand: What about traditional RDBMS workloads? Do you guys support those? Do you have a storage engine that is like a RDBMS type of engine, like InnoDB, or something like this?

Sridharan: We just announced the Azure Cosmos DB for Postgres. At least, this particular engine that we're talking about in this talk is geared towards NoSQL, primarily.

Anand: Stay ahead of business growth and trend, what are current research areas being explored related to current business trends in enterprise?

Tsai: I think there's two observations in a big organization that we observe. One is actually more of industrial training. There's SQL, there's NoSQL. Actually, it's two spectrums of things. We start observing the new trend, more like a distributed SQL, trying to bring NoSQL closer to SQL space, or in light of the SQL capability. The second trend that we observe is actually, we call it SaaS, Software as a Service. Actually, a lot of developers would like to write very little code, and so chaining things together as experience and actually enabling developer's fast productivity is super key.

Sridharan: I think there's one aspect, like you said, the distributed SQL and trying to merge what we think of as traditional relational with NoSQL is a direction we've been headed towards for the last at least decade. We're pushing further on, almost getting all three of the consistency, availability, and partitioning, with caveats, and everyone picks their own. That is one area for sure. I think the other part is, how do we get lower and lower latency? We keep talking about people building complex systems with caching on top of databases, because the database takes too long to run your query, and you need to have Redis on top of your database. Or if you have like Text Search, you build some other caching layer or another key-value store you shove the results in? How do we integrate these to provide more of a low latency experience for people who really need the high performance and low latency, but at the same time, give you the flexibility of the query? I think is another area that is currently being explored.

Anand: There's some knowledge or talk about HTAP, Hybrid Transactional and Analytic Processing. Is that where you'd like to go, with these other things, graph and NoSQL combined in that? There were papers like Tenzing, Google published a while ago, which was this hierarchical storage engines with one API. Those were mostly for analytic processing. It was a paper maybe 10-plus years ago.

Sridharan: HTAP is definitely something we see us looking at, if you look at, for instance, things like the Synapse link where you can ingest with OLTP and then run analytical workloads on top. I was just trying to formulate the answer on how I see the staying ahead parts, and how I can talk about it. At the very least, I know that that is a direction we are exploring as well.

Tsai: The analytic query or information, some are not needed in real time, some are actually needed in real time, and differentiate those and provide different solutions for those, certainly could be a strategy tackling that particular space. We just need to figure out what is in scope with the customer as product as itself. What's in scope as a solution, end-to-end, because a lot of times people do not just use a product itself. When our customers architect for their problem space, you see many components are chained together. It is about, how do you leverage those, and solve the problem with the best outcome that you will hope for?


See more presentations with transcripts


Recorded at:

Jul 15, 2023