CAP and Cloud Data Management
This article first appeared in Computer magazine and is brought to you by InfoQ & IEEE Computer Society.
The relative simplicity of common requests in Web data management applications has led to data-serving systems that trade off some of the query and transaction functionality found in traditional database systems to efficiently support such features as scalability, elasticity, and high availability.
The perspective described here is informed by my experience with Yahoo’s PNUTS (Platform for Nimble Universal Table Storage) data-serving platform, which has been in use since 2008.1 As of 2011, PNUTS hosted more than 100 applications that support major Yahoo properties running on thousands of servers spread over 18 datacenters worldwide, with adoption and usage growing rapidly.2
The PNUTS design was shaped by the reality of geo-replication-accessing a copy across a continent is much slower than accessing it locally-and we had to face the tradeoff between availability and consistent data access in the presence of partitions. It is worth noting, however, that the realities of slow access lead programmers to favor local copies even when there are no partitions. Thus, while the CAP theorem limits the consistency guarantees programmers can offer during partitions, they often make do with weaker guarantees even during normal operation, especially on reads.
Background: acid and consistency
Database systems support the concept of a transaction, which is informally an execution of a program. While the systems execute multiple programs concurrently in interleaved fashion for high performance, they guarantee that the execution’s result leaves the database in the same state as some serial execution of the same transactions.
The term ACID denotes that a transaction is atomic in that the system executes it completely or not at all; consistent in that the database remains unchanged; isolated in that the effects of incomplete execution are not exposed; and durable in that results from completed transactions survive failures.
The transaction abstraction is one of the great achievements of database management systems, freeing programmers from concern about other concurrently executing programs or failures: they simply must ensure that their program keeps the database consistent when run by itself to completion. The database system usually implements this abstraction by obtaining locks when a transaction reads or writes a shared object, typically according to a two-phase locking regimen that ensures the resulting executions are equivalent to some serial execution of all transactions. The system first durably records all changes to a write-ahead log, which allows undoing incomplete transactions, if need be, and restores completed transactions after failures.
In a distributed database, if a transaction modifies objects stored at multiple servers, it must obtain and hold locks across those servers. While this is costly even if the servers are collocated, it is more costly if the servers are in different datacenters. When data is replicated, everything becomes even more complex because it is necessary to ensure that the surviving nodes in a failure scenario can determine the actions of both completed transactions (which must be restored) and incomplete transactions (which must be undone). Typically, the system can achieve this by using a majority protocol (in which writes are applied to most of the copies, or quorum, and a quorum member serves the reads). In addition to the added costs incurred during normal execution, these measures can force a block during failures that involve network partitions, compromising availability, as the CAP theorem describes.3,4
Both the database and distributed systems literature offer many alternative proposals for the semantics of concurrent operations. Although the database notions of consistency apply to a distributed setting (even though they can be more expensive to enforce and might introduce availability tradeoffs), they were originally designed to allow interleaving of programs against a centralized database. Thus, the goal was to provide a simple programming abstraction to cope with concurrent executions, rather than to address the challenges of a distributed setting.
These differences in setting have influenced how both communities have approached the problem, but the following two differences in perspective are worth emphasizing:
- Unit of consistency. The database perspective, as exemplified by the notion of ACID transactions, focuses on changes to the entire database, spanning multiple objects (typically, records in a relational database). The distributed systems literature generally focuses on changes to a single object.5
- Client- versus data-centric semantics. The database community’s approach to defining semantics is usually through formalizing the effect of concurrent accesses on the database; again, the definition of ACID transactions exemplifies this approach-the effect of interleaved execution on the database must be equivalent to that of some serial execution of the same transactions. But the distributed systems community often takes a client-centric approach, defining consistency levels in terms of a client that issues reads and writes sees (potentially) against a distributed data store in the presence of other concurrently executing clients.
The notions of consistency proposed in the distributed systems literature focus on a single object and are client-centric definitions. Strong consistency means that once a write request returns successfully to the client, all subsequent reads of the object-by any client-see the effect of the write, regardless of replication, failures, partitions, and so on. Observe that strong consistency does not ensure ACID transactions. For example, client A could read object X once, and then read it again later and see the effects of another client’s intervening write because this is not equivalent to a serial execution of the two clients’ programs. That said, implementing ACID transactions ensures strong consistency.
The term weak consistency describes any alternative that does not guarantee strong consistency for changes to individual objects. A notable instance of weak consistency is eventual consistency, which is supported by Amazon’s Dynamo system,6 among others.1,5 Intuitively, if an object has multiple copies at different servers, updates are first applied to the local copy and then propagated out; the guarantee offered is that every update is eventually applied to all copies. However, there is no assurance of the order in which the system will apply the updates-in fact, it might apply the updates in different orders on different copies. Unless the nature of the updates makes the ordering immaterial-for example, commutative and associative updates-two copies of the same object could differ in ways that are hard for a programmer to identify.
Researchers have proposed several versions of weak consistency,5 including
- read-your-writes-a client always sees the effect of its own writes,
- monotonic read-a client that has read a particular value of an object will not see previous values on subsequent accesses, and
- monotonic write-all writes a client issues are applied serially in the issued order.
Each of these versions can help strengthen eventual consistency in terms of the guarantees offered to a client.
Cloud data management
Web applications, a major motivator for the development of cloud systems, have grown rapidly in popularity and must be able to scale on demand. Systems must serve requests with low latency (tens of milliseconds) to users worldwide, throughput is high (tens of thousands of reads and writes per second), and applications must be highly available, all at minimal ongoing operational costs. Fortunately, full transactional support typically is not required, and separate systems perform complex analysis tasks-for example, map-reduce platforms such as Hadoop.
For many applications, requests are quite simple compared to traditional data management settings-the data might be user session data, with all user actions on a webpage written to and read from a single record, or it might be social, with social activities written to a single user record, and a user’s friends’ activities read from a small number of other user records.
These challenges have led to the development of a new generation of analytic and serving systems based on massively distributed architectures that involve clusters of thousands of machines. All data is routinely replicated within a datacenter for fault tolerance; sometimes the data is even georeplicated across multiple datacenters for low-latency reads. Massively distributed architectures lend themselves to adding capacity incrementally and on demand, which in turn opens the door to building multitenanted, hosted systems with several applications sharing underlying resources. These cloud systems need not be massively distributed, but many current offerings are, such as those from Amazon,6 Google,7,8 Microsoft,9 Yahoo,1 and the Cassandra and HBase open source systems.
Although Web data management provided the original motivation for massively distributed cloud architectures, these systems are also making rapid inroads in enterprise data management. Furthermore, the rapid growth in mobile devices with considerable storage and computing power is leading to systems in which the number of nodes is on the order of hundreds of millions, and disconnectivity is no longer a rare event. This new class of massively distributed systems is likely to push the limits of how current cloud systems handle the challenges highlighted by the CAP theorem.
The belief that applications do not need the greater functionality of traditional database systems is a fallacy. Even for Web applications, greater functionality simplifies the application developer’s task, and better support for data consistency is valuable: depending on the application, eventual consistency is often inadequate, and sometimes nothing less than ACID will suffice. As the ideas underlying cloud data-serving systems find their way into enterprise-oriented data management systems, the fraction of applications that benefit from (indeed, require) higher levels of consistency and functionality will rise sharply. Although it is likely that some fundamental tradeoffs will remain, we are witnessing an ongoing evolution from the first generation of cloud data-serving systems to increasingly more complete systems.
Bigtable and HBase are systems that write synchronously to all replicas, ensuring they are all always up to date. Dynamo and Cassandra enforce that writes must succeed on a quorum of servers before they return success to the client. They maintain record availability during network partitions, but at the cost of consistency because they do not insist on reading from the write quorum. Mega-store comes closer to the consistency of a traditional DBMS, supporting ACID transactions (meant to be used on records within the same group; it uses Paxos for synchronous replication across regions. (Note that because new systems are being announced in this space at a rapid pace, this is not meant to be a comprehensive survey.)
Pnuts: a case study
Yahoo has 680 million customers and numerous internal platforms with stringent latency requirements (fewer than 10 ms is common). Servers can fail, and individual datacenters can suffer network partitions or general shutdown due to disaster, but data must remain available under any failure conditions, which is achieved via replication at datacenters. We developed PNUTS to support CRUD-create, retrieve, update, delete-workloads in this setting.
Many applications have moved to PNUTS either from pure LAMP (Linux, Apache, MySQL, PHP) stacks or from other legacy key-value stores. Illustrative applications include Yahoo’s user location, user-generated content, and social directory platforms; the Yahoo Mail address book; Yahoo Answers, Movies, Travel, Weather, and Maps applications; and user profiles for ad and content personalization. The reasons for adopting PNUTS include flexible records and schema evolution; the ability to efficiently retrieve small ranges of records in order (for example, comments by time per commented-upon article); notifications of changes to a table; hosted multidatacenter storage; and above all, reliable, global, low-latency access.
At Yahoo, the experience with PNUTS led to several findings:
- The cloud model of hosted, on-demand storage with low-latency access and highly available multidatacenter replication has proved to be very popular.
- For many applications, users are willing to compromise on features such as complex queries and ACID transactions.
- Additional features greatly increase the range of applications that are easily developed by using the PNUTS system for data management, and, not surprisingly, the adoption for these applications. In particular, providing support for ordered tables-which allows arranging tables according to a composite key and enables efficient range scans-sparked a big increase in adoption, and we expect support for selective replication and secondary indexes to have a similar effect.
Users have pushed for more options in the level of consistency.
In the context of this article, the most relevant features are those that involve consistency.
PNUTS was one of the earliest systems to natively support geographic replication, using asynchronous replication to avoid long write latencies. Systems that make copies within the same datacenter have the option of synchronous replication, which ensures strong consistency. This is not viable for cross-datacenter replication, which requires various forms of weak consistency.
However, eventual consistency does not always suffice for supporting the semantics natural to an application. For example, suppose we want to maintain the state of a user logged in to Yahoo who wants to chat. Copies of this state might be maintained in multiple georegions and must be updated when the user decides to chat or goes offline. Consider what happens when the two regions are disconnected because of a link failure: when the link is restored, it is not sufficient for both copies to eventually converge to the same state; rather, the copies must converge to the state most recently declared by the user in the region where the user was most recently active.
It is possible to update the copies via ACID transactions, but supporting ACID transactions in such a setting is a daunting challenge. Fortunately, most Web applications tend to write a single record at a time-such as changing a user’s chat status or home location in the profile record-and it is acceptable if subsequent reads of the record (for example, by a friend or user) do not immediately see the write. This observation is at the heart of the solution we adopted in PNUTS, called timeline consistency.
In timeline consistency, an object and its replicas need not be synchronously maintained, but all copies must follow the same state timeline (possibly with some replicas skipping forward across some states). PNUTS does not allow objects to go backward in time or to not appear in the timeline associated with the object. The approach essentially is primary copy replication.1 At any given time, every object has exactly one master copy (in PNUTS, each record in a table is an object in this sense), and updates are applied at this master and then propagated to other copies, thereby ensuring a unique ordering of all updates to a record. Protocols for automatically recognizing master failures and transferring mastership to a surviving copy ensure high availability and support automated load-balancing strategies that transfer this mastership to the location where a record is most often updated.
To understand the motivation behind this design, consider latency. Data must be globally replicated, and synchronous replication reduces latency unacceptably. Database systems support auxiliary data structures such as secondary indexes and materialized views, but maintaining such structures synchronously in a massively distributed environment further increases latency.
The requirement of low-latency access therefore leads to asynchronous replication, which inherently compromises consistency.10 However, because object timelines provide a foundation that can support many read variants, applications that can tolerate some staleness can trade off consistency for performance, whereas those that require consistent data can rely on object timelines.
Timeline consistency compromises availability, but only in those rare cases where the master copy fails and there is a partition or a failure in the messaging system that causes the automated protocol for transferring mastership to block. Additionally, timeline consistency weakens the notion of consistency because clients can choose to read older versions of objects even during normal operation. Again, this reflects a fundamental concern for minimizing latency, and is in the spirit of Daniel Abadi’s observation that the CAP theorem overlooks an important aspect of large-scale distributed systems-namely, latency (L). According to Abadi’s proposed reformulation,11 CAP should really be PACELC:
If there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?
Selective record replication
Many Yahoo applications have a truly global user base and replicate to many more regions than needed for fault tolerance. But while an application might be global, its records could actually be local. A PNUTS record that contains a user’s profile is likely only ever written and read in one or a few geographic regions where that user and his or her friends live. Legal issues also arise at Yahoo that limit where records can be replicated; this pattern typically follows user locality as well.
To address this concern, we added per-record selective replication to PNUTS.12 Regions that do not have a full copy of a record still have a stub version with enough metadata to know which regions contain full copies for forwarding requests. A stub is only updated either at record creation or deletion, or when the record’s replica location changes. Normal data updates are only sent to regions containing full copies of the record, saving bandwidth and disk space.
The case for a consistency spectrum
Cloud data management systems designed for real-time data serving and workload updates amply illustrate the realities of the CAP theorem: such systems cannot support strong consistency with availability in the presence of partitions. Indeed, such massively distributed systems might settle for weaker consistency guarantees to improve latency, especially when data is georeplicated. In practice, a programmer using such a system must be able to explicitly make tradeoffs among consistency, latency, and availability in the face of various failures, including partitions.
Fortunately, several consistency models allow for such tradeoffs, suggesting that programmers should be allowed to mix and match them to meet an application’s needs.
We organize the discussion to highlight two independent dimensions: the unit of data that is considered in defining consistency and the spectrum of strong to weak consistency guarantees for a given choice of unit.
Unit of consistency
While the database literature commonly defines consistency in terms of changes to the entire database, the distributed systems literature typically considers changes to each object, independent of changes to other objects. These are not the only alternatives; intuitively, any collection of objects to which we can ensure atomic access and that the system replicates as a unit can be made the unit of consistency. For example, any collection of objects collocated on a single server can be a reasonable choice as the unit of consistency (from the standpoint of ensuring good performance), even in the presence of failures.
One widely recognized case in which multirecord transactions are useful is an entity group, which comprises an entity and all its associated records. As an example, consider a user (the "entity") together with all user-posted comments and photos, and user counters, such as number of comments. It is frequently useful to update the records in an entity group together, for example, by inserting a comment and updating the number-of-comments counter. Usually, an entity group’s size is modest, and a single server can accommodate one copy of the entire set of records.
Google’s App Engine provides a way to define entity groups and operate on them transactionally; Microsoft’s Azure has a similar feature that allows record grouping via a partition key as well as transactional updates to records in a partition. The basic approach to implementing transactions over entity groups is straightforward and relies on controlling how records are partitioned across nodes to ensure that all records in an entity group reside on a single node. Then, the system can invoke conventional database transaction managers without using cross-server locks or other expensive mechanisms.
This model has two basic restrictions: first, the entity group must be small enough to fit on a single node; indeed, for effective load balancing, the size must allow many groups to fit on a single node. Second, the definition of an entity group is static and typically specifies a composite key over the record’s attributes. A recent proposal considers how to relax the second restriction and allow defining entity groups more generally and dynamically.13
A consistency spectrum
We begin by discussing a spectrum of consistency across copies of a single object and then discuss how to generalize these ideas to handle other units of consistency.
Consistency models for individual objects. Timeline consistency offers a simple programming model: copies of a record might lag the master copy, but the system applies all updates to every copy in the same order as the master. Note that this is a data-centric guarantee. From the client’s perspective, monotonic writes are guaranteed, so an object timeline-a timestamp generated at the master object that identifies each state and its position on the object’s timeline-can support several variants of the read operation, each with different guarantees:
- Read-any. Any copy of the object can be returned, so if a client issues this call twice, the second call might actually see an older version of the object, even if the master copy is available and timeline consistency is enforced. Intuitively, the client reads a local copy that later becomes unavailable, and the second read is served from another (nonmaster) copy that is more stale.
- Critical-read. Also known as monotonic read, critical-read ensures that the copy read is fresher than any previous version the client sees. By remembering the last client-issued write, the critical-read operation can extend to support read-your-writes, although to make this more efficient, it might be necessary to additionally cache a client’s writes locally.
- Read-up-to-date. To get the current version of the object, read-up-to-date accesses the master copy.
- Test-and-set. Widely used in PNUTS, test-and-set is a conditional write applied only if the version at the master copy when the write applied is unchanged from the version previously read by the client issuing the write. It is sufficient to implement single-object ACID transactions.
Timeline consistency over entity groups. A natural generalization of timeline consistency and entity groups is to consider entity group timelines rather than individual records. The timeline has a master copy of each entity group, rather than each record, and applies transactional updates to an entity group (possibly affecting multiple records) and at the master copy, just like individual updates of a single record in timeline consistency. The transaction sequence is then logged, asynchronously shipped to the sites with copies of the entity group, and reapplied at each such site.
Although this generalization should be supportable with performance and availability characteristics comparable to timeline consistency and entity group consistency, I am not aware of any systems that (yet) do so. It seems an attractive option on the consistency spectrum, and covers many common applications that would otherwise require full ACID transactions.
Offering consistency choices
Geographic replication makes all records always available for read from anywhere. However, anytime a distributed system is partitioned due to failures, it is impossible to preserve both write consistency and availability. One alternative is to support multiple consistency models and let the application programmer decide whether and how to degrade in case of failure.
Eric Brewer suggests14 thinking in terms of a partition mode-how a client enters and exits this mode, and what it does while in partition mode and upon exit. Intuitively, a client enters the partition mode (due to a failure of some kind, triggered by a mechanism such as a time out) when it cannot complete a read or write operation with the desired level of consistency. The client must then operate with the recognition that it is seeing a version of the database that is not strongly consistent, and when emerging from partition mode (when the system resolves the underlying failure and signals this state in some way), the client must reconcile inconsistencies between the objects it has accessed.
In the PNUTS implementation of timeline consistency, a client enters the partition mode when it attempts to write an object but the write is blocked because the master copy is unreachable, and the mastership transfer protocol also is blocked, typically because of a partition or site failure. At this point, the client should be able to choose to degrade to eventual consistency from timeline consistency and continue by writing another copy. However, the system is now in a mode in which the given object does not have a unique master. At some future point, the client must explicitly reconcile different versions of the object-perhaps using system-provided version vectors for the object-if the weaker guarantees of eventual consistency do not suffice for this object.
PNUTS is not this flexible-it requires the programmer to choose between timeline consistency and eventual consistency at the level of a table of records. It treats each record in a table as an object with the associated consistency semantics. Inserts and updates can be made to any region at any time if eventual consistency is selected. For programmers who understand and can accept the eventual consistency model, the performance benefits are great: the system can perform all writes locally at the client, greatly improving write latencies. Given a server node failure, another (remote) node will always be available to accept writes.
PNUTS can also enter partition mode if a read request requires access to the master copy; again, the client has the choice of waiting for the system to restore access or proceeding by reading an available copy. Furthermore, a client can choose to read any available copy if a slight lag from the master copy is acceptable even during normal operation.
Although massively distributed systems provide multiple abstractions to cope with consistency, programmers need to be able to mix and match these abstractions. The discussion of per-object timeline consistency highlights how data- and client-centric approaches to defining consistency are complementary. The discussion of how to build on per-object timeline consistency to support different client-side consistency guarantees carries over to timeline consistency over entity groups. Indeed, this is a useful way to approach consistency in distributed systems in general:
- decide on the units of consistency that the system is to support;
- decide on the consistency guarantees the system supports from a data-centric perspective;
- decide on the consistency guarantees the system supports from a client-centric perspective; and
- expose these available choices to programmers through variations of create-read-write operations, so they can make the tradeoffs among availability, consistency, and latency as appropriate for their application.
For example, we could allow
- the type of consistency desired-timeline or eventual-to be a property of a collection of objects;
- alternative forms of reading an object that support various client-centric consistency semantics;
- flexible definition of groups of objects to be treated as one object for consistency purposes; and
- specification of how to degrade gracefully to weaker forms of consistency upon failure.
Determining the right abstractions and the right granularities for expressing these choices requires further research and evaluation. It is time to think about programming abstractions that allow specifying massively distributed transactions simply and in a manner that reflects what the system can implement efficiently, given the underlying realities of latency in wide-area networks and the CAP theorem.
The tradeoff between consistency on one hand and availability/performance on the other has become a key factor in the design of large-scale data management systems. Although Web-oriented systems led the break from traditional relational database systems, the ideas have begun to enter the database mainstream, and over the next several years, cloud data management for enterprises will offer database administrators some of the same design choices.
The key observation is that we have choices-not just between ACID transactions and full RDBMS capabilities on one side and NoSQL systems offering no consistency guarantees and minimal query and update capabilities on the other. We will see systems that are somewhere in the middle of this spectrum, striving to provide as much functionality as possible while satisfying the availability and performance demands of diverse application settings. Designing abstractions that cleanly package these choices, developing architectures that robustly support them, and optimizing and autotuning these systems will in turn provide research challenges for the next decade.
In writing this article, I was strongly influenced by the experience gained in designing, implementing, and deploying the PNUTS system (also known as Sherpa within Yahoo), and thank the many people who contributed to it and to the many papers we wrote jointly. PNUTS is a collaboration between the systems research group and the cloud platform group at Yahoo. I also thank the anonymous referees and Daniel Abadi for useful feedback that improved the article, and Eric Brewer for sharing a preprint of his article that appears elsewhere in this issue.
1. B.F. Cooper et al., "PNUTS: Yahoo!’s Hosted Data Serving Platform," Proc. VLDB Endowment (VLDB 08), ACM, 2008, pp. 1277-1288.
2. A. Silberstein et al., "PNUTS in Flight: Web-Scale Data Serving at Yahoo," IEEE Internet Computing, vol. 16, no. 1, 2012, pp. 13-23.
3. E. Brewer, "Towards Robust Distributed Systems," Proc. 19th Ann. ACM Symp. Principles of Distributed Computing. (PODC 00), ACM, 2000, pp. 7-10.
4. S. Gilbert and N. Lynch, "Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services," ACM SIGACT News, June 2002, pp. 51-59.
5. W. Vogels, "Eventually Consistent," ACM Queue, vol. 6, no. 6, 2008, pp. 14-19.
6. G. DeCandia et al., "Dynamo: Amazon’s Highly Available Key-Value Store," Proc. 21st ACM SIGOPS Symp. Operating Systems Principles (SOSP 07), ACM, 2007, pp. 205-220.
7. F. Chang et al., "Bigtable: A Distributed Storage System for Structured Data," ACM Trans. Computers, June 2008, article no. 4; doi:10.1145/1365815.1365816.
8. J. Baker et al., "Megastore: Providing Scalable, Highly Available Storage for Interactive Services," Proc. Conf. Innovative Database Research (CIDR 11), ACM, 2011, pp. 223-234.
9. P.A. Bernstein et al., "Adapting Microsoft SQL Server for Cloud Computing," Proc. IEEE 27th Int’l Conf. Data Eng. (ICDE 11), IEEE, 2011, pp. 1255-1263.
10. P. Agrawal et al., "Asynchronous View Maintenance for VLSD Databases," Proc. 35th SIGMOD Int’l Conf. Management of Data, ACM, 2009, pp. 179-192.
11. D.J. Abadi, "Consistency Tradeoffs in Modern Distributed Database System Design," Computer, Feb. 2012, pp. 37-42.
12. S. Kadambi et al., "Where in the World Is My Data?" Proc. VLDB Endowment (VLDB 2011), ACM, 2011, pp. 1040-1050.
13. S. Das, D. Agrawal, and A.E. Abbadi, "G-Store: A Scalable Data Store for Transactional Multi Key Access in the Cloud," Proc. ACM Symp. Cloud Computing (SoCC 10), ACM, 2010, pp. 163-174.
14. E. Brewer, "Pushing the CAP: Strategies for Consistency and Availability," Computer, Feb. 2012, pp. 23-29.
About the Author
Raghu Ramakrishnan heads the Web Information Management Research group at Yahoo and also serves as chief scientist for cloud computing and search. Ramakrishnan is an ACM and IEEE Fellow and has received the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award, a Packard Foundation Fellowship, and the Distinguished Alumnus Award from IIT Madras. Contact him at email@example.com.
Computer, the flagship publication of the IEEE Computer Society, publishes highly acclaimed peer-reviewed articles written for and by professionals representing the full spectrum of computing technology from hardware to software and from current research to new applications. Providing more technical substance than trade magazines and more practical ideas than research journals. Computer delivers useful information that is applicable to everyday work environments.
Dmytro Svarytsevych Oct 30, 2014