Real-World Consistency Explained: Uwe Friedrichsen Discusses His Favourite Academic Papers
At the microXchg 2016 conference, held in Berlin, Germany, Uwe Friedrichsen presented a deep-dive into “real-world consistency explained”. Friedrichsen referenced multiple academic papers and discussed topics such as ACID vs BASE, his belief that many developers may not fully understand consistency guarantees with a typical SQL database, and how the implementation of consistency models should be pushed up into the application-level code of distributed (microservice) systems.
InfoQ sat down with Friedrichsen, fellow at codecentric AG, and explored the ideas presented in his talk in more detail. Key takeaways from the discussion included a learning path for better understanding the topic of consistency, the importance of implementing a correct consistency model throughout the system (not just within the data store), and how emerging technologies like Storage-Class Memory (SCM) and Remote Direct Memory Access (RDMA) may change our view of consistency.
InfoQ: Hi Uwe, thanks for talking to InfoQ today. Can you explain the basic premise of your microXchg talk, and also the motivations for sharing this knowledge?
That is a tricky question, as it was a series of insights that led me to create this talk, and so it take me a bit more than just two or three sentences to answer the question. I will give it a try, though - and apologize upfront for the lengthy answer.
When reasoning about consistency, most people (including myself for quite a long time) only have two models in mind: Strong consistency, usually based on ACID transactions in non-distributed systems, and weak consistency, usually based on BASE transactions in distributed systems - plus maybe a bit of Paxos or alike to stretch the concept of strong consistency into the world of distributed systems.
For a developer, this means that you either have a quite simple programming model that does not scale well on distributed systems, or you end up with an extremely hard programming model which works on distributed systems. Based on my observations, the latter model is really problematic in most projects as BASE transactions and their implications are not understood very well. Most developers make implicit consistency assumptions when working with BASE transactions that do not hold true - partially due to a lack of knowledge, and partially because the underlying programming model is so mind-twisting.
In that setting, I stumbled upon some papers of Peter Bailis which were extremely interesting. One of the driving ideas of these papers was: How much consistency can you get without giving up scalability/availability (which are basically the same thing)? The ideas presented in these papers were intriguing. Peter Bailis and his co-workers presented a multitude of consistency models between the two models described before, each providing a different set of consistency guarantees and many of them highly available (i.e., scalable).
As each additional consistency guarantee makes the life of a developer a lot easier and many of them were even highly available, I decided to dive deeper. This led me to a plethora of other papers, all dealing with consistency in distributed systems. Even though the research implementations shown in the papers were usually not mature enough to be used in commercial software development, a clear trend became visible: Provide developers with better consistency guarantees, i.e., provide them with an easier to grasp programming model without giving up high availability or low latency.
Besides the insights, I got from those papers, there were two additional drivers for the talk. Some of the papers I read while diving deeper into the consistency topic dissected the ACID consistency model and reminded me of something, I learned many years ago but almost had forgotten along my way through IT: If we talk about ACID transactions, we usually have the strongest variant - serializable transactions - in mind. Serializable transactions are great because they are easy to reason about. They are the perfect consistency model, most of us have in mind when thinking about transactions: Start a transaction, do something, commit. Either all of it works or nothing. No inconsistencies, no anomalies - just great.
However, ACID in the real world rarely means serializable transactions. The ANSI SQL standard defines several isolation (i.e., consistency) levels and serializability is the strongest level amongst them. The problem with serializability is that it usually means a big performance hit and thus rarely any production databases run serializable isolation. This means that you can and will experience anomalies in production environments, even if you use ACID transactions, but most developers are not aware of that. This mismatch between the mental model of most developers and reality leads to problems in production more often than most people assume.
The last driver is an unfortunate tendency (not only) in the area of data stores that I have observed for quite a while. It is that "one size fits all" thinking. About 10 years ago, everything needed to be a relational database. Then, with the rise of NoSQL new options came up that helped to fill the gaps of relational databases. Unfortunately, the "one size fits all" habit of the relational database times stayed: First, the new solution for everything was Hadoop, after that MongoDB, and now we see everybody jumping on Cassandra.
Even though all those database are great in some areas, they are not a general purpose solution that suits any arbitrary need. There are many dimensions to consider when reasoning about an appropriate data store, e.g., the required consistency model, the richness of the data model, the richness of the query model, the scalability and many more. None of the databases is great in all dimensions. Thus, it is necessary to carefully reason about the right fit for the given requirements. The diversity of storage requirements also requires diversity in the solution domain.
These were the drivers that led to the microXchg talk - as I was afraid, a way too long answer, but I have no idea how to put this in two or three sentences ...
InfoQ: You mentioned that a lot of the talk was inspired by Adrian Colyer's "Morning Paper" blog series. Are you a regular reader, and if so, would you recommend reading academic style papers as a developer/architect?
Yes, I have Adrian's excellent "Morning paper" blog in my news feed for about a year now. Before I read it sporadically. The downside of his blog is that he very often picks extremely interesting papers and also his original posts are usually excellent, which means that following his blog is a real time killer... but it is definitely worth the time.
I would recommend that developers read academic-style papers, if you really want to deep-dive into a topic. Actually, quite often those papers are the only source of good in-depth content. If you are new to a topic, you usually can start with a book or some online tutorials or courses. If you decide then to dig deeper, you still find some blog posts or an "expert level" talk or article. But for a real deep-dive, you have to dig down to the fundamentals and climb up to the highest peaks - and those parts are usually only covered in academic style papers, at least in the area of distributed systems.
InfoQ: In your talk you suggested that ACID may be a great programming model, but the reality of the implementation of this model may often be different. Could you explain a little more about this please?
I think I already have answered this question partially in my reply to your first question. The core point is that the typical mental model regarding ACID transactions are usually serializable transactions ("ideal transactions"), but due to the huge performance hit inflicted by serializable isolation, most database use a weaker isolation level. Thus, we design and implement our applications with an ideal transaction model in mind, neglecting that reality is not ideal, i.e., anomalies will hit us. This leads to applications that are confronted with data anomalies they do not know how to deal with - which in the end means "strange" production problems, no one knows how to deal with because no one knows how they can occur.
InfoQ: When talking about the "present" of real-world consistency you mentioned that cloud computing, microservices, NoSQL and BASE are all important topics within this theme. Do you believe that consistency models are given enough focus within an average development project, for example, when designing scaling or data storage/access guarantees?
Based on my observations, no.
In my opinion, there are several reasons for that shortcoming. First of all, there is a lack of knowledge. Though, I would not blame anybody for that. Understanding consistency, especially in distributed systems, is a very challenging, sometimes brain-twisting topic and it takes time to really grasp it. On the other hand, most developers and architects get flooded with stuff to do all day: tons of requirements, contrary stakeholder opinions, arbitrary deadlines, new technologies every other day, complemented with lots of bogus routine reporting. When should those people find the time to dive into the challenges of consistency? Thus, the understanding of consistency usually remains quite shallow.
Additionally, if you want to achieve better consistency guarantees in a distributed system than just plain eventual consistency as Pat Helland described it in his famous "Life beyond distributed transactions" [PDF] paper, it cannot be left to the datastore only. As with resilience to maximize availability in a distributed system, applications and the underlying systems, i.e., the databases, have to play together. This means that you can no longer leave consistency just to the data store, but you have to explicitly design it into your application. Again, Pat Helland described this very clearly in another famous paper, "Building on quicksand" [PDF].
This means that we need to explicitly reason about the consistency guarantees, our application operations need to satisfy the business requirements. We also need to discuss this topic with our stakeholders in order to find the right consistency level, which often is challenging. And last but not least, we have to implement the consistency model on application and data store level, which is also challenging for most designers and developers.
And if we take the continuous lack of time into account, I mentioned before, we find most developers left without time to reason about and design systems that provide them with a better consistency model, which in turn makes their programming model a lot easier, which in turn saves them a lot of time - a bit unfortunate, but we see this problem quite often, not only with respect to consistency.
InfoQ: Could you talk about the current research that is exploring the "frontiers" of consistency models? What are the trends for memory and storage?
One frontier is “how far we can push consistency in a distributed system without giving up high availability (including scalability)?” Actually, this frontier is clear. The best you can get is causal consistency, accepting sticky high availability (which means that a client always sends its requests to the same node of the data store). Most of the current research is about finding better ways to implement the consistency models between pure eventual consistency and causal consistency - easier to reason about, easier to use, more efficient, more general. None of the current solutions is ready for commercial software development yet, but I expect a lot of advances in the future.
On the other hand, a lot of effort is put in strong consistency models in distributed systems with low-latency properties. Accepting that strong consistency never can be highly available, the effort is put into making it as efficient and fast as possible. Having in mind that some use cases require strong consistency, and that a strong consistency programming model is a lot easier than a weak consistency programming model, this is also a very valuable research direction.
Besides those research trends, we see new memory and storage trends emerge. Storage class memory (SCM) fills the gap between RAM and SSD, i.e., provides us with persistent memory that is in the best case as fast as today's RAM. Also RDMA (Remote DMA) implementations become available, i.e., access to remote computer's RAM, bypassing any CPUs. Even if slower than accessing local RAM, RDMA still is about two orders of magnitude (80-100 times) faster than local SSD access.
Those two storage trends open up lots of new system architecture options and new application architecture options, which could be revolutionary. Since the beginning of computing, we always lived with the relation: Speed of CPU >> speed of volatile memory >> speed of persistent memory. Therefore, we put incredible amounts of effort into keeping the CPU busy - caching all the way between CPU and persistent memory (ultimately RAM is nothing but another cache layer) and lots of tricks in our architectures to make sure that we maximize utilization of our CPUs. With the new storage trends, the old relation does no longer hold true: We now need many CPUs to saturate our storage devices, which makes a lot of "proven knowledge" obsolete.
I actually do not know yet where this will lead us to, but I am sure that new system and application architectures will emerge that fit those new constraints better than the existing ones. This will also have consequences for consistency models in distributed systems, but actually I think, this will become a lot bigger than just new consistency options. We will most likely end up with completely new ways to design and implement our applications.
InfoQ: If readers are keen to learn more about the present and future of consistency, especially in relation to the application-level requirements and effects on the programming model, could you recommend a list of articles to read, or some form of 'crash course' or guidelines?
Sure. Watch my microXchg talk... just kidding. Still, I haven't found any compact introduction to the topic yet. For the preparation of the talk I have used ~40 papers, and I have included a list of ~20 references in my slide deck (slides 55-57).
Therefore, unfortunately I cannot provide you with any compact crash course or alike. I only can offer you a small guideline through the references, you find in my slide deck. Here it is:
I would start with "Building on Quicksand" by Pat Helland and Dave Campell. This paper provides you with the rationale why we need to reason about consistency at the application operation level. Without understanding the "why", nothing makes a lot of sense - therefore this paper first.
Then I would read the "Out of the Fire Swamp" blog series by Adrian Colyer which nicely sums up a lot of the research that is currently going on.
After that, I would read the "Highly Available Transactions: Virtues and Limitations" paper by Peter Bailis et al. which defines kind of a consistency tree as well as "A Critique of ANSI SQL Isolation Levels" by Hal Berenson which describes the limitations of ACID transaction. Adrian Colyer also summed up some of the concepts described in those papers in some recent "Morning paper" blog posts.
Those papers and posts provide you with the core basics. Afterwards dive deeper wherever you like, either using the other references, I provided in my slide deck or references, you find in the aforementioned papers and posts.
InfoQ: Thanks again for your time today! Is there anything else you would like to share with the InfoQ reader?
Thank you a lot! I think that even though we have only just scratched the surface of the topic and there is a lot more to share, I already have said way too much!
I hope that I was able to spark some curiosity to dig deeper into the topic.
The video recording of Uwe Friedrichsen’s microXchg talk, “Real-world consistency explained”, can be found on the conference’s YouTube channel
One of the new transaction models related to this topic is Try-Confirm/Cancel (TCC) available in open source as part of Atomikos TransactionsEssentials