BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations CRDTs in Production

CRDTs in Production

Bookmarks
51:01

Summary

Dmitry Martyanov talks about how PayPal developed a distributed system dealing with consistency issues and shares the lessons he learned in developing the system based on an eventually consistent data store. The solution utilizes conflict-free, replicated data types with causality tracking to achieve strong eventual consistency for critical data in multi-master, multi-datacenter DB deployments.

Bio

Dmitry Martyanov is a Software Engineer at PayPal working with a focus on distributed systems and resilient architectures.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Martyanov: Thank you everyone. Welcome to CRDTs in Production talk. So let's get started. If you work in development of an international product, customer experience is always impacted by the distance to the data center serving the request. And you start considering the ways to move data centers closer to all your customers, and eventually, you come up with a deployment schema containing multiple data centers across the globe. But in reality, this transition is extremely challenging, especially if your data processing relies on transactions.

So I started working on this initiative in 2016 when I joined PayPal regulatory compliance platform team. And in this talk, I would like to cover first our thought process when we started developing the system, the distributed system, with eventual consistency based on CRDTs. And the second is our learnings from this technical journey.

So let me provide you some context. PayPal operates in more than 200 countries, and its products have to be aligned with regulatory requirements for each particular region. Compliance platform team is responsible for technical support for these parts. And at a very high level, it looks like a state machine of compliance statuses where transitions might be performed by different actors. I will give you a simple example. When you create an account, your compliance status is in some initial state. Then you submit your documents proving your identity. And once these documents are reviewed, your compliance status is flipped to some completed state and you are able to start using products.

And our task was to design and execute a transition for the system to the state when you can deploy it in a geo-distributed environment with multiple data centers and with high availability and low dependencies on infrastructure. And honestly, when I first time read some concept document about how this desired state should look like, I was just, "Wait. Is it feasible?" Because this CAP theorem tells us about trade-off between availability and consistency.

Fundamentals

So I would like to recap some fundamentals. We all know that shared mutable state is a root cause of the majority of problems in software. And we try to minimize … It increases accidental complexity, it introduces side effects. And good design usually minimizes shared mutable state. That's why we like functional programming. But since this problem appeared at the beginning of the age of programming, because of hardware architecture, the immediate solution was introduced and this solution is mutexes or locks. The idea is to limit an access to data point for the purpose of consistency of data modifications. And locks is an excellent solution. They work fast. They are very reliable. They are validated over time.

When you think about this concept in space of mutable state and data stores, this concept is implemented through transactional processing. And transactions guarantee us atomicity, consistency, isolation, durability of our operations. And they also work very well when you are within the same data center and you don't have a cross globe message propagation.

But when you move to geo-distributed environment, transactions do not work so well anymore because the physical distance between data center contributes a lot to transaction time. And if you have a high-load traffic, you start observing transaction failures, transaction overlaps, and this becomes a bottleneck of scalability for your system. And this is exactly what CAP theorem tells us about trade-off between consistency and availability.

Eventual Consistency

And at this moment of time, some teams start exploring another side of this spectrum in CAP theorem, a little bit sacrifice in consistency and an increase in availability. And for most of them, it works pretty well. But if you really have a high load of traffic and your data modification happens much more frequent than replication, this system works even worse because instead of transaction failures and transaction overlaps, you will see data losses, which usually are not under control and very difficult to trace.

So it happens because the distributed system contains multiple components. Each of them has a local copy of your data, called a replica. These components are reliable itself. They guarantee strong consistency within it, and they communicate to each other through asynchronous message passing. And this communication is not so reliable. You cannot rely on time required for this communication or that this connection will be successfully established.

When you succeeded with some put operation in replica A, and in a consequent read operation for the same data point, you should be able to observe the results from put operation. But if you perform read operation from another replica for the same data point, your result is not defined because you might or might not observe the expected result, depending on whether replication has already happened or not. And this is an extremely important problem because literally, it means that every time you read your data point, you don't know whether it is actual or not. And you have no mean to verify it. And of course, if you transform this data value to some other value, your system might start to diverge. So we started thinking about how to build a system which will not diverge eventually. And we started our project from an overview of existing solution of how it might be solved.

Affinity Based Approaches

The first class of approaches is affinity based approaches. They rely on the fact that within the same replica, you have a strong consistency. The idea is to introduce some constraints to your data access patterns. That actor will not be able to communicate with different replicas at the same time. And the most popular affinity solution is ticket session. And within customer session, you have all requests handled in the same data center and you have a situation of read my rights. The problem with the affinity based approach, is that your data have to be sliceable in such a way. And it works really well when you have a customer who modifies his personal data and nobody else touches this data. But when you have multiple actors interacting on the same data points, it is not a trivial task to define these constraints, how to define affinity here.

There is also another affinity approach, when you have, for example, geo-related affinity, like all the Europe requests goes to the Europe data center. This definitely will help you, but as more affinity is introduced, the less your system is really distributed. And you will start facing issues when something goes wrong with this data center and disaster recovery will be required or something like that. That's why we gave up with affinity.

Coordinator Based Approaches

And the second class of approaches is coordinator based approaches. These approaches rely on the existence of some coordinator who manages conflict for your environment. It doesn't necessarily mean that every customer talks to coordinator, but coordinator can handle some critical components. And we very seriously treated coordinator as a solution, but when we started introducing more details to our design, we recognized that it is very complex to handle all the possibilities of coordinated life cycle, because he might fail and you need to recover it in another zone. And there is some intermediate state where you need to handle your requests somehow, or something like put them on hold, and that's why we gave up with coordinator.

Consensus Based Approaches

The third class of approaches is a consensus based approach. The idea is that replicas communicate to each other and they have a protocol to get to some agreement about data transformation. This is actually a very fundamental topic in distributed systems. And there are several very reliable products built on top of consensus approaches like Google Spanner or FoundationDB. But let me give you some context why it didn't work for us.

If you work in product development, your service stack looks like multiple layers that handle client requests. The top layer is your business logic for compliance. For example, what types of document are required for verification. Then you have a domain platform which introduces some data access functions, descriptions of your entity objects and flow control mechanisms. And the last layer of your service is something like service infrastructure, which is responsible for downstream dependency discovery, different routing and balancing, and failover strategies like retry of the requests. And the same situation you have on data store side. So you have some space where you manage your data. But at the same time, there is a huge bunch of work related to configuration of data store, to how it is deployed, to what is the failover strategy, what are some logical to physical mappings.

And when you're on the product development team, you usually own these parts. And the service infrastructure provided for you as a service, usually when you need to call some downstream dependency, you don't necessarily know how pools of this downstream dependency are organized. The same in data store; you don't know how many real hardware nodes are used to support your database.

But if you want to go with consensus, you need to work on these parts because you need to know what is the configuration of your class stack, how many replicas, how they interact with each other, what to react if replica fails somehow, and so on. So typically, consensus is implemented in some shared layers. And for large enterprises, it is very difficult to do because this responsibility and implementation will be shared across multiple teams, and you need to be injected into your service infrastructure road map, which has its own plans for what they want to work on. If you own your database end-to-end, you own every process that works with your data, consensus is a good solution for you. And this is, I think, what FoundationDB team does and Google Spanner class.

Conflict-Free Replicated Data Types

So in this moment of time, we start looking for a solution which will live in product domain. And we start looking at conflict-free replicated data types. So at that moment of time, I already had some experience with CRDTs in Riak. I also looked at CRDTs in Akka distributed data. I know that [inaudible 00:12:07] is pretty positive about safetyness of using CRDT, so we decided to give it a try.

There are two big classes of CRDTs. The first class is commutative CRDTs, when replicas communicates to each other through passing update operation; this update operation has to be commutative and associative so the ordering will not matter. And your environment has to provide exactly once delivery. The very rough example of commutative CRDTs is payments in block chain network. So add-subtract operation is commutative and associative, and block chain magic helps you to receive events only once.

The second class is convergent CRDTs, when replicas communicates to each other by sending the entire state, and when another replica receives the state, it merges the state with a local one. This merge operation also has to be commutative and associative, but it also has to be idempotent, and the environment needs to support at least once delivery, which is much easier.

The classical example of convergent CRDT is grow only set. This is an ordered set of unique elements without remove operation. You can make a thought experiment and permutate what can happen with the data structure in concurrent environment, and you will recognize that eventually, this data structure is consistent and convergent, and it will contain the union of all the elements in the intermediate stages.

Convergent CRDTs

We decided to go with convergent CRDTs. And for that, we need to implement some merge function for our business data types, which will be commutative, associative, and idempotent. And this is not an easy task because your business data doesn't look like an academic data structure-like set; it contains something like fields and parameters, and this is not an easy task. But we were very inspired by the fact that since this is just a data type and operation for this, the implementation of CRDT is similar to the situation when you need to add one more field to your data model, you need to add one more column to your database. You don't need to go to talk to DB administrators. You don't need to go to talk to service infrastructure. You just modify your object. You modify your IPI interfaces, and you persist the new data model with your data store.

Online Flight Check-in System

So before I introduce our solution of CRDT based on causality tracking, I want to go through some examples of how replication works in some usual data store. So I will not use compliance domain as an example. I will use online flight checking system because everyone has experience with it, when you want to pick a seat. On the diagram, we see two replicas, replica A and replica B. And the value is different in some initial state. In replica A, we have value, 12F. And in replica B, we have a value of 16D. After cross data center replication, replica B has a value of 12F, so this value is replicated to replica B.

What exactly happened here? Database usually maintains some metadata about your records. It might be a timestamp when their record was last modified or touched, or it might be some generation counters. And when database data store sees that values are different, it uses this metadata to resolve your conflict. And surprisingly, this is also CRDT. This is last write win, which has a merge function equal to max out of a, b. So 220 is greater than 150, that's why 12F overrides 16D.

We don't want in our design any data to be discarded by some database meta information which is not under our control, because we want to build our solution on top of existing data store. And that's why we do this trick. Instead of storing some scalar value, we extend our data type to some map. And when before it was 12F, now it becomes A1: 12F and B1: 16D for replica B. This key represents a replica identifier plus atomic counter of operation. We can maintain atomic count within replica because the replica gives us strong consistency. So the next insert into replica A will be A2 and so forth.

Since these keys are unique globally, we want cross data center replication to merge these maps by key. And in this situation, none of the values will be lost. So we still have both of the values. And if you think about this map, the keys are globally unique and they are immutable because every time the counter is new. And this is add only map when we can only insert a new key with some value. And this is CRDT. This is commutative, idempotent, and associative, right?

Let's look at this map in details. So the set of keys for this map uniquely identify the state of the replica at some moment of time. And it means that we can use this data to maintain causality. So what is causality? Causality is, if you think about evolution of your value as a graph, causality is your edges. So for example, we have some initial state of value, 12F, and some client flips it to 10A. So 12F is causal to 10A, then we can drop 12F. Then as a client flips from 10A to 5C and 10A is causal to 5C, and we can drop 10A.

This is another situation. When 12F is also causal to 10A, that's why we can drop 12F, but another client who doesn't know about this transition, he wants to flip 12F to 5C. And 10A, in this situation, is not causal to 5C. We cannot drop 10A. And in this situation, 10A and 5C are concurrent values and we call them siblings. And we cannot resolve each of them [inaudible 00:19:16] at current moment of time. And when we sum up all the keys of this map, it will give us some causality vector. So to use this causality vector, we want to give it back to our clients when they read data. For that, we need to extend our signature of get and put operations. So we add causality vector to the total type of get operation, and we add causality vector to one of the parameters to put operation. It is very similar to how compare and set works when we provide some version of the value. And then when you want to write, this version is compared with the current one. And if it is not the same, you will get some second [inaudible 00:19:59] exception, with only one difference, that we don't need exact equivalents of this causality vector to complete a commit. And we also add causality vector information to our data value for each put operation.

Aerospike Datastore

Okay. Now as I said, we wanted to implement it on some available data store because we don't want to develop our own, and we used Aerospike. May I ask you to raise a hand if you have experience with Aerospike? Not so many. Okay. This is a hybrid memory key-value store. It has very high performance and high throughput, low latency. It is proven that majority of write and read operations will be completed within one millisecond. So hybrid memory means that index of the keys is stored in operational memory, but records itself are stored on the SSD disk. And Aerospike has a mode of strong consistency within its cluster, and it also has cross data center replication. So far, it looks like a good fit. Next.

The data model of Aerospike is designed across the concept of record. You might think about record as a row in a classical relational database. Record has a key and metadata, but values itself are stored inside bins. So you might think about bin as a column in a relational database, with the only difference that Aerospike does not require you to have the same set of bins for all the records. So each record might have its unique set of bins. And also, Aerospike gives you an API to iterate over all the bins within record.

We also use two more features of Aerospike. The first one is user-defined functions. It is LUA operations which are executed atomically per record. And these user-defined functions helped us to maintain this atomic counter. If you don't use a user-defined function, it means that within replica, we need some locking mechanism like optimistic locks or pessimistic locks to maintain this counter. But it's easier to do it on a server side.

The second good feature in Aerospike that crosses data center replication can be configured in a way that bins are replicated in isolation. If you have something like two bins- bin one and bin two in some replica A, and bin two and bin three in replica B, the result of cross data center replication will be the union of all these bins, which is very good concept for our design. For example, if you would like to repeat it in MongoDB and your structure is something like a JSON, it will be not so easy because you actually need to build a new JSON which will be a union of all the keys that are inside it. So it helped us to maintain this causality map with the merge operation that I talked before.

So now, I would l like to consider one example of how it works. I will again, use flight check-in system with when you want to choose your seat. So once online registration is open, you go to the browser and choose a seat, 12F. Since there is no initial state, we execute put operation, which contains 12F and empty causality vector because there is no initial state. And we create a new bin in replica A, A1, which contains 12F from none. We successfully performed it on database, but your browser is hanging. Something goes wrong. You don't want to lose the opportunity to pick the right seat. And you decide to go with mobile app, expecting that it works better. Mobile app connects to replica B. There is no initial state in replica B, so you don't see any seat in your account. And you pick another seat, 10D. Mobile app executes put operation with value 10D and none and empty causality vector. We created bin B1 in replica B, which contains value 10D from nothing.

In this moment of time, two replicas have a single value each, but they don't know about each other, right? This is exactly eventual consistency. Then cross data center replication happens and bin A is replicated to replica B, and bin B1 is replicated to replica A. So now, each of the replicas has the same set of bins, but none of them is causal to each other. They are parallel to each other and we have nothing to do with it. Then your browser comes back and it says, "Hey, you successfully chose your seat, 12F." And it also gives back the causality information because at that moment of time, there was only one bin A1, and your causality vector will be A1. You are surprised or something like, "I just picked a place, 10D. So I want to at least verify that the system works fine," and you decide to pick a new place, 10F, expecting that it will override everything else. Actually the situation was with me. This is how I came up with this example. Actually, most of the time for this presentation, I spent on what will be a good example to show this.

And this 10F put operation, this 10F has a causality context A1. And this is the moment when our CRDT magic happens. We have a new bin, A2, which has a value of 10F: a1. And the database understands that this causality vector, A1, is greater or equal than bin, A2. That means that bin A1 is causal to A2. It means that we can drop bin A1 because it is overridden in our logic. Our client knew about bin A1. I knew that there was this data value and I wanted to override it; so drop the previous version and add a new one, right? And you still have B1 bin, which is parallel to A2, but A1 is causal to A2. That's why we drop A1 bin. So okay. You decided that you are fine. And then before your flight, you go to the airline representative and ask him to change your class to business. He reads your data from replica A. And at this moment of time, your causality vector will be A2, B1, because there is bin A2 and B1.

Okay. This I will skip for now. And he decided to give you another seat, 5C, which is business class. So he performs put operation with causality vector, A2, B1, and it is persistent to replica A. So we created bin A3, because this is the third operation in replica A, which has a causality vector, A2, B1. And CRDT magic happens again. We know that A2, B1 vector is greater than B1 and is greater than A2. That's why we can drop these bins. And we have only one single value, 5C, A2, B1. So this bin has overridden both of them. But in replica A at this moment of time, we still have two bins, A1 and B1. And then replication happens again. So bin A3 is replicated to replica B, A2 with causality vector, A2, B1. And even not having A1 in replica B at all, we are able to make a simple vector comparison saying that A2, B1 is greater than B1 and A1. It means that A3 definitely overrides these two, and we drop them.

So eventually, you see that our state is convergent. If you try another operation to permutate it, it will be convergent also. If you need one more cluster to your deployment schema, you just create a new dimension of your vector like C, D, E, EE, AB. And you say that as a default, this is 0. That allows you, actually, to include new replicas into your deployment schema and exclude them also, because you can compare with empty values. So this is the example. I hope it is understandable.

Learnings

So now I would like to share learnings. So I started working on this project in 2016. And a little bit more than a year ago, we went to production. And it works since already a year. So what I learned from this: that CRDTs are real. They are feasible. This is not only an academical paper. And they definitely require you to rethink about how you approach concurrency, but it allows you to achieve convergent predictable state of our data without strong synchronization across clusters.

The second learning is that when you want to approach your concurrency, when there is no room for concurrent modification exception, or any flow-related mechanisms that allow it to say, "Hey, hey, stop. This put operation cannot be executed." You don't have this capability. You need to educate yourself about the right trade-off between consistency and correctness, because even in my example, some of the values in some moment of time were correct, and your service might reply back saying, "We are fine. We set your seat to 10A." But eventually, for consistency purpose, you might discard this data. And these also have to engage your product owners to build a right service for your domain in particular.

The third learning is never underestimate concurrent data access. When we started this project, of course, we did some assessments regarding what is the expected rate of concurrent data operation we should face. And once we went live, we recognized that this rate is much higher. The problem was that this transition to distributed deployment pattern has impacted many other teams. And some of them might misbehave some concept of how downstream dependencies work, or how they handle traffic. And for example, we had a situation when a message was dequeued twice in both of the data centers.

So what I am trying to say, that usually we try not to overengineer solutions. But at the same time, we designed them with an assumption of a good weather; that everyone works fine, the database always responds, messages always dequeued only once. Which is not always true, is what I am trying to say. And if your solution is durable enough to handle such cases, you will actually be able to help your partners to trace their own problems and maybe for the whole purpose, it will be very collaborative, a positive effort. So this is what was my learning.

Caveat #1: CV Propagation

Now I would like to share some caveats about this design. So, the first one is causality vector propagation. So we need to pass back the causality vector during read operation and receive it for write operation. The question is, how fast should we propagate this causality vector? And the answer is, till the moment of making a decision; because concurrency happens not only across threads, concurrency happens in man behavior, right, human behavior. For example, you want to book a hotel and you see that this hotel is still available. Once you call your wife, "Is this hotel okay?" And it's no longer available. This is a concurrency, right? And sometimes, you have a customer-facing application which shows some data to customers or users, and you don't know how long they will hang this browser session and when they will change something. So you need to propagate the causality vector as far as you can. If you have some batch decision engine that is easy for customer-facing components, you need to cache it at least at the browser session.

But what I'm trying to say, you might be thinking about, "Hey, let's cache it somewhere in the middle." In this case, you will not handle your entire concurrency. You will handle only some portion of your concurrency, which is related to only this flow.

Caveat #2: Siblings Explosion

The second caveat is siblings explosion. So you might mention, that if I do not provide a causality vector for my put operations, every value goes as a new one and this map grows, which will be dangerous for your performance. So this situation is definitely not good. And if you face it in your scenarios, you maybe need to rethink what is concurrent for you, because if you have something like hard overrides of something, this is not a concurrent operation. You just hard override, you don't care what was the previous state, right? But to save yourself from further problems, from a snowball of these problems, we introduce some constraint limiting the size of this map. And we prune it, like list added records, if the size of this map goes beyond some number. Of course, this is a very- how do I say- rough data, I will say discard of some data. But it can save your performance eventually.

Caveat #3: Wait, Siblings?

The third one is siblings. So this is an interesting thing; that siblings and system is eventual consistency, and is a natural thing. When one replica appraises with data, another replica appraises with another data, and those values are concurrent, you might actually see siblings even within one data center. The problem is how to handle them. Ideally, it should be some collaboration decision with products also, because the appearance of these siblings is probably a leak in some data access patterns. And they might require your attention. For example, when a customer is not happy, he tried a mobile app and web application and this created two isolated parallel records. But if you have a huge technical landscape, of course all of your clients will not be happy that suddenly all the values become lists. And what should we do with it?

So what we did, we introduced another CRDT on top of this list, which is much simpler, like monotonic type reconciliation, we have a strict degradation of what are the values that we prefer. And all these lists are collapsed in our platform and customers still see only scalar values.

Okay, thank you. This is all what I wanted to say. It seems like I'm a little bit early. I'm sorry. Thank you all for listening, and I will be happy to answer your questions.

Questions and Answers

Woman: Thank you. If you have questions, if you could please line up by that wall or the other wall, and then we'll just pass along the mic. Thank you. Perfect.

Man 1: So can you go back two slides?

Martyanov: Which one?

Man 1: Next one. That one. I guess, what I don't quite understand is, if A1 is 10A and B1 is 10A, both have no causality. How is the conflict resolved?

Martyanov: If they are not causal to each other, right, this is exactly the sibling situation, right? When you have a conflict values - this is the one, right? This is the siblings. So these two values, B1 and A2, they are not causal. Any of them not causal to each other and we keep them both, right? And this is exactly what I am trying to say. So you can introduce one CRDT on top of what you have, like last write win, for example. But as I said, this is a compromise to satisfy your clients which want your API looks like a scalar value for the scalar value in their understanding. But if your flows are designed in a way that they produce siblings, you have to address it somehow, right? You have to address it.

Man 1: The third one in the middle, at the bottom. If B1 was 10F, with no causality.

Martyanov: No. B1 is 10D.

Man 1: Yes, if it was 10F and you got to the third step where B1 is 10F, and A2 is 10F, how do we resolve that they're both the same value, right? I guess, what I don't quite get is you started by saying that this is a way to beat CAP.

Martyanov: Right.

Man 1: Right? But then you've also said, "A and B are not consistent."

Martyanov: I'm saying A and B are parallel to each other, right? If you think about these two bins, they are parallel to each other.

Man 1: But they're never the same value. Oh sorry. There is a delay between A and B being the same values?

Martyanov: I'm sorry. I didn't get it. Maybe we can discuss it afterwards.

Man 2: For those of us who are less inclined to implement this from scratch, are there any libraries or things that you know of that have implemented this for just kind of [crosstalk 00:39:18]?

Martyanov: Okay. This is a good question, actually. So you can look at Riak. Riak database has something like open sourced some part of their code that has the library supporting CRDTs. You can also type something like CRDT in GitHub and you will see a couple of implementations. For example, this exact implementation was inspired by dotted version vector implementation for Erlang. And you might use it in some way, I would say. But I would say that our design is a little bit more sophisticated [inaudible 00:39:56], for example, Aerospike capabilities. So for your situation, it might require some more workaround and some downstream layers. You also can look at Akka distributed data. They definitely use basically the same approach. But they also, how do I say, have these constraints of how Akka cluster actually works.

Man 3: Thanks for your talk. Did you keep causality vectors for multiple dimensions, or for the entire record that you were keeping?

Martyanov: This is a good question. So if you think about that we have multiple values, right, you need to define what is the atomicity that you want to achieve in key-value data store? So this example is about exactly the same value in different data stores. This example doesn't cover when I have two different values maintained by two different microservices and they need to be consistent across each other. This talk is not about this. This transactionality across different keys is a separate thing, how you [inaudible 00:41:09].

But if you are in document array, and I know that some companies do document array under persistence, when you just have an entire snapshot of what's going on in your system and this is exactly just one value.

Man 3: So just a follow-up- so what I'm asking is, if your record has an enumerated status value, you can keep a causality vector for the status changes across different data centers. But what if you also want to keep track of date, like a date and a status, two different columns?

Martyanov: Yes. You're right. When you think about this 10D, right, for example, this is just a scalar value. In our example, this is actually a JSON object itself. And it has, for example, actor, status itself, date of modification, and a lot of other information that is business-related. And this reconciliation, for example, this 10D, 10F, right, it seems not reasonable for this example. But if you think that this is not just 10D, this is 10D modified from mobile session and so on and so forth, you can say, for example, "Mobile session is always more priority than web browser session," right? And you can resolve it in such a way, for example.

So yes. There are also timestamps, and you can operate with them. I only want to say that if you will use timestamps to resolve these parts, you might face a situation where, for example, your data center works in isolation and a lot of things happen outside it. And once connection is again established, last just will win, but who knows which of this data center is, how to say, the correct one, right?

Also, timestamp is a very tricky thing because, for example, I face weekly stuff like clock skew problems. And if you work in large enterprises, again, this is a philosophical question, right? A lot of people realize, saying that time is always correct. It's not true. If you have a few hundred thousand different machines, time is not always the same. So yes. That's why timestamp is a very flaky foundation to make decisions. Yes.

Man 4: Great job, Dmitri. One question, maybe somewhat naive and you might've just answered it in your previous answer. But so in this example, what happens if that passenger never talks to the gate agent, so they've just done the check-in through the mobile and through the browser, and then they showed up to board their flight. Are they going to get 10D, or are they going to get …

Martyanov: Yes. This is a great question. So as I said that your client has to want to have one scalar value, right? And for your business purposes, you want to resolve it, right? At the same time, you might have a business logic that airlines themselves call to customers saying that, "Hey guys. We see that your check-in is not something, like, malformed or something like that. We want to get it resolved somehow." And this will be customer care, how to say, action.

But at the same time, you might say, "Okay. Last wins, so this is the last one. 10F is the last one, winner.” So this is another orthogonal dimension of decision. This is like a product call. We are infrastructure persons, a platform team; we want our values to evolve in a convergent way, right? Actually, if you think about the stack of multiple microservices and something like that, right, and in some microservice you have a quantification exception. You actually need to propagate this problem to everybody else, and they need to somehow … So it only depends on when you resolve it, right, because even in a synchronized system, one of your microservices, which is in charge for seat selection, states concurrent quantification exception, you always send an e-mail saying, "You're successfully …" and so on, right? So this is only the point of time when you resolve this problem.

Man 4: So basically, the idea is that the system will provide you with all the possible variants and then it's up to your business logic decide when and how to go about reconciling.

Martanov: Which are concurrent to each other. Right. Exactly.

Man 5: Hi. This is a great presentation. And I have a small doubt here. Here we are talking about two different persons in two different countries selecting two different seats in this particular example. But let's say I have a person in North America selecting 12A and somebody in Africa selecting the same 12A and then both of them are in different continents and both of them will have that transaction done. And when the application happens, even if you merge it, that'll be …

Martyanov: You're right. And the same happens when you, for example, order something from your favorite marketplace and they call up afterwards and say, "Oh, we are out of stock," right? So what I am trying to say is, the concurrent access to some data point is not a trivial problem. And you're right. To the same person, one person in replica A might book the same seat which is booked in replica B, but you at least will have the set of deterministic rules, which can tell you why it happened, roughly speaking, right? It is not simply because of the replication. So you might work around it later on when after replication happens and check that, "Do I have any concurrent situation?" Or something like that.

Because otherwise, for example, in this, when replication happens here, right, this is just one value overriding another, right? This is what happens in a classical database, if you use GoldenGate or any other replication solution. This is just one of this value will overriding another, depending on some timestamp value. This is the same problem.

Man 5: No. I think since it's an airline, it might be fine. We can resolve it later. Let's take the concept of a prepaid wallet, right? I have $100 in my wallet and I make two transactions from two different countries from two different websites. In that case, how would you justify this?

Martyanov: Yes. I understand. This is another problem. So this is not solved here because this is convergent CRDTs, right? Actually, what you're talking about is a better fit for commutative CRDTs when you have some evolution of value which is commutative and associative. But this is also one of the problems that we haven't … This doesn't touch that, such thing. I'm sorry.

Man 5: Do you want to talk about how it was solved? I also ended up in similar problems there.

Martanov: I would say that you can maintain these counters differently in each replica. For example, you will have your evolution of subtract, add balance in replica A, subtract, add balance in replica B. And of course, you will have some deviation and possibility that you will go beyond your limit, for example. For today, I don't see any way to prevent it with 100% confidence.

Man 6: Hey. So I guess in my organization, we've worked a little bit with Riak. And we ran into this problem, which I guess I don't know if you really … you kind of skimmed over it. But the operations need to be idempotent. So a lot of operations are not idempotent, like the wallet example or incrementing a counter or something like that. Is that just … do you say, "Okay. The infrastructure is there. That's a problem for the engineers and the product team”?

Martyanov: This is the same actually question that was asked before, right, because this reconciliation has to be idempotent, right, this one. Because if you have, for example, 10D, another 10D, another 10D, and 10F, you have to resolve it in a way that all the 10Ds will be collapsed together. This is exactly your idempotence, right? No matter how many times.

First of all, this is a convergent CRDT when you deal with state and not deal with transition. And for state, idempotence is not so difficult I would say, because, for example, if you want to say that my status is completed, no matter how many times you will say your status is completed, it will be completed, right? If you have a parallel that your status is completed and your status has failed, and for example, you're saying that failed has more priority than completed, it also doesn't matter how many times failed and completed occurred in your logic, right?

So for us, idempotence is not a big deal. But I understand that for commutative CRDTs, when you operate with transactions, for example… exactly, right? This was the slide that says it, for commutative CRDT- for this one, right- idempotence is not required, but it's required for exactly once delivery. So this is another approach.

Exactly once delivery, we can talk a little bit about it. So the problem is that you need to define the scope when exactly once delivery is granted. It might be your runtime. It might be your cluster. It might be something like until the restart of your system or something like that, because you can use GUID, for example, for operations, right? And if you have all of this on operation with GUID, it means that this is another operation and you just keep saying that it was delivered already.

See more presentations with transcripts

 

Recorded at:

Dec 19, 2018

BT