Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Panel: The Promises and Perils of Eschewing Distributed Coordination

Panel: The Promises and Perils of Eschewing Distributed Coordination



The panelists talk about the promises and perils of eschewing coordination in distributed systems. They cover a diverse range of opinions and use-cases (control planes, streaming engines, SQL databases, service discovery systems, etc.).


Cindy Sridharan is a distributed systems engineer. Colm MacCárthaigh is an engineer at Amazon Web Services. Chenggang Wu is a Ph.D. student at UC Berkeley. Armon Dadgar is currently the CTO of HashiCorp. Peter Mattis is the co-founder of Cockroach Labs. Sean T. Allen is vice president of engineering at Wallaroo Labs and a member of the Pony core team.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Sridharan: My name is Cindy, and I'm going to be moderating this panel. What was it even called? "The Promises and Perils of Eschewing Distributed Coordination."

The Difference between Paxos and Raft

Let's start with Paxos, because someone was just talking about Paxos in this previous talk. When you think of distributed consensus, you think of Paxos. At least maybe a decade ago, the reason people used to try to avoid it or not do it was because Paxos was so hard to implement. I know that there are at least two people here, Peter [Mattis] and Armon [Dadgar], who have actually implemented Raft in their real products. Now that they've actually really implemented Raft, there really isn't much of a difference between Paxos and Raft. Could you elaborate a little more on that? Would you still say that's true?

Mattis: Yes, I would say that's true. They're fundamentally distributed consensus algorithms, and if you squint, they look very similar. A PhD thesis recently came out from Heidi Howard, and she talks about these distributed consensus algorithms, and the alternatives, and the variance of Raft and Paxos, and they´re all on this spectrum of distributed consensus. In my experience, Paxos was fundamentally avoided because it seemed hard to implement, because the descriptions of it were opaque, frankly.

Raft, the description of it was much more accessible, so people glommed on to the accessibility. When you get down to the implementation, they really had the same problems underneath. From my experience with five years with Raft, it probably would have been just as easy or difficult to use Paxos. I don't know if you have had any different experience. We've been using Raft for five years now, we're starting to think about extensions and whatnot. Some of these things have already been explored in the realm of the variance of Paxos and trying to figure out how they apply to Raft is something that we're actively exploring.

Dadgar: Yes, I think I'll echo you. There's no fundamental difference with Raft. I think it was just the explainability, the accessibility, that made it possible. I think one of the challenges though, is there are so many degrees of freedom. When you think about building a consensus-driven system of, "Do you have a long-lived leader? Do you do a per-operation? Do you have multiple quorums, single quorum?", there are so many of these choices, and I think what Raft did well that made it simple is, it was opinionated. It was the Rails of consensus.

There are no choices. This is just how it works, this is just how it is, not good or bad. There are tradeoffs with all those things. I think you guys have talked a lot about the nuances in there and some of the cool things you're doing with Cockroach to eke out more performance. I think it's not necessarily good or bad. It's just one set of choices you could make. Otherwise, when you approach Paxos in a very unopinionated way, it can be just, "Ok, there are so many tradeoffs here. How do you get to a workable system?"

Scaling Infinitely

Sridharan: One of the reasons people try to avoid distributed consensus is to, quote-unquote, "scale infinitely." I know that you, Sean [Allen], have mentioned that in order to scale infinitely you need to avoid coordination. The way to avoid coordination is to not do distributed transactions. You've said that. How many people really do even have to scale infinitely? Is that even really a thing? The second question is, "What are the use cases where that's even remotely required?"

Allen: I think the scale infinitely is more of a theoretical thing. I don't know of anybody who, at this point in time, has to scale infinitely. At the same time, an awful lot of my day-to-day work is working with people who are unable to build systems that they want to because it's too expensive, in terms of the amount of machines that they need in order to accomplish things, and lowering the overhead of how they design their systems.

Avoiding consensus where possible, etc., can result in applications being built that otherwise wouldn't be built. Talking about things in terms of, "Yes, if you avoid coordination, you could, in theory, scale infinitely," it's nice, and it makes for a great Papers We Love talk, and everything like that. But I think the real-world use case for that is that if you can avoid overhead for things like distributed consensus or whatnot, whole new applications become possible within the constraints that somebody might have for hardware budgets, etc.

AWS Systems Avoiding Global Coordination

Sridharan: I want to ask you, Colm [MacCárthaigh] - because you work at Amazon, and Amazon put out this amazing paper last year about how to build cloud native databases that actually scale. One of the things at the very heart of the paper was that in order to really scale, you need to be avoiding distributed consensus. The way I look at consensus or distributed coordination is I see it as an implementation detail. What the user or the end user really cares about is they just want the behavior. They just want the guarantees that distributed transactions provide. They're not necessarily concerned about the implementation details or what really happens within the systems.

Aurora was a very interesting paper, because it took a completely different view to actually providing somewhat the same guarantees, but without actually using coordination. Not only with Aurora; I also believe - I could be wrong - that a lot of other AWS systems tend to avoid global coordination. Could you talk a little bit about that? What's the philosophy behind that thinking?

MacCárthaigh: Sure, I'll try to. From the very beginning at AWS, we were building for internet scale. AWS came out of and had to support as an early customer, which is audacious and ambitious. They're a pretty tough customer, as you can imagine, one of the busiest websites on Earth. At internet scale, it's almost all uncoordinated. If you think about CDNs, they're just distributed caches, and everything's eventually consistent, and that's handling the vast majority of things.

Then people are building patterns on top of that, where they end up having their own versioning built in to try to get around eventual consistency. They're trying to come up with new lockless patterns that can work in these environments, and reinventing CRDTs in different contexts, and all this stuff. It's just so familiar to us, and it's been second nature. Where we have built Paxos systems, generally, it's been not directly for coordinating strong consistency on data itself. It's generally been for leader election and then leaving it to single leaders. It's worked incredibly well for us, it builds big, reliable systems. We just haven't really found much pressure to get extremely tight coordination, extremely granular cases. We just don't have many use cases.

Service Discovery Coordination and the Pros & Cons of Building Control Planes

Sridharan: Armon [Dadgar], one of the things you mentioned in a previous talk was the fact that the difference between eventual consistency and strong consistency, at least in systems like Consul, isn't so much about correctness, but really about the operator experience; that it's much easier to reason about fully consistent systems as it is to reason about eventually consistent systems. When I talk about Consul, the number one use case I think of is service discovery. There's been this recent trend in this industry where Matt Klein, who wrote Envoy, has been talking a lot about how service discovery is really an eventually consistent problem. Systems like ZooKeeper and Consul take their own approach to service discovery. Does service discovery really require strong coordination? And what are the pros and cons of building control planes, specifically, with relaxed guarantees around agreement?

Dadgar: A super interesting thing you pulled out is user accessibility, user experience question. I think a lot of the tradeoffs that we'll talk about internally of strong versus eventual consistency is about user experience. In fact, we spend a lot of time with Amazon's API because of Terraform. A common issue we get all the time with Terraform is, "Create an instance, or create whatever, and then describe it." It's, "That thing doesn't exist."

I think there are obvious reasons for that, in terms of scalability of the backend control plane and things like that. But there's a user experience aspect of that where it's like, "How does the thing I just got a response back saying it created not exist?" You look at a system like that and you say, "Totally makes sense from the design of, 'How do I scale this thing? How do I operate at cloud scale?'" But from a user impact, you get an untold number of tickets where we have to explain to users, "Yes, just hit Terraform apply again, and it'll exist the next time you run."

With service discovery, there's a similar challenge there. In abstract, do you need strong consistency for service discovery? No. Going back to the question of large scale systems, probably no system at a higher scale than probably DNS, the ultimate distributed eventually consistent system. God knows how many trillion requests it does per second. I think the answer is no, but when you really try and understand it, it's almost a meme. You're debugging any issue, and at the end of the day, it's always a DNS issue.

These systems are hard to debug, they're hard to understand the behaviors. It's hard to reason about, because the state doesn't exist in one place, and state's not consistent. When I think about service discovery, Consul often gets lumped into the strongly consistent category, because it's a system running Raft at the center. That said, the overall architecture, I think if you zoom out, isn't necessarily strongly consistent. You have agents of the edges caching. You have DNS caches.

If you consider the system as a whole, it's an eventually consistent system. The beating heart at the center database is strong. I think that goes back to the user experience question. When someone wants to understand, 'What is the state of the system?", they get a consistent answer. I think that's really why that design goal exists. I think an important tradeoff between eventual and strong consistency is human operators who have to understand and debug a system, not necessarily, technically, whether or not you need it.

Sridharan: Colm [MacCárthaigh], would you agree?

MacCárthaigh: Yes, mostly. When we started out, we were big promoters of eventual consistency. S3 is eventually consistent data store, and it was our flagship data store. If you look at S3's history, we gradually added stronger read-after-write consistency and then strong consistency. A lot of that was driven by usability. Having idempotent request patterns and having people being able to repeat things is fine when you've got people who totally understand distributed systems top to bottom. It's particularly hard for 99% of developers, it's just incomprehensible. We do need all that consistency.

We built a DNS service at Amazon, Amazon Route 53. We felt like we'd learned that lesson, and so we made an API that could do strong consistency. You can actually make a change against Route 53, and poll, and treat it as a synchronous call, and wait for it to go in sync. Eight years later, nobody calls it. It is depressing, it's defying all expectation.

Coordination-Free Architecture

Sridharan: Chenggang [Wu], I'm going to pass the mic on to you. You are the author of this great paper which came out last year called the Anna KV Store. Anna, for those who aren't aware of it, is a key-value store, it's reported to be much faster than Redis. If that doesn't get your attention, I don't know what else would, because people care about benchmarks. That is a completely coordination-free architecture, where you have virtual actors that talk to each other, but completely avoid coordination. Can you talk a little bit about why that came to be? What's the main spurring behind that?

Wu: The initial observation is within a single machine. If multiple threads are accessing, reading, writing to a shared memory location, it introduces pretty expensive contention overhead, because you either have to use logs or use some more advanced atomic instruction to get the linearizability right. The problem is, as you go from a single machine deployment to a distributed setting, you have multiple machines. Then you have to use this distributed message passing to coordinate between different thread. It's tough to have different communication mechanisms.

Within a single node, you have different thread serializing the memory location using logging. In a distributed setting, you have message passing. It'd be nice to have a single execution model across different scales within a single machine, NUMA or geo-distributed deployment. Then if you have this distributed message passing mechanism, then the problem is different replicas make updates in different order. Then it becomes a problem of, "How do you maintain this correctness? How do you make sure that these updates eventually converge in the same space?"

Anna is completely coordination free. That, at the same time, poses some burden to the application developer, because now they have to program in a way that's monotonic, which means their application has to be resilient against message reordering and duplication stuff. That's how lattice comes into play, Anna uses this lattice composition. Lattice basically wraps your state and accepts updates in a way that is associative, commutative, and idempotent. That achieves eventual replica conversions, which then in turn, guarantees eventual consistency.

Lattice enforces the programmer an easy way to ensure that their program is monotonic. Lattice and CRDT have the same flavor, except that CRDT is also considered as a lattice, because it also has a merge function that accepts, tolerates, these out-of-order messages. The difference is lattices are typically simple, and small, and easy-to-understand. By carefully composing these simple lattices together, you're going to be able to build some pretty strong isolation models.

The second issue is that having eventual consistency is typically not sufficient, because eventual consistency is like a liveness property. It guarantees that something good will eventually happen in the future. The problem is, before that future, the client can observe anything. Out-of-order writes may happen, and some write could even get overwritten.

What we've been trying to do is, on top of this careful lattice composition, we also come up with some clever protocols to have some safety guarantee. For example, supporting stuff like repeatable read, atomic visibility, and causal consistency, and stuff like that for Anna. That enables the application that's built on top of Anna to more easily understand the application behavior.

Sridharan: Sean [Allen], doesn't this sound somewhat similar to Wallaroo and Pony's actor model, in general? I know that Pony, the language, and Wallaroo, which is the streaming engine built using Pony, uses - I'm not super familiar with it - uses coordination-free actors. Could you just describe a little bit about how that really works? What are the tradeoffs made that are inherent in that architecture?

Allen: First of all, we loved the Anna paper when it came out, because we're, "Look, there's more people having fun with this." We have a different set of constraints than what Anna was going for. In Pony and the stuff we built on top of it, we really went with a single writer solution for it. Then coming up with a coordination-free way across a cluster of different processes that are working with one another to route data to wherever the specific single writer is that it needs to go, because one of the things that we needed was a programming model where there wouldn't be out-of-order writes. This is for the people who are our target audience that we're trying to sell to.

Out-of-order writes would be considered violating the laws of physics, or otherwise get us kicked out of our contracts, etc. Primarily, it comes down to, "This is Pony as an actor model language." You don't have any locks. The only shared data structures really are the queues that exist for any actor, where there's a small little bit of atomic operations around it so that you can have something popping off the head and then adding stuff to the back side of these queues at the same time.

When we built Wallaroo, we built it over the top of that, because there is no distributed Pony the way there is a distributed Erlang right now. We added routing to that. There's a consistent hashing algorithm you can use. Each machine is capable of determining, on its own, where the cluster changes in size, where data is supposed to live, which was a move that we did over the course of time to having less coordination., as very early on, any time there was a change in the size of the cluster, everybody would have to come to a consensus as to where data was going to live in the future. That actually made things really complicated, and so we moved to one where it's like, "No, each node on its own can know who's going to own this, based on some consistent hashing." There are a lot of similarities to Anna, but then a lot of very wide differences.

Asynchronous Consensus

Sridharan: One of the reasons people tend to say distributed coordination is slow is because it's inherently a very synchronous process. If you want quorum writes, a certain number of replicas need to agree on that. This isn't, strictly speaking true, because I know CockroachDB does asynchronous replication, asynchronous consensus. There are other systems that do other flavors of that. Are these designed to get around some of the problems with distributed consensus, or do they offer other optimizations? How do they look? Mattis: I'm not quite sure what you're referring to as asynchronous writes in CockroachDB. When a user writes into a data table, that is synchronously replicated. It's distributed consensus, it's quorum. You say you want three replicas, and two have to be written by the time it returns. Interestingly, we do have eventual consistency in certain parts within the database. These are notoriously tricky parts within the database to get right. I'll give you one example of that, which is the schema of the database. That is asynchronously distributed, it's eventually consistent.

We followed this paper that came out at Google called F1 and how this is done. We have to do this for performance reasons. You couldn't, on every single user operation, do a synchronous lookup for what the schema for the table is. That'd be way too expensive. We have to have that cache on every node of the system, and have it updated in some way that they can play together and not be munging the user data underneath, but do this really fast. This is basically just an eventually consistent distribution of the system. We allow two versions of the schema at one time, and we march through in a particular way.

It's notoriously difficult to get this right, and super complex. I think the right thing to be doing this is completely hidden internally. You, as the user, never see access to that. I think Armon [Dadgar] and the others have touched on this; having the user visible model presents strong consistency even if there's weaker consistency taking place inside, I feel, it’s extremely important.

Dadgar: I'll just add quickly. Often times, you can have a synchronous top-level API, but the internal implementation is async. I'm not sure about the Cockroach Raft implementation, but knowing ours, the API call looks synchronous from the developers. Internally, once it taps into the Raft library, there are many different asynchronous operations in flight that are being batched. There's streaming and pipelining, and batching taking place.

The internal library is treating it as a bunch of different asynchronous operations that they need to commit, and there's some order at the end of the day, but the user experience is one of a synchronous transaction call. I think they're not necessarily one or the other, you can present a synchronous API, but have it be implemented in an async way, or present an asynchronous API but have it implemented in a synchronous way.

Interesting Consistency Bugs

Sridharan: The next question I want ask. What's the most interesting consistency bug that you've hit? Let's get into story time. It can be either thing, eventual consistency, full consistency, I don't care. What's the most interesting bug that you've had to fix, and how did that deepen your appreciation for the subject?

MacCárthaigh: Do I get to claim that CPUs are distributed consensus and that coherent caches count? Couldn't you say Spectre and Meltdown? I think, fundamentally, that is what's going on, and has been where I've spent the most of my time.

Dadgar: You can bill Intel for those hours, I'm sure.

Mattis: I'm not sure that this is the most interesting consistency bug I've ever encountered. It's just the one that came to mind when you mentioned this. One of the other places we have eventual consistency within CockroachDB is this gossip protocol that's used to exchange information. One of the bits of information they exchange is their node name. Everybody's gossiping, "I'm node 1," "I'm node 2," "I'm node 3," and then elsewhere in the system, "I need to get a piece of data replicated on nodes 1, 2, and 3," and you look at this map, this eventually consistent map.

There's a protocol to do this gossip provocation. It's one of the very first things. I think it's the first line of code written inside CockroachDB, about five years ago. It's old, it's crusty, but it's been working. I think it was just last year that we found a case where one of the things we'd assumed at some point - this was a stupid assumption in hindsight - was the name for a node would never be reused by another name. Clearly, people on this stage were laughing. I hate to name him, but it was made by our CEO at some point, brilliant guy, easy mistake to make.

What would happen is that there's this thing introduced in the system saying, "If someone else starts gossiping a certain piece of information, then we're going to delete it from our gossip map." The result of the bug was that under Kubernetes, which assigns you your DNS name, when you take down the system and restart it, it'd very frequently come up that two nodes would think that they have the same name for a short period of time. Then the system would delete this address from this internal gossip system, and nobody would be able to talk to this node, but it would be able to talk to everybody else.

The end result was that this looked like there was an asymmetric network partition. Nobody could talk to this one node, but he could talk to everybody else. That was a very confusing debugging session to figure out what was going on there.

Dadgar: I'll add to that. That's a very nasty problem, we ran into that ourselves too. We were, "Yes, this is a tricky one." At some point, you need somebody to eject one of the nodes in the case of a permanent name overlap, if it's temporary confusion. Related to that, our experience has been, people view Raft and Paxos as, "They're super complicated, very difficult to get right." What's nice about eventual consistent systems is they're easy. I think our experience has been the exact opposite.

We have both Serf as an eventually consistent peer-to-peer gossip and a Raft implementation. I'd say we've had 10 times as many bugs, and they're 10 times as difficult to debug in the eventually consistent system than the consistent system. The reason for that is when you talk about a Paxos type system, you have a few nodes, talking 3, 5, 10, maybe on the outside. You can easily understand the state of it. There's only so many ways a thing goes wrong with 10 nodes, where when you go to an eventually consistent system of 10,000 nodes interacting in an asynchronous way, there's no easy way to say, "At this point in time, hit pause. What's the state of this system?" and try to get the message delivery in that exact pathological case that it happened. It was a nightmare.

One nightmare issue was trying to debug this particular failure at a customer site where they had a power loss on one rack, and that would cause the system to get into a permanently oscillating state, where even though the rack had a power loss and then it came back on, it was forever stuck in this cycle of detecting all the nodes as failed and detecting all the nodes as rejoining, back and forth forever. It was a super subtle thing where it was like the power loss to that rack had just killed enough nodes, that basically the failure messages would totally fill up the buffers of the other nodes, and it was just enough nodes failing at the exact same moment that basically you'd get this constant, if you think like a wave pool, the failure wave would propagate to one side of the pool, and then right behind it is the alive wave. It'd bounce off the wall, and then the failure wave would go the other direction. It'd just bounce back and forth.

It was just enough nodes that the system could just keep perfectly oscillating. It was just the right size to fill the queue buffers. It was one of these things where unless you have 2,000 nodes, and you power off an entire rack, and then you plug it back in 30 seconds later, it's just an impossible thing to recreate, where you don't have that type problem in a consensus system.

Mattis: How did you find the bug?

Dadgar: Just reams of telemetry. It was just going through the telemetry and logs and trying to recreate the problem. Once we knew what it was, we were able to reproduce it by just spinning up 1,000 nodes in Amazon and turning on an IP filter rule all at one time that basically looked like a power loss to those 40 nodes. We could prove it, but it was one of these things where it was like three weeks of looking through logs to really understand what the hell was going on.

Allen: I think mine would be a little different. Years ago, when I was working at a company, and we'd been using MySQL with a bunch of different replicas and everything for it, it reached a point where that wasn't scaling, and we switched over to a proprietary SQL database that had a MySQL-like API so we could just switch it over to it. Then it handled all the replication and everything for it. Every now and then, we'd start to see really weird results in the database, where we're, "We have no Earthly idea where this is coming from." People would spend time every now and then going, "Why, every now and then, does a customer end up in this really weird state?"

People are looking around through our code base for probably 18 months, trying to figure out where these occasional things would come from. It'd be like, "I have some time. I'm going to look into this." Eventually, it came down to reading exactly what their implementation was, how things were serializable, linearizable, in terms of how it was doing replicas inside the database. It was possible, in certain scenarios, that we would hit every now and then, for you to end up with a state which in theory, you wouldn't actually be able to get, assuming you did have the affordances of, "This is supposed to work just like a single MySQL should." You're supposed to be immune to all of that.

That was the first time I really started thinking about the affordances that are provide; if you provide people and say, "This is going to be the case," that you can still end up with these surprising violations of that as a provider of these sorts of things, and the amount of time you can potentially go into trying to understand where these weird results are coming from.

Service Discovery

Participant 1: For service-to-service communication, there is RPC, and then there is a service discovery that is in [inaudible 00:29:52]. There are two ways to do service-to-service discovery. One is that your client can subscribe to ephemeral node of ZooKeeper or Consul. The service owner will tell where it is running, and you get the changes, you make the service call. The other way is that you do some kind of HAProxy, where you have targets, or you can do a [inaudible 00:30:19] feed where a service owner will register with HAProxy. Among these two approaches, both seem to work. Is there any one which is better than the other one? What are your thoughts on it?

Dadgar: I can take a stab. I'll rephrase it, it was kind of cloudy, I don't know if everyone caught that. There are two flavors of service discovery. One is directly querying the central system, like ZooKeeper or Consul. Another one is having an intermediate central load balancer like HAProxy. You update that, and people go talk to the load balancer. Both seem to work. What are the tradeoffs?

I think, in practice, you use both. To a point, the service discovery is a pre-flight. Before you talk to the actual service, you're pre-flighting to understand, "Who should I talk to?", and then there's the actual path. There are times when that pre-flight request might be too expensive, if you have some high frequency, low latency trade system, you might not want to do a service discovery inline all the time, where you're, "I do one request a day and I'm not latency sensitive," maybe you don't care.

Where you see those tradeoffs is in cases where you're much more sensitive to that pre-flight cost. You might have a hot path go through a load balancer, and the load balancer's being asynchronously updated as the backends change. Versus, if you're not latency sensitive, you might say, "It's fine. I'll just eat the cost and do the discovery inline." I think part of it's also volume of scale. It's cheaper to say, "I'm going to do a periodic reconfiguration of 50 load balancers," which will do 20 million requests a second, than try and have 20 million requests per second against the service discovery system. As you go through different scale tipping points, latency thresholds, what makes sense will change.

MacCárthaigh: We see both patterns at AWS, although 10 years ago, it was pretty much all load balancers. Now, you see HAProxys, and Envoys, and Istios, and all the meshes, and we have our own as well come about. I don't think the service discovery aspect really makes much difference in practice. Finding nodes and who to talk to is the relatively easy part of it. The decision is usually around, "Do I need to keep persistent connections open for a long time? Do I need authentication inbound?"

Sometimes, load balancing can be better, for surprising reasons. A central actor will usually make a better job of balancing load than a relatively hard-to-coordinate set of distributed actors. You can usually get tighter margins. That can matter for smaller deployments. If you only have one or two backends or targets, if you have a lot of distributed actors, it's a coin flip. You might just overload one versus the other, whereas a load balancer can really decide. It's just stuff like that that tends to decide it.

One of my funny stories that I probably should have mentioned, was a load balancer that was getting about two requests per day; what was happening is the customer had built their daily reporting mechanism to run on these web servers as a batch job, and then they were calling a script using the load balancer as a distribution mechanism. The first requests get to go to web server A and the second request gets to go to web server B. I remember getting a really painful customer ticket, where they had complained that one day we had sent both requests to the same target.

Accessible Testing Frameworks

Participant 2: Whenever you're working on a system where cap is in play, one of the challenges, I think, is writing automated tests which test your system, creating asymmetric network outages, etc., like inducing split-brain, see how valid implementation is, and things like that. Have you guys had experience using such testing frameworks which are accessible, easily usable by an average developer to write test scenarios for such things?

Sridharan: The verification of a distributed system. Who wants to take this?

Dadgar: Cockroach probably has a lot too on this.

Mattis: I can talk about what we do. I'm sorry to disappoint, we have nothing that's easy to use. We're not using anything off the shelf, we do things at various levels. We have unit tests, you can configure within a single unit test. We use Go, so it's all within the Go test infrastructure. We built another test infrastructure on top of that. You can start up a unit test which starts up a 10-node Cockroach cluster and control how the network connectivity is between these nodes in this cluster. They're all within the same process.

Then we have nightly tests, which actually spin up clusters on GCP and AWS, and some of those are also controlling the network connectivity. I can't remember the name. There is a tool we use, and it's escaping me right now, a third party library we use to do that control.

We also run Jepsen tests. Jepsen Kyle Kingsbury, he writes this awesome Call Me Maybe blog about breaking distributed systems. Part of this is he's gotten this library of test infrastructure. We run that nightly as well. He has a bunch of custom code inside Jepsen. I love his blogs, Jepsen is not the easiest thing to use. If you're trying to do it for your own system, I don't recommend it unless you're building a distributed database, or doing something like Consul or Etcd.

We also do a variety of manual testing as well. We can't get away from manual testing, it is there. It's like defense and depth at a bunch of different levels.

Dadgar: One thing I'd add is there was a really interesting paper that came out a year or two ago, talking about exactly how do you debug these complex systems and reproduce their problems. The really interesting, maybe counter-intuitive finding was that - and they surveyed a ton of different systems, like Hadoop, and a bunch of distributed databases, and things like that - that the vast majority of the bugs in those systems could be replicated on a single box.

The point of the paper was that you can actually find a lot of these edges, which are still mostly timing issues and asynchronous communication problems, that you can actually simulate them on box by doing inter-process IP filters, dropping packets, or doing simple profiling that would just, "Ok, drop every third packet," or things like that. I think that paper was really interesting, you can go crazy with distributed, chaos testing, all sorts of fault-injection type stuff. But the vast majority of your problems actually, with a reasonable test harness on one box, you actually can expose.

Some of the testing we do for a lot of our systems, obviously we'll do large soak tests on big clusters. A lot of the most effective stuff is we shim out the network layers and we have in-memory implementations that'll allow us to just do fault injection on those things. You can't receive, but you can send, you can send, but you can't receive. We drop packets at some rate. You can actually just run that all in memory. Basically, you just say, "I'm going to put three Raft actors and do consensus on one process," and can stress test the system by just creating a bunch of those fault injections. A lot of that stuff's pretty effective and accessible, you don't need an enormous cluster.

Allen: When we were building Wallaroo over the course of time, we'd gone through a whole bunch of different tools that we built ourselves to do it. Just starting, for stuff that we were just doing on developer laptops with being able to mess with the network, between just having a couple processes running on a machine. God, it's probably 90% of the value that we've ever gotten from tools like that. We've built more sophisticated tools over the course of time, but honestly, most of them were not worth the effort, probably.

As you try to build these tools that can allow you to do really complicated scenarios and tests like this, you often end up in a scenario where you're, "Is it actually a bug in my thing, or is there a bug in my test harness that was doing this?" In our case, it's even worse because we have a system where you can get different results based on the order that things occur, and that we have no control over the order that these things would happen across CPUs, etc. It becomes practically impossible for these.

In a lot of them, we're actually having conversation right now about, "Should we drop this?", because we spend more time chasing down what turns out to be bugs in the test harness, than actual bugs in the system under test itself. I think we had one bug that we found with the test harness in the last year that was in the system under test, and we fixed probably 25 bugs in the test harness itself.

Testing in Order to Deal with Weaker Consistency Models

Sridharan: Speaking of testing, it just makes me wonder. One of the reasons why people like strong consistency is for correctness. Do you think we could ever get to the point where our testing could be good enough that - we can still work with eventually consistent models, but in variance in the code and tests, whether it's unit tests, or soak tests, or testing your production, any manner of testing - that we can get so advanced that we can deal with weaker consistency models?

Mattis: I'd love to see that future, it's hard to say. There has to be some fundamental advances made there for us to see that future, but that isn't to say we can't get there. I've been in the industry for 25 years now, there have been some amazing fundamental advances. I have more power in my phone than I did on my desktop 25 years ago. I can believe it, I just don't see it as being a near term thing. I see some incremental advances coming.

Something I learned not too long ago, that maybe some of the other people on stage know about is, we hear about chaos testing, but you should go watch this. It's a video from Peter Alvaro talking about lineage-driven fault injection. It's a fun video, he drops a lot of F-bombs in it, but he talks about how we could do fault injection testing in a more principled way. I think that's a fundamental advance there. We just need several more of those to happen.

Advanced Testing and Strong Consistency

Sridharan: Do you think there's anything, Chenggang [Wu], in academia? Are there any forms of advanced testing, any research happening in that space, that can provide the same guarantees as consistency or strong consistency models?

Wu: I am aware that there's some work by, for example, Natasha Cruz. The underlying storage system offers some weak guarantees, so not serializability, linearizability, stuff like that. Because your application may have some restricted form of data access pattern, read-write pattern, and stuff like that, you may still get the illusion that your application satisfies the serializable requirement. If the application developer can’t come up with this set of invariances, then we're able to provide some proof that shows, "Your application is correct under this consistency model."

Testing in Larger Scale Distributed Systems

Sridharan: Colm [MacCárthaigh], in your talk, you talked a little bit about using formal methods at AWS to verify services. What kind of testing do the largest scale distributed systems undergo?

MacCárthaigh: We have similar unit testing frameworks; we do in-memory network simulation, and can simulate basic network faults that way. About two years ago, I was working on TLS; the next version of TLS, which became TLS 1.3, one of the design challenges we faced in TLS 1.3, we had no idea how to figure out how to do 0-RTT messages without also permitting replays. We finally fixed it, -ish, and got it all the way down to, "We now tolerate exactly one replay."

I actually plugged this in to a bunch of these test frameworks, and saw that actually a lot of things are not tolerant of replays. They're tolerant of re-transmits that might include incrementing a message number or something like that. But if you replay exactly the same message, it can cause havoc in a lot of these systems. We have stuff like that lying around that we can play with. We do a certain amount of formal verification, where we feel it's really merited. We have papers out there with TLA+, but we have other tools that we use as well.

I, myself, mostly use F* because I just find it easiest to use. That takes a fair amount of effort. Then we do a lot of real-world game day testing. Actually, before every single AWS region comes online, we take the opportunity to break it a lot, and see what we learn and every time we learn something. We'll keep doing stuff like that, there's no end to it.

Formal Verification

Participant 3: Colm [MacCárthaigh], you mentioned formal verification a few times. I was just wondering about anyone else's experience in formal verification, and how useful they found it?

Mattis: Come to my talk tomorrow, I'll mention it briefly. It's just a little bullet point on one of my slides. One of our engineers used TLA+, PlusCal, actually, which is the imperative version of TLA+, to prove our new transaction protocol. It gave us a nice bit of confidence, it's a nice pat on the back that you didn't completely muck something up.

Dadgar: What I'll add is I think a lot of these systems are not necessarily designed with usability in mind. From the person who brought you LaTeX is TLA+. Your mileage will vary, and it will all be bad. They're just not very usable systems, that's one side of it. The other part of it is it's like model-based testing, which what you're proving is that your TLA+ model is correct, not that your implementation is correct. I think it's the same problem you have with model-based testing.

You haven't necessarily proven that it's not just a problem with the model when you do model-based testing. I think some of the problem is that the prover will say, "Yes, this white, clean room implementation is good." There's always the translation into the real production version, which there's always the performance optimizations, and the shortcuts, and the things you have to do to actually make the system work in a production setting. Sometimes, those can introduce correctness bugs where, "This little cache we added here actually totally broke the correctness of this thing."

Allen: I do like, though, that if you have a formally verified TLA+ model or whatever, that it cuts down on your debugging space when you're trying to figure out where it's wrong. It's like, "Is our protocol just fundamentally flawed, or are we human and we messed up the implementation?" It's nice in those scenarios, the couple times when we've done the formal proofs, for it to go, "Ok, we messed up somewhere in the implementation. Let's go find it." That at least cuts the search space down by an awful lot.

Dadgar: It's like compiler bugs. "Is it my app launching or is it the compiler bug?" Ruling out the compiler is actually nice. Yes, I think that's the nice advantage, you don't worry that it's fundamentally flawed.


See more presentations with transcripts


Recorded at:

Aug 08, 2019