BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The Anatomy of a Distributed System

The Anatomy of a Distributed System

Bookmarks
37:43

Summary

Tyler McMullen talks through the components and design of a real system, as well as the theory behind them. The system is built to perform very high volumes of health checks, done across a cluster of machines for reliability and scalability. It includes things like gossip, failure detection, leader election, logical clocks, and consistency trade-offs.

Bio

Tyler McMullen is CTO at Fastly, where he’s responsible for the system architecture and leads the company’s technology vision. As part of the founding team, he built the first versions of Fastly’s Instant Purging system, API, and Real-time Analytics. Before Fastly, he worked on text analysis and recommendations at Scribd.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

All right, let's get started. So after this morning, we had a chance to see what it is like to develop a highly consistent banking system.

Next up, we have ultralarge scale, I think, it is best way to describe it. We have Tyler McMullen from Fastly, talking about how to build distributed systems. Whatever he says about distributed systems, you should take seriously. He gets load tests from various state agencies, spontaneous, of course, we will not name any names. Without further ado, Tyler.

Tyler McMullen: Hello, wow, it is a lot of people here. Hmm. All right.

So, yep, as Chris was saying, I'm Tyler McMullen, you can find me on Twitter right here. If you disagree with the things that I say, you can yell at me, or you can yell at me regardless. I'm fine with that. And I am the CTO of Fastly, if you like the things that I'm going to be describing here, this is interesting to you, you should feel free to come up and talk to me afterward. We are always looking for various skilled distributed systems people.

Okay, what's up? I figured, you know, it is the first day of QCon SF, I thought we would start nice and slow, have a gentle, high-level, description of distributed systems. I'm just kidding, we're going to build a gnarly distributed system together.

So this talk is kind of my response to what I see as the two types of distributed systems talks, right? You end up usually having talks about theory without implementation which, is useful in its right, or implementation without theory, we tie this technology to this technology and it is useful. But I try to bridge the gap between those two, this is my attempt to do that. We will have plenty of theory and we will talk about specific technologies.

The Project

We will talk about cluster membership, rendezvous caching, and CRTs. I have 45 minutes to do it, let’s get started. You can talk about the project itself, in theory or practice, or if you don't talk about something that is real, it kind of loses a lot of meaning. And so, the project that we're going to talk about is kind of specific to Fastly, but a lot of the principles apply to other things.

So, if you are not familiar with what Fastly does, we're essentially, like, a middle man. We are a proxy, and that's one way of putting it. We sit in between your users and your origin servers. We try to do things at that middle layer that you cannot necessarily do at any individual place in the network.

So really, in, like, the lowest-possible terms, you take a request from the end user on the left, and it sends it through to the origin server on the right. That is easy enough. It is not just one server on the right, many of our customers have whole clusters, or data centers sometimes. So we are talking about load balancing, an individual request is sent to the user, routed to one, or a different origin server. Easy enough, right? This is where the project comes in. If you are doing load balancing, you need a way of detecting when a server is gone instead, in some way or another. And so we need some way of making sure that this doesn't happen. And, yeah. I have a lot of animations in this talk.

We, of course, want to route around the origin servers, we need an idea of health checks. And the problem is gnarlier than that; we cannot health check everything. It is not just one data center that we're talking about, we have many data centers. There we go.

So we have many data centers here, and the trick with this is that the idea of health and health checking in a distributed system like this depends on your perspective within the network. So from each of these, from the perspective of each of the individual data centers, different origins may appear to be up or down, right?

The route that you take through the internet has a great affect on the health, the appeared health, right? It is all about perspective and vantage point inside the network. So we are designing a system that does health checks, but it needs to operate on a per data center basis, that makes it easier and then we are not sharing information between multiple different data centers, great. And it gets a little bit more complicated, even, each of these data centers is not a single server. Each of these data centers have many servers in them, between 4-64, or sometimes 168 servers in the data centers, right? And the trouble is, you of course cannot do health checks from every single server. That is just called a DDOS.

If we were health checking, let's say that you have your health checks set up once every second, you are health checking every second, then you get thousands of requests per second. So what we want to do is to assume that all of the servers in our data centers have the same perspective in the network. So what we want to do is say, okay, only one of our servers within each data center should do health checks on the origins. Now we have a distributed systems problem.

Because, of course, oh, right. Of course, sorry. One of the other troubles with this is due to sheer asynchrony of the network, not every server in the data center has a view of which origins exist or don't exist at any one time. Just looking at the asynchrony of the network, if you are saying, hey, Fastly, I have a server sitting here, we cannot put a lock on the system and say, hold on, everybody stop, we need to deploy a server now. It will arrive at different servers at different moments. So for individual minutes, seconds sometimes, individual origins have a different view of which origins exist and which do not. Now it is even more complicated. Okay.

So, essentially, we are talking about a need to distribute work. We need a way to essentially decide on ownership, which server, which one of our servers and nodes is going to take ownerships of doing the health checks for each individual origin server. We will only have one inside each data center doing this. And so we need to distribute this work across them in order to share, and then we need to share the information that they are learning from doing these health checks with all the other servers in the data center. Is this making sense? Okay. Good. Great.

But, of course, we have to deal with server failures on our side. If an individual server inside our data centers fails, we need to make sure that the work that that data center was doing is taken over by someone else. We are talking about 50-plus data centers spread over around the world, we are talking about tens of thousands of origin servers that we are worried about health checking.

So, to recap. The system that we're going to design is something that keeps track of origin health, per data center, and make sure that regardless of our server's failures, there is someone always doing health checks of your origin servers. The system check needs to handle origin servers gracefully, and be in sync if possible and be graceful if out of sync, and only having within health checker server serving one origin of data center.

One Possible Solution

So are you with me? Cool. We will talk about a solution to this. We could use a system like ETCD, Consul, or Zookeeper. I'm sure many of you have heard of the cap theorem, CAP, if you have partitions in your network, which you do, you can only choose between consistency and availability. You can't have both.

And so, essentially, in the case of console or etcd or Zookeeper, we are talking about atomic broadcast, and if there are partitions inside of the data center, that system may, in fact, stop. Which is great in some ways, right, because it means we can apply the total order to every event that happens in the system, which would be great. It would make the whole thing a lot easier. But my argument is that, in this case, and also because this is a boring talk if I just use console, in my opinion, this is actually too strong of consistency for this application.

Eventual Consistency

In the presence of partitions, we are going to attempt to use availability. And the console chooses consistency, our system uses availability. We are talking about in this case some form of eventual consistency. Great, and really the thing that we care about, the thing that we actually care about here is this idea of forward progress. We want to make sure that, regardless of what happens inside our network and data center, that each individual server can always make forward progress. And another way of looking at this is to say that, in the case of a system or network failure of some kind I would prefer for us to do two health checks against your origin servers than to do zero health checks. Does that make sense?

So, obviously, you know, this is kind of the idea of coordination of avoidance. You need to communicate in some way, but you want to make sure that the communication or protocol will not block progress on any individual node.

All right. A great paper on this particular thing is Coordination Avoidance in Database Systems by the venerable Peter Balis. If you cannot wrap a Paxos on the problem, what can we do, and what guarantees can we continue to make regardless of that? We cannot rely on ordering and animosity to solve the problem, we need to decide which variants we need and which we don't.

Ownership

So we will talk about ownership, this is the idea that we need to spread work around and handle failures when it occurs. So, how many people have heard of rendezvous hashing? Only a few, neat. I'm happy to help. So, rendezvous hashing was invented at the university of Michigan in 1986 with consistent hashing that many of you may know about this, it solves the problem in a significant way, in good and different circumstances. So what it basically just looks like is this.

So it is actually really, really straightforward. So H is just a hash function, simple enough. Any hash function, you probably want to choose a good one. And S is what we are talking about the set of live servers that we talk about in a particular data center. And the O, in this case, is the thing that we are hashing. Right? So in this case it is an origin server that we are deciding I'm the owner of. And what we get out of this is a weight or priority.

So, what this ends up looking like is that, if I go and I pass an origin into this, and I iterate through this hash function for every server in one of our data centers, what you end up with is an ordered list, you end up with a prioritized list of which server should own a particular origin. Does this make sense? Right. So that's the whole idea here, is that, for any individual origin server, we wanted to have a prioritized list. So, we know that if this server is up, if the server A or B is up, that one is the one that should own it. If it is down, here's the next one in the list that should own it.

Okay. So let's talk about failure detection, then. So, if we're going to communicate in a cluster, or move work around in the cluster, we need to know who is in the cluster, out of the cluster, who is alive and who is dead at any particular moment. All right? Nice.

And so, one of the ways that people have started doing this lately, one of the more popular failure detectors is the SWIM failure detector; this has become more popular in the last few years. I will walk through roughly how this works. The idea is that, every so often, servers that are in a data center, they walk up and they say, hmm, okay, I know the server used to be here, my peer used to be here. And so I will wake up and I will say, hey, are you up? And that server responds and is, like, yeah, I'm up. Great, okay, cool. Now we know that the server is up, and server one broadcasts that around the network and says, hey, three is still alive, everything is cool, keep moving.

On the other hand, sometimes what happens is that server one will say, hey, are you up? And server three doesn't respond, and nothing happens. And so the idea with SWIM is that SWIM says, hey, two and four, can you check on three? Maybe there's a communication blip between us, I cannot tell if it is up right now. All right? And so then three is checked by two and four, three responds and says, yeah, it is still up, everything is cool. You have the down server, you do the same thing, and three never responds and we know that it is dead. Right?

Memberlist

And so that's like, it is a simple protocol. It is hard to get all the little bits of it correct, I would not recommend trying to implement SWIM if you are making a production system. I would recommend memberlist, it is a HashiCorp project, and it is a way of determining which servers are still alive at any one particular moment.

That's the whole idea behind failure detectors. And a failure detector is, like, a totally fascinating area of distributed systems that I recommend reading about that we don't have time to go into here.

So we will use Member List, that gives an easy way to achieve the fault detection thing that we need to achieve. Okay. So, we have a way of load balancing, using rendezvous hashing, we will use SWIM for failure catalyst and cluster management, and we need a way to communicate. You are all probably familiar with the concept of gossip. The server decides to send messages that it received from other servers, or generated itself, and it sends it to servers in the network and says, here is what I know, or it says, hey, what do you know, please tell me things.

Push and Pull

So this is kind of the idea behind push versus pull in gossip. And another way of doing gossip is the pull system. Instead of saying, this is my messages, the fire and-forget type of thing, we will say, hey, what do you have, please tell me things. There are advantages and disadvantages to both. So if you are pushing, you are know that you are getting your messages out. And so when you look at the growth rate of message delivery in a push-based gossip system, you see a huge spike at the beginning and then it slowly levels out until it gets to every server on the network. Pull, on the other hand, is a totally different piece. What pull does, instead of saying, like, here, fire and forget, you are waiting for someone to ask you about your messages. Right?

And so, if you look at the growth rate of a pull-based gossip system, what you end up seeing is that, for a little while, the message does not get around to everybody. And then someone asks you, and as soon as it starts to grow, it grows rapidly and hits every server quickly. So they are both log N time, but they get there in a slightly different way.

And so, in our particular case, for this particular application, we will go with a pull-based system, I will try to explain why, as we get into a later part of the talk. For now, just know, we will go with gossip, it will be a pull-based system.

Convergence

Okay, and so, yes. Convergence. Okay. And so, again, we are making an eventually consistent system, we are using rendezvous system for it, and some sort of gossip in the communication. That is the infrastructure of our system and the core guts of it, gossip, failure detection, okay. And we need to talk about how we keep everything in sync, how we know that the system will converge in some way.

We need to design the protocol. So, okay, we will talk about convergence. This is the idea that knowing that, regardless which way the messages arrive, when they arrive, who is up and who is down, eventually every node will converge to the same set of values. How do we do that?

Causality

So, going back to, like, the eventual consistency decision that we made early on, we can do better than a basic weak eventual consistency. And rather than, again, so going back to, like, linearizability and one of those CP-type systems, which assigns a total order to events, meaning that anything that happens in a CP system, you know that this event happened before or after any other event. There is no other non-determinism in there. So with causality, we assign a partial order to events. What does that mean? It allows us to say that some events happened before or other events, but also that some events happened at the same time. So you can think that it is much more consistent with how the world actually works, the universe is more causally consistent than it is linearizable.

So we will give an example of this. We have two actors here, we have Jack and Jill. They are trying to decide what they are going to eat for lunch. Jack suggests, let's have arugula. Great. Jack then says, okay, arugula. He says Jill, I have decided upon arugula. And Jill, on the other hand is like, I don't like arugula. Jill wants burgers. So what happens in this case, we know that arugula happened before burgers.

So, essentially, this is going -- okay. When Jill made the decision burgers, she already knew that Jack had said arugula. Great. So we know that the arugula event happened before the burgers event. And Jill replicates that back to Jack, and Jack, I don't want that, I don't want burgers, I want calzone. While, at the same time, Jill is like, I changed my mind, I don't want burgers, I want Daal. Great. In this case, what we can say is that burgers happened before calzone, and burgers happened before Daal. As we replicate the messages back and forth to each other, we now realize that calzone doesn't happen before Daal, and Daal did not happen before calzone, they are concurrent with each other. That's lunch-based causality.

Lattices

And now, let's talk about something that is related. We will talk about lattices. So lattices is a data structure, a mathematical concept of some kind. In this case, if you want to read about it more, it is a joined semi-lattice. So you see a data structure, you can apply a least-upper bound, AKA, a join AKA a merge function. So if you have two lattices, and two joint semi-lattices, there's a merge function that you can apply to them. Let's talk about what that means.

And so, lattices start from a root, or a bottom value, despite the fact that our dot is at the top right here. And from the value, if you have two or three actors that are operating on this lattice, they can diverge. And the branches, no matter how much they diverge, they can be merged back together in some way, no matter how much time has passed, they can be brought back together, at least two of them can always be merged and then they can diverge again after that.

Okay, let's go back to causality. It turns out that lattices are super useful for modeling causality, for reasons you can see. And specifically, in this case, what we are talking about is a concept called version vectors. The idea is that -- we will walk through an example of a version vectors scenario. So, in this case, we have three servers. We have S1, S2, and S3. Right? And S1 has an event of some kind. In this case, maybe it is doing a health check, for instance.

Okay, so S1 increments its individual index in that version vector. So we have three nodes, three indexes, three slots in the version vector, S1 is going to increment one because an event happened. S1 can then replicate that down to S2, and now S2 sees, okay, I have seen S1's event. From there, we can go, okay, we have more events that are occurring. In this case, S2 has its own event and now its version vector is 1-1-0 because it sees S1's event and has its own, and S3 is not seeing any of these, it is incrementing its own. We can merge these back together from S3 to S2, and now S2 knows that it has seen the events from every other node. So that's the whole idea behind version vectors.

So that gives us a way of tracking causality, it gives us a way of seeing, this thing happened before this other thing. So, in this particular case, given that S3 had not actually seen any of the other events that occurred on the other parts of the system, its event is concurrent with all of the other ones. And you can determine that based on that version vector. And so, of course, it continues on. Um. And so it turns out that you can actually model this exactly the way that you model lattices.

And so, again, if you have your root value there at the bottom, 0, 0, 0, for our version vector, it can diverge, it can merge, and it can continue merging there and, of course, diverge again. Is this making sense? Cool.

Coordination-Free Distributed Map

One way of applying lattices to distributed systems problems is through this idea of CRDTs. And so what we're going to design next is actually what is called a coordination-free distributed map, a CRDT map of some kind. We need a data structure that we can replicate, it respects causality in our case, and it is guaranteed to converge. So again, CRDT map. Great.

And so what it looks like is something like this, and this goes back to what we were talking about before with our pool-based gossip system. So what we're going to say is, okay, when one node wants to exchange state with another node, it wakes up and it gets its own version. It looks at its own version, its own version vector and it says, okay, this is what I have seen, this is my summary of the causal context I have seen so far in the distributed system. It sends it out and serializes it over the network, in this case, we use HTTP, it can use anything that you want to communicate between them. It sends the version and gives a way of summarizing. It says, okay, I have seen up to this particular cut in our causal time.

And so what that does is it allows, the node that it is communicating with to come up with a delta. That node also has its own set of operations it has seen, these two are operating independently, this sees the version and takes the version and says, okay, I know you have seen these, I will show you the new ones I have seen. It constructs the delta, it sends it back, and then the first node can just merge it. Again, this delta we are operating on is actually a lattice. So it can be merged back together with whatever state that node has. Pretty cool.

And so, this is again called a -- this is not just a regular CRDT map, in this case we are talking about a delta CRDT map. It is not exchanging the entire state, it is exchanging deltas, which are themselves a lattice. Okay.

So, let's just look at code. It is surprisingly easy to do this. So, this particular project was written in Go. So what we're talking about in this case is a very simple thing. We have the shared map at the top that we're going to exchange back and forth with each other, and that shared map has its underlying storage and a version vector that goes with it, so it remembers what the current state looks like, a summary of its current state. Each of the shared map records includes its values and a version vector dot. So where a version vector includes, like, is a way of summarizing the state, a version vector dot says this is one particular event. So a version vector dot is just a tuple of a server ID and an event number. It does not give you the full causal context, but it tells you where in that causal context this one particular value came from. Okay.

And so, again, what we're going to do is, for our delta CRDT map, the first step is easy. We take the version vector, serialize it, and send it over the wire. When our chosen gossip partner receives the message, what it will do is go through each record in the map and it says, okay, did this version vector happen before this particular event which, based on the fact that you have this version vector that contains a summary of every event that occurred in the past and you have the individual event this value value came from, we can say, did this event happen later or concurrently? If those are true, the node did not see this event yet and it gives us things to decide which to put into the delta. Great.

So, yes. If the version happened before our individual record, or is concurrent with it, we add it to the delta, that is all we do and we send the delta back. And the merge function is slightly more complicated, but not that bad. The idea here is that we go through each record in the delta and if the dot associated with the record, still happened before our version vector, the state of the node may have changed. So we need to go back and say, maybe we got this message, so cool. And so we are going to say, okay, go through each record in the delta, see if it happened before our version. If it did, we skip it, we know that we got it or we see something later than it. And then, we check and see, okay, was our version of this value older? If it was older, great, we merge it.

And then, of course, we need some way of determining what to do if they are concurrent, because causality does not give us a total order, it gives us a partial order. So we have not actually solved the problem of how to merge these records together if they conflict with each other. So, what is that little sparkle that we're going to use there? This actually brings us straight back to rendezvous hashing, that gives a way of breaking a tie in this case. There are many ways to break ties in delta CRDTs, but in this case, rendezvous hashing is perfect for us. Not only does rendezvous hashing allow us to determine ownership for this particular piece of work, we can break ties. So when a node sees its operation has been usurped, it says I should stop doing the work because somebody at a higher priority has done the work. So it determines tie breaking and ownership of a particular thing which, is pretty cool.

And we have built a delta-state CRDT map that is easier than expected, it is coordination free and it converges to a deterministic state. So no matter how out of order the node is out of sync, it will come back to the same state as any node in the rest of the system. That is pretty school, this is based on this paper. It is a short, digestible paper out of one of the Portuguese universities that is doing fantastic work on this. And in the paper, you can see it. We see the version vector, generate the delta, when we receive the delta, we merge it, and we randomly pool, or gossip. So I like the system, but it is a beautiful way of applying academic research, it is one-to-one mapping basically.

So after this, we achieved all the goals that we wanted to without a CP system, or linearizability. The system is causally consistent, and every node can still always make forward progress. Great. Okay.

Edge Compute

And so, we use rendezvous hashing, SWIM, and CRDTSs. So why is a CTO of a company for content delivery talking about this and why should I care about this? It is because of the concept of edge computing, we can do so much more in edge-based scenarios than we do today. Another way of describing edge computing is coordination-free distributed systems. You are saying that I cannot do a CP-based edge computing system, it does not make any sense. So all of the research that has been going on for decades in coordination-free distributed systems applied to edge computing scenarios.

So why, why hasn't this caught on more, why aren't we doing these not consistent, but still achieving the user's goals distributed systems? They are harder, sure. And I would argue that one of the things that holds us back is the prevalence of this idea in the developer's mind: single system image. It says that, in the distributed system, regardless of the number of nodes we are running it on, and regardless of where they are, we should provide an interface to our users that appears to be one system. You see this in, like, distributed databases at the time. The idea is that they want to hide the fact it is running across hundreds of nodes and it appears to be one individual computer. And another way of describing this is to say that the system is linearizable. So to an outside observer, it appears that every event has a specific order.

For any two events, you can say that one of them happened before the other. But as we were talking about earlier, that's not how the world and universe works. Causal consistency is a better way of looking at how things happen in the real world, but it is trickier and that is not the way that we think about things about as we are designing systems. It is our attempt of applying the single system image idea that leads us to bad, hacky, failure-prone problems in designs. So in edge computing, it doesn't work, you cannot do it. It does not hide the fact that you are running across many nodes all simultaneously.

And so, as I was preparing this talk, I was reminded that a friend had actually written an article about this. It is called A Certain Tendency Of the Database Community, the shadiest title I have ever seen in a talk. So what he says here is that we posit that striving for distributed systems that provide single system image semantics is flawed and at odds with how the systems operate in the real world.

So what we’ve walked through today is a coordination free distributed system that maintains the requirement and the invariants that we set out to achieve early on. This is without using linearizability or single system image. So this stuff is not as hard as we believe it to be. It is hard and tricky, if you get it wrong, it is hard to debug. But we need new metaphors, we have the single system image idea, that's a great metaphor for a certain type of distributed system. We don't have one for these types of systems that I'm talking about today. There is no easy way to say, this is what I, like, am imagining this system is going to do.

New Metaphors

We need new metaphors and intuition for these types of things, we have been building systems for a long time, and using this idea of single system image, and what we are lacking is the intuition to build systems like this. And we also need more tools and frameworks for this. So, that's all I got for you. Thank you.

We have some time for questions.

So when you talk about the pull, is that happening in the client side or the node to node?

It is a node to node system, each individual data center is working up and pulling this thing, it is done simultaneously.

[Editor’s note: Speaker far from mic].

I'm sorry, I cannot hear you.

In that case, there is no partition tolerance, you need it with the same network to do the -- I'm sorry, it is too noisy. Can you come up afterward and ask me? Great.

Any other questions? Oh, one over there.

Hi, great talk, thank you. You glossed over the reasons for not using Zookeeper, or console. Why wouldn't you use those?

In this particular case, like, the question was about whether or not, why I wouldn't use console. The answer is that, for many systems I would absolutely use it. But it really comes down to this decision, like, when there is, like, a network failure, what is preferable? Is it preferable for us to go out of sync briefly, as long as we have a way of getting back, or is it preferable for us to stop the system? In some cases, stopping the system is the right answer. As we were talking about the financial systems earlier, you don't want two bank accounts to emerge that cannot be put back together or abused. So you wouldn't want to use an APC system. It is preferable for us to have two health checks happening, rather than zero.

Hi, I didn't understand what about the model was a pool-based system, it appeared there was a node that was sending out its version vector, and then the delta happens and it gets it back. That appeared to be a pull-based system.

Yes, it is a pull-based system. In that, the node that sends the message out isn't sending messages, there is no payload to that. It is really just saying, I want you to send me messages, but here is the -- here is a way of figuring out which messages I'm going to want. That's all it is. Yeah.

This is more of a clarification question to help understand what you just said.

Sure.

So, first question: this whole thing that you explained, was this custom developed specifically for Fastly, for network failure detection only, or was there another use case?

So this was a system that we specifically designed for this particular problem. As part of that, like, we ended up putting a lot of work into some of the HashiCorp projects that we were working on, like on memberless to make it easier for people that are doing it in the future, and we have internal projects that resulted from it. The ideas are applicable to many problems, it is a matter of doing it the first time.

So this was applied to failure detection?

Correct.

Thank you.

You're welcome.

Now, I'm a little new to distributed systems, so bear with me here.

Cool.

But it seems like the entire algorithm had the dependency that every node needed to have a complete node set and origin set. So, we didn't talk about how that registration process worked, or where that would play a role.

Sure. So, having an entire, like, node set for the other servers in its data center is, like, a requirement for this particular system. Luckily, like, the -- the SWIM-based fault detection system we are using has a way of dealing with that. It has the concept of going, I will add a new node, all I have to do is contact any other node in the system and that node will get to everybody else, it is an invariant of the particular failure detector.

And, as for the origins, like, the full set of origins -- so I guess I didn't talk about that. So the interesting part about that, since we are using rendezvous happening and -- hashing and we know that there's a specific order, when I receive the origin server, I can wait a little while if I'm low on the priority list. If I wait long enough, I will say, I have not seen anything from anyone higher from me and I will take over. Alternatively, it would remain in sync if you had every node start doing the work for every origin server. Eventually what happens is they start gossiping about it, and they see there's a higher priority node that is doing this, I will shut mine down. So there's a way that it continues to stay in sync over time, regardless of whether or not the node set is actually the same across all of them. Anyway, I hope that explains that.

So, your talk seemed to end with, like, a dot dot dot. This is where we are, how do we go from here to something better? Why did you write -- why did you come up with this talk, what do you expect to happen?

Yes, why did I come up with this talk, man. That's an interesting -- I will go lay down now. No, so, the whole idea is showing that, like, this stuff is possible, and it is actually not as hard as a lot of people might believe it to be. We are almost in some ways afraid, you have, like, you heard of NIH syndrome, not invented here, and some people have NNIH syndrome, not NIH, they are afraid to invent these components. And I want to show that this is actually possible. And alternatively, I look at LASP, a language -- it is an academic language, and it has CRDTs built into it as the primary data structures. So when you run the system, you don't have to think about it. The operations that you can do are all fundamentally causally consistent. So long-term, that's the kind of thing that I'm talking about, like, where I think we need new frameworks and maybe new languages to describe the things that we are doing in a way that is not weird and scary.

Can you talk a little bit about performance and scalability and network load with a big, chatty system like this?

Totally. So the system that this replaced was one that was WAY chattier, because it didn't have any idea of, like, ownership, necessarily, or not necessarily ownership, but it didn't have an idea of the whole delta idea. So what it meant, everyone was talking all the time. And what they were doing was multi-casting within the data center, that our network systems team was losing their minds about. So we said, we will cut out the broadcast and multi-cast, and cut out the number of messages that need to be sent down to what I see as essentially the minimum. And so, I don't know, I don't have really great answer for you on this particular one. All I can say is it was a hell of a lot better than the last system. All right. Thank you so much. Oh -- all, right. Thanks, everyone.

Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx.

See more presentations with transcripts

Recorded at:

Dec 12, 2017

BT