Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Camille Fournier on Consensus Systems: ZooKeeper, Etcd, Consul

Camille Fournier on Consensus Systems: ZooKeeper, Etcd, Consul


1. We are here at CraftConf 2015 in Budapest, I’m sitting here with Camille Fournier. Camille you are here at the conference to give a talk on ZooKeeper and similar projects, so can you give us an overview of what ZooKeeper does?

Sure, so ZooKeeper and etcd and to some extent Consul, Consul is a little bit more interesting, but all these systems I’m calling them consensus systems, or consensus services, because I can’t figure out a better name for them. They are systems that are written based on consensus algorithms, so ZAB which is sort of like Paxos or Raft, and they are systems that are used primarily for the purposes of either doing distributed locking or providing data for service orchestration and sort of distributed metadata, so they are key/value stores that keep all of their data in memory, for very fast read access, they are designed for read heavy workloads.

The unique qualities of these systems versus other key/value stores like Cassandra or Riak or whatever, are that they provide, first of all they are highly consistent, that’s what they do, all writes all go through a voting protocol to make sure that they are written to a majority of replicas, everyone sees them and agrees that, yes this data is valid, we are going to write it to disk and persist it so they’ll never lose data, you can use them to build compare-and-set operations or they even provide those compare-and-set operations as primitives.

So that’s one of the biggest and most important parts of them, a lot of the distributed data stores have that ability, you can sort of put them into a mode where they provide highly, highly consistent writes, but a lot of the time you don’t want to do that, you want to let them be a little more optimized for availability or doing dirty reads, so you are reading old data or data that it’s in the process of being written and the ZooKeepers and etcs and whatevers in the world are really optimized for very high consistency. If you really need to write things and always see what you write and be guaranteed that that write will not partially succeeded or partially fail, it will always succeed or fail, that’s one of the biggest things that they provide. The other two things that they provide that most of these distributed data stores do not provide are ephemeral nodes or nodes with a time to live, so you can create data that will automatically be cleaned up by the system, so DNS sort of has this. If you advertise something, put some data into DNS that has a TTL and it will eventually expire and refresh itself. These have that to a much tighter time bound, so in ZooKeeper they are called the ephemeral nodes, in other systems they are just nodes that you write but you create a TTL on them, you have to continually update or refresh the TTL in order to keep the node available.

The final really big thing that they do is notifications, watches, events. So you can say: “ I’m interested in this particular piece of data, let me know if it changes” and you set that up and the system will let you know when and if that data changes, and that is not generally something that distributed data stores provide, you know if you even want to use like pub/sub messaging for example, RabbitMQ something like that, you are listening to a channel but somebody is publishing consciously a lot of data to that channel. This is a little bit more lightweight, it’s not designed to do pub/sub messaging, it’s really designed for events that may not occur that often but that are very important for you to know about and act on when they occur, so like I said, service orchestration is a very popular use case, if a service advertises its existence by creating a node and you are watching to see if that node changes, when that service goes away or dies, that node will change and possibly be deleted, and you will automatically know to refresh your view of the world at that point, you don’t have to go poll the database or poll the file system or whatever to find that out.


2. The notifications, how is that, for instance in ZooKeeper, how are they implemented, is there a network connection back to me or…?

Yes, so in ZooKeeper, ZooKeeper has a thick client model, so that’s one of the big differences between ZooKeeper and the etcds and Consuls of the world, ZooKeeper is built with the notion that you are going to have a client, that’s a thick client, you create stateful sessions to the ZooKeeper, you have 2 channels of communication, you have a synchronous channel, you have an asynchronous channel, so the asynchronous channel carries a lot of the heartbeat traffic and will carry the watch notifications back to you, so when something changes you will get a notification on that channel and you’ll do something with it.

And I think one of the interesting things about these two models, so in etcd, I don’t know Consul very well but I did some research on etcd as part of my research for this talk, you’ll actually do an HTTP longpoll, so you will longpoll and that longpoll will return when something changes and the system is implemented in that way. So it’s essentially opening a socket, HTTP in that case. I think that difference, I think that’s a very interesting difference between ZooKeeper and between sort of etcd/Consul, those two different models.

The original system of this sort is a system called Chubby out of Google, so they published a paper a number of years ago called the Chubby lock service or something like that, it’s a great paper, I highly recommended that anyone who is interested in building distributed systems, read it because it’s a great engineering tale, it’s really, it’s not so much about algorithms, it’s about how they made the choices they made to build the system, and Chubby is a thick client model, so you are creating these long lived sockets and ZooKeeper is very much in that fashion and in that manner, and in some ways it’s kind of an old school manner, and we were writing C++ or Java, languages that support threading fairly well, it made sense to say: “You know what, we are not going to have like a bunch of teeny, tiny little services all over the place, these services are a little bit bigger, there are processes that we're running that we may need to be coordinating, they are kind of thick processes themselves, so putting a thick client on a thick process to coordinate with ZooKeeper is not a big deal and you get some optimizations, you are not opening and closing sockets constantly to do your ZooKeeper actions, right?"

Now of course the world has changed and in the world of, you are trying to coordinate cron jobs or really micro microservices or you are using languages where threading is not as easily supported, it’s much harder to reason about and build these thick clients, and that actually has been a big part of the struggle for adoption I think for ZooKeeper in certain languages, Ruby I’ve heard is, there are Ruby clients for ZooKeeper, people definitely use it but it’s not easy, it doesn’t work as well as people might like.

So etcd is providing an HTTP access which is really nice, because everything supports HTTP, curl it and you’re done, and that’s great and I think that there is definitely overhead consequences to that, but it’s definitely an interesting design decision and actually I have been talking to the ZooKeeper community about what it would take for us to start to support HTTP access, we have an HTTP Proxy that can be used, but it doesn’t support all of the functionality, it doesn’t support, you can’t really create an ephemeral node for example via HTTP, because that would be a fundamental new design piece for our systems to implement, we actually have to implement that TTL timer on something that isn’t just been heartbeated by your standard service, so I think it’s an interesting potential evolution for the ZooKeeper product and I think it’s a very interesting design decision for etcd, I think it has a great strength if you are in an environment where you have these very lightweight microservices things that are not, you don’t want to have to have a big thick client around and you are not using the JVM maybe, right?

You want to use it for things sort of like cron jobs even. I’m not sure on the other hand what the scaling applications will be of that, and maybe there is not, I don’t know and that’s the other thing about of course the new systems like etcd and Consul, is they are new systems, so one of my friends makes this great analogy that you are looking for the blood spatter pattern and when you start to use a new system, the blood spatter pattern around the system tells you where to be careful, and that’s the bug report and the stories and the war stories that come out either in talks or in blog posts or in mailing lists about “I was using the system and like accidentally this client started doing this thing and the system just went haywire” and sometimes the answer to that is: “Don’t do that thing, it’s not a big we can fix, you just need to not do that thing”.

And so the newer systems, etcd and Consul, they don’t have that blood spatter pattern quite as clear yet. I haven't looked at Console but I couldn’t find any good benchmarks for example on etcd, like what is the performance or the performance restrictions for doing this type of system - I don’t know, I know people are working on it, but that’s I think another one of the trade-offs here, I do believe fundamentally that HTTP has got to have some consequences, that can't be a free choice, I could be wrong.


3. You mentioned that Chubby paper by Google, so if people want to be, want to look at interesting papers or interesting things to read in that area, what else would you recommend?

That’s a good question actually, so there is a book on running ZooKeeper, I have to admit it I have not actually read it, but some of the fellow ZooKeepers, Flavio and I forget, maybe Ben Reed wrote that book, so I think if you are interested in running ZooKeeper, certainly it’s an O'Reilly book, it’s a good book to read. There are papers, so ZAB has a paper, so that's the ZooKeeper Atomic Broadcast is the protocol that we actually use inside of ZooKeeper to do our consensus, so obviously Paxos is the originally consensus protocol, Leslie Lamport, years and years ago, and it’s also famously difficult to understand.

ZAB is still pretty difficult to understand, it’s a little bit simpler, makes some simplifying assumptions by virtue in fact it uses TCP, so TCP provides some nice simplifying assumptions but it’s still fairly complicated. And there are a few papers that are related to ZooKeeper out there if you search the academic literature you can find them, then the newest hotness in consensus is of course Raft which was written, partly, to be understandable, it was written partly because Paxos and even ZAB are so difficult to wrap your head around because they are very complex protocols, so the author of the Raft, Diego [Ongaro], out of Stanford I believe, he was actually at this conference last year I believe talking about his work, and he went to great lengths to create an algorithm that is still correct but that people could understand and they actually tested that, he actually taught the two different algorithms to students and then tested them to see which one they could remember more easily and they could remember Raft much more easily which is really exciting so you’ve seen a lot of people adopt that Raft algorithm and start implementing it in systems, so again etcd and Consul and people have implementations of it in all kinds of languages which is very interesting, so if you are interested in the general problem of consensus and the algorithms around it, I think that there is a plenty of literature out there, I think there is a little bit less literature about the implementing of the actual system themselves, and that’s why I love that Chubby paper so much because I’m not an academic, I’m an engineer, I really like to hear about the actual real world trade-offs that people have to make when building these systems and the surprising things that you learn when you put a new system in the hands of a giant team of developers as they did at Google, so I don’t know any other papers like that off the top of my head, I’m sure they exist.

Werner: So hearing this I should ditch the Paxos paper and go straight for Raft.

Well I think there is nothing, it’s good to learn the classics, you know and it maybe that if you read Raft first you would have a basic sort of fundamental understanding of what consensus really means that could actually make Paxos easier to understand, so maybe you should just try reading them in that order.

Werner: Ok, that’s a good tip, so we’ll all do that, and thank you Camille!

Thank you very much!

May 25, 2015