BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Kafka Needs No Keeper

Kafka Needs No Keeper

Bookmarks
43:52

Summary

Colin McCabe talks about the ongoing effort to replace the use of Zookeeper in Kafka: why they want to do it and how it will work. He discusses the limitations they have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. He talks about their progress, what work is remaining, and how contributors can help.

Bio

Colin McCabe is a Kafka committer at Confluent, working on the scalability and extensibility of Kafka. Previously, he worked on the Hadoop Distributed Filesystem and the Ceph Filesystem.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

McCabe: Today we're going to be talking about Kafka and how it's going to evolve in the future. Kafka has gotten a lot of mileage out of Zookeeper. Zookeeper is used in many ways in Kafka. We use it to store metadata, we use it to receive notifications when certain things change, and it's actually done pretty well at that. We have a lot of respect for Zookeeper, for the community behind it and for the authors, but it is still a second system. It's another thing people have to learn. It's another moving part. All of those things.

What we're talking about today is a new Kafka improvement proposal called KIP-500 that's talking about how we can move beyond Zookeeper and basically use Kafka itself to store metadata and manage it. I want to emphasize that this isn't going to be a one to one replacement of Zookeeper. We're not replacing Zookeeper with another system, which is substantially like Zookeeper.

It's not about using something like one of the many projects which are similar to Zookeeper but maybe have a slightly different API, slightly different implementation. It's actually about managing the metadata in a very different way, and a way that we feel would be much more extensible and supportable in the future.

Along the way I'm going to talk a little bit about how we got here and what direction we've been going. There's actually a lot of previous work that's been leading up to this. KIP-500 is the culmination of a lot of work that the community has been doing for many years.

Evolution of Apache Kafka Clients

To start off, I'm going to talk a little bit about the evolution of Apache Kafka clients. You guys probably know already, the client is the thing that talks to the server here. We basically have three kinds of clients. One of them is the producer that will put messages into the system, the consumer that will pull them out, and administrative tools that will do stuff like creating topics, deleting topics, security-related stuff, setting up security settings.

In the very beginning we had producers. They would write to topics. That was obviously going through the broker. I'm using this Kafka symbol here to represent the broker processes. We would have consumers that would read those topics. There was a hitch here, which is that the consumers needed to store the offset that they last fetch from periodically, and we used Zookeeper for that in the beginning.

We also use Zookeeper to understand what consumer group we were part of and understanding this is key to doing things like doing rebalancing when a consumer from a consumer group goes down, and all that stuff. Of course, the administrative tools would also go directly to Zookeeper. There would be no broker involvement in the administrative tools. If you wanted to delete a topic or create a topic, you would just basically have a script that launched a Scala process and that would manipulate the Zookeeper stuff directly. This was the situation back in the Kafka 0.8 days.

Consumer Group Coordinator

I guess I should talk a little bit about why there were limitations here before we go on. First of all Zookeeper is not really a high bandwidth system. If you have a lot of consumers doing offset fetches and offset commits, it can really start to bog down the system and it can interfere with stuff which is also key for Zookeeper to do. Secondly, we're going to talk more about this later, but this is a real obstacle to having effective security because all of the security that we implemented was actually on the broker, and by going through Zookeeper you're bypassing that layer of security.

We started to move away from this. The first instance of this was the consumer group coordinator. As I mentioned earlier, originally all of the consumer group stuff would just go through Zookeeper and all of the offset fetch, offset commit stuff would also go through Zookeeper. You had these end-users who were talking directly to Zookeeper.

In order to avoid this, we were able to create these consumer APIs. Basically, this offset fetch and offset commit became APIs on the broker. They were no longer things that you would talk directly to Zookeeper to do, but you could talk to the broker and then the broker could store this stuff internally in a topic.

Two things here. We're creating a new API on the broker and we're storing the data in a different place. It's no longer being stored in Zookeeper, but it's being stored in Kafka itself. This is an example of Kafka on Kafka. We're using the system to store our own metadata here.

The second thing we did was basically the same for group partition assignment. We created a similar set of APIs. The join group API, the synch group API, heartbeat API, and so we were able to take Zookeeper out of the consumer repath. That's really good because then we can scale out the number of consumers more effectively and we can also basically secure consumers a lot easier. After this, the diagram looks a little more like this. The producer and the consumer are going through the regular brokers and the admin tools are still talking directly to Zookeeper.

Kafka Security and the Admin Client

Before we talk more, we should have a brief digression about security here. As Kafka was growing in popularity, there was a greater demand to have security and in particular, policies that we could enforce at the topic level and things like that. This is something that we implemented on the broker side through access control lists, or ACLs, we usually call them. You could have an ACL that would say you can read from this thing or you can't read from this thing, stuff like that, but again, when you go directly to Zookeeper you're bypassing ACLs. It's a bit of a hole in the security model, in that sense.

Our next goal here was basically to rework the admin tools in terms of APIs on the broker, and in order to do that we created this thing called admin client. Admin client was similar to the producer and consumer in the sense that it's basically a class which communicates with the broker over well defined APIs. These APIs were things like create topics, lead topics, alter configs.

We've been reworking the admin tools so that they use the admin client rather than talking directly to Zookeeper. By doing this, we can do things like give people the ability to create topics that start with a certain prefix, but not the ability to create topics that start with another prefix. They can't delete certain topics, but maybe they can create their own topics. Rather than being an all or nothing, security is now a lot more fine-grained.

In fact, we can even have programs which create Kafka topics as part of their regular operations, so it's no longer just something that the admin has to do from the command line. At this point the diagram looks a lot more like this. You have the producer, consumer, and admin client all going through these APIs and all talking to the broker, and then the broker will manage what needs to be done in Zookeeper.

In addition to security, there's actually a lot more benefits that we've only really touched on here, and basically these are the benefits that you get by encapsulating the data here. What do I mean by that? What I mean is that the data in Zookeeper is in a certain format. Over time that format will evolve. On the other hand, people may use different versions of the client. It's actually really nice to have an API in the middle that can translate, let's say, an older request into the new format which we're now using. This helps with compatibility, because we're no longer directly accessing basically the brain of Kafka, we're actually going through these APIs, which give us the ability to have a well-defined interface. We can also validate these APIs, so we can say stuff like, "You can't create this topic with these certain settings." It became possible to have policies, which is really something we couldn't have when we're directly modifying stuff in Zookeeper. Those are all pretty important things there.

Inter Broker Communication

So far we've just been talking about clients and how they interact with the system. It's actually equally important to think about how brokers interact with other brokers over Zookeeper. In the case of Kafka, we actually have a lot of interacting to do. Brokers need to register themselves with Zookeeper so that people know where they are. Obviously we have to manage ACLs. We have to know about what they are. We have this concept of dynamic configuration, which is basically configurations that you can change over time in a managed way and have them changed at runtime.

We also have this thing called the ISR, which is basically the in synch replica set, and this has to do with how we do replication. This is all metadata that we need to worry about. In order to do this, Kafka elects a single broker to be the controller. This is a familiar pattern if you've used Zookeeper before, this idea of having a leader, which is elected by Zookeeper, where you delegate certain responsibilities to that node. Of course, controller election is very important for us. Controller election is managed by Zookeeper.

The model that we have here is basically that Zookeeper will elect a single controller, and that controller will update the rest of the system with the changes to metadata that are made. That's done through certain APIs. I'm not going to go into detail about what they are, but are basically the ones pictured in the top right here. The controller will be responsible for pushing out these updates to the metadata.

Even though I just got done saying this is the model, there are actually some perceptions now. One of them is that there are cases where brokers go directly to Zookeeper, and one of them is when changing the in synch replica set. Currently, the leader of a participation will directly communicate with Zookeeper when changing the in synch replica set. This creates a lot of the same problems that we saw with clients, which is that now it's hard to change the format. What if you have a broker from an older version that's changing this and it's overriding the new format with an old one? There's also issues about notification. We'll get into those later.

One of the things that we've been working on lately is actually creating APIs to avoid having this direct Zookeeper access from brokers. The model that we'd like to get to is a model where we centralize the Zookeeper access in the controller. The controller is the node that we're going to delegate this work to, and that gives us, again, a lot of the same benefits we had before. It's a lot easier to reason about a system where there's a single source of truth, in this case the controller and ideally Zookeeper is the single source of truth, but the controller should be the bottleneck through which changes flow to it.

It's a lot easier to reason about that than having dozens of brokers that are all making changes at once, potentially overriding each other's changes. We have the benefits of better compatibility, and of course we just have better encapsulation. I should note in passing that there are facilities in Zookeeper to manage concurrent rights. There's basically a compare and swap type primitive. Again, that makes the system pretty hard to reason about if you start using it a lot.

Broker Liveness

Another important once here is broker liveness. This is basically who's in the cluster. This is actually a very important concept in any clustered system. You have to know who's alive, who's part of the system and what responsibilities they have at any given moment. This is one of the things that Zookeeper gives us currently, and the way that it's done is by maintaining this Zookeeper session. The Zookeeper session is basically opened when the broker starts talking to the Zookeeper cluster, and the broker will create this node that talks about where it is, how you contact is and so on. That node will contain information like the IP addresses you can use, the rack and so on.

Let's say that a broker actually goes away. In that case, the Zookeeper session will also go away and this ephemeral ZNode will be removed. Now currently that will trigger watch and that will update the controller, and then the controller will push out these updates basically saying what just happened, that the broker is offline.

Network Partition Resilience

This seems like a pretty simple system. You have Zookeeper here and whenever the session goes away, the controller is notified and then the controller pushes out the updates. Let's talk about what can go wrong. That's always the interesting part here.

One thing that can go wrong is maybe the node's not down at all but it's just partitioned out from the cluster in some way. You've probably heard the term network partition before. The idea is that maybe some nodes can see each other, but some other nodes can't.

The network partition that we like the most is the total partition, where basically the node goes away. You could also regard this as being the case when the node crashes or the hardware goes away. Logically, it's out of the cluster. No one can see it.

Kafka handles this pretty well, because the Zookeeper session will go away and then the controller will notify everyone about what's going on and all the good things. Then there's the case of just being partitioned from a broker or several brokers. In this case there will actually be some potential loss of availability because you may not be able to replicate. If you have a partition which relies on that other broker, that will not be replicated. At least you keep consistency in the system.

Then the case which is most interesting for this talk is when you partition from Zookeeper. I guess there's two cases that are most interesting for this talk, but this is one of them. This case is interesting because you will not be regarded as part of the cluster, and yet nodes can maybe continue talking to you, which creates a bit of an uncomfortable situation. Regardless of some weird situations that can happen, you'll still basically keep consistency here because the controller will remove you. It will remove leadership of all the partitions that this node has. Similar to case two, we lose some availability but we still keep the consistency.

Then there's a fourth case which actually we don't handle that well and which KIP-500 will actually improve a lot, which we'll see, and that's the case of the controller partition. If you remember from earlier, the controller is actually telling us about the changes that are happening in cluster topology. If we can't talk to the controller, or rather if the controller can't talk to us, we miss out on that stuff, but at the same time we're still regarded as part of the cluster because we keep our Zookeeper session alive.

This is a very specific example of how the separation between the controller and Zookeeper actually creates awkward situations because they're two separate systems. Keep that in mind. We'll come back to that later.

Metadata Inconsistency

Logically, we have this model where Zookeeper is the source of truth, Zookeeper tells the controller about what's going on and the controller then pushes out these updates to everyone else. In a sense everybody is getting these updates in an asynchronous way because they happen first in Zookeeper, the controller becomes aware and then they're rushed out. This is the ideal model of what we would like to happen. In reality though, this doesn't always work, because there may be cases where, again, if you have a partition, the controller will try to send out the information, but it simply won't be received. In this case, you can actually end up with divergence. As a last resort, sometimes people have to actually force a controller election and there's a way to do that from within Zookeeper.

This is actually frustrating. You would really like the system to always know whether the metadata is consistent, even if there's some weirdness going on at the network level. There's a second reason why this is annoying, which is that when a new controller comes up, we actually have to load all of the metadata from Zookeeper because remember, the metadata is not stored on the controller. The metadata is stored on Zookeeper.

Performance of Controller Initialization

I'll go to the performance issue here. Again, when you're loading this metadata, the complexity is O(N), where N is the number of partitions here. As your number of partitions increases, this loading time will increase as well. This loading time is actually critically important because while the controller is loading the metadata it can't be doing things like electing a new leader, stuff like that. That, again, can lengthen the period when the system is unavailable.

I should note in passing that controller election doesn't necessarily create unavailability. It only creates unavailability if you needed the controller to do something during that time. It's not as bad as it may seem, but it still isn't good. In order to scale, we'd actually like to get rid of this O(N) behavior.

Another O(N) thing is pushing all of the metadata. As the number of partitions grows, that metadata will increase. It really sucks to have to tell someone all the metadata every time you do the selection.

This complexity here, obviously the number of brokers will factor in because we have to send this information to every broker. How can we get around this problem? I guess I gave it away with this slide title, but the solution here is to actually do something more like Kafka's users are doing inside Kafka, specifically we want to see metadata as an event log. Rather than sending out snapshots, we'd actually like to treat it as a log. I'll give an example here.

Metadata as an Event Log

If you treat metadata as an event log, then each change to it can become a message. For example, creating a topic becomes a message, deleting a topic and so on. These changes can be propagated to all the brokers in the same way that we propagate changes to consumers.

The nice thing about having this log is that there's a clear ordering of what's happening and we can also send deltas rather than these snapshots. For example, if I read up to 9/25, then I don't need to set everything, I just need to set the updates that happened after 9/25. It's also easy to quantify what the lag is. How far behind are you actually? If you know the offset then you can have that information.

Traditionally, Kafka has these consumers, they consume at an offset and they're consuming from brokers. In this case we're actually talking about the brokers that are consuming. What are they consuming from, though, in this case? It's the controller. The controller will maintain this metadata log internally. How do we implement this controller log? If you recall back to the earlier slides, we were able to implement the log which stored the consumer information just by using a regular Kafka topic. It was nothing special. It had some underscores before it, but it's not that special.

In this case we actually can't do that, because right now the controller is involved in choosing the leader for partitions. This doesn't really work. We have a circular dependency when we're dealing with metadata. What we really need is a self-managed quorum here where there's not a dependency on an external system to choose the leader. The nodes should be able to choose their own leader, and this is actually where Raft comes into the picture. The main property that we really like about Raft for this application is the last one – the fact that leader election is by a majority of the nodes rather than being done by an external system. Aside from that though, actually, the replication protocols are not too different. Raft has terms, Kafka has epochs. They're pretty similar.

The push-pull thing is interesting. In the traditional Raft context, changes are pushed out. In Kafka they're traditionally pulled. It turns out it's not that big of a difference. You can actually adopt the system pretty easily.

The Controller Quorum

Let's talk about the controller quorum a little bit more here. How do we choose the active controller? First of all, why do we even have multiple controllers? The biggest reason is we have high availability, so we want to actually have the ability to failover when the controller fails. Secondly, by having a quorum, we can actually take advantage of some of the properties of Raft here. We can just say that the active controller is just the leader of the Raft quorum, and the active controller is the only one that can write to the log. The number of nodes that we like here is basically going to be very similar to Zookeeper, so we're probably going to want three nodes, maybe five. We're probably not going to want to go to too many nodes, in most cases.

When I said we weren't going to build a better Zookeeper I lied, because we are, but not in the sense that we're creating a general-purpose system. Only in the sense that we want to have replication system, which is vaguely reminiscent of Zookeeper.

It turns out that this gives us practically instant failover because Raft election is pretty quick. Furthermore, the standbys have all the data in memory so they don't have to load it from some other place. Another way that failover is instant is the fact that brokers don't have to refetch everything. This ties in with what we talked about earlier, which is that if the broker knows the offset that it read up to, it just needs to read the offsets after that. It doesn't need to refetch everything in the world. Another advantage we can have here is we can actually cache this metadata to disc. When you're starting up your broker, maybe you actually don't want to have to refetch everything.

I will note in passing that there are cases where you probably do want to refetch everything. Maybe your disc was wiped out or maybe you're just so far behind that you don't want to replay all the deltas. We hope that that's the exception rather than the rule.

These things are going to become increasingly important as we start adding more zeroes to the number of topics, the number of partitions that Kafka supports. As Kafka starts scaling out, it's critical that we get rid of the O(N) behavior in the system.

Let's talk a little bit about broker registration, which we discussed earlier in the context of how it works now. Of course, broker registration was just the act of building a map of the cluster. In the post-KIP-500 world, brokers just send heartbeats to the active controller and the controller can use this to build a map of who is in the cluster. The controller can also use its responses to tell broker's things about that they should do. For example, maybe they should be fenced if there's already a broker with that ID and stuff like that.

Fencing is an important topic in general. If brokers can't keep up with the metadata updates that are happening, they need to be removed from the cluster because otherwise, they're going to be propagating really old information. Brokers should also self-fence if they can't talk to the controller. This closes off a lot of the problem scenarios that we've had in the past. There's a lot of different types of fencing in Kafka, and I couldn't possibly talk about all of them. This is a very important one.

Let's talk a little bit about network partitions. We obviously have the same case that we had before, where something's totally partitioned. The behavior hasn't really changed. If you can't contact the controller, then you will be removed from the cluster, as is appropriate. The broker partition hasn't really changed either. Same behavior there. In the case of the controller partition, which was a problem before, it's actually not now. If you can't talk to the controller then you can't send heartbeats, and so, therefore, you can be removed from the cluster and not be causing problems as a zombie. This is actually an example of where the post-KIP-500 system eliminates some metadata inconsistency. Of course, we don't have the case we had before where we can't contact Zookeeper because we don't have Zookeeper.

Let's switch gears a little bit and talk about deployment. I think deployment is one of the things that people are most excited about when they think about getting rid of Zookeeper. I do understand why. Not to really knock Zookeeper, but it is a separate system. It's another configuration file with a different format, different stuff in it. It's another set of metrics and another set of admin tools that you have to learn. Finally, and perhaps even worst of all, it's another set of security configurations that you have to master to really secure your system. It's alarming how many people are running without Zookeeper security right now. Definitely getting everything under the same security umbrella is really going to help.

When we talk about how controller nodes are going to be employed, there's actually two options that we're going to support. One of them is a shared controller node where the controller is actually colocated with the broker. This is similar to what we have now where a node that is a controller also has broker duties associated with it.

The other option that we're going to support is having separate controller nodes, and this also has a precedent in the sense that typically bigger clusters will run separate Zookeeper nodes to avoid having the performance of the brokers negatively interact with Zookeeper. Both nodes, I think, will be useful and I think it will depend on the size of the cluster and maybe what hardware you have, stuff like that.

Roadmap

For small clusters, colocation I think will be very useful. I've talked about a lot of stuff, but what's the roadmap for actually doing this stuff? Pretty important question.

There's three basic phases here, I think. One of them is that we remove the client-side Zookeeper dependencies, and this is actually almost done. We've been creating APIs for the last stragglers who still access Zookeeper directly from the client-side. We haven't yet deprecated direct Zookeeper access, but that's happening soon.

The phase after this is to remove the broker side Zookeeper dependencies, and in this phase we actually will try to centralize the access to Zookeeper in the controller, and this will involve creating some broker side APIs in a parallel echo of what we did in the client. I think this phase should also involve fully removing the Zookeeper access for the tools. You may ask why we didn't do it in the previous phase, and the answer is we try to have a pretty generous deprecation policy in Kafka.

Finally, the final phase is this controller quorum and basically reworking the controller in terms of it, and this includes implementing Raft as well.

Once we've implemented all this stuff you're going to need a way to upgrade to it. This is a problem because right now you probably have tools running around which are directly accessing Zookeeper. You have brokers running around accessing Zookeeper, and you have a lot of state in Zookeeper. If you were to try to jump directly to a KIP-500 release, it wouldn't be really possible to do that through a rolling upgrade.

You guys probably already know this, but a rolling upgrade is one in which you upgrade nodes one by one and there's no actual downtime, and this is something Kafka has supported for a long time and it's very important to us.

In order to keep that working, we have this concept of a bridge release. In the bridge release, there's no Zookeeper access from the tools and there's also no Zookeeper access from the brokers, except the controller. The rationale behind this will be apparent in a few slides. Let's say you're doing this upgrade to a post-KIP-500 release. You would start from this bridge release, which remember, has no Zookeeper access from tools and no Zookeeper access from brokers other than the controller.

The first thing you would do is you would start the new controller nodes for the release you're upgrading too. That quorum will elect a leader, and it will claim the leadership in Zookeeper. Basically, the new controller will take the leadership and maintain the leadership throughout the upgrade process. At this point, you can start upgrading the nodes one by one, just as you usually would, and the reason this will work, of course, is because all of the Zookeeper access is going through the new controller. Basically, it's handling all of that. This also means that the new software will need to know how to send these old LeaderAndIsr messages to the existing nodes. Finally, once you've rolled all the brokers, you can decommission the Zookeeper nodes.

In summary, it is going to be possible to upgrade to a post Zookeeper release with zero downtime.

Conclusion

Wrapping up here, Zookeeper has served us pretty well. What we're doing here is not so much replacing Zookeeper as switching to a different paradigm, a different way of managing metadata we think will be much more effective. These changes have been going on for a long while, and there are more reasons to do it than just scalability or even just deployability. There's also the improved encapsulation, the improved security and the improved ability to upgrade and to freely change our internal formats. Those things are all important here, I would say.

I think this is a great example of how managing metadata as a log is much more effective because it eliminates this need to send full snapshots. As the size of the system grows and we expect Kafka to really grow over time as people start using more and more data and more and more topics, and so on, it becomes more and more important to have the good properties of a log, to basically have deltas rather than sending the full state, to have caching, to be able to know what the ordering is for metadata changes.

We also want to improve our availability, to improve controller failover, to improve fencing and basically be more scalable and more robust. Along the way it's important that this metadata log be self-managed. I touched on why that is and why that needs to be the case here.

Finally, it'll take a few releases to fully implement KIP-500. There's going to be additional Kafka improvement proposals for some of these APIs for Raft, for how we're going to handle metadata. KIP-500 is like the overall roadmap, but there are going to be more detailed proposals to come.

It's very important to us, and really the whole community is very interested in rolling upgrades and we are going to support those. We talked a little bit about how that's all going to work. In conclusion, we believe that once this is implemented, Kafka will need no keeper.

Questions and Answers

Participant 1: You mentioned the Raft. Is it an external service system, or is it going to be implementing that?

McCabe: We're going to implement Raft in Kafka itself. The reason for that is Kafka is a system for managing logs. We think Raft is part of our core competency. If you're a system managing logs and you use another system to manage your logs, it's bad. All the usual software engineering reasons apply.

Participant 2: How do you make sure that the participation of the Raft consensus is only done by the brokers, which are in synch replicas, and not the ones that are already taken out by the controller?

McCabe: Sorry, I didn't quite get that.

Participant 2: You have controller and let's just say you have 10 brokers. Control is keeping track of the in synch replicas for all of these brokers. One of the brokers never actually received the metadata, so the controller will possibly remove that from the in synch replica group. Is that correct?

McCabe: In most cases, the leader will remove a broker from the in synch replica set, not the controller. There are cases where the controller will do it, but in general it's normally the leader of the partition.

Participant 2: Ok. In that case, suppose the leader actually removes the in synch replica, a broker from the in synch replica, and the leader dies, there will be a Raft consensus, that would elect a new leader.

McCabe: I think we're talking about two different things here. If we're using Raft we're not using the ISR. There's going to be basically two replication systems. One of them, the one we have now, and the other will be Raft. Raft does have something similar to the in synch replica set, but it's not going to be the in synch replica set through Kafka. It's not going to have the same properties exactly.

Participant 3: What if you have only two Kafka workers and something bad happens to this network? Who will be controller?

McCabe: In general, if you only have two nodes then you can have a Raft quorum, but it will be a size either one or two, and that's basically not going to provide any redundancy.

Participant 3:. Let's say you have four brokers and it separates, two and two. Who will be a controller?

McCabe: If there's a network partition and it's two and two, then you can't form a quorum basically. First of all, I'm assuming that you have an odd number of controller nodes. That's how you would normally deploy it. Even if you did choose to have four nodes, the rule is going to be, basically, you need a majority.

Let's say your quorum is size three. That means you need two nodes for a majority. If your quorum is size four, then you would need three, I think. The point is, at a certain point you have to sacrifice availability. If you can't form a quorum, then you must sacrifice availability. This is the same thing we have with Zookeeper right now. Let's say you have a Zookeeper quorum of size three and you lose two nodes. They go away. Then your quorum is done and your system is down until you can bring up at least one more node. It's not changing, basically, in that regard.

Participant 4: Are you expecting if you have a pool of 10, 20 brokers, they will all be participant potential controllers or do you have to explicitly set these three, these six could-be controllers. We'll all just duke it out and say, "Hey, five, that sounds like a good number," and choose?

McCabe: The answer is the second one. There's only going to be certain nodes that you would nominate as potential controllers, which again is similar to how we do Zookeeper now. You're not going to run a Zookeeper node alongside every broker in a very large cluster. You wouldn't have a controller node on every broker in a post-KIP-500 cluster.

 

See more presentations with transcripts

 

Recorded at:

Mar 03, 2020

BT