InfoQ Homepage Presentations Kafka: A Modern Distributed System

Kafka: A Modern Distributed System

View Presentation

Speed:

Download

52:24

Summary

Tim Berglund covers Kafka's distributed system fundamentals: the role of the Controller, the mechanics of leader election, the role of Zookeeper today and in the future. He looks at how read and write consistency work, how they are tunable, and how recent innovations like exactly-once semantics and observer replicas work.

Bio

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the Senior Director of Developer Experience. He is the co-presenter of various O’Reilly training videos on topics ranging from Git to Distributed Systems, and is the author of Gradle Beyond the Basics. He blogs very occasionally at http://timberglund.com.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Berglund: I am Tim Berglund. I'm here to talk to you today about Kafka as a distributed system. I work for a company called Confluent. We make a cloud service and an event streaming platform that's based on Apache Kafka. I spend most of my time talking about Kafka itself and the architectures that are arising around event-driven things and around events. There's a lot of exciting things to talk about.

Today, a little bit different. Who uses Kafka right now? Who is new to Kafka? I'm going to give you a little bit of a primer on Kafka basics. We don't have a lot of time for that. I want to just give you the shape of things, basically. Then I want to dive into three things, how replication works, how leader election works, and how consumer groups work in Kafka. There are some interesting distributed systems problems that you have to solve to do those things.

Distributed Systems Simulation

I know there's two ways I could do this. One is I could make a bunch of animations on slides that you could watch as I talk about them. That's fun. I like doing that. The other is I have these Sharpies and I have these gigantic Post-it Notes that I purchased here in London this morning. I'm not even going to tell you how much I paid for these Post-it Notes. It's a scandal. I want to know who's responsible. We're going to not worry about that. I'm putting it aside because what we're going to do for two of these things is I'm going to get volunteers up here, on stage, we're going to act these things out. Some of you are going to be brokers. Some of you are going to be consumers. I'm going to be ZooKeeper in all cases, when a ZooKeeper is necessary. That's what we're going to do.

I was talking to some British colleagues here this morning, who work for Confluent also, and they pointed out that that seems a very American thing to do. It sounded absolutely terrifying to him. I don't know any other way to be. I have difficulty pretending to be anybody else. We're going to do that. I think I will need at most six. There might be one where I need eight volunteers. Start thinking about whether you want to be a part of that especially if you're, I don't know, in the front row, you have easy access to the stage. We'll get there. It's going to be a little bit from now. You got time to psych up for it. You don't have to talk. I'll do all the talking. What you're going to do is you're going to hold this. You're going to write numbers on it. You might peel one off and stick it on your shirt and write other numbers on it. I've never done this before. We're going to see how it goes. You get to be a part of it. I should say, I've never done it before. I have wanted to do this for years. I am excited that we finally get to explore the space together.

Kafka Basics - Topics

Some Kafka basics, just to make sure we're all on the same page. Topics. Kafka is a system for managing and logging events. One of the biggest anti-patterns I see for people who are new to Kafka and just trying to learn how to think about it, is that they think about it as a queue. That's terrible. It's not really a queue. The abstraction you want to think is that it's a log. It's a distributed log. The difference is that a log stores things and a queue doesn't. Or if it does, it does it in incidental way.

You have these things called topics. Unfortunately, there are message queues that also have things called topics, but we overload terms all the time. You produce events into that. Events are modeled as key value pairs. That might be important for some of our in-person, acted out exercises later. They get produced to these topics and they're kept in order. That's it. They live in the topics until an elapsed time that you specify goes by. By default, that's seven days, but you can make it infinity. You can make it seven minutes. You can make it whatever you want. That retention period is a dial you get to set. A Kafka topic can be the place that data goes to live and is the system of record.

Our previous speaker was describing systems and data structures and things that can emerge around a structure like this, when you use a log as your system of records. That's what Kafka is. It is fundamentally a log. However, it would not be any fun if it were just one log stored on one server. That wouldn't be very reliable, and it wouldn't be very scalable. We want to be able to grow this beyond just one machine. We're going to partition the log. We take that one log that's got stuff in it, and we're going to split it into pieces.

Now there's one topic in three partitions. When we act out some of the protocols we're going to go through later, I'm going to stick with that three partitions. Sometimes topics have 50 partitions. I wasn't sure how many people would be in the room. Considering what I paid for these Post-it Notes, that would have been $500 USD to make that work. I think it just would have gotten inconvenient. Three is a good number. We're just going to go with three.

You take the log. You split it up into pieces. These boxes, these logs, that's an actual file. The simplest way to think about that is just as a log file that is stored on a disk somewhere. Of course, if you really go under the covers, it could be several segmented files. It doesn't have to be literally one file. Just think of it as a file, and you append events to that file. There are three files, and we store them on separate machines. Those machines, I'm going to call them brokers. Some of you are going to get to be brokers in a little bit. You can go home tonight or back to your hotel, if you're traveling here, and you say, "Today, I was a broker and it involved a Post-it Note and a Sharpie. It seemed it would be harder than that." That's what you got.

Here's our first problem that we have to solve. This one's really easy. When you now have a key value pair and you want to produce that to the topic, you have to decide what partition to write to. The answer to that is a scheme that shows up all over in all distributed systems. It's called consistent hashing. What we do is we just take the key, and we run it through a hash function. It doesn't matter what hash function it is, as long as we always use the same one. It doesn't need to be cryptographically secure, or anything like that. It just needs to be a hash function that's nice and fast. All of our producers are always using the same hash function. Hash function, output mod, number of partitions, gives you the partition number. Those things will always get produced to the same partition if they have the same key.

Ordering in Kafka

This is a little bit separate from our concern today, since as quickly as possible, I want to get down into some interesting weeds. Ordering in Kafka when you're thinking from an application perspective, and not from an operational perspective like we'll be looking at today, from an application perspective, ordering in Kafka is by key. Once we take a topic, we break it into pieces, and we put those pieces on different computers. We no longer have a guarantee of global ordering. You'll never know, globally, what the ordering of all the events were. However, for a given key, you'll know. Those will always be written to the same log on the same broker, and they will always be in order.

Which, when you're brand new to Kafka sounds terrible. You're like, "I want to queue. Why I can't, I have something that acts like a queue. It's not. It's broken." You'd be amazed how seldom it's actually a problem. That ordering problem usually doesn't show up, and ordering by key is typically what you need. Every once in a while, you come up with a need for global ordering, and those are interesting for how rare they are. You come up with other ways of solving them. All those ways involve a topic of one partition. You get there. It's very simple. There's one way to do it. If you're using Kafka, you probably have more data than comfortably fits in one partition. If you have a global ordering problem, you have some scheme of reducing that data, the things that you need to order so that it will fit into one partition. I'm just going to leave that topic for now and produce that remaining key.

Replicas

Replicas, we're going to talk about how Kafka does replication. I just want to tell you what the basic scheme is, so everybody's got this picture in their head. Here I've got four brokers, and one topic that has four partitions. It does not always need to be that neat. I could have three brokers and a topic that has 10 partitions. These numbers don't need to align at all. This just makes this slide pretty. That's why I picked it. Those four partitions I have assigned to the four brokers. This is nice. I have scaled out my topic. Each one of those machines can handle storage, can handle Pub/Sub independently to the others. If I lose one of those machines, I lose my data. That seems bad. What I'm going to do is I'm going to replicate it. Three is a common replication factor. This is tunable. It doesn't need to be three. Sometimes you'll read in tutorials and documentation, just the assumption that it's three. The reality is, it almost always is three, but it doesn't have to be. That's a dial that you set.

What Kafka does is it elects one leader. There's one replica that's the leader. For each partition, we've got three replicas. One of those replicas is the leader. When there's an application that's producing to that topic, it does the write to the lead replica. Whatever broker the lead replica is on, that's the one it sends the write to. When you read, you read from that leader also. I'm going to repeat that later. The reading and writing is to the leader and the others are there just for redundancy. That way, for example, when one of those goes away, we're able to elect a new leader, like we did here. Now suddenly, Broker 2 is the leader for that partition.

A few more things, I mentioned producers. Producer is just a program that writes to a topic. It's client application. This is what you actually write when you're running a Kafka cluster, when you're building applications on Kafka. You write a program that's a producer. You can do that with an API called the Producer API. It's very low level and just lets you put messages into topics. There's higher-level APIs and other ways to be a consumer that are not quite so explicit, that let you deal with higher level abstractions of stream processing. Not thinking explicitly about putting a message in somewhere.

Consumers, likewise, these are ways of getting messages out. A couple interesting points that are distinguishing features of Kafka. One is that consuming a message doesn't make the message go away. It's non-destructively reading it. That means I can have more than one consumer on a topic. If I've got some event stream, and two services that want to consume that event stream. You know what? You folks, I'd even give you a three, if you wanted three. Or if you wanted a few thousand, it would be ok to have a few thousand different consumers that are consuming from one topic. Because consuming does not destroy, you can have many interests on one event stream, many applications that are reading that one event stream.

Scaling Out a Consumer Group

Because consuming might be computationally expensive, we can scale out a consumer group. You can just add another instance of a consumer group. That consumer group gets partitions assigned to it. This is a hugely important part of Kafka, that once I've got a consuming application, if I deploy a second instance of that, if there are more partitions to distribute, there are three partitions here. I can deploy yet another instance of that consumer, and it will get a partition assigned to it. Without doing really anything that I'm aware of at the application level, I'm just using the consumer API. I get this horizontally scalable thing. I become horizontally scalable. That's because of the consumer group protocol, which we're going to look at in some detail.

It's super useful. It's a thing that gets reused in a number of interesting ways in different parts of Kafka. It's also not perfect, the consumer group thing. It distributes work to some set of workers, that is, partitions, to consumer group members. There are things it doesn't do. This is why higher-level APIs like the Kafka Streams API and languages like KSQL exist. Because there are some thorny application side state management problems, that the consumer group protocol doesn't address. I just want to leave that as a little flag in your head. If you're new to Kafka, and you're tickled by the interesting distributed systems problems that I'm presenting here, and you look into Kafka more. When you look into Kafka as an application programmer, and you're using that consumer group API, or consumer API, and you remember, "Tim said this was really cool. Look at all this assigning that it's doing of work. I remember that." There are still reasons to look at Kafka Streams for developing stateful stream processing applications rather than just using vanilla consumers all the time.

Kafka Streams are built on top of the consumer API. Really, the only thing anybody ever does to Kafka is be a producer or be a consumer. That's what you can do. Kafka has its own communication between brokers and its own replication problems it's solving, and these interesting interactions between clients and brokers. As far as you know, all you do is you put events into topics and you get events out of topics. There could be any number of more abstract ways of doing that, like a functional stream processing API like Kafka Streams, or a streaming SQL like KSQL. You're still just a producer and a consumer. Everything's a producer and a consumer.

What I've given you so far, just this 15-minute summary of Kafka is actually adequate as a set of abstractions for you to think about. It's all that you need for groundwork.

Let's get into the pieces. I want to talk about a part of Kafka called the controller. This is foundational. We need the controller to think about how replication works. I want to talk about replication. I want to start with the controller. I want to remind you of some things I just said. These are important facts about Kafka that we need to keep in mind to think about what the controller is and what it does. First of all, remember that partitions are replicated. We take topics. We split them into pieces. We take those pieces and we make copies of them.

One of the great things architecturally about Kafka is that it's a system for dealing with events. What do we know about events as data structures? They're immutable. Events are immutable, sometimes tragically. As data structures, that's never a tragedy. It's good. That's a great simplifying assumption, lets us get away with all behavior in event-driven systems that would be harmful with a database-based system or an entity-based system. It's good that events are immutable. That's a very helpful thing.

Partitions themselves are not immutable because you add to them. The actual partition and the replica of the partition is the thing that you change. The fact that we're going to replicate partitions means now we have these mutable objects whose state we need to keep consistent in the system. That's a three word bullet, there's lots of pain that results from that. As I said, writes, when I'm producing, that always goes to the lead replica. Partitions are replicated. There's a leader and followers. For n replicas, there's one leader, there's n-1 followers. Writes always go to the lead replica.

Followers ask the leader for new writes. I'm a producer. I produce to the lead replica of the partition, of the topic that I'm writing to. I don't then go to the followers and produce to them. That follower doesn't then go to its followers and say, "I have stuff for you. What are you doing right now? I have some writes." It just doesn't do that. Those followers have the responsibility of periodically going and asking their leaders for what new writes exist.

We always read from the leader. When I'm a consumer and I'm reading. I don't read from followers. I read from the leader. This makes it easier for us to reason about consistency. When we act this out, I hope it'll make sense how we do this. Why is that asterisk there? That is troubling me. Who put that there? That's because in a recent release, the idea has been introduced of a replica that is asynchronous. It's a follower and it's asynchronously replicated. You can get a consumer to read from it. It's called an observer replica. That's a footnote. Google it, check it out. I'm not talking about observers today. I needed to put that asterisk there for purposes of keeping this simple, I want to say reads always come from the leader. I don't want anybody who knows better to be mad. Yes, there are observers. No, we're not going to talk about observers.

When is a write considered a write? Is it considered a write when I send it and I hope maybe it got written somewhere? Is it considered a successful write when the leader acknowledges it and writes it to disk? Is it considered a write when the leader writes it to disk and all the followers have written it to disk? These are all options. It's all tunable at the producer level, and that is there. Just as a note, I don't want to talk about the tunable producer settings in detail. What constitutes a successful write is a thing you get to determine. That is your trade-off of durability versus latency.

This is also very important. When I'm reading, I can only read data. I just said, what constitutes a successful write is up to you. Does it have to get to the followers? It doesn't have to, if you don't want it to. When I'm consuming, I can only consume data that has gotten to the followers. Consuming, I'm only able to see stuff that is fully replicated. If you're playing the part of a consumer you get what you get. If you play the part of a lead replica later, you may have to know that, "This is going to be complicated".

Controllers

What is the controller? The controller is a broker. There is one broker in the Kafka cluster that is elected as the controller. There's always one. By default, it's the first broker that comes up in a Kafka cluster. It gets to be the controller. If it goes down, a new one is elected. There is always one. Its job is to monitor the health of all the other brokers in the cluster. We've got all these partitions scattered all over my cluster. I've got three brokers. Let's go with three. For any given broker, you are the follower for some partitions and you're the leader for others. When you die, and you will die, the replicas for which you are leader now have to go to somebody else. It's the job of the controller to mediate that election of new leaders. Some other broker who was a follower replica for that same partition is going to get elected as the leader. The controller mediates that process.

Then when that has happened, the controller tells all of the other brokers, "There's a new leader in town for replica six of partition one of topic page views." All the other brokers need to do that. All the brokers have that metadata. They all know who's got what partition and who's the leader? That has to get communicated.

How does this voting work? It's very simple. Think of ZooKeeper as a hierarchical, in-memory tree. It's a hierarchical file system looking thing. I can write things to paths. I have a path called /controller in ZooKeeper. This is a special node. The things you write in ZooKeeper are called nodes. That's an ephemeral node. To create an ephemeral node, you just use the API to say, "ZooKeeper, I want an ephemeral node." You, the creator of that ephemeral node, keep a session to that thing. It's there as long as you're connected to it. As long as that session is up. When that session goes away the node goes away. It seems convenient. We might be able to use that.

When the broker boots, it tries to write a little description of itself to that controller node, and the first write wins. When the controller dies, the session expires, and the next broker in line now wins. Because the first broker to write controller became controller, the second broker to come up tried to and it's in line waiting to do that write. It knows it has not become controller until the first session goes away and then it wins. ZooKeeper gives us this nice way of doing this.

If you're following Kafka news, there is an important KIP, or Kafka Improvement Proposal called KIP-500 that has to do with killing ZooKeeper. If you operate a Kafka cluster, you've wanted to kill ZooKeeper for years. It's ok. ZooKeeper really never did anything intentionally to hurt you. It's just been this workhorse of an effective little strongly consistent quorum thing. Kafka is getting rid of ZooKeeper. A year-and-a-half from now this talk will have to change. Talk about how the post-ZooKeeper world works.

Here's an example of what the controller might manage. A controlled shutdown of a broker, say, I'm killing a broker. I send a SIG_TERM to a broker, to the JVM process. That broker then tells the controller, "I'm going down." The broker then says, "There are 300 replicas for which you're leader. I'm going to go through them one at a time, and find the next person in line. The next broker in line is going to be the new leader for those partitions. I'm going to tell that broker it's the leader for those partitions now. Then I'm going to tell everybody else, that it's the leader for those partitions now." Informs leaders, informs the rest of the brokers.

Broker Failure

Likewise, if that broker fails, same thing. There's a broker failure scenario. This is initiated by the controller when it sees a broker no longer responding. When it sees a broker die, it basically just goes through the shutdown scenario. Same thing. Already talked about controller failure. That's the controller. That gets us to the replication protocol, which relies on the broker, which relies on these and these.

Replication

Replication, remember, just like we talked about some basics with the controller, I want to talk about some basics with replication. The log is a sequence of records and those records are immutable. We want to replicate each partition, each log. There's one leader, n-1 followers. Producers always write to the leader. Consumers always read from the leader. Followers query the leader for new replicas. If you're a follower, it's your job to reach out to your leader and get the new stuff.

Which gets us to another term so-called in-sync replicas, better known to the world as ISRs. I never was a fan. I just age-wise missed them, no harm. It's fine. This joke is made occasionally. A replica is considered in-sync, if it is within a certain number of messages or a certain amount of time of the leader. By default, a replica is out of sync if it is more than 10 seconds behind the leader. These are tunable things. I can set this to a number of messages and an amount of time. Rather than just saying a replica is live or dead, we consider a replica in-sync or out of sync. A slow replica may drop out of this list. The live followers, who are eligible for leader election when we die as leader, are the ISRs, or the in-sync replicas. You have to be in-sync. To be in-sync, you can still be behind. There can be messages that I've got as the leader that none of my followers have. Think about that. That's just going to be true all the time. Stuff gets written to me. They haven't gotten around to asking. They're always behind. In-sync is this fuzzy thing, but I might be able to elect them as a leader even though they've got messages I don't have. That's ok, for reasons that I hope are going to be clear.

Recall that for each partition, ZooKeeper knows the leader. A thing called the epoch, which we're not going to need, and the ISR list. For every partition, I know who's the leader, and I know what followers are in-sync. I also can define a thing called the High Water Mark. This is the thing that the leader knows. The leader knows the High Water Mark, which is the highest replicated offset, offset's just a number of the message in the log, common to all the partitions in the ISR list. I as the leader, I will always have messages that you followers don't.

People are writing to me, and the followers are always, after that, coming and asking for new messages. I'm always going to have stuff they don't. They'll be caught up to various levels. I will know because they asked me, at the time they asked me what offsets I gave to them. When I send those off, and they say, "Got them." They acknowledge that. Then I'll know, everybody's caught up through offset 1,048,576. That's my High Water Mark. I'll get 10 new messages produced, but my High Water Mark is still back here. I'll know I'm ahead of the High Water Mark, but that High Water Mark means everybody's got those. That's important, because I am not allowed to expose messages to a reader past the High Water Mark. The contract is for it to be readable it has to be fully replicated. Any broker can die and I won't lose those messages if I have allowed them to be readable. I just said that.

Scenario

I need three replicas, one producer, one consumer. Ben you're going to be a broker. Here you're a broker. Raul, you're a broker. Ron, you're a consumer. Lars, you're a producer. I would like you guys to write just B1 in big letters so that they can see them back there: B1, B2, B3. You're the brokers. Just write consumer. I have a producer and I have a consumer. Write producer in big letters there. Probably good, if you just take those off and stick them on you.

Ben, you are also going to be the leader. Yes, just want to put L on there. Broker 3, Broker 1, put an F underneath your broker, your followers. Producer, it's your job to come up with the messages. This is creativity on your part. I would like you to keep them short so these guys can write them, because Ben is going to have to write things. I'm thinking positive integers that are small, would be good as messages, or short words that are easy to spell. On your next piece of paper on your pad, brokers, this is going to be your actual log. Remember, messages are key value pairs. We don't need to know that right now. Just the value is fine. In fact, sometimes people produce messages using no key at all, animals. We're not animals. Just to keep it simple, just a value, a number is fine. A word is fine too. In fact, if you want, here's a smaller pad, you can actually write your message on that, and you can walk it over to Ben.

Ben, you're going to receive that message. Producer, go ahead and produce. What have we got? The number, 1.

Lars: That could be impossible, you always start with 0.

Berglund: Yes. Producer, you would need to connect to a bootstrap broker and identify the topic that you want to write to, that you'd get some metadata back saying what partitions that topic had. You would hash your key, or if you didn't have a key, do some other things. Then you'd make a decision that that partition is the one you want to write to. I've selected a single partition of this broker and left all that stuff out of the process because it would just be ugly. You know that the Ben partition is the one you want to write to. You're going to go take it over to the leader. You can go take that message over to Broker 2. You can give it to him, now Broker 2. That's one way to log it. You just write that. That actually works. You have to assign it an offset. You want to assign it an offset. I like this, that's offset 0. Would you like to produce another number producer?

Lars: Yes. What if I said no?

Berglund: I would fail you over. I can do that. Forty-two, excellent number. Why don't you go ahead and produce that. Ben, if you would assign that an offset, please. One more. This is fine. You can stick those on your body, wherever you like. That's no, I didn't mean that. Kafka is untyped, it's fine. We're not using schema registry right now, which bodes ill for the project. We're just getting started. That is now going to be offset 3. I think that's enough messages for now. We're going to do one more, don't worry.

Follower 1, if you could just ask him if he's got any messages. Ben, you'd send a copy of them over the network, but you can just show him, he's right there. We don't need to be pedantic. Follower 1 is going to do that. Broker 3, would you also ask the follower if he's got any messages? I just want you to write those down. Stop right there at two. You've got two of them written down. You just had a major GC pause. Could you turn around?

Consumer, you would like to read from this topic and you happen to know that your leader is Ben. Could you go over to Ben and ask him what he's got.

Ron: Hi, Ben, what have you got?

Ben: These messages.

Ron: Can I get from him too?

Berglund: No. Which messages can Ben expose?

Ron: Only the one he has committed.

Berglund: The High Water Mark is right here. Offset 0 and 1, he can't expose number 2. He's got it. It's in his log. He's not going to not write it. It hasn't been replicated because we had a little GC problem over here. Consumer, you can't take them. You can make a copy of them. You just write them down.

Ben: It's not destructive?

Berglund: It's not destructive. If it were a legacy message queue then it would do that. There you go. Broker 3, wait, back up. Look, you have message 2. Consumer, you're going to go ask Ben for more messages, and now Ben can tell you about the cat.

Ron: You have more messages?

Ben: Yes, you can have a cat now.

Berglund: Look at that. If we spend all our time on this, there are actually some really interesting failure scenarios of times things can break in ways, everything that can go wrong.

Consumer Groups - Requirements

Let's talk about consumer groups and how this works. Remember consumer groups. When you've got more than one application trying to consume from a stream, you need to be able to scale that computation out, depending on what it is. It's going to have to happen. Rather than replicas, you guys are going to become partitions. We need to be able to assign partitions as consumers enter and leave the group. We want the clients to decide how those partitions get assigned, because we want brokers to be as stupid as possible. That is an overriding priority in Kafka. We want brokers to be the dumbest things they can possibly be. That's actually a very good thing. Brokers don't do much. That allows Kafka to be what it is.

We want to be able to scale to on the order of 10 to the 4, individual consumer groups, and 10 to the 2 consumers in a topic. That's a lot of consumption that we can do. We want this to be strongly consistent so that the group always knows once a decision is made, that decision is strongly consistent across the group. That has some unfortunate side effects, historically, about pausing consumption while this stuff gets figured out to make it strongly consistent. We can't have people reading while you're trying to decide what partition goes where.

Who are the persons in this drama? There are the consumers. There's the special consumer called the controller. One of those consumers is the leader of the group. There's also a controller of the consumer group, and that's one of the brokers that controls that thing.

Here are our protocol operations: FindCoordinator, JoinGroup, SyncGroup, Heartbeat, and LeaveGroup. Without walking you through each one of these steps, I'm going to try to do the acting out of this.

I have three partitions, and I need one more consumer. I need one more person. You, sir, are now a consumer, if you would rename yourself. I need another consumer, and broker, broker, broker. Instead of leader and follower, if you could write P0, P1, and P2, you're partitions of a topic. You are going to be a consumer. You just want to write consumer and label yourself.

I have six partitions. You're also P3, you're also P4, and you're also P5. We start counting at 0 because we're computer scientists. If you two could turn around. You're not on yet, and you are. We don't know any of that, that's gone. It's a brand new day. As far as data goes, I'll just feed you your data, just to keep this simple.

You, wake up. You're alive for the first time. You're a new JVM process. You've got your bootstrap broker list, which is a list of addresses of brokers that are out there in the world. To be a Kafka application, you don't need to know all of the nodes in a cluster. That would be silly. You need to know a few. You have two or three well-known names that are going to be there in the cluster, and the rest of them you don't care. You've got your bootstrap list Mr. Consumer. One of them is Broker 3 here. You wake up and as a part of your protocol, you run an operation called FindCoordinator.

You could send it to your bootstrap broker, so you ask Broker 3. Say, "I would like to find my group." What you do is you pass him a thing called your group ID. That's the name of the application. Every consumer to configure it, you will not start if you don't give it a group ID. Your group ID is QCon. That's your group ID. You can stick that right in front of this one. You wake up. You talk to your bootstrap broker, which is Broker 3. You say, "I'm QCon".

Ron: I'm Qcon.

Berglund: "Nice to meet you." You don't know what to do yet. It's ok. I'm about to tell you. What you're going to do is you've got this internal topic called consumer_offsets. It's keyed by that. You're going to go to that consumer_offsets topic and look at that key, and automatically, this is this nice way that Kafka automatically gives you, poor bootstrap broker, of knowing who to send him to. That partition, that that key is going to get written to in that magical internal topic, has a leader somewhere. Remember, I'm ZooKeeper, I happen to know, again, it's Ben. I pick on my coworker. You're going to say, "You want to talk to Broker 2." Go ahead and tell him.

Raul: You want to talk to Broker 2.

Berglund: It was quiet, but he did say it. You can go back. You know that now, you've got that state. Now you want to say, "I know who to talk to. I'm going to join my group." Same thing, you want to say, "I'm going to go ahead and join my group now." What do I know? I know my group ID. I'm brand new. I don't know my member ID. I know I'm a consumer. I know a few other little bits of metadata. Now you can go to Ben, and you could say, "I would like to join the group." Level 3 consumer looking for group. You can go over to Ben and say you'd like to join.

Ron: Ben, I want to join the group.

Ben: Of course you can.

Berglund: Ben, you say, "You're the first one to join this group. You're the leader".

Ben: You're the leader.

Berglund: Send him back with a member ID. We're going to say you're member, write M=1. We'll say your member ID is 1. Now what you're going to do is you are going to ratify that. You're going to do a call called SyncConsumer. This is the next step. Basically, you go back to Ben and you say, "Thank you. I'm good. I got that. I am Member 1 of group QCon. I'm off and running." At that point, the consumer group is ready to go. You can start consuming messages. You don't need to actually consume any messages. I want to speed things up a little bit here. You have now properly joined the group. Something you have to do every few seconds is you have to go back over to Ben. Say, "Ben, I'm ok." Which I think is a little clingy. You know people like that. Some people would accuse me of being like that. I think that's unfair. You every once in a while have to do that.

Another thing Ron has to do at this point, since he's synced, Ben told him about the partitions in the group. Ron has to make a decision about what partitions he's going to consume. What partitions are there here? 0, 1, 2, 3, 4, 5, six of them. Ron is one consumer. Guess what he's got? All of them. That's pretty easy. He knows what's happening. When you're consuming, you're going to consume from all those partitions.

If you would turn around and wake up. It's a brand new day. You're alive for the first time. Go find your coordinator. Your bootstrap broker is Broker 1. Why don't you go over to Broker 1, tell Broker 1 you want to find your coordinator. Broker 1, introduce him to Ben, please.

Participant 1: This is Ben.

Berglund: Now you join. You say, "I would like to join the group".

Lars: I would like to join the group.

Ben: Welcome to the group.

Berglund: Welcome to the group. Here's your member ID. You're now a member of this group. You can come back over. When is poor Ron going to know that this happened? This seems important. He doesn't have any partitions. He needs partitions. Ron, give Ben a Heartbeat.

Ron: [inaudible 00:51:02].

Ben: Hold on a minute.

Berglund: "Hold on a minute," says Ben. Somebody joined your group. You need to go back and do a rebalance. Ron, go back, you can do a rebalance. What Ron is going to do now is rejoin the group. You don't have to do that. That he learns about that at the Heartbeat stage, that somebody else has joined the group. If you turn around, you would do the same thing. You'd find your controller. You'd join. Then the next Heartbeat that Ron does, as the group leader, he's going to know, "I have to do a rebalance. Let me rejoin." Ron will decide their partition assignments because he is the lead consumer and he has that power. They get them at their next Heartbeat. That's the simple version.

Conclusion

We saw them. We saw a replication. We saw a leader election. We saw consumer groups. If you want to know more about stuff like this, and Kafka, and listen to podcasts, and tutorials, and example code, go to the URL, that QR code.

See more presentations with transcripts

Recorded at:

May 05, 2020

Tim Berglund

InfoQ Software Architects' Newsletter