Wes Reisz, Matthew Clark, Gwen Shapira, and Ian Thomas discuss the evolution of event-driven architectures over the decades, the advantages that EDA offers, and thoughts for the future.
Key Takeaways
- Most companies adopt an event-driven architecture as part of their evolution from a monolith to microservices, and had a need for the scaling that EDA provides.
- Having well-defined boundaries in your system is very important for EDA. Events are used to communicate outside your domain boundary.
- Idempotency of events is critical. In production, you don't want duplicate data or transactions to occur. It also facilitates testing, as you can replay events into a staging environment.
- A good rule of thumb is to use orchestration when you're coordinating events within your bounded context, and to use event-driven choreography for interactions across domains.
- Properly configuring Kafka and your topics takes some effort. When possible, make reversible decisions, so you can adjust your design if necessary.
Subscribe on:
Transcript
Background
Reisz: My name is Wes Reisz. I'm a platform architect with VMware and working on Tanzu. I chair the QCon San Francisco software conference. I'm lucky enough to be one of the co-hosts for the InfoQ podcast.
Gwen, what I'd like you to do is introduce yourself, and maybe talk a little bit about the systems you build. Then, how did you land on event driven? What brought you there?
Shapira: Basically, I'm a software engineer, a principal engineer at Confluent. I lead the cloud native Kafka team. We're running Kafka as a service at large scale for our customers. Before that, I was an engineer and I was a committer on Apache Kafka.
How Confluent Landed on Event-Driven Architecture
We landed on event-driven, basically, after we tried everything else. Not exactly like that. I spent a lot of time with our customers who were already using Kafka for event driven. I got to learn the patterns with my customers and how they solve problems. What problems it solved. What it created. Then when I started managing Kafka in the cloud, we found ourselves with a monolith, and we knew we had to solve it. You always start with a monolith. They're very fast to write. We knew we wanted something better, and we had a bunch of different options. The thing that really got us to event driven was the fact that it looked like it would allow us to avoid finger pointing between teams, because everything is through events. It's recorded forever. You can actually see what messages were sent, and reconstruct the whole logical flow of the system, if needed, on a staging environment. If you saw something in production, it wasn't what you expect, you can actually take the entire topic of events and see what happens in another system. For us, it was huge. It's not like, this is my responsibility, your responsibility. We got to really well-defined, this is what you own, these are the events you react to, and we can take it from there.
Reisz: That brings up a lot of other questions, though, on how things actually react to all those events and how they're choreographed. Ian, what about you?
Background, and How PokerStars Sports Landed on Event-Driven Architecture
Thomas: I am a senior principal engineer, working for Flutter International, which is the current incarnation of a job I started seven years ago, working for Sky Bet. I'm in the betting and gaming industry. Over the years, I've worked on Sky Bet, Betstars, and now laterally PokerStars Sports. I've got a few different angles from an event-driven point of view. One of the ones that I'd be quite interested to learn more about myself is how PokerStars has grown over the years, because that's one of the biggest real time event-driven systems, I think that probably exists around the place.
I joined Sky Bet back in 2014. The main thing that we'd adopted then was a pattern to take data out of a monolithic system, a massive Informix database, and spread it out to engineering teams within the organization to allow them to have control of the data, and then they could build frontends that would scale. Since then, I've worked on various other incarnations of systems, including some backed by Kafka that have been quite successful. Looking at how we can actually use it to manage state in its own right, which has been a really interesting journey. Quite a lot of different angles. One of the things that I've been working on recently is looking at how we use real-time events across our frontends, and taking our, sort of, poker heritage and bringing that to sports betting and gaming.
Background, and How BBC Landed on Event-Driven Architecture
Clark: I'm Matthew. I'm head of architecture at the BBC. I'm sure everyone knows BBC. We have dozens of websites and apps, and with that, hundreds of services under the hood. It's quite a broad range of things. It's quite fun to keep on top of, but lots of microservice thinking, and cloud based thinking, so event-based architectures have to fall into that. It's not a dogmatic thing. It's not that we use that everywhere. A lot of the time request-based is a better solution. Has always these pros and cons. Event-based has to play a part where it has so many advantages. Fundamentally, if you have something like a search engine or a recommendation engine, it isn't going to fill itself. You need those events to come in and populate it, so it becomes a good service.
Importance of Knowing the Domain Model When Working With an Event-Driven System
Reisz: One of the first questions that I wanted to start off with, are maybe just some things you didn't expect when you went into an event-driven system, some things that maybe caught you by surprise. This is early on in your journey. I'll give you one example from my own viewpoint. I found that when I used event-driven systems that it was a little bit hard. I had to really know the domain extremely well, before I got involved, to really understand that choreography that was happening. Gwen, you talked a bit about choreography and orchestration. What is the importance of really knowing the domain model, for example, when you're working with an event-driven system?
Shapira: Especially as an architect who tries to advise other teams, you also have to know what you don't know. A lot of your job is to draw the boundaries of this thing and say, this is what you own, and don't step outside. If you want to do something outside you send a message, someone else will own it. Trust them to do the right things, they own their domain. It is funny how the culture and the architecture work together, because if you try to write an orchestrated system versus a choreograph, you actually have to know everyone's logic. You are the one who's like, I'll call this and this will happen. Then we'll call the other thing. If this fails, I have to call this other thing. I feel like, in many ways, a culture of choreography means that you're an expert in your domain, you define the boundaries. Then you don't have to worry about other domains. There will be other experts, and you can trust them for that. I think it's a good company culture.
The Surprises with an Event-Driven System
Reisz: Ian, what are some of the things that surprised you when you started working with event-driven systems from maybe a more classic, monolithic type system?
Thomas: I think the big one that seems to come up time and again, is moving from this idea of something being synchronous to something having the time axis as well to consider. Especially when you've got potentially disparate data sources, or different disparate producers of data, and thinking about, is this actually happening before that? How do I handle this? Then, moving on from that to thinking, what happens if I see this event twice? What happens if I never saw it? How do I reconcile my consistency after time? You can see that all over the place if you look at it just in terms of people moving from synchronous to asynchronous programming models, just within a monolith. You've got similar situations. When that's also distributed across different systems, and you got to work out, how do I go and inspect that data, or how do I see when this thing happened in another system, or play back a log? That's quite challenging. I'd say yes, probably the time element.
Reisz: Matthew, any thoughts?
Clark: Yes, I agree with what was said. Yes, understanding the state of what things are at, and whether you've lost something, whether you got a race condition, these things get seriously hard, seriously gritty, definitely. We talk about how stateless is a wonderful paradigm. You get that with the serverless functions. You don’t need to care, you just worry about that current moment. Whereas in a world where you're event driven, and you have your microservice, it's got an awful lot of state. It's received a lot of events. If you've lost some, you're in trouble. It might have to pass it on to something else. What happens if that fails, or needs a redeployment or something? Suddenly you look at this and go, this isn't a trivial problem. This isn't the dream. When I moved from that classic REST API that I was very happy with, it was very simple, suddenly this isn't the panacea, is it? It's got all kinds of challenges.
How to Deal with Unordered Events
Reisz: One of the questions that was asked that people wanted to learn is how to deal with things like those unordered events. Ian, you talked a little bit about having to deal with different events that may come in at different times. How do you deal with this idea that that event may not necessarily show up in this synchronous order of events? How do you deal with something like that?
Thomas: For us, when we were looking at this, the most important place it came up was in bet placement, which is, someone's actually spending some money with you. The key word that gets tossed around is idempotence, and making sure that your events can be replayed without severe consequences, especially financial ones. It's a case of education really, so understanding that it's a possibility, and designing the system with that in mind, as with most things. We have lots of things that we have to think about in terms of if we've got this event multiple times, how do we discard things? If we haven't seen it, how do we play back or push new events into the system to try and get the consistency correct? Then one of the biggest push backs that we had from some of our operations people was whether this is right or wrong to do in production, or what have you. You make up your own mind. If you've got a database and your data is inconsistent, then you at least have the ability to go in and tweak it. You can run some SQL commands. "I can fix this." When you're relying on an event log being played back, you’ve got to think about, “What's that control plane like? What's my method of getting myself back into a good state?”
Getting Back Into a Good, Known State
Reisz: Gwen, what do you suggest on having people think about to get yourself back into a good known state?
Shapira: I am a big believer in making sure everything is idempotent. You go a bit further back and trust that if you replay, it will not get you into a worse state. In my mind, the biggest blocker to really doing async events, is not really that async events are that hard, it's that the people did not deep down accept that this is the only way. Doing something synchronous and doing something that scales and doing something that has good performance, you're not going to get all three basically. It can be synchronous and high performance but it does not scale. You can be synchronous and try to scale, but you'll have very large queues. It's not going to be very performant. If you want something that's performant and scales, you have to be async. Once you start going, I have to do it. Then, really, is it that hard to have an idempotent event? It's usually not that hard. It's just that you have to kind of, I'm in a new world, and I'm not trying to create my old world with new tools. I'm actually in a new world now.
Choreography vs. Well-Defined Orchestration
Reisz: Nandip asked a question around well-defined business processes. What I read when I see this is choreography versus orchestration, back to what we were talking about. Is there always a case where everything should be choreography, or are there cases when we need that well-defined orchestration that has individual steps? Matthew?
Clark: There's never one right answer. We've got a bit of both ways. Sometimes you can operate it with an orchestration setup, other times not. To pick it up at what we were saying before is, assume that you will at some point get replayed, no matter what you use, you are going to find bugs in your event-driven messages, for example, where you need to replay things. Even if your technology is very good at emitting the right things at the right place and guaranteeing at least once consistency, you are going to have to handle that repetition of content at some point, because it's just going to be part of what you do.
Reisz: Ian, Gwen, any thoughts?
Thomas: I really like Yan Cui, who compromises on this one, which is within the context of like a bounded context, orchestration is probably the right thing to do. When you're looking at communication between different contexts, that's when the event-driven choreography really comes to play, and it's powerful then. It's still not a complete slam dunk, of course. I think that's probably a really good starting point for a definition.
Reisz: That's a good one. That's exactly what was in my mind too, Yan Cui. He's got a great blog post out there that dives into this, if you want a little bit more about the differences between the two.
Separating Events and Creating Topics on a Kafka Architecture
Gwen, there's a question here about separating events, and how you really start to think about your topics. When someone comes up to you and is asking about separating events and creating topics on just a Kafka architecture, how do you talk to them about that? What do you tell them to think about? What do you tell them to consider?
Shapira: It's interesting, because I used to answer those questions for databases, and what should be in this integral, and what should be separate dimensions. It just feels like the same thing keeps coming back. First of all, like getting a good, very old school book on data modeling, basically never hurts, like modeling is modeling. You have the domain-driven design book. Then, on the other hand, you have one of the old school data warehouse modeling or data modeling systems. The thing that you want to take into account in Kafka is a bit of the scaling requirements. That's the thing that it does slightly differently. If some event is just super common, then you will probably want to separate the main measurement and metrics topics from things that are slightly more infrequent. Because they will probably be processed separately, and you'll want to react to them in different timelines.
The other important thing is really the ordering guarantees, which doesn't happen in databases. If stuff is in different topics, then you will have no control over what order they're in. They could be processed in any order, and you need to be ok with that. If you want things to be in one order, you put them on the same topic on the same partition, and you have this full ordering, it is right there.
Then a lot of it is just business logic. I saw a question passing by about how big an event should be. It's like how big a function should be. If it gets overly big, it's probably a smell. At the end of the day, do you have good boundaries for your model? Is an event something that is a real world event in your business? Does it align to some business thing that is going on? That's the main consideration. You don't want to artificially chop things up in different ways.
Thomas: We used to have quite a lot of conversations with engineers who were looking specifically around Kafka and Kafka streams, and understanding how their topic design affected their streams up, because there are quite a lot of long term implications. Specifically, if you're using it for storing state, and compacted topics. People were getting the wrong number of partitions set up from the beginning.
Shapira: One thing that I caution people around and also internally and also my cloud managers, you don't want to turn temporary limitations into a religion. If you think something is the right business thing to do, but you have to make a compromise because technology forces a compromise, you want to very clearly document, “We wanted to do X but it was actually impossible.” Because then you don't know, maybe a year from now X will be possible and you can go back to it. For example, Kafka used to have a limited number of partitions. It's long gone and it's in the process of being even more gone. People designed an entire world ideology around it, and it's very hard to tell that, do you do it because it's the right thing or because you believe in the limitations that actually no longer exist.
Making Reversible Design Decisions
Reisz: Ian, I want you to double-click on that a minute. You said long term implications of your topic design, like what? Describe that a little bit more?
Thomas: Sometimes it's a bit of a naivety in terms of thinking how easy it is to change things after the fact, and looking at the throughput that you might need. To something that Gwen touched on there about the size of an event, if your events get too big, there are issues with the replication model that you want to have and how much traffic you're going to be sending between brokers. The main thing for us was that if we were holding our state in a compacted topic, and then you suddenly realize, hold on, we didn't have enough partitions to support the throughput that we've now got, as this has grown. All of those previous events will be on the wrong partition if you try and widen out. You've got to play through with people like, how are you actually intending to scale this up if you need to in the future? Are you aware of what the constraints are of your choice now?
We tend to try and model things in that Amazon type-1, type-2 framing. Like, is this something you can just do for now and not worry about it, it will change easily in the future? That's one of those ones where I think if you don't necessarily have enough understanding of how the systems work, or the actual technology you're working with works, you can't turn a type-2 thing into a type-1 quite easily without really meaning to. It's making sure people are aware that this is a constraint, just keep it in mind when you're designing your system and how you're putting your data through this technology.
Clark: Indeed, I find that even the great type-1, type-2 thing, this idea, is this a reversible decision? That's one of the challenges I do have with event-driven architectures, per se. It can lock you into things that are hard to change later. Once you've got multiple clients that are now accepting your events, changing that event format becomes a really tricky thing to do. You hope that you can add new fields to your JSON or whatever, without your clients caring, but it always still feels a very nervous thing to do. I don't think we've quite worked out how you handle that one.
Shapira: There's an entire book on that. The Greg Young book.
Reisz: Let's talk about that, Gwen, because there were some questions that came up. How do you address problems like that?
Gwen, you talked about a book?
Shapira: Yes. It was by Greg Young. He wrote an entire book on event versioning, which just goes to show that it's not an easy problem, and I'm not going to solve it for you in five minutes right now. Kafka is well known internally on its own protocol for being fanatical about stability. You can take like a 0.8 broker and a 3.0 producer and a 1.0 consumer, and just have it all work. It comes at a cost in which you evolve things incredibly slowly. If every client and every application has a big, if you get events of version 1, if you get events of version 2, it's highly non-magical.
Day-2 Concerns with Event-Driven Systems
Reisz: This morning, Katharina Probst in her keynote mentioned a bunch of day-2 operations things for microservices. She listed some things like load testing, chaos engineering, AIOps, monitoring. When we talk about event-driven systems, what are some day-2 concerns that you need to be thinking about? You talked about versioning, for example. What are some things that you need to be thinking about that you maybe don't really consider right off the bat?
Clark: A couple that come to mind, scale is definitely one of them. What happens if a large number of events have been republished? Often, you find, if you're a microservice owner, you might find that one of your offensive players is suddenly choosing to be publishing things for whatever reason. Maybe they had a bug or something. You need to be able to handle that. Or, at the very least, you probably have a queue in front of you from which you can handle that backlog. You do not want that backlog to last a particularly long time. You have an interesting scale challenge, which from nowhere, all the traffic can come from all those events. Just the fact that you're storing that state, how are you storing that? What happens if you redeploy yourself? Are you making sure that you're not dropping anything during those moments?
Reisz: Ian, what are your thoughts?
Thomas: I completely agree with both of those. Perhaps one of the ones that I've noticed over time is past day-2, but more like day-600 when the people who built the system have moved on, and the fear of new people coming in and trying to work out how this thing works, and not being able to change things particularly, and being worried. A lot of it comes around like what you mentioned at the start, the domain and how is it documented? How people are able to change things? What's it like to actually come in cold and try and adopt this system, and evolve it to suit the current needs of the company?
Reisz: Gwen, any thoughts from you?
Shapira: Yes, I feel like my day-2 events, we should have done it in day-0, kind of thing. Test framework, you have all those microservices, you're going to upgrade them independently. People have talked about building confidence, so really you make a change, you want to have a test framework that, A, will not take too long to run, maybe an hour or two, but not that much longer. B, will mostly reliably pass. It should have a few green builds every single day. Then, three, fairly easy to use and evolve and diagnose. I discovered on day-2 that actually upgrades and releases are hard because we don't really have a great test framework, and now we have to basically stop a bunch of production projects, go back to the drawing board. We have 50, 60 services, we're not even that large, how do we actually test scenarios that evolve, all of them, to be confident that we did not break anything else?
Monitoring and Observability of Event-Driven Systems
Reisz: There's a bunch of questions here around observability, monitoring, and things like that. I want to shift over and just give each of you an opportunity to talk a little bit about the importance of observability, monitoring event-driven systems, and any tools. I think, Ian, when we were trading some emails, you talked about day-2 ideas of actually building in some monitoring types of tools into what you're working with. I'd love to know some tips, tricks and thoughts from each of you on monitoring an event-driven system.
Thomas: Some of the ones that gave us the most value were things like adding tracing, to be able to see this lifetime of messages and records as they go through various parts of the system. That coupled with tools like Kibana can be really powerful to understand exactly how things are moving away but on X app. One of the questions that we constantly got asked was, have we seen this? One of the things you don't always have the luxury of is that you're the producer that's the source of events. For us, we take a lot of data from third-party suppliers that have scouts at football matches and publishing updates, and we just don't often know, has this happened? We should have seen that the score on this football match has reached 3 - 0, or whatever, but we don't have that state, so what events have we seen, what order? We built some tooling that allowed us to really quickly dive onto a production box and play back some events.
The things that always tripped us up, before we spent the time to build internal tooling around this was, we wanted to have TLS between the broker and the clients. This was Kafka specific. That enforced our ACL, so you might not have permissions to see a certain topic, you got to think about that, what you're doing there. If you do have some debug facility, make sure you're not going to be messing with your actual production consumers so that they're not getting fist around all over the place, and making sure you're considering how it actually will affect a working system. Then, if ever we needed to extract data, normally, I don't know if everyone's systems are different, so we had multiple levels of jump posts to get to our actual Kafka brokers. Then you're thinking about, how do I actually extract useful information from this in a way that I can then take it away and triage it in a PIR, or something like that? It comes down to like, you don't really find out your requirements until you need them, and that's making sure you've put the time aside and put the effort in place to build the things that you need.
Reisz: Your mileage may vary. Absolutely. Matthew, anything that you all learned on the observability front that might be some good advice for other folks.
Clark: As Ian says, tracing is really good, isn't it? We do a lot with Amazon X-Ray and it works very well. Then individually, at each microservice level, you're getting the logging right so you can diagnose where there are issues. As long as you've got some broker in between each microservice, be it Kafka, or Kinesis, or whatever, then you hopefully can discover and isolate which is the one microservice that's letting you down, and address it as quick as you can.
Reisz: Gwen, anything from you?
Shapira: The only thing that I have to add is maybe the idea of sampling, that you can have an external system that will sample some of the events, especially if everything that's going on is very high scale. Then, double check it in the background for outliers, and that nothing unexpected, like things are not overly large. Ian just spoke to it, you know what the shape of your data should be like. That's how you detect if we should have seen this and it's not here, kind of thing. We also know that we should not expect that many authorization attempts a second. If we get that, probably something went terribly wrong. We have built this system that goes in the background and double checks some rules on samples. I think that served us quite well.
Lessons Learned Through War Stories
Reisz: Gwen, I'm going to start with you on this one, because I think you already mentioned one. There were a lot of requests for different war stories that led you to some different lessons. What are some lessons that you learned the hard way through some war stories? Tell us about the war story, and maybe the lesson?
Shapira: I think that's the one that relates to the versioning discussion from earlier. We basically wanted to upgrade a lot of things. We had about 1000 instances of a single service type, and we wanted to just upgrade them. It's a stateless service, which makes it easy. We just pushed about 1000 upgrade events through our pipeline, hoping that it will all get processed, and 997 of them managed to, over time, upgrade themselves, and 3 wouldn't. We couldn't even really see why. The event was getting there, everything looked fine. We had traces. We had the logs everywhere. Eventually, we discovered that those were our three oldest services, basically, like the first three customers we've ever had, dating back to 2017. They had some different authorization key that prevented them from downloading these things they needed to download in order to upgrade themselves. Nobody even remembered exactly how the key got there. Apparently, it was a different type of event. It was just three. We ended up brute forcing them. That kind of thing, even if you're very careful about evolution, like one step at a time you evolve away into a system that will be totally incompatible with whatever happened in 2017 that nobody even remembers. I think that the main lesson here is just don't have anything that is that old. Everything has to be upgraded every three months, six months, maybe a bit longer if you have less churn in what projects you work on.
Reisz: Ian, what about you, tell us some war stories?
Thomas: I've got a couple that sprang to mind as you asked that. They're both from a few years ago, so I don't think I'm hurting anybody's feelings by saying these. One of them was precisely about the size of events, or rather the size of things linked to an event. On Sky Bet, one of the ways that pages are built is that this flow of information out of Informix goes through various RabbitMQs, and then processed by Node, and eventually gets stored in Mongo documents. Because of the way that the updates happened, we tended to read the document from Mongo, work out what that meant to the document, and then write it back. There was a bug in that logic that meant that we didn't ever really delete stuff from the Mongo document. Because it was a homepage, I think it was the horse racing homepage on the site, it just gradually got bigger, and bigger, and bigger. While it wasn't obvious straightaway, when the site went down, unfortunately, on Boxing Day, which is a pretty big day for sports betting in the UK, all these things were going wrong. We couldn't work out why. It was basically because we saturated our network by pulling this document in and out of Mongo so frequently that we couldn't actually handle it anymore. That was a pretty interesting day.
The other one that I can think of that was quite difficult to work out and probably speaks to something around best practices with working with Kafka was that we had this really weird situation where we had two producers, so that's a smell straight away. We had two producers writing to a topic, but the records with the same key were ending up on different partitions. The long and short of it was that basically one of them was a Node app and the other one was written in Kotlin. The way that the key was used and the data type that was used to produce the actual partition hash, meant that the integer was used in Kotlin, it overflowed. It was actually producing a different hash to the Node.js one. That was quite a day, looking for it.
Shapira: How did you find it?
Thomas: I can't remember. It was a few years ago now. We were just literally going line by line in these programs, like what is different? The only thing that we ended up concluding was this one is Node and that one's basically the JVM. What could possibly be different in the implementations? It was just a number.
Day-2 Advice on Operating an Event-Driven System and Things That Aren't Great For Event-Driven Systems
Reisz: I wanted to focus on day-2, if you could sit down with someone and give them one piece of advice on what to think about for day-2, or long-term operating of an event-driven system. We've been talking mostly about Kafka. What might you suggest? It doesn't necessarily have to be with Kafka? What might you suggest to them?
Clark: Do your very best to keep things as simple as you possibly can, because it is extraordinary just how complicated these things get. The story I would have said if we had time was to talk about how we had one moment where we have all these different systems doing all these different events; wouldn't it be great if we standardized the events and put them all together, and made this one super topic of all the events? Of course, that was a terrible idea. Because they all have their different properties, scale in different ways, needed in different ways. Just like the microservice concept, keep things separate, keep things simple. Don't just assume event-driven is the answer, because it's a great solution but it's not always the right one. Just be aware, it might not be as simple as it looks at first.
Reisz: What are some systems that maybe aren't the best for event-driven systems? Do you have any thoughts on that, Matthew?
Clark: Fundamentally, if yours is a user facing thing, it ends with a request, isn't it? A user turning up going, give me a thing. At some point, your event has to turn into a request. It's all about working out where that is. At the BBC, we prefer to have it, so actually we do quite a lot of requests based with the user comes in, so we can respond to who they are. We want to be dynamic in that regard. That's one example. You cannot realistically prepare it ahead of time, because you want to respond to the moment.
Reisz: Ian, what are some things that aren't great event-driven systems? Then, what is your recommendation for someone for day-2?
Thomas: Things that aren't great? I think one of the nice ways to think about it is if I've got a workflow, and you want to be able to identify all the steps in that workflow and keep an eye on it as a deliberate entity. That's quite a nice way to orchestrate it, rather than event-driven.
My advice is similar, don't force fit it where you don't need it, but also be quite deliberate in designing your data to allow it to evolve. Keep in mind the way that you are choosing to implement it. Do you need SNS or SQS or Kinesis? Think about the constraints of the actual broker and systems you're using and design for them, rather than against them.
Shapira: When not to use event driven? I would almost say that, start with Node, and look for places where you need this level of reliability or this ability to replay and really strong decoupling, really large scale. Basically, keep an eye on when you'll need event-driven rather than starter, because I do feel like it adds a layer of complexity, is that maybe you will never get there, who knows? Maybe your startup will not be that successful.
In terms of day-2, I'll be slightly self-serving and say that you do have an option not to run Kafka yourself. It just removes a bunch of pain to hand it off to someone who is actually fairly excited and happy to take care of it. I think it's true in general, like we don't do our own monitoring. We have a bunch of third-party providers that do our monitoring for us. We don't run our own Kubernetes, we use AKS, EKS, GKE for all those. Yes, basically, it's nice to have things that you don't have to worry about every once in a while.