Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Cloud-Native Data Pipelines with Apache Kafka

Cloud-Native Data Pipelines with Apache Kafka



Gwen Shapira discusses how data engineering requirements changed in a cloud-native world, and how the solutions change with them. She shares architectural patterns that are commonly used to build cloud native data infrastructure, and how they help us build flexible, scalable and reliable pipelines to give our business visibility on all our data.


Gwen Shapira is a principal data architect at Confluent helping customers to achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating microservices, relational and big data technologies. She specializes in building real-time reliable data processing pipelines using Apache Kafka.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Shapira: I am going to talk about cloud-native data pipelines. I am Gwen Shapira, I'm an Apache Kafka committer, I worked on Kafka for the last four, five years or so, lots of different use cases, users, and customers. I worked for Confluent which is a company building data streams and event stream systems on top of Kafka. As you may guess, I really like Kafka, I really like distributed systems, I like microservices as this special subset of distributed systems, I like data in general, and I really love cats. This presentation will involve the best of microservices, distributed systems, data, and cats.

What Is A Cloud Native Application?

Because we are talking about cloud-native, you can't escape the question of what the hell is a cloud-native application? Does it even mean anything at all? In my team, at my day job, a large part of what we do is adopt Apache Kafka to run well in managed service as part of the Confluent cloud. Then, we have quite of an ongoing discussion, what is cloud-native enough? How cloud-native something can be. Is this cloud-native? Is that cloud-native? It's question from executives, from customers. We can't really go through the day without having some definition.

Usually, there is a bunch of buzzwords and terms around cloud-native and if you'll notice, this is kind of a subset, it's my 4 favorite out of 12. You'll see that none of them has absolutely nothing to do with the cloud, just general good, old, engineering best practices. Some of them look very different in the cloud, for example, take resilience. For the last 20 years in which I've been building software, we had to build resilient software but as long as it was my own data center, I could pretend stuff. I could pretend that if I only picked the right vendors and picked the right people and have enough process in place, my network will never ever break. My disks will never ever flake on me. It was a nice dream, but every once in a while the networks did break in an unexpected way. We yelled at the vendor, we fired the random person who headed the bit of process and we could go back to pretending again.

Suddenly, we're in the cloud where machines disappear, where the disks may have performance that changes widely, the network may or may not be reliable. It's very clear that there's absolutely nothing we can do about it, changing vendors is incredibly hard and it's not going to help us a lot anyway. Firing people is not going to change all that much, they are not responsible for the network anyway, that was the entire point. We already fired everyone who was responsible for the network when we first moved to the cloud.

Suddenly, we're, "There's nothing we can do. We have to write better software architectures. We tried everything else, we need to write software that is resilient to failures in the underlying layers, no choice." and similar thing for the CCT. We always wanted to scale up and down but it was always like, it's going to be nice but you know scale up means being able to run on more machines next quarter when the machines show up. We had time, it was never a big deal, never urgent. Suddenly we're in the cloud, we can get machines any time we want and there is a real business value in being elastic. Being elastic means that I use only the resources that I really need right now. I don't need to purchase ahead, it saves me real money.

Suddenly those ideas have more urgency. I am going to talk a lot about how our architectures evolve especially around resilience and elasticity which I think is the core of what clouds are about.

Cloud Native Architectures

One of the things that you will note in part of this debate, a lot of people talk about, "Is this system cloud-native by my database? My database is cloud-native. My database is more cloud-native than the other guy's database." Maybe, but none of it matters. The same way that in modern networks we use TCP, which is a very reliable network protocol built on top of very unreliable components below it, it's exactly the same way you're going to build cloud-native architectures on top of components that may or may not be cloud-native, that may or may not be elastic. They may or may not be resilient, it doesn't matter.

The architecture as a whole, the thing that you're building, can and has to be cloud-native, elastic, resilient, irrespective of the components that you're using. This is a sentiment that I think you already heard that I'm trying to remember which one of today's sessions mentioned it, but I remember looking at someone's slide and saying, "Oh my god. This is exactly what I have in my slide and it will look ridiculous." I think it was the second session of the day when he was talking about building cloud-native architectures, Monzo, the bank thing. Here also, we are building something resilient on top of something that isn't.

What do cloud-native architectures look like? Well, that's pretty easy, we all know. They look like this, they are made up of a lot of microservices. The reason they are made up of a lot of microservices is that you have individual components that are easier to scale up and down. You scale up and down just the parts you need. If something crashes it's easy to failover again just the parts you need, and they all look quite stateless. We forgot something, they also talk to each other. Normally they do it with REST APIs although I'm pretty sure someone here mentioned GRPC and there's few other methods as well. It doesn't matter, you have microservices, they talk to each other.

You take a good look at that and try to think what is this image missing? Because it's missing something, we totally forgot. Something, somewhere in your organization, there is someone with a different architecture diagram that looks totally different and it says my data architecture. It doesn't have any microservice on it, it doesn't have any application. It has a bunch of databases and ETA lines going between the databases and maybe some Hadoop somewhere on the site. If you're really unlucky, this diagram did not change since 1985, it has been known to happen. These segregations that you have the application architecture, and the data architecture, and they don't know about each other, they have zero overlap, and yet you're supposed to work together at the core of the problem.

Just think about it, the data engineering track is over in the other room. You're over here, clearly, something went wrong somewhere on the way because all your applications need data.

My thesis is that we're building different architectures for the cloud and in the same way we need to reconsider how we are treating data in the cloud. We cannot treat data the way we treat microservices. Microservices and especially the communication between microservices is about context are about boundaries. It's about really sharing the minimal information, the APIs, exposing only the things that you need to know. Having very deep classes with very small APIs is like the golden holy grail of great microservice design.

Data is the total opposite, data is about the unexpected. It's about mixing and matching, looking for interesting patterns. The more data you put and mix together, the more powerful your data systems become. There is a real gap to reconcile here and that's what I want to address.

Let's take a look at the simplest possible microservice architecture I could possibly come up with on the slide. We are selling something, refrigerators, whatever, and we have this order service. The web service will talk to that and create orders, orders have to be validated. Basic things, does the user exist? Does the price look responsible? Not too much discount for the sales guy, please. Do you buy no more than 15 refrigerators at once? This kind of thing and then you get a response, that's fairly typical.

Fraud Detection

Now let's say that someone shows up and tells you, "Hey, you're the owner of this. You'll need fraud detection." What do we do? You may be tempted to say "Well, obviously, fraud detection requires lots of data." If you think about it, fraud detection is about detecting abnormal patterns, you'll need to know what is normal. This means that you'll need to know the history of what orders look like and maybe the history of what this customer looks like and maybe what this specific product looks like and maybe also alerts from credit card companies about countries that have more danger or known exposure or whatever, target doing a leak, this kind of thing. You say, "Oh, ok, no problem. I have all the data in my data warehouse. Let's just build a fraud service that will do all that on top of my data warehouse." This is fine but note how this is completely out of bend. This is not part of the purchase process and this is what a lot of people do today.

A few months ago I got a call from Home Depot and they say, "Hey, did you buy a laptop?" I said, "No, I did not buy a laptop," and they said, "Well, call your bank. Your credit card has been stolen." This was three days after they got the order. I'm glad that they didn't actually charge my credit card but this is still something that they lost money, they may have already shipped the laptop to someone and lost an item. Doing this out of bend especially if it's a data warehouse that only fills up like once a week or once a day has quite a lot of risk to it.

The other thing you may be tempted to do is say, "Well, we want to do it online. Let's have the order service call the fraud service, have the fraud service call five other microservices that have all the information, put it all together, and come up with an answer." The problem with this is, first of all, you have to modify the order service. Maybe you don't want to do it. Who says you have to? Maybe the fraud service can be too slow which slows down the entire process.

You're starting to filter down a lot of dependencies and creating something that maybe you don't need to build something that complex. What we really want is to build smarter microservices that don't just know about the one transaction that you just created, not just about the current orders that they have to validate, but they are actually aware about the entire history of what's going on in the rest of the company. Maybe not all of it, we are not putting a data warehouse inside our microservice but the bits that they actually need to use. They should be able to have ongoing awareness of it not as a data warehouse that is loaded once a day but every time something happens, you want them to get an update that, "Hey, you have new orders and now the average order value is actually a bit like that."

If we do it really well then we can start building things that are pretty cool. We can start doing something like service mesh and let's say we have two ideas of how to do the fraud detection. We have one of them over here and the other one over there and we can start by diverting 1% of the orders over there, see how much fraud we detect if we have too many false positives and play it that way. We can build really nice architectures if we only had enough intelligence in the microservice itself.


There are obviously a bunch of challenges, the first is that we are very used to pretending that microservices are stateless. The reason we can pretend that is because the state is elsewhere. I'm saying let's move a bunch of the state into the microservice. How do we deal with microservices that actually have state in them? Then the rest is that data is almost the opposite of what microservices is about. Data is shared, you need your data to also be available to a bunch of other microservices, you need to be aware of a lot of history. How do we do that?


For the rest of the talk, I want to share some design patterns. It's not the whole solution, I'm not doing a reference architecture, here is how you are going to build your systems from now on forever because you have your context. You have the things that you know about your requirements and how you do things in your world. I figured that few design patterns that have been proven to be useful for other people dealing with this kind of problems may be also useful for you as you come to solve your problems. Let's start taking a look at those.

The first one is that I want you to start publishing events. As microservices do things, it's changed the state of the world. They create orders, they validate them, I want you to let the world know what is getting done. I want to start by explaining what events are because this is something that everyone gets wrong on the first try. Events are not commands, if I talk to a microservice and say, "Please validate this order," this is not an event, this is me telling someone what to do. If I have a question, "Hey, what's the address of customer with ID 53?" That's not an event, that's me asking you a question.

Those things are not events, events are me telling the world about things that happened, "This customer changed his address. Everyone, you should know, this customer now lives somewhere else. An order was created. The validation for this order failed. The validation on this credit card failed." Those are things that you can tell other services and some may be interested. The fraud detection service may want to know that, "Hey, something went wrong with this credit card and someone changed their address." Those are two important bits of information. Events are used for notification and they are also just data. I have some data and if I create an event out of it, it's now data that is available for someone else.

Buying An iPad (With REST)

This is super theoretical so I want to go through a specific example for a few microservices, a bit more than in the other example, and what it looks like when I change the architecture to event-driven. Let's say that I have this web server and I want to submit an order, so I create an order. Once the order is created, let's say it was validated, I want to ship it. I ping the shipping service and say, "Hey, ship this order." The shipping service needs to know, where to? It calls the customer service, say, "Hey, where does this customer live?" It gets an address, it ships, this is fairly traditional architecture. Note how everything is either requests or commands.

The first thing we can do is actually use events to notify services about things that happened. When the order service creates an order, instead of saying, "Hey, ship it," it just creates an event, an order was created. The shipping service will obviously be interested, note how the event is published on Apache Kafka, it’s a service box but down there. Everyone who is interested can subscribe to the topic and know that, "Hey, an order was created." Among others, the shipping service will now be able to pick up that, "Hey, we have an order." I know that I'm supposed to ship it.

Why is it a big deal? Because we broke down the dependency, if the shipping service is down, the order service no longer have to know or care. When the shipping service is back up, there is obviously a sysadmin somewhere or a Kubernetes process. It will pick up the event, everything will be ok.

The shipping service still has to call the customer service to know where do people live but they don't have to. The customer service can also publish an event, " Hey, this person changed the address." This is data, the shipping service can accept this data and basically populate its own cache and now have a small copy of where everyone lives. If it needs to ship something to customer 53, it's now a local cache LOOKUP to say, "Oh, customer 53, he lives down in this street," and just ship it. No more remote calls again, it reduces dependency, it reduces the chance that the customer service is down while we're trying to ship something. This is very nice decoupled architecture that we like.

One of the problems here is what is known as the chicken and egg problem. I want to build an amazing event. We have an architecture but nobody is publishing the events. Nobody publishing the events because nobody needs events. Nobody needs events because how can I build anything that uses events if there are no events? This will go around in circle for a very long time basically waiting for someone to breakdown first and do the extra work that is worked to get everyone to start. A cool trick to it is to be aware of the fact that all the data always ends up in some database, and B, every database in the world logs events already. They log every insert, every update, every delete. You can listen to those events and copy them over to Kafka.

Imagine the updates and deletes going on in the database and you use the connector, the ones that you normally use. There is a bunch but the open-source ones is called Debezium. It's by Red Hat, I guess now by IBM, open-source, very cool, very reliable, used in production. You can see the events that it creates in Kafka, it reads updates and insert and delete. What it publishes is basically pairs of two records. It has the key, the primary key is a change, the before record, and the after record, so you have complete information. Every service that reads it know what happened before, what happened after, and the diff. It's not just the N equals N plus one. This is incredibly useful, very rich source of information, great way to jumpstart your event streaming process.

Local State for Microservices

Let's say that one way or another, either by publishing or by stealing from someone else, you manage to get a bunch of events. Now what? I say that the next cool thing to do with events is to create this local state, this local cache that allows you to remove the dependency from asking other services questions.

We have this stream of events, “this happened, this happened”, we had an order, it was validated, we have another order, etc. Local state is basically what is the latest things that happen to every item, in that case, the kid order, what is the latest things that happened to it? I don't care that the order was created, and then it was validated, and then it was the shipped. What's the status of order number one right now? Oh, it's already shipped, fantastic. This is the way to answer questions to your microservice.

One of the reasons that people really resist creating those local views which will be as you've seen like few examples, they can be incredibly useful, people resist it because they see it as duplicating a lot of data. That's true to an extent but it's not as bad as you may think. First of all, one of the reasons people hate duplication is that they can diverge. I have my copy of your data, you have your copies of your data, and five months later we don't even know what data is there and we cannot agree on anything. It happened again and again but if we all build our source of data out of the same shared event stream, we will have the exact identical copies, you know, plus-minus maybe few microseconds in-between. It will be slightly eventually consistent but we won't get into total la-la land where my data is so different from your data. It reduces the risk of duplicates.

You just pick the data you need, you create a view that's very optimized for your use case so it's not actually not that big. The previous talk explained how they created an index that takes just one megabyte of data out of terabytes, you have to be efficient about it. The other nice thing I'll show you in a second how you can shard it and scale it out with your application and then the fourth point is basically that sometimes duplicates are not that bad. They can be kind of nice and cute.

The idea of sharding, there are a lot of ways and dimensions in which you can scale. You take the same thing and run multiple copies of it, you can build bigger machines. Sharding is about taking a lot of things that are essentially identical and randomly assigning them to machines usually with the help of a hash function. In this case, we are talking about, let's say that we are single-machine processing all the orders. Cannot do it anymore, no problem, we can have two of them and then make sure one of them gets the even events, one of them gets the odd events. They'll have 2 100% distinct copies of database, 1 with the even events, 1 with the odd events. They can just work in parallel because you only care about specific or one order at a time or at least that's how we modeled the data in my hypothetical example. You can obviously see how with hash function you can scale it out quite a bit.

Better Than Shared DB

I really like it, I think it's way better architecture than everyone sharing a bit database partially because I don't have to fight with the DBA to get the data in just the way that I need it for my application, part of it because I reduce dependencies on other services. Part of it is because it's because it's low latency. It's a local cache. It's just all memory called, it's incredibly efficient and part of it is because unlike a database, events - not just data, their notifications, which means that this architecture allows me to do something like that.

I can say I have some state, microservices have been running for a while, and I can call it and say, "Hey, I want to know about the really big orders. The salespeople want to be notified so can you give me a list of all the big orders?" I get them and then I say, "Fantastic but now here is a callback. Please run this function every time this happens again." Then this function is like, send an email to the salesperson to let him know that, "Hey, we have another big order." You can actually combine triggers and data. Theoretically, all databases can do triggers, in practice, they have so many limitations that no sane DBA will ever let you write triggers on their database. This is another constant pain to deal with but if you have an application that gets all the event, it's just Java programming. You can query the local shard of the database and have a callback.

Reporting Live From Streams of Events

So far we talked about data in the context of a single microservice but the really big pain with microservices is that I have something like 20,000 microservices in my organization but the CFO still insists on getting a single report about expenditures every morning at 8:00 a.m. Now, what do I do? How do I handle all those 80,000 microservices? We go back to the stream of events and we say that we use them to create reports. Reports have a bunch of requirements. They have to be aggregated and usually quite complex aggregation. They combine data from a lot of services. They are going to be updated in real-time, that's not something the CEO asked for but that's how I justify my project budget. Sometimes auditors need it, I think MiFID II regulation needs you to be able to generate updated report at any time which basically means that this has to be in real-time. Then obviously we're cloud-native so scalable and resilient is order of the day.

This is in architectures that we built once and twice on Confluent, so if you use Confluent monitoring systems, this is basically what's going on behind the scenes. You publish events to all those streams of events in Kafka and all those topics and you create an application that reads data from a lot of topics and creates a local aggregate. Using something like Kafka Streams or Flink or KSQL, you do all your joins and groupbys and window groupbys and averages and all those calculations inside your application code. Then what happens, that will happen whether you use Kafka Streams or Flink, it will create a local database using RocksDB. It has ongoing copy of the results of your report and you will get a rest point that allows you to query this local state that gets continuously updated.

You actually are using modern stream processing system. You actually get quite a lot for free and you usually do it in a very high-level DSL, Scala, Python, Java, or SQL. The other thing that you get for free and this starts sounding a bit magical, you can scale it out. As long you have enough partitions in Kafka, you can add more and more copies of this application. If there's too much data for one machine to process, you can have multiple, and you get failover. If one machine fails over completely, loses its local state, another machine can recover the local state and continue processing.

People didn't really believe me about the local state, so I want to share some details. The idea behind state recovery is that inside our application, those orange dots are basically steps in processing. Think about it as read events, join, aggregate, and so on. Some of those points are stateful, they actually have local state, what is the current value of my aggregate? Every time I update the local state, I also write an update to an optimized topic in Kafka. That's the changelog topic. If you ever used Kafka Streams, you'll know about the changelog topics that show up in Kafka when you do it.

If something happened to instance one and instance two has to take over, it reads the changelog. It's a compacted topic so it should take something between five seconds and if you have really huge state, then maybe half an hour to recover it. Now you have a new running copy of your local state and you can continue processing new events. It does take time and definitely if your state grows too large because of various issues and we have a lot of discussion on how to prevent it from happening and how to optimize, but this is the main idea.

3-layer Data Model

This is the part of the presentation that I am least certain about, I wanted to share it and I hesitate. The reason that I am uncertain is that it looks a lot like things that I already tried in the '90s and kind of work but also kind of don't. The reason I'm sharing it is because every time I told about it to a customer, he was like, "Oh my God. How come you are not talking about it in presentations?" "Well, because I don't know if it's a good idea." "No, no. It's a fantastic idea. It will solve everything for us." The next bit, please take with more salt than the rest.

You're talking about those topics and obviously, applications produce to the topic and you have a lot of applications reading from them. Ideally, a lot of microservices and a lot of reports. Who controls the data in this topic? Because if something makes a change, it can potentially break a lot of stuff, people have to agree, what is a customer? What is an account? Does the order has line items inside the order or do you put the line items somewhere else? People have to agree on those things. Who decides? The person publishing? The person consuming? It took as a few iterations to really solve the problem. The first part was that it's incredibly hard to convince people to publish events. If I put too many roadblocks if I try to say, "You have to publish events in this exact format," they basically go away and never come back and I really need them to publish events. The compromise often ends up being, "Here is a topic. Just write to it. Don't worry about anything else, just produce something. That will be a good enough start." That's obviously nice but those events very often have missing fields. I've seen them being in the sales in the wrong currency, I've seen people publish XML after the entire organization supposedly immigrated to JSON three years ago, a lot of those things happen. Usually, there will be someone who cares a lot about data integrity, an integration team, architecture team. There's always like the group that really cares about standards of data, what is an account? What fields should an order have? It's their job to write the service that reads raw data, figures out where to find the rest of the data, how to arrange it, and write it to another topic.

Now you have a topic who is data that should be clean and standardized and expected. This is important because now as a consumer you can start making assumptions about the data in your code. As a consumer you can sometimes just build an app directly on this topic and a lot of the time this is going to be absolutely fine. Enjoy building an app on clean, standardized data. That keeps arising in real-time, every one of those cycles is a real-time processor. Usually, it's enough but sometimes you want your applications to have this local database. You want some pre-processing, you want the data that is joined, aggregated, sliced and diced in exactly the way you want, filtered, etc. Here you have two choices. You can create a very big stream processing job that just does everything that was the reports that they showed you in the earlier pattern, or you can split it into two.

You'll have one stream processing job doing the data slicing, dicing, managing, etc., and you'll have the consumers that only reads those nice events, populate it in a local state, and maybe expose an HD port to allow people to allow people to build reports on top of it. You have those options but in general in this system usually, the consumers end up doing most of the work, I don't know if it's fair or not. You need to protect them, that's why we have the extra layer of clean data because they are the most vulnerable to data suddenly changing and blowing up the entire architecture but they also usually stand to get the most benefits. They are the ones that make the business' benefit, publishing events is not a big deal. Using the events to make a better fraud detection that saves your organization millions, better recommendation systems that increases sales, those are the things that make a difference and they definitely require more work.

One thing that you'll notice is that some of those topics have to have controls in them. We need to know exactly what data goes into them because we have expectations about what data will come out.

In Event Streaming World Event Schemas ARE the API

This may look familiar, those events are used to communicate between teams, between systems, and, in essence, these events are going to have a schema, a definition, what fields, and what pipes is in there, and it's going to serve as an API between services. Someone asked me, "Isn't the whole sending messages from one service to another tons of overhead and really complicated? Can't we just use REST and send some JSON?" At the end of the day, it doesn't matter if you use some REST and have JSON or if you send Protobuf events via Kafka or whatever it is you do. The hard thing is defining APIs and maintaining them and you'll have to do it no matter what technology you'll use. Figure out the APIs and then separately figure out if you want to transport things via Kafka or via REST but you'll have the same problem elsewhere.

In the Kafka world, I know that in APIs there is a lot of systems for documenting and managing APIs. In the Kafka world, we use what is called a schema registry, it used to be a Confluent schema registry but now everyone has a schema registry, it's a design pattern and not a product. The idea is that you basically announce, "On this topic, this here is a list of schemas that is allowed to be stored there." It used to be that all the schemas had to be more or less of the same type plus or minus some fields but now you can store any set of schemas belongs in this topic and the producers know about it and the consumers know about it. Consumers will use this set of schemas as an assumption when they build their system.

The producers, on the other hand, will have stabilizers that are tied to the schema definitions. Every time they try to produce something that doesn't belong, they'll get a sterilization error with the goal of keeping the topic clean because if a bad event goes into the topic, we don't know what is the blast radius. We don't know if 1 microservice will fail or 50 or 100 or if we just sent wrong data to the report that the CFO is sending to some regulatory commission and he will go to jail and the company will close down. You may laugh but this has actually happened for real, so you want to be very careful about the data that goes into the system. Therefore, producers get the pain of having to comply with the schemas.

Orchestration vs Choreography

Another thing that people keep asking me about is about how do I create business processes on top of microservices? There are two main patterns and honestly, I can't choose. It’s the classic consultant answer, it depends, I'll just show the pros and cons of each of them.

The first is known as orchestration. You'll have one microservice called the orchestrator which basically encodes the entire business process. Start with this and then if this happens, do this, if that happens, do that, and you have your entire business process in one place. It's relatively easy to follow, if you read the log, you know what happened which is nice. On the other hand, the downside is that you basically have one massive dependency. You can't make any change to the business process without ending up modifying the orchestrator. I now, on some of my systems, live in this world and basically I constantly have to open PRs to the orchestrator team and they constantly have to review my PRs and merge them and deal with a bunch of merge conflicts because every single change that me or anyone else in the company makes has to also make a change to the orchestrator. It can be quite painful to develop but a lot easier to debug than any alternative.

The other option is what we call choreography. Instead of the image of an orchestra where you have someone that tells everyone to play at what time, you have a bunch of dancers on stage and each memorizes his own script and they react to each other, someone jumps and someone leaps and catches them, etc. It's like a choreographed dance, and this may mean that I can at any time add another step by reading events, reacting to an event that someone else sent, sending events that maybe someone else will respond to, and we can integrate organically with each other. The downside is that if step four failed to work as expected, I have to start tracking down things in a lot of different places to figure out where the hell did things went wrong?

Take Away Points

Now time to summarize few things I want you to remember. First of all, as you design cloud-native architectures, do me a favor and don't forget that your microservices are in fact faithful. The data is somewhere, and you need to know where it is and how to use it appropriately usually leaving it up to a DBA who may still be stuck 20 years earlier. He is not going to work as expected, so you really want to take responsibility of the data part of your architecture.

Publish events because this is the basis of creating event-driven systems. It can take you to serverless, it can take you to stream processing. This is really a great start to building awesome architectures and then use those events to build those local views and local caches and integrated live real-time reports using things like Kafka Streams and Flink that are really built for creating both microservices and stream processors.

The last point, a lot of what's making architecture work is really about not just caring about meeting your deadline but thinking about how to build something that will be useful for other people. Is my event something that other people can find useful? Did I publish all my internal states in places where other people can find them? Am I sharing the data? Am I sharing my reports? Being nice is good life philosophy in general but I think also in those interconnected architectures, it actually, in fact, makes for better architectures.

Questions and Answers

Participant 1: You gave an example of the schema registry and you gave an example of an event that might be published to a topic which doesn't contain all the data or needs cleaning up. Are those two patterns exclusive or would you advice blending them?

Shapira: One of the reasons we have this three layers model is that we found out that we couldn't really impose restrictions on the producers, like the publishers of the data, without losing a large number of them because they started getting those sterilization errors from the schema registry and just said, "I don't want to deal with this BS," and went away, this was a problem. First of all, try to put schema registry in front of your publishers and if they cooperate, you are working with people who are nice to each other, you all won because now you just don't need an extra layer. If they stop cooperating, which is something that definitely happened to a bunch of companies I worked with, that's where you are like, "Ok, just publish whatever you want and we'll add another process to take care of it." Then you get the other process.

Participant 2: To tackle on the previous question, we use pact DSL to do schema registry and validation. We don't use Confluent at the moment so I'm not sure how similar or dissimilar that is, but one of the nice things is that the publishers know when they break to consumers. Could that help with adoption?

Shapira: Sorry, the question was around a system where you want the publishers to know about consumers?

Participant 3: Yes, in the case where the publisher knows that he's breaking certain consumers is they need to version up their schema or what have you, would you get better cooperation?

Shapira: Usually by the time you broke consumers, it's way too late, you want to catch those issues like in development. When I am writing code on my personal machine, I want to know that I'm going to break consumers. Given schema registry or, as you said, many other methods, the earlier you know that you made the breaking change in your code, the more chance you'll have to fix it. On the other hand, again, some people are nice and some people will be motivated and write code and fix it. If you have that, fantastic. Keep your developers, raise their salary, everything.

There are cases where if people run the local test and it breaks and they get this weird sterilization error, schema doesn't work and they make some schema changes and it still doesn't work, after a while they are like, "Do I really need to publish the data?" They are like, "Well, I can hit my deadline faster if I am not publishing my data. Then I won't run into these errors." That's something that I am really trying to prevent. That's why I have trouble with this three-layer model because in essence it cuddles and promotes what I see as antisocial behavior, people not taking responsibility for their actions. For a lot of companies, that was literally the only way to get this architecture anywhere. I don't know if you want to start there but know that this thing exists in case you start hiring not so nice developers and don't feel like firing them.

Participant 4: Do you have something, some good practices, if it goes to event attributes, for example, event version or maybe some timestamp?

Shapira: Yes, a lot, event versions are important. Event timestamps are often very useful. Every event should have who produced the event because if something goes wrong you want to know where to come back to. What else is a bit useful? Sometimes I see people put like general location if they deal with like multi-country, multi-region putting location in the header of the event. If you deal especially in Europe, there is some events that you can or cannot use for specific purposes, and you can and cannot store in specific ways, and you cannot store in specific countries. All those specific limitations around your business process very often go in the header and like the event metadata.

My general advice about what goes into the event metadata is to think about what processes you have in the organization that don't care about the payload but still need to know about all the events. Things like audits, like did all the event make it to their final destination? Things like GDPR compliance. Did I store PII in the correct way? They don't actually care about the data, they care about the label. PII could be used for marketing, it cannot be used for marketing, never store beyond the 30 days, etc., and that's about it. Those things, store them usually as strings because who knows what your payload is. Usually store them as strings, something that everyone can read including auditors and it will be very easy to enforce across the organization.

Participant 5: Any advice on how to handle breaking schema changes when you have to make some change that's not forward-compatible?

Shapira: I would normally say don't but I notice that even people who directly report to me don't listen so I don't know, this probably is not even that useful. I'd say that don't break compatibility inside the same topic because inside the same topic, that's exactly where the expectations are enforced. If you absolutely have to break compatibility, have another topic and as people upgrade, they will transition to the next topic. You may have a time where you actually need to copy the data back and forth and translate which is a pain in the ass but, hey, breaking compatibility is going to be a pain in the ass as well which is why first advice, don't do it.


See more presentations with transcripts


Recorded at:

Aug 03, 2019