InfoQ Homepage Presentations From Batch to Streaming to Both

From Batch to Streaming to Both

View Presentation

Speed:

Download

41:47

Summary

Herman Schaaf talks about how the streaming data platform at Skyscanner evolved over time. This platform now processes hundreds of billions of events per day, including all their application logs, metrics and business events. He explains what got them here, their current plans and why they may want to skip some of the steps along the way.

Bio

Herman Schaaf is a senior software engineer at Skyscanner, where he works primarily on building the central data platform. Before this he worked on applications in machine learning and machine translation, including an offline mobile application that can recognize and translate Chinese to English.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Schaaf: About two-and-a-half years ago, I found myself in a room with four other engineers at Skyscanner. We were tasked with answering a question that I thought should be fairly simple to answer. That question was, where does our company's revenue come from? As it turned out, it wasn't quite that easy to answer. The reason we had this question in the first place was, we now had to comply with regulations known as SOX. It's in the U.S. It stands for Sarbanes-Oxley. It essentially meant that we needed to be able to show for every line in our financial statements and our records, every line there was traceable and that we could prove that the data there was reliable, and from reliable sources. That it couldn't have been altered in any fundamental way by anyone inside or outside our company.

In this case, it wasn't that easy to answer. We were scribbling on whiteboards, trawling through GitHub. Asking questions to other teams trying to figure out, how does revenue actually flow through our streaming pipeline? It actually took us a couple of weeks to get enough confidence in the answer to give the recommendation, and say, "This is exactly how we're going to secure it. This is how we're going to make sure that these numbers don't get fundamentally changed." It gave me quite a bit of a pause, how did we end up in this situation, where no one individual or one team even could tell from beginning to end, from producer all the way through to financial statements how the data flowed? What was happening in there?

What I'd like to do today is share a story with you of how the data platform at Skyscanner evolved over time, from batch to streaming and to both. By sharing these stories, I hope to share some of the lessons and help you avoid some of the traps that we fell into along this journey, doing things in a fairly agile fashion as we did. When I originally submitted this talk, the title was, from batch to streaming to both. In the weeks following, I've been rethinking it and thought that a more accurate title might in fact be, from batch to streaming to both, and then back again, perhaps. I'll get into why there's a question mark at the end there. This new title wasn't quite as catchy, so I didn't change it.

Background

My name is Herman. I work at Skyscanner. If you haven't heard of Skyscanner, we help you plan trips. We do that by helping you find the best flights from A to B. When you get to B, you can also find good accommodation, and car hire, all these things. When I say good flight, that might actually differ depending on the person. That might mean cheapest. It might mean lowest carbon footprint. It might mean most comfortable. We allow you to search across all these different things. Over the years, we've grown to be fairly popular. At peak we have about 100 million monthly active users. It's available in 30-plus languages. We've had 100 million or more app downloads. We have about 1200 partners, by which I mean online travel agents, and airlines, and hotels. Fairly big.

With that scale, comes also a pretty big data platform. What I'm showing you here is the number of messages per second that we were processing from 2016 to fairly recently, in our streaming data platform. Our platform wasn't always streaming. This isn't the full timeline. Back in 2016, we were doing about 20,000 messages per second. Later in the year, about 80,000 messages per second. That grew to about 2 million messages per second now. That's fairly big scale. To put that into perspective, it's about 172 billion events per day. It would be several petabytes, if we stored everything, which we don't. Even if you stored it in an efficient format, like Parquet, it would be quite a lot of data.

Our story started with a batch pipeline. Like many organizations, we started with data being in lots of disparate sources. We had Mixpanel. We had Google Analytics. We had an OLAP database that we lovingly referred to as The Cube. This cube allowed you to do slice and dice queries, fairly effectively and efficiently. It was hosted on-premise because we had not yet transitioned to the cloud. We were running into some scaling problems. Firstly, we didn't really want to wait 24 hours for batch extracts to happen. As we started processing more messages, the cube is starting to show its age. Together with a transition to the cloud and AWS, we started moving and exploring the possibilities of a streaming data pipeline.

We went from batch to streaming. We didn't really know whether this whole streaming thing was actually going to catch on. We put Kafka in the middle. Kafka looked like a really promising technology at the time. We said, on the producer side, we'll have our servers communicating directly with Kafka. Writing messages there. We'll have an HTTP API which we can use for web clients and mobile clients. They can write to that instead using REST and JSON. This HTTP API would convert the JSON to Protobuf. Pretty early on, we had this convention that all topics in Kafka should be Protobuf. This was a really good thing. If you're not using Protobuf as a message format, like Avro, or Thrift, or any other number of ones, if you're not using a binary format like that, I really recommend you go out and do that. We'll see some of the benefits later.

An interesting thing here is that we used our Kafka cluster for three very different data as well. We had metrics. We had business events. We had application logs. Metrics would be pulled from Kafka and delivered to OpenTSDB, which is a very scalable time series database, based on Hbase. Our data platform users will be able to query that through Grafana. Business events would go to S3 as Parquet, and be queryable through systems like Athena, or Redshift, or any number of engines that we later added. Application logs would be sent to Elasticsearch, and our customers within Skyscanner, the engineers would be able to query their application logs in Kibana.

This is still a good design. I think it's fundamentally a really good thing to do. Because it's paradoxically both decoupled and unified. It's decoupled in the sense that the producers no longer need to know which consumers are going to be interested in the data. They could just produce to Kafka and forget about it. That's good. The consumers similarly don't need to know exactly who produced the data. You don't have this web of interactions that you might have if all the consumers need to be aware of the producers and vice versa. It's decoupled. That's a good thing. It's simultaneously unified in the sense that all the messages now have to go through Kafka. Kafka becomes, to some extent, your single source of truth. In fact, around that time, we started talking about this concept of the single unified log, which basically just means that all the messages would now go through Kafka. If you had a question, you could answer it from Kafka using the data in Kafka. All the other systems would derive their truth from Kafka, from this single unified log.

This was all good. Our users really liked it. Pretty soon they were telling us, "What we'd really like to do is add some transformations to the topics that we have in Kafka now. We would like to merge topics. We'd like to split them into only the information that we care about. We'd like to enrich them with further information from other APIs." We said, "This is exactly what we think stream processing is for." We wanted to exercise a little bit of control, though. We didn't want everyone to reinvent the wheel and build their own stream processing system. We standardized on Apache Samza. If you haven't heard of this, Apache Storm is another similar one. We standardized in Samza. We basically made a templated version of this. That any team in Skyscanner could easily create within a couple of clicks by filling out two very simple Java functions. You could have a stream processor up and running within minutes. Fairly soon, we had hundreds of these Samza jobs running. They could do any transformation. We didn't put any limits on that. Things were great. We were pretty happy.

There was one problem, however, that we found also fairly early on, which was that as we were transitioning to the cloud we had some services that had hundreds of containers. When they did a blue-green deploy that would suddenly add hundreds of new connections to the Kafka cluster. That would then have a cascading effect of lots of extra metadata on the cluster and would cause some temporary degradation. We decided that for that reason, mainly, we would make sure that the servers within Skyscanner that are now deployed to AWS, would also talk through this HTTP API to Kafka. They weren't talking to Kafka directly anymore.

This was a really good move, in retrospect, because it actually allowed us to decouple the production from Kafka itself, which means we now have a place we could implement any business logic. We could enforce the usage of Protobuf rather than have it as a convention. We could check that certain fields are set. We can make sure that messages are good enough quality to go in, all these things. Also, some other surprising benefits, we could have this API deployed in multiple regions whereas our Kafka cluster was only deployed in one region at the time. This API would be in multiple regions. We could have local sinks that we write the data to, and then forward the data onto our Kafka cluster, the central one, which means that write resiliency is a lot higher. Even if our central Kafka cluster was having an issue, we would not lose messages. We would still be able to take those, whether from server-side, or from mobile devices, or from the web. This proxy was a really great move.

We were in a pretty good place. Then we realized that maybe we have become the victims of our own success, to some extent. We now had about 100,000 messages per second going through the platform. It was in this room where we were trying to trace our revenue events that we saw that actually these revenue events were going through an epic journey. First, they would go from the server to the HTTP API. That's all easy enough, then to Kafka. That's all, as we expected. Then a Samza job would pick that up. It would pick up this revenue event. It might add some extra information, something like device information extracted from the user agent, for example. That's good. Then another Samza job would read from the second topic and write out to a third topic. In this case, it might add some information like geolocation information. As we started tracing it, we realized there was a third Samza job writing to a fourth topic, and a fourth job, and a fifth. It went on a long, epic journey all the way until finally, the final Samza job, a number of hops later. At this point, we probably needed a Samza job to do some deduplication, because distributed systems introduce duplicates pretty easily. Before we commit the final results that we will use in our reports to S3, we have a deduplication job in there as well. This was fairly complicated.

Lesson 1: Conway's Law is true for Data Platforms

The first lesson is that Conway's Law is true for data platforms. If you haven't heard Conway's Law before, it basically says that organizations which design systems are constrained to produce designs, which are copies of their communication structures. Here, I've substituted systems for data platforms because I think the same thing goes for data platforms in particular. When we were looking at our revenue events, in this case, we realized that all these different hops, all these different Samza jobs basically mirrored all the different organizational hops that the data had to go through.

To understand this, it helps to give a little bit of context about how Skyscanner is structured. We operate using a squads and tribes model, which basically means we have small, 67-people autonomous teams. They solve problems in an independent way. They try not to have too many dependencies. What happened was, for example, the producers would produce the data. It wasn't in their interest, really, to make sure it made it all the way through. They didn't have the full picture. Similarly, other teams would create Samza jobs, enrich the data, write it back, and so forth, all the way through to the financial reports. No one single team had the full picture. Conway's Law is true.

We perhaps didn't realize this enough early on. One thing we did say was that we're a small platform team. We're trying to do the stream processing thing for quite a large company. We're trying to make that platform as self-serve as possible. I still think being self-serve is very good. It basically means, we enable our customers as data platform users, to use our platform without us needing to get involved, without us needing to have deep knowledge of every single dataset. Because as a small platform team, you just can't do that. What we didn't realize though, was that metadata then becomes absolutely critical. It was metadata to some extent that we lacked when we were looking at the pipeline and trying to figure out how events flow through the system.

Let's talk about metadata. In the architecture diagram earlier, there was no schema registry. How did consumers know which schema to use? How did producers tell consumers which schema to use? The answer to that in Skyscanner's case, at least, is convention. We had a simple convention for topic names. Every topic at Skyscanner looks like this. It's prod, or sandbox, or local, so the ..... This tells you from a glance, which schema was used. We store all the schemas in a central repository where everyone has access to it. You can easily see, this is the schema file name, and the message. Go look it up in that repository. This was a pretty good convention because it actually allowed us, as a small platform team, to share quite a bit of data about each topic and know how to interpret the data in that topic without us needing to go build an API to support this. Nowadays, Confluence's schema registry might actually do the same thing. However, until recently, I believe, it at least didn't support Protobuf. Back when we built this, the schema registry didn't exist.

Pretty soon we had topics looking like this, prod.identity-service.audit.identity.AuditMessage. That's fairly good. It tells you what it is and what schema to use when you read it. We also had topics like, prod.flyingcircus.applog.applog.Message, which is less useful. It came down to the team how useful they want to make these topic names and how much context they want to give it. It was a very useful thing.

I want to distinguish here between three different types of metadata: descriptive, structural, administrative. With descriptive metadata, you're answering questions like, what does this dataset mean? Who owns it? What does it contain? Where does it come from? Structural metadata is more about, how does this relate to other datasets? How is it organized? How is this dataset sorted and partitioned? Administrative, which is about how far does this dataset date back? How frequently is it updated? How large is it? How complete is it? All three kinds are very important.

When we started looking into it, we had some descriptive metadata by virtue of using Protobuf and by having our topic naming convention. That was good. We had some structural metadata, but we didn't know, not without a lot of effort, at least, how datasets relate to one another. This was something we missed. We could tell if customers were to ask us how something was partitioned. We could go and look that up. It wasn't easy for our customers to tell that. Then administrative metadata, we didn't have much of it at all. Similarly, we could go look it up if someone asked with a fair bit of trouble. We didn't surface this to our customers. We didn't have a lot of visibility into these ourselves.

Lesson 2: Metadata is Critical

The second lesson is that metadata is critical. Especially, when you're building a self-serve platform that you want a lot of people to use in your organization. Especially, the relationships between datasets, if we had this information, determining the revenue pipeline, and determining where our revenue came from would have been a lot easier. Ideally, of course, it would be automated. Ideally, we'd have all this from the start, but we didn't. We did the next best thing, which was we started to gather this information manually. When users wanted to archive something to S3, we asked them, could you please use your knowledge of this system to tell us what is the hierarchy? How did events actually flow to make it all the way to the final topic that you now want to commit to S3? That was a start. It gave us some form of metadata. We did it knowing that this would get out of date. It would be better if this was automated. A bit of metadata is already better than not at all. The tools like schema registry are a start here, but they're not the full solution. You need to also think about how you can make these relationships between datasets and the three different types of metadata, make those visible as well.

Many of you have probably seen this picture before. It's from an XKCD comic. It shows the flow of characters through Jurassic Park. I like to compare this to topics flowing through Kafka. To explain this diagram briefly, on the X-axis we have time, and on the Y-axis we have the proximity of different characters to one another. If the characters are in the same scene, then the lines are close together. If they're far apart geographically, then the lines are further apart. As you might expect in Jurassic Park, some characters make it all the way through. Some characters don't make it all the way through. The reason I'm saying this, is a bit like a Kafka cluster. Jurassic Park, we thought this is what our pipeline looked like. It looked a bit more like the movie "Primer." When we started looking into it, things were pretty difficult to follow. That's the point here.

Lesson 3: Data Engineers Control the Plot Line

Data engineers control the plot line. By this I mean that when you design the system, the restrictions you put in place on your data platform users determine the plot line you're going to get, determines whether you're going to get "Primer" or "12 Angry Men." If I had to choose a plot line for my platform, I would rather have a plot like "12 Angry Men" than like "Primer." That's what we did. We realized that these Samza jobs, these stream processing jobs, they were causing us a bit of pain. Firstly, we didn't enforce using Protobuf there, for example, but we trust that people would use Protobuf. More to the point, we couldn't really trace the lineage through Samza. We didn't have a lot of control because it was essentially free code and people could do in them whatever they wanted. This made it pretty difficult to trace things through the system.

What if we could go straight from the HTTP API to S3? This would also avoid a problem that we had with Samza jobs in terms of repeatability. When things go wrong in your streaming pipeline, it's really worth thinking about what you're going to do in that situation. For example, if a Samza job stopped, then it would affect all the downstream Samza jobs. Users would come to us and ask, "What happened? Where did my data go? Which team do I talk to"?

Another problem here is that when you inevitably do discover a bug in your Samza stream processor, long after the fact, it's very difficult to go and fix it. You basically have to do a replay. You have to send that data through the Samza job again. That would have unpredictable consequences on the downstream Samza jobs if you don't have a good handle on what's happening in them.

What if we could go straight from this HTTP API that we introduced straight to S3? A bit less like "Primer," a bit more like "12 Angry Men." That's what we did. We said, we're going to still have Kafka. We're going to send all our metrics, and application logs, and even business events, everything we were sending there before we're still going to send to Kafka. That will still be going through stream processing, all the usual things. In parallel, we're going to also send it to AWS Kinesis. It's a managed service that AWS offers. It's a bit like Kafka. We chose it not for any big technical reason other than it's very easy to spin up in multiple regions, and AWS pretty much manages it for us. We chose Kinesis here and we started sending things to Kinesis, and then reading them from Flink. Flink is another stateful stream processor. Using Flink, we could abstract away, firstly, whether it's Kafka or Kinesis. It has adapters for both. Then send those messages to S3 every couple of minutes. It would batch them together into a bigger file, into a Parquet file and then send it to S3.

This time, we also learned a lesson. We said that, of course, we still need transformations. We can't completely do away with them. Transformations are useful. Datasets aren't in their final form the moment they are produced. We allowed transformations to be done, but this time only in the archive, and only in a repeatable way. It meant that we didn't need to do any replay anymore. We could just read that day's data with our Spark job, for example, to do the transformation again and write the data again. It wouldn't necessarily need to have an impact on any other jobs. We also realized that we need to make sure that these Spark jobs that do these transformations on the archive, have to at least register their dependencies and register where they're going to write, so that when we do find errors, and we do correct them that we know what we need to do to correct it in the downstream systems as well. This was really important.

We also decided that this is a good opportunity for us to measure completeness and tell our users, give them a bit more metadata about the datasets that they are working with. Tell them that all the data that we found in our HTTP API has been delivered to S3. All the transformations you were expecting to be done have been done. We can verify that for you as a platform.

Lesson 4: Repeatability is Important

Repeatability is very important. Repeatability is something that is not always discussed. It's more of a failure case, in some cases, for streaming platforms. It seems streams often have to choose between replays and accepting errors as permanent. Replays can be very complicated. In our revenue pipeline example, we had about a 30-step process for doing replays that were really complicated. We didn't have a lot of confidence in doing them because you don't do them very often. If you can avoid them entirely, that might be better. That's the approach we went with here. It might not work in every case. We found that to be a bit simpler.

Batch processing is good because it can be done again, any time. The source data is always there. It's a bit easier to manage. Going straight to the archive in small batches, micro-batches, gives you the benefits of both. You still have very low latency but your replays can be at least quite simpler.

Some of you who are familiar with data architectures may be asking, is this a Lambda architecture? Is this what you're telling me about, Herman? Yes, it is. For those of you who haven't heard this term, I just want to make clear that Lambda architecture here, it doesn't have anything to do with AWS Lambda, or Lambda functions. I believe the name derives from the fact that the Lambda character in Greek has two little legs. This corresponds to the two legs in the data platform. There's a streaming leg and there's a batch leg. This is what we built. We had a stream processing side and we had a batch side that we introduced now. It's a little bit different. This has been used in other companies like Yahoo and in Netflix quite successfully for a number of years. They use Lambda Architectures, but they came at it from a different angle. They actually had Hadoop batch processing pipelines. They weren't satisfied with the high latency. They weren't satisfied with the 24-hour wait they needed. They introduced the parallel streaming architecture. We came at it from the other direction. We had a streaming pipeline, latency was fine. The error case, the replays that we had to do, those were pretty complicated. We wanted to simplify the lineage and increase our visibility over what was going on. We added the batch pipeline.

It almost feels this is an inevitable architecture to some extent. Then I want to question, is this really a Lambda Architecture? Because what we're doing is pretty low latency on both these legs. In fact, we realized that we could write files to S3 every 5 minutes. We could also use a technology known as Delta Lake, which is an open-source technology, and use that as a sink for streaming and for batch, and as a source for streaming and for batch.

Just to talk a little bit about Delta Lake. This was initially developed by a company called Databricks. It's essentially metadata on top of Parquet in your archive: could be HDFS, could be S3. What it gives you is ACID compliant features and semantics on your S3 archive. Basically, you can do updates and deletes on data in your archive. You can do very nice things like sort the data, they call it Z-ordering, which can make certain types of queries very efficient. In our case, because of GDPR, we need to be able to look up individual users and delete their information if requested. Delta Lake allows us to do this. In our experiments, we found that using Delta Lake instead of just there in Parquet could speed up these needle and haystack queries up to 100 times.

It also does something really nice, which is, it gives us the ability to roll up these very small Parquet files that we're constantly delivering into bigger files, which we know improves the query performance as well. From our experiments, those speed up typical queries by about two times, just those roll-ups. Delta Lake is able to do these roll-ups without affecting running queries. That's where the ACID database compliance comes in. It's really nice technology.

We put this in place in our pipeline. We had Flink delivering files to S3. At the moment, it's delivering them as Parquet. Then we have another transformation step that loads them into Delta Lake. I'm hoping in the future these will merge into one and we'll be writing directly to Delta Lake. In essence, we have very low latency on both sides. We have 1-minute latency in our Kafka pipeline, and we have about 5-minutes latency right now in our so-called batch pipeline. I think it's more appropriate to call it a micro-batch pipeline.

I think in the future here, it seems to me our streaming architecture that we originally started with, we had some reasons to go back to batch, but it's more a micro-batch this time. Our micro-batch architecture, in this case, the latency is getting lower and lower. In the future, it looks like latency could be low enough that most of the use cases we were using Kafka for before, might not need to be done in Kafka anymore. We could do this in our batch pipeline instead. Perhaps, we'll only reserve Kafka for very real-time use cases like metrics and application logs.

Key Takeaways

The key takeaways. The first one I talked about was Conway's Law and how that's true for data platforms. This basically means that your data platform structure and the data in it specifically will mirror that of your organization. Luckily, for me, the advice I can give you here is going to be fairly generic because it will depend on your exact organizational layout, what this means. It's really worth thinking about it upfront, and realizing that this is something you need to account for.

The second thing was that metadata is absolutely critical. Having a self-serve architecture is very good. As your organization grows, you don't want your data platform team to become a bottleneck for using data. You can't know about every dataset. That means you need to provide your users and yourself with the metadata you need to work with these datasets efficiently and make good decisions. It's as important as the data.

Data engineers control the plot line. With this, I mean that the restrictions you put on the platform, there's the ways that you're allowed to use the data that actually matters. It determines whether you get a really complicated plot line like "Primer," or whether you get a pretty simple plot line like "12 Angry Men," or "Jurassic Park".

Then I talked about repeatability and how that's important. How it's often overlooked in streaming pipelines. It's a difficult topic. Getting repeatability right, in our case, meant completely avoiding streaming. That's not necessarily the solution. I think you can do repeatability with good partitioning schemes on Kafka, for example. You can achieve some of these same semantics. Or, by keeping your system fairly simple then repeatability is not such a big problem. It is very important to know what your plan is here and know how you're going to keep it under control.

Then finally, I briefly touched on Delta Lake. It's a technology that we're starting to use now at Skyscanner. We're using it right now as a bit of a batch sink, and probably a batch source. I'm hoping soon also as a streaming source, which will give our customers the ability to choose whether they want very low latency, maybe 5 minutes, maybe 1 minute, or whether they are ok with having daily batch transformations done on the archive. Either way, they can use Spark. Either way, it's the same code. That's quite a nice thing. It's quite a nice thing to be able to do. If you haven't heard of Delta Lake, that's a good technology to go explore next. We're already seeing quite a bit of benefit from this at Skyscanner. I think we'll see more of that in the year to come.

If you do all of this right, all these many different things, then I hope you don't, one day, like me, find yourself in a room where you have to answer the question, where does your company's revenue come from?

Questions and Answers

Participant 1: Have you considered introducing lineage technologies such as Apache Atlas or IGC type of technologies?

Schaaf: Yes. That's really important. That's something we're actively looking at now and deciding what the best system will be for us. We've looked at Atlas. We've looked interestingly at Glue, specifically, because we use AWS Glue for pretty much everything. We're thinking what we'll do is store the metadata for lineage in that instead of using existing open-source solutions, because our system is quite customized. Having this metadata at all is the important thing. Apache Atlas, I believe, is a good option. We will likely be using it soon but aren't yet.

Participant 2: Looking at what you were saying with your stream processing, it seemed your issue there wasn't so much batch versus stream but was almost the fact you had so many different dependencies in the stream engines. You had so many teams where one team has a little bit of data, has a new topic. The next team reads that, and the next one. How is that working with the batch one, when you change it later to the Spark engine? Where you're having lots of batch jobs that all depend on each other. Was that really what you fixed, having so many dependencies, and actually said that each team instead of depending on the previous team is going to start from the beginning and add most of their own data at that point?

Schaaf: Yes. We'll see how it ends up. This side of the pipeline is fairly new. Time will tell whether we were successful in simplifying the whole thing. What I can say is that, by having the metadata at least this time to know what the dependencies are, teams can decide whether they want to start from scratch from source, or use some of the existing metadata. You can get the picture and you can see what's available already quite a bit more easily than you could before. We also added some useful distinctions here in terms of the quality of the data. We were thinking of, let's put it into Databricks, terms like bronze, silver, gold, and so you can know the quality of each data set. We will prefer that people build their datasets on top of gold datasets so that the transformations happen in as few places as possible and not all over the place. Metadata is important to make this visible and for teams to know which datasets to build their own transformations on, and not necessarily do everything that other teams have already done. Time will tell whether we were completely successful here.

Participant 3: What was the relationship between those architectural choices on your original performance graph? Also, in your Lambda with your two flows of data in the streaming and batch space? What was the split of the 2 million per second across that?

You had a performance graph that was showing your increase in rate over time. All those different architectural changes, at what points did they happen on those improvements? Were the improvements a result of the architectural choices at all?

Then on your final picture with the split of metadata versus the business events, where does the 2 million per second fit across that? Is that both of those or one of them?

Schaaf: The Kafka cluster right now, in all the pictures is doing 2 million. We haven't reduced the amount of data in Kafka at this stage. That might happen in the future. The batch side is doing quite a bit less, I think, about 20% of that.

In terms of how things played out in the timeline, the batch pipeline we introduced, we started introducing this new batch pipeline about a year ago. It's been rolling out. It's still in the early phases of seeing how it works, but it's processing only the business events and not the metrics and application logs. Actually, as time goes on, we're splitting these completely. We're trying to not use the same Kafka cluster for all these different data use cases, because they are actually quite different, and have different profiles. What seems will happen is we will use one Kafka cluster for metrics, another for logs. Then probably, this batch pipeline for business events only.

See more presentations with transcripts

Recorded at:

May 11, 2020

Herman Schaaf

InfoQ Software Architects' Newsletter

From Batch to Streaming to Both

Summary

Bio

About the conference

Transcript

Background

Lesson 1: Conway's Law is true for Data Platforms

Lesson 2: Metadata is Critical

Lesson 3: Data Engineers Control the Plot Line

Lesson 4: Repeatability is Important

Key Takeaways

Questions and Answers

Related Sponsors

This content is in the QCon Software Development Conference topic

Related Topics:

Related Editorial

Popular across InfoQ