BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations A Dive into Streams @LinkedIn with Brooklin

A Dive into Streams @LinkedIn with Brooklin

Bookmarks
50:05

Summary

Celia Kung talks about Brooklin – LinkedIn’s managed data streaming service that supports multiple pluggable sources and destinations, which can be data stores or messaging systems. She dives deeper into Brooklin’s architecture and use cases, as well as their future plans.

Bio

Celia Kung manages the data pipelines team at LinkedIn. Previously, she was the lead engineer for building Oracle change-data capture support for Brooklin, as well as a new Kafka mirroring solution that has fully replaced Kafka MirrorMaker at LinkedIn.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Kung: First off, I’d like to thank all of you for choosing to attend my talk here today. Before I get started, I just wanted to address a potential confusion around the topic of my presentation – since we are in New York, this talk won’t be about the city of Brooklyn, or the beautiful bridge that is just right outside this conference. Hopefully, you guys are here to learn about this Brooklin. I get questions all the time whenever I place orders for custom things like hats, T-shirts, and stickers with our Brooklin logo on it, “Are you sure you wanted to spell ‘Brooklin’ like that, with an ‘i’ and not an ‘y’?” I always have to reassure them that I did intend to spell Brooklin.

How did we come up with the name Brooklin with an “i”? The definition of a “brook” is a stream and, at LinkedIn we like to try and incorporate “in” in a lot of our team names, product names or event names, so, if you put the two together, that gives you “brookin” but that sounds an awful lot like “broken” and that is not what we wanted to name our product after. Instead, if you combine the word “brook” and “LinkedIn”, that gives you “brookLinkedIn”, which sounds a little bit better, but it still sounds awkward. What we did was, we simply dropped the latter half of that word, and that’s how we came up with Brooklin.

Here is a picture of me wearing one of those customized T-shirts with our logo on it. My name is Celia [Kung] and I manage the data pipelines team, which includes Brooklin as part of the streams infrastructure work at LinkedIn. At LinkedIn it’s a tradition, whenever we are introducing ourselves to a large audience or a bunch of people that we don’t know, to talk about something that they cannot find on our LinkedIn profile. Something about me that’s not on my LinkedIn profile is that I am a huge shoe fanatic. I used to line up to buy shoes at stores in the mall, right before it opened, and, at one point, I used to have over 70 pairs of shoes. I used to keep them neat and tidy in their original boxes so that they could stay clean. At the time, I was living with my parents, who, needless to say, were not very happy about my shoe obsession.

Let’s dive into the presentation, here is a little bit of an outline of what I will be covering today. First, I’ll start off with a brief background and motivation on why we wanted to build a service like Brooklin. Then, I’ll talk about some of the scenarios that Brooklin was designed to handle, as well as some of the real use cases for it within LinkedIn. Then, I’ll give an overview of the Brooklin architecture and lastly, I’ll talk about where Brooklin is today and our plans for the future.

Background

When we talk about streams infrastructure, one of the main focuses is serving nearline applications. These are applications that require a near real-time response, so in the order of seconds or maybe minutes. There are thousands of such applications at LinkedIn, for example the Live Search Indices, or Notifications that you see on LinkedIn whenever one of your connections changes their profile and updates their job.

Because they are always processing, these nearline applications require continuous and low-latency access to data. However, this data could be spread out across multiple database systems. For example, at LinkedIn, we used to store most of our source of truth data in Oracle databases. But, as we grew, we built our own document store, which was called Espresso and we now have our data stored in MySQL, in Kafka, and, as LinkedIn acquired some companies that built their applications on top of the AWS and then LinkedIn itself got acquired by Microsoft which called for the need to stream data between LinkedIn and Azure.

Over the years, LinkedIn data became spread out across many heterogeneous data systems and to support all these nearline applications that needed continuous and low-latency access to all of this data, we needed to come up with the right infrastructure to be able to consume from these sources. From a company’s perspective, we wanted to allow application developers to focus on data processing logic and application logic and not on the movement of data from these sources. The streams infrastructure team was tasked with the challenge of finding the right framework to do this.

When most of LinkedIn source of truth data was stored in just Oracle and Espresso, we actually built this product called Databus, whose sole responsibility was to capture live updates being made to Oracle and Espresso and then streaming these updates to applications in near real-time. This didn’t scale so well because the Databus solution for streaming data out of Oracle was pretty specialized and separate from the solution to stream data out of Espresso, despite the fact that these two were both within the same product called Databus. At that point in time, we also needed the ability to stream out of a variety of sources.

What we learnt from Databus was that building specialized and separate solutions to stream data from and to all of these different systems, really slows down development and is extremely hard to manage. What we really needed was a centralized, managed and extensible service that could continuously deliver this data in near real-time to our applications, and this led us to the development of Brooklin.

What is Brooklin?

Now that you know a little bit about the motivation behind building Brooklin, what is Brooklin? Brooklin is a streaming data pipeline service that has the ability to propagate data from many source systems to many destination systems. It is multitenant, which means that it can run several thousand streams at the same time and all within the same cluster, and each one of these data streams can be individually configured and dynamically provisioned. Every time you want to bring up a pipeline that moves data from A to B, you simply need to just make a call to our Brooklin service and we will provision that dynamically, without needing to modify a bunch of config files, and then deploy to the cluster. The biggest reason behind building Brooklin was to make it easily extensible to be able to plug in support for additional sources and destinations going forward. So we built Brooklin with extensibility in mind.

Pluggable Sources & Destinations

This is a very high level picture of the environment. On the right hand side, you have of nearline applications that are interested in consuming data from all these various sources which you have on the left hand side. Brooklin is responsible for consuming from these sources, and shipping these events over to a variety of destinations where the applications can asynchronously consume from. In terms of sources, we focused on databases and messaging systems, with the most heavily used ones at LinkedIn being Oracle and Espresso for databases, and Kafka is the most popular source in LinkedIn as well. Most of our applications consume from Kafka as a destination messaging system.

Scenario 1: Change Data Capture

Now that you know a little bit about Brooklin, I’d like to talk about some of the scenarios that Brooklin was designed to handle. Most of these scenarios fall into two major categories at LinkedIn. The first scenario is what is known as Change Data Capture. Change Data Capture involves capturing live updates that are made to the database and streaming them in the form of a low-latency change stream. To give you a better idea of what Change Data Capture is, I’ll walk through a very simple example of how this is used at LinkedIn. Suppose there is a LinkedIn member whose name is Mochi, and she updates her LinkedIn profile to reflect the fact that she has switched jobs. Mochi used to work as a litterbox janitor at a company called Scoop, and now she switches jobs and works at a company called Zzz Catnip Dispensary, where she is a sales clerk. She updates her LinkedIn profile to reflect this change.

As a social professional network, LinkedIn wants to inform all of Mochi’s colleagues about her new job change. For example, we want to show Mochi’s job update in the newsfeed of Rolo, who is one of Mochi’s connections, and we want this to show up in the newsfeed so that Rolo could easily engage in this event by liking or commenting on it to congratulate his friend on her new job.

Mochi’s profile data is stored in a member database in LinkedIn, and one way to enable this use case that I just talked about, is to have the News Feed Service, which is a backend service, constantly query the member database to try to detect any changes that are made to the member database. It can run a lot of Select* queries and try to figure out what has changed since the last time it ran this query, to enable a use case like this.

Imagine that there is another service called Search Indices Service that is also interested in capturing live updates to the same profile data because now, if any member on the LinkedIn site does a search for Zzz Catnip Dispensary on LinkedIn, you want Mochi, who is their newest employee to show up on the search results for Zzz Catnip Dispensary and no longer in the search results of her previous company, Scoop. The Search Indices Service could also use the same mechanism, they could also query the member database frequently to try to pull for the latest changes, but in reality, there could be many such services that are interested in capturing changes to the same data, to power critical features on the LinkedIn app, like notifications or title standardization. They could all query the database, but doing this is very expensive, and, as the number of applications grows, or the size of your database grows, you are at risk of bringing down your entire database. Doing this is dangerous because you don’t want a live read or write on the LinkedIn live site to be slow or not work just because you have a bunch of services in the backend querying the same dataset.

We solved this by using a known pattern, called Change Data Capture (CDC) where all of the applications in the backend don’t query the database directly, but they instead consume from what is called a change stream. Brooklin is the force behind propagating these changes from the live sources into a change stream.

There are a couple of advantages of doing it this way – first, you get better isolation. Because applications are decoupled completely from the database source, they don’t compete for the resources with the online traffic that is happening on the site. Because of this, you can scale the database and your applications independently of each other. Secondly, applications that are all reading from a particular change stream can be at different points in the change timelines. For example, if you have four of those applications reading from a particular change stream, you can have some that are reading from the beginning, some from the middle, some from the end - it doesn’t matter, they can all read at their own pace.

Here is a high-level overview of what Change Data Capture looks like at LinkedIn. Brooklin is responsible for consuming from the source databases and note that Brooklin itself does not Select* queries on the database itself, it doesn’t compete for resources with the online queries, instead, it depends on the source. For example, if you want to stream data out of Oracle, Brooklin may read from trail files in Oracle databases or, if you wanted to stream from MySQL, Brooklin would read from the bin log. It depends on the source, but Brooklin does not do Select* queries on the source.

When it captures these changes from the database, it feeds them into any messaging system, and our most heavily used messaging system, as many of you may know, is Kafka. Because Kafka provides pretty good read scaling, all of these applications, if necessary, could all read from the same Kafka topic to retrieve these changes.

Scenario 2: Streaming Bridge

That was Change Data Capture, and the second scenario that Brooklin was designed to handle is what is called a Streaming Bridge. Streaming Bridge is applicable when you want to move data between different environments. For example, if you want to move data from AWS to Azure, or if you want to move data between different clusters within your datacenter, or if you want to move data across different data centers as well, any time you want to stream data with low-latency from X to Y, is where a streaming bridge is required.

Here is a hypothetical use case of a streaming bridge. In this example, Brooklin is consuming from two Kinesis streams in AWS and moving this data into two Kafka topics in LinkedIn. Because Brooklin is multitenant, that same Brooklin cluster is also reading from two other Kafka topics in LinkedIn, and feeding this data into two EventHubs in Azure.

Having a streaming bridge allows your company to enforce policies. For example, you might have a policy that states that any data that is streamed out of a particular database source, needs to be encrypted because it contains some sensitive information. You may also have another policy that states that any data that is coming into your organization, should be of a particular format – say Avro or JSON. With Brooklin you can configure these pipelines to enforce such policies. This is very useful because having a bridge allows a company to manage all of their data pipelines in one centralized location, instead of having a bunch of different application teams manage their one-off data pipelines.

Mirroring Kafka Data

Perhaps the biggest Brooklin bridge use case at LinkedIn is to mirror Kafka data. Kafka is used heavily at LinkedIn and it actually came out of LinkedIn, as most of you may know. It’s used heavily to store a bunch of data like tracking, logging and metrics information. Brooklin is used to aggregate all of this data from each of our datacenters to make it easy to access and process this data in one centralized place. Brooklin is also used to move data between LinkedIn and external cloud services. For example, we can move data between Kafka and LinkedIn and Kafka running on Azure. Perhaps the biggest scale test for Brooklin was the fact that we have fully replaced Kafka MirrorMaker at LinkedIn with Brooklin.

For those of you who are not familiar with Kafka MirrorMaker, it is a tool that is included within the open source Kafka project that supports streaming data from one cluster to another and many companies use Kafka MirrorMaker for this purpose.

At LinkedIn, we were seeing a tremendous growth in the amount of Kafka data, year after year, and in fact we were probably one of the largest users of Kafka in terms of scale. As this Kafka data continued to grow, we were seeing some real scale limitations with Kafka MirrorMaker. It wasn’t scaling well, it was difficult to operate and manage at the scale that LinkedIn wanted to use it, and it didn’t provide very good failure isolation.

For example, if Kafka MirrorMaker hit any problems with mirroring a particular partition or a particular topic, oftentimes the whole Kafka MirrorMaker cluster would simply just fall over and die. This is problematic because all of the topics and partitions that the cluster is responsible for mirroring are also down with it, so they are all halted just because there is a problem with one particular pipeline or topic. In fact, in order to mitigate this operational nightmare, we were running a nanny job, whose sole purpose was to always check the health of all of these Kafka MirrorMaker clusters and, if one of them is down, then its job is just to restart the Kafka MirrorMaker cluster.

This wasn’t working well for LinkedIn. What did we do? We already had a generic streaming solution in Brooklin that we had recently built, so we decided to double down on Brooklin and use it to mirror Kafka data as well. When we talk about Brooklin pipelines that are used for Kafka mirroring, these are simply Brooklin pipelines that are configured with a Kafka source and a Kafka destination. It was almost as easy as that. By doing this, we were able to get rid of some of the operational complexities around managing Kafka MirrorMaker and this is best displayed by showing you what the Kafka MirrorMaker topology used to look like in LinkedIn.

I’ll take a very specific example, where we want to mirror two different types of Kafka data – tracking and metrics data. We want to mirror from tracking and metrics clusters into the respective aggregate metrics and aggregate tracking clusters, in let’s say three of our datacenters. To do this very simple thing, we actually had to provision and manage 18 different Kafka MirrorMaker clusters. You can imagine that, if we wanted to make a simple change, let’s say that we wanted to add some topics to the mirroring pipelines of our metrics to aggregate metrics pipelines, we had to make nine config changes for each of these nine Kafka MirrorMaker clusters, and we had to deploy to nine different Kafka MirrorMaker clusters as well.

In reality, we have more than just two types of Kafka clusters at LinkedIn, and we have more than just three datacenters, so you can imagine the spaghetti-like topology that we had with Kafka MirrorMaker. In fact, we had to manage probably over 100 Kafka MirrorMaker clusters at that time. This wasn’t just working well for us.

Let me show you what the topology looks like when we replaced Kafka MirrorMaker with Brooklin. With the same example of mirroring tracking and metrics data into aggregate tracking and aggregate metrics data, in three of our datacenters, we only need three Brooklin clusters to do this. We are able to reduce some of the operational overhead by doing this because Brooklin is multitenant, so it can power multiple data pipelines within the same cluster. Additionally, a lot of management overhead is taken away from the picture because we are able to dynamically provision and individually configure these pipelines. So, whenever we want to make a simple change, like the one I mentioned earlier of adding some topics to the whitelist, it’s very easy to do that. We don’t need to make any config changes, we don’t need to do any deployments.

Brooklin Kafka Mirroring

Brooklin’s solution for mirroring Kafka data was designed and optimized for stability and operability because of our experiences with Kafka MirrorMaker. We had to add some additional features to Brooklin to make this work. Before this, Brooklin could only support consuming from one topic in the source and writing to one topic in the destination, but Brooklin mirroring was the first use case where we needed to support a Star-to-Star or a Many-to-Many configuration, so we had to add some additional features to be able to operate at each individual topic level. What we did was we added the ability to manually pause or manually resume mirroring at every single granularity. For example, we could pause the entire mirroring pipeline, we could pause an individual Kafka topic or we could pause just a single partition if we wanted to. Additionally, if Brooklin was facing any issues during mirroring, for example, if it has issues mirroring one particular topic or one particular partition, it has the ability to auto-pause these partitions. Contrast that with the setup of Kafka MirrorMaker where, if it saw any issues with a particular topic, it would simply fall over and leave all of the other partitions hanging. With Brooklin, we can auto-pause and this is very useful whenever we see transit issues with mirroring from a particular topic or partition. This also reduces the need for a nanny job or any manual intervention with these pipelines.

Whenever a topic is auto-paused, Brooklin has the ability to automatically resume them after a configurable amount of time. In any case when anything is paused, the flow of messages from other partitions is completely unaffected. We really built this with isolation at every single level.

Application Use Cases

Now that you know the two scenarios that Brooklin was designed to handle – Change Data Capture and Streaming Bridge – I’ll talk about some of the real use cases for Brooklin within LinkedIn. The first one is very popular, it’s for Cache. Whenever there is a write to a particular data source, you want to keep your cache fresh and you want it to reflect this write too and Brooklin can help with keeping your cache in sync with the real source.

The second one is similar to what I mentioned earlier, it’s for Search Indices. Suppose that you worked at a company and now you work at company X, you want to show up in the search results of company X and no longer in the search results of your previous company. You can use Brooklin to power Search Indices as well, this is what we do at LinkedIn.

Whenever you have source of truth databases, you might want to do some offline processing on the full dataset. One way that you can do this is you can take these daily or twice a day snapshots of your entire dataset and then ship this over to HTFS where you can do some offline processing and run some reports. But, if you do this, the datasets and the results can become pretty stale because you are only taking these snapshots once or twice a day. Instead, you can consume from a Brooklin stream, to more periodically merge the change events into your dataset in HTFS and to do ETL in a more incremental fashion.

Another common use case is for materialized views. Oftentimes, when you are provisioning a database, you choose a primary key that best fits the access patterns of the online queries that will always be retrieving this data. But there can be some applications that want to access the data using a different primary key, for example, so you can use Brooklin to create what is called a materialized view of the data by streaming this data into another messaging system.

Repartitioning is similar to the materialized views use case, but it’s simpler – it’s just to repartition the data based on a particular key. I’ll skip through some of these use cases in the interest of time, I think I talked about a few of them before. I’d really like to get into the architecture of Brooklin.

Architecture

To describe the architecture, I’ll focus on a very specific and simple use case, which is where Brooklin wants to stream updates that are made to member profile data, particularly this scenario here. Brooklin wants to retrieve all the changes that are being made to the member database and stream them into a Kafka topic, where the news feed service can consume these change events, do some processing logic, and then enable features such as showing the job update for all Mochi’s connections, in this case, on their news feed. In this case, we have the source database as member database profile table in Espresso. The destination is Kafka, and the application that wants to consume from it is the news feed service.

Datastream

Before I jump into the architecture, it’s important to understand the fundamental concept of Brooklin, which is known as a datastream. A datastream describes the data pipeline, so it has a mapping between the specific source and a specific destination which you want to stream data from and to. It also holds additional configuration or metadata about the pipeline, such as who owns the datastream and who should be allowed to read from it and what is the application that needs to consume from it.

In this particular example, we will be creating a datastream called MemberProfileChangeStream and the source of this is Espresso, specifically the member database/ profile table and it has eight partitions. The place that we want to stream this data into is in Kafka, specifically to a particular topic called the ProfileTopic and it also has eight partitions. We also have some metadata in the datastream, such as the application which should have read permissions on the stream and which team at LinkedIn owns the datastream – which is this case is the news feed team.

To get Brooklin started, we need to create a datastream. To do this, at LinkedIn, we offer a self-service UI portal where our applications can go to, to provision their data pipelines whenever they need to within minutes. An application developer will go to this self-service portal, they will select that they need to stream data from Espresso, specifically the member database, profile table, and they will select that they need to stream this data into a Kafka topic. It’s pretty easy, they just select the source and destination, and then they click “create” and this create request makes it to a load balancer where one of the requests goes to any of the Brooklin instances in our Brooklin cluster. It hits the request of this Brooklin instance on the left hand side, and specifically, it goes to what is called the datastream management service (DMS), which is just our REST API, to be able to do “create” operations on datastreams. We can create, read, update, delete and we can configure these pipelines dynamically.

Once the Brooklin instance receives this request to create a datastream, it simply writes the datastream object into ZooKeeper. Brooklin relies on ZooKeeper to store all of our datastream, so that it know which data pipelines it needs to process. Once the datastream has been written to ZooKeeper, there is a leader coordinator that is notified of this new datastream. Let me explain what a coordinator is in this example. The coordinator is the brains of the Brooklin host. Each host has a coordinator, and upon startup of the cluster, one of the coordinators is elected as the leader, as part of ZooKeeper’s standard leader election recipes. So, whenever a new datastream is created or if it’s deleted, the leader coordinator is the only one that is notified of this new datastream.

Because Brooklin is a distributed system, we don’t want this datastream to be just processed by a single host in the cluster. Instead, we want to try and divide this work amongst all of the hosts in the cluster. The leader coordinator is responsible for calculating this work distribution. What it does is it breaks down the datastream, which I mentioned earlier, and it splits them up into what is called datastream tasks, which are the actual work units of Brooklin.

The tasks may, for example, be split up by partitions – it really depends on the source. The leader coordinator calculates this work distribution, comes up with a task assignment, and it simply writes these assignments back into ZooKeeper. Once that happens, ZooKeeper is not only used for our metadata store to store datastreams, but it is also used for coordination across our Brooklin cluster. What happens is ZooKeeper is used to communicate these assignments over to all of the coordinators in the rest of the cluster. Each of the coordinators that was assigned a datastream task, will now be aware that they need to process something.

The coordinators now have the datastream tasks that are assigned to them and they need to hand these tasks off to the Brooklin consumers. The Brooklin consumers are the ones that are responsible for actually fetching and reading the data from the source. How do the coordinators know which Brooklin consumer to hand these tasks off to? As I showed you in the datastream definition, there is a specific source - in this case it is Espresso – and this information is also held within our work units or our datastream tasks. The coordinator knows that for this task, it needs to hand off this task to the Espresso consumer specifically.

Once the Espresso consumer receives this task, it can now start to do the processing. It can start to stream data from whatever source it needs to stream from – in this case it is Espresso – and the datastream task also tells that it needs to stream from member database and specifically the profile table. The consumers each start streaming their own portion of the dataset, and they propagate this data over to the Brooklin producers.

How do the consumers know which producers to pass this data onto? Similar to what I mentioned earlier, the datastream task has the mapping between the source type and the destination type and in this particular example it is Kafka. The Espresso consumers know that they need to deliver their data over to the Kafka producers within their instance.

Once the Kafka producers start to receive this data from the Espresso consumers, they can start to feed this data into the destination, in this case it is the profile topic of Kafka. Data starts flowing into Kafka and, as soon as this happens, applications can start to asynchronously stream data out from the source. In this example, the news feed service can start to read those profile changes from Kafka.

In the previous examples earlier in this presentation, I talked about many applications potentially needing access to the same data because profile data is very popular for nearline applications in LinkedIn, there could be notification service, search indices service that are also interested in profile updates. In this scenario, the Kafka destination or the Kafka profile topic, can be read by many applications. Each of these applications or each of these application owners still need to each go to our self-service UI portal where they can create their own datastream, so each team still needs to have their own logical data stream. However, Brooklin underneath the hood knows how to re-use specific datastream tasks or re-use certain work units, so that, if a particular source and destination combination are very common, it won’t do the same work twice. It won’t do the same work of streaming from the database or writing into a separate Kafka topic twice; it will re-use those components and simply allow those additional services to consume from the same Kafka topic that already exists with that data.

Overview of Brooklin’s Architecture

Stepping back from the particular example that I’ve been talking about, here is a very broad overview of Brooklin’s architecture. Brooklin doesn’t depend on very much, it only requires ZooKeeper for storage of datastream metadata, as well as for coordination across the Brooklin cluster. I talked a little bit about the datastream management service, it is our REST API which allows users or services to dynamically provision or dynamically configure individual pipelines and coordinators are the brains of each Brooklin instance that are responsible for acting upon these changes to datastream as well as distributing the task assignments or receiving the task assignments from ZooKeeper.

In reality, Brooklin can be configured with more than just an Espresso consumer and a Kafka producer; it can be configured with any number of different types of consumers and any number of different types of producers. In fact, you can configure your Brooklin cluster to be able to consume from MySQL, Espresso, Oracle and write to EventHubs, Kinesis, and Kafka. One Brooklin cluster can take care of all of this variety of source and destination pairs.

Brooklin – Current Status

That was very high level of Brooklin’s architecture and now I’ll talk about where Brooklin is today. Today, Brooklin supports streaming from a variety of sources, namely Espresso, Oracle, Kafka, EventHubs and Kinesis and it has the ability to produce to both Kafka and EventHubs. Note that the consumer and producer APIs for Brooklin have standardized APIs to support additional sources and destinations.

Some features that we built within Brooklin so far are that it is multitenant, it can power multiple streams at the same time within the same cluster. And in terms of guarantees, Brooklin guarantees at-least-once delivery and order is maintained at the partition level. If you have multiple updates, being written to the same partition, we can guarantee that the ordering of these events to this specific partition are going to be delivered in the same order.

Additionally, we made some improvements specifically for the Kafka mirroring use case at Brooklin, which is to allow finer control of each pipeline - which are the pause and resume features that I talked about earlier – as well as improved latency with what is known as a flushless-produce mode.

If you are familiar with Kafka, you know that the Kafka producer has the ability to flush all the messages that are in the Kafka producer buffer, by calling KafkaProducer.flush. However, doing this is a blocking call, so it is synchronous, and this can often lead to a lot of delays in terms of mirroring from source to destination. We added the ability inside Brooklin’s Kafka mirroring feature to run flushless, where we don’t call flush at all and instead, we keep track of all of the checkpoints for every single message and we keep track of when they were successfully delivered to the destination Kafka cluster.

For streams that are coming from Espresso, Oracle or EventHubs, as a source, Brooklin is moving 38 billion messages/ day and we have over 2000 datastreams that we manage for our company over 1000 unique sources and we are serving over 200 applications that need all of these data from these unique sources.

Specifically for Brooklin mirroring Kafka data, which is huge at LinkedIn, we are mirroring over 2 trillion messages / day with over 200 mirroring pipelines and mirroring tens of thousands of topics.

Brooklin – Future

For the future of Brooklin, we want to continue to add more support to consume from more sources and write some more destinations. Specifically, we want to work on the ability to do change capture from MySQL, consume from Cosmos DB and Azure SQL, and we want to add the ability to write to Azure Blob storage, Kinesis, Cosmos DB, Azure SQL, and Couchbase.

We are continuing to make additional optimizations to the Brooklin service or the Brooklin engine itself as well. We want to have the ability for Brooklin to auto-scale its datastream tasks based on the needs of the traffic. We also want to introduce something that’s called passthrough compression because in the Kafka mirroring cases it’s very simple – you just want to read some bytes of data from the source Kafka topic, and you want to deliver these bytes of data into the destination Kafka topic and you don’t need to do any processing on top of this data. But instead, what is inside the Kafka consumer and Kafka producer libraries of the Kafka product itself, is that upon consumption of Kafka data, the Kafka consumer will de-serialize the messages that were stored in the source and then, when mirroring, we produce to the destination and the Kafka producer will actually re-serialize these messages to compress them in the destination.

Because we are mirroring just bytes of data, we don’t care about the message boundaries of Kafka data and we just want to mirror chunks of data from Kafka as is. You want to be able to skip the process where the Kafka consumer de-serializes and skip the producer to do the re-serialization, which is known as passthrough compression. We know that this can save a lot of CPU cycles for Brooklin and therefore reduce the size of our Brooklin mirroring clusters.

We also want to do some read optimizations. If there are a couple of datastreams that are interested in member database, say that there are two tables inside the member database that customers are interested in, we would have to create one datastream to consume from table A and one datastream to consume from table B. However, we want to do some optimizations such that we only need to read from the source one time and be able to deliver these messages into topic for table A and topic for table B.

The last thing for the future of Brooklin, is that I am very excited to announce that we are planning to open-source Brooklin in 2019 and this is coming very soon. We’ve been having plans to do this for a while and it is finally coming to fruition.

Questions and Answers

Participant 1: I am really happy to hear about the open-sourcing at the end, I was going to ask about that. My team uses Kafka Connect, which is pretty similar to a lot of this and is part of Kafka. Can you give a quick pitch for why you might use Brooklin over Kafka Connect?

Kung: This is a pretty common question, it’s a good question. As far as I know, Kafka Connect, you can configure it with sync or source, but my understanding is that you cannot configure them with both. If you configure Kafka connect with a source of your choice, then the destination is automatically Kafka. And, if you configure it with a sync, then your source is automatically Kafka.

Brooklin provides a generic streaming solution – we want to be able to plug in the ability to read from anywhere to anywhere and Kafka doesn’t even have to be involved at all on the picture.

Participant 2: Fantastic talk. My team is heavy users of Cosmos DB that I saw it was on the slide “Future” and I was curious where that is at in the roadmap and if there are plans to support all the APIs that Cosmos offers, the document DB, Mongo, etc.?

Kung: I have to say that this is very early. We have just started exploring, so we are not quite there in providing support for Cosmos just yet, but it is definitely on the roadmap.

Participant 3: How do you deal with schema changes from the tables up to the clients who consume this data?

Kung: To handle schema changes, I think varies by the source, so it depends. All of these schema changes need to be handled by the consumer. These consumer APIs that are very specific to the source that you need to consume from, they each need to take care of all of this schema updates, and we handle it differently for each of our data sources. I’d be happy to give you a deep dive later if you can me in this conference to talk about how we solve it for Espresso vs how we solve it for Oracle, for example. It’s pretty low level.

Participant 4: Two questions. One is that we use Attunity or GoldenGate, one of the more established CDC products – how do you compare with those products? And number two, we have Confluent infrastructure that has a Connect framework. How do you play into this? How do you merge Brooklin platform into a Confluent platform?

Kung: Your first question: Golden Gate is for capturing change data from Oracle sources, right? As I mentioned earlier, Brooklin is designed to be able to handle any source, so it doesn’t have to be a database; it could be a messaging system. And it doesn’t have to be Oracle as well, but one of the features of Brooklin is to consume changes that are captured out of Oracle.

Participant 4: Maybe we should look for a more feature-by-feature comparison and if there is a website or someplace – it’s deeper than that.

Kung: I am happy to talk to you afterwards as well. Your second question was if we plug in to Kafka’s Connect at all? We don’t connect with Kafka’s Connect at all, Brooklin doesn’t as a product. To further elaborate on the question, we use the native Kafka consumer and Kafka producer APIs to be able to read from Kafka or produce to Kafka, we don’t use Kafka Connect.

Participant 5: On partitioning and on producing to Kafka, you mentioned ensuring partition keys. Do you guys have an opinion about how you go about finding those partition keys, mapping from one source to another?

Kung: How do you come up with the right partition keys to produce to Kafka? That is not something that Brooklin necessarily handles, but each application that’s writing to Kafka would need to select the right partitioning key, depending on the order of messages that they need. If they need to guarantee that all updates going to a particular profile were in order, then they would choose profile ID as the partition key for example – it just depends on the use case. Some applications don’t care about ordering within the partitions, so they would just use the default partitioning strategy of Kafka which is just Round-Robin of the partitions to send it to any partition.

Participant 6: Thank you for this amazing talk and looking forward to see the open-source. Can you tell us how easily we could extend and write plug-ins? Will you provide a framework or will there be a constraint on languages to use to write plug-ins for more sources or destinations? A preview of the open-source, basically.

Kung: Brooklin is developed in Java; when we open-source it, it will be Java and we designed it so that it’s easy to plug in your own consumer or producer, so the APIs are very simplified. I think each one of these APIs has four or five methods that you can use to implement your specific consumption from a particular source and write to a particular destination. We’ve done this a bunch of times, as you saw on the “Current” slide, we already support a bunch of sources. We built that with extensibility in mind, so hopefully it should be very easy for you to contribute to open-source by writing a consumer or producer in Java.

Participant 7: Maybe it’s orthogonal to the problem you guys have looked at, but what kind of data cataloguing, data management tools do you use as front-end for Brooklin? What data is available and who can access it? Some data is premium or not.

Kung: I briefly touched upon this, but to go in a little bit more detail, I mentioned that we have a self-service portal UI, where our developers can go to, to provision these pipelines – at LinkedIn it is called Nuage; it is like a cloud-based management portal. They connect with all sorts of sources like Espresso team and MySQL team, Oracle team, Kafka team to get all this metadata, and they house this metadata, so that they can show this up on the UI, so that application developers know what is available and what is not available.

 

See more presentations with transcripts

 

Recorded at:

Jul 31, 2019

BT