Celia Kung on LinkedIn's Brooklin Data Streaming Service

Increasing data volumes and the proliferation of database systems in organizations create challenges in streaming data in near real-time to the applications. Celia Kung from LinkedIn's data infrastructure team spoke at the QCon New York 2019 Conference last week about Brooklin, a managed data streaming service that supports pluggable sources and destinations. These sources and destinations can be data stores or messaging systems making the solution flexible and extensible. Brooklin is part of the streams infrastructure platform developed at LinkedIn.

Brooklin is a Java based multi-tenant data ingestion service that can be used to stream data from multiple streaming sources to different destinations. It can stream data from messaging systems (like Kafka or Event Hubs), databases (Oracle, MySQL), or other datastores (e.g. NFS) and publish to destinations like Kafka, Event Hubs, HDFS, etc. The steams can be individually configured and dynamically provisioned. Brooklin doesn't perform "select * from table" queries on the source database for the data replication. Instead it uses source database logs, like binlog in MySQL database. There is a self-service UI portal that developers can use to configure various data streams. Apache ZooKeeper is used to store metadata and coordinate the nodes.

The idea behind this framework is to let developers focus on data processing logic in their applications, rather than moving the data between different systems. They have been using Brooklin framework in production for the last three years for a variety of use cases, such as change data capture (CDC). The framework can also be used as a replacement for Kafka MirrorMaker for replicating data between different Kafka instances.

The primary use cases for using this framework are the nearline applications that require near real-time response. There are several applications at Linkedin that fall into this category, like Live Search Indices and Notifications.

Kung discussed two different scenarios of using Brooklin for data streaming. The first scenario, change data capture, is about capturing live updates made by a member on LinkedIn website. Multiple services like Notifications Service, Search Indices Service, and News Feed Service connect to the same database and retrieve the same dataset to display the updates on the website.

She talked about the streaming bridge, which is basically a data pipe to move data between different environments like cloud services, clusters or even data centers. For example, streaming bridge can be used to transfer data from a Kinesis instance running on AWS to Event Hubs hosted on Azure cloud platform. It supports the configuration of different data formats such as Avro or JSON. We can also enforce different policies for encryption or obfuscation.

She also discussed the following application use cases:

Cache
Search Indices
ETL or Data warehouse
Materialized Views or Replication
Repartitioning

For the requirement of mirroring Kafka data between different instances, Brooklin MirrorMaker has completely replaced Kafka MirrorMaker (KMM) at LinkedIn. It also replaced several KMM instances in each data center with one Brooklin instance per data center.

Kung mentioned that LinkedIn Brooklin framework will be open sourced in the near future so the developer community can start using it in their organizations.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter