BT

Is Batch ETL Dead, and is Apache Kafka the Future of Data Processing?

| Posted by Daniel Bryant Follow 704 Followers on Jan 22, 2018. Estimated reading time: 5 minutes |

Key Takeaways

  • Several recent data trends are driving a dramatic change in the old-world batch Extract-Transform-Load (ETL) architecture: data platforms operate at company-wide scale; there are many more types of data sources; and stream data is increasingly ubiquitous
  • Enterprise Application Integration (EAI) was an early take on real-time ETL, but the technologies used were often not scalable. This led to a difficult choice with data integration in the old world: real-time but not scalable, or scalable but batch.
  • Apache Kafka is an open source streaming platform that was developed seven years ago within LinkedIn
  • Kafka enables the building of streaming data pipelines from “source” to “sink” through the Kafka Connect API and the Kafka Streams API
  • Logs unify batch and stream processing. A log can be consumed via batched “windows”, or in real time by examining each element as it arrives
     

 

At QCon San Francisco 2016, Neha Narkhede presented “ETL Is Dead; Long Live Streams” and discussed the changing landscape of enterprise data processing. A core premise of the talk was that the open-source Apache Kafka streaming platform can provide a flexible and uniform framework that supports modern requirements for data transformation and processing.

Narkhede, co-founder and CTO of Confluent, began the talk by stating that data and data systems have significantly changed within the past decade. The old world typically consisted of operational databases providing online transaction processing (OLTP) and relational data warehouses providing online analytical processing (OLAP). Data from a variety of operational databases was typically batch-loaded into a master schema within the data warehouse once or twice a day. This data integration process is commonly referred to as extract-transform-load (ETL).

Several recent data trends are driving a dramatic change in the old-world ETL architecture:

  • Single-server databases are being replaced by a myriad of distributed data platforms that operate at company-wide scale.
  • There are many more types of data sources beyond transactional data: e.g., logs, sensors, metrics, etc.
  • Stream data is increasingly ubiquitous, and there is a business need for faster processing than daily batches.

The result of these trends is that traditional approaches to data integration often end up looking like a mess, with a combination of custom transformation scripts, enterprise middleware such as enterprise service buses (ESBs) and message-queue (MQ) technology, and batch-processing technology like Hadoop.

https://lh6.googleusercontent.com/L1ECZcrL8bFr6dPHhTs59chwB791guKf6VjJcoGXLQ-iHc0qYQE86MHMnZB918wlvt2nJtOx6zIr2XtSCDb4o9t7HDhmmMTU94iyoR7SXdA7ICRSnHpxBB4Bksflz8j-bEpjfjuD

Before exploring how transitioning to modern streaming technology could help to alleviate this issue, Narkhede dove into a short history of data integration. Beginning in the 1990s within the retail industry, businesses became increasingly keen to analyze buyer trends with the new forms of data now available to them. Operational data stored within OLTP databases had to be extracted, transformed into the destination warehouse schema, and loaded into a centralized data warehouse. As this technology has matured over the past two decades, however, the data coverage within data warehouses remains relatively low due to the drawbacks of ETL:

  • There is a need for a global schema.
  • Data cleansing and curation is manual and fundamentally error prone.
  • The operational cost of ETL is high: it is often slow and time and resource intensive.
  • ETL tools were built to narrowly focus on connecting databases and the data warehouse in a batch fashion.

Enterprise application integration (EAI) was an early take on real-time ETL, and used ESBs and MQs for data integration. Although effective for real-time processing, these technologies could often not scale to the magnitude required. This led to a difficult choice with data integration in the old world: real time but not scalable, or scalable but batch.

https://lh4.googleusercontent.com/3B9NlfTdsoeBTgBchM4vFrN7d786uZcqMGuTvfQosSTHSc2QVIl_lSMrAMYX1wLUKWQDxl6IQFSfbqgPuhB4m6cMmV0l4N7LRxsvLFt2DnigoV34Ewd3Wm9yxVdEzkhtdbPza3-G

Narkhede argued that the modern streaming world has new requirements for data integration:

  • The ability to process high-volume and high-diversity data.
  • A platform must support real-time from the ground up, which drives a fundamental transition to event-centric thinking.
  • Forward-compatible data architectures must be enabled and must be able to support the ability to add more applications that need to process the same data differently.

These requirements drive the creation of a unified data-integration platform rather than a series of bespoke tools. This platform must embrace the fundamental principles of modern architecture and infrastructure, and should be fault tolerant, be capable of parallelism, support multiple delivery semantics, provide effective operations and monitoring, and allow schema management. Apache Kafka, which was developed seven years ago within LinkedIn, is one such open-source streaming platform and can operate as the central nervous system for an organization’s data in the following ways:

  • It serves as the real-time, scalable messaging bus for applications, with no EAI.
  • It serves as the source-of-truth pipeline for feeding all data-processing destinations.
  • It serves as the building block for stateful stream-processing microservices.

Apache Kafka currently processes 14 trillion message a day at LinkedIn, and is deployed within thousands of organizations worldwide, including Fortune 500 companies such as Cisco, Netflix, PayPal, and Verizon. Kafka is rapidly becoming the storage of choice for streaming data, and it offers a scalable messaging backbone for application integration that can span multiple data centers.

https://lh5.googleusercontent.com/h6hoeP3b_HrqzQNZzh2HG7nPHMhLjK98G7ufp8Cdu9Zr5z0B04sv7qxdmEpw28akLnYAzxTECYB9INrMVdf0V5JaCMCWy58ztCyVxO-IJIY5q1u9bOLDU7kfLL62ygiayyRvDP6j

Fundamental to Kafka is the concept of the log; an append-only, totally ordered data structure. The log lends itself to publish-subscribe (pubsub) semantics, as a publisher can easily append data to the log in immutable and monotonic fashion, and subscribers can maintain their own pointers to indicate current message processing.

Kafka enables the building of streaming data pipelines — the E and L in ETL — through the Kafka Connect API. The Connect API leverages Kafka for scalability, builds upon Kafka’s fault-tolerance model, and provides a uniform method to monitor all of the connectors. Stream processing and transformations can be implemented using the Kafka Streams API — this provides the T in ETL. Using Kafka as a streaming platform eliminates the need to create (potentially duplicate) bespoke extract, transform, and load components for each destination sink, data store, or system. Data from a source can be extracted once as a structured event into the platform, and any transforms can be applied via stream processing.

https://lh6.googleusercontent.com/Q5Bn3dKqFLJJ0HlRfK3M2jBsmrFs85SuhT82C6nO3R2gtWicrTuLgD7AOzH886A8p9mrhLGT0K-fNDRdoC5qOznpkj6rX8uEi0vW5lzqefgUPkZIt-OuXCpXy4_epht1Q8qkl0kB

In the final section of her talk, Narkhede examined the concept of stream processing — transformations on stream data — in more detail, and presented two competing visions: real-time MapReduce versus event-driven microservices. Real-time MapReduce is suitable for analytic use cases and requires a central cluster and custom packaging, deployment, and monitoring. Apache Storm, Spark Streaming, and Apache Flink implement this. Narkhede argued that the event-driven microservices vision — which is implemented by the Kafka Streams API — makes stream processing accessible for any use case, and only requires adding an embedded library to any Java application and an available Kafka cluster.

The Kafka Streams API provides a convenient fluent DSL, with operators such as join, map, filter, and window aggregates.

https://lh3.googleusercontent.com/hiGvHDdmh_dcf6SPWZdd9yqAPmtRj_JMZVFA0omzrz_rcVVGWHabG-MLrQsL689LL2hD5sxKpjan17fiFV_xxXxJ_oSngVVbj_Zo57Y0HgF5XM6bAWKqRUZ7jrp38xbi4QvKtyYT

This is true event-at-a-time stream processing — there is no micro-batching — and it uses a dataflow-style windowing approach based on event time in order to handle late-arriving data. Kafka Streams provides out-of-the-box support for local state, and supports fast stateful and fault-tolerant processing. It also supports stream reprocessing, which can be useful when upgrading applications, migrating data, or conducting A/B testing.

Narkhede concluded the talk by stating that logs unify batch and stream processing — a log can be consumed via batched windows or in real time by examining each element as it arrives — and that Apache Kafka can provide the “shiny new future of ETL”.

The full video of Narkhede’s QCon SF talk “ETL Is Dead; Long Live Streams” can be found on InfoQ.

About the Author

Daniel Bryant is leading change within organisations and technology. His current work includes enabling agility within organisations by introducing better requirement gathering and planning techniques, focusing on the relevance of architecture within agile development, and facilitating continuous integration/delivery. Daniel’s current technical expertise focuses on ‘DevOps’ tooling, cloud/container platforms and microservice implementations. He is also a leader within the London Java Community (LJC), contributes to several open source projects, writes for well-known technical websites such as InfoQ, DZone and Voxxed, and regularly presents at international conferences such as QCon, JavaOne and Devoxx.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Very useful summary by Thomas Betts

After seeing Neha's presentation live, I've re-watched the video and shared it with dozens of co-workers. Even if it doesn't make people change their behavior overnight, (It's hard to break away from ETL when you've been doing it for years) I think everyone comes away with at least a few ideas of how they can approach a problem in new ways.

Having a 5-minute summary to share is very much appreciated.

Very nice, Is still ESB needed? by mehdi mohammadi

I'm thinking that Do We still need to use ESB or integration framework with Kafka?

Re: Very nice, Is still ESB needed? by Daniel Bryant

Hi Mehdi, in an ideal world (or new greenfield project), no, an ESB is not required -- the Kafka Connector API provides the integration points of E and L within ETL, and the Streams API provides the T.

In reality, any brownfield migration will most likely involve the technologies all working side-by-side, and the Connector API will also help with this

The book "Kafka: The Definitive Guide" is well worth a read: shop.oreilly.com/product/0636920044123.do

Re: Very nice, Is still ESB needed? by Richard Clayton

I just want to "bump" this comment. "Kafka: The Definitive Guide" is fantastic - this is where you should start with Kafka.

Re: Very nice, Is still ESB needed? by liu rui

Great to see this reply, that's exactly the feedback I want to give to the article.

The original diagram ( the mess one) is a little bit misleading for me because I used to use the exact same diagram to describe the pre-ESB era. :D

For me, Kafka Streaming is more to help the ETL world, by providing events on a variety of topics in timely style (message retention), the design of partition and replica to support parral processing and fail over.

It's quite different with ESB and queues which are more to do with transactional data and with a goal of making various systems all in sync.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

5 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT