The Confluent Platform 3.0 messaging system from Confluent, the company behind the Apache Kafka messaging framework, supports Kafka Streams for real-time data processing. The company announced last week the general availability of the latest version of the open source platform. The Confluent Platform can be used to create a scalable data platform built around Apache Kafka, a real-time, distributed, fault-tolerant messaging queue that scales to a large number of messages.
Kafka Streams is a lightweight solution to real-time processing of data which is useful in use cases such as fraud and security monitoring, Internet of Things (IoT) operations and machine monitoring. It provides a new, native streaming development environment for Kafka. Developers will be able to leverage this library for building distributed stream processing applications using Kafka. Kafka covers the messaging and transport of data, Kafka Streams covers the processing of the data.
Kafka Streams also supports stateful and stateless processing as well as distributed fault tolerant processing of data. No dedicated cluster, message translation layer, or external dependencies are required to use Kafka Streams. It processes one event at a time instead of as micro-batches of messages. It also allows for late arrival of data and windowing with out-of-order data.
You can download Confluent Platform 3.0 or checkout the documentation of the new release including the Kafka Streams documentation and Quickstart Guide.
In a related news, Confluent also announced last week Confluent Control Center, the commercial product to manage Kafka clusters. Available as part of Confluent Enterprise 3.0, Confluent Control Center is designed to help data engineering teams operationalize Kafka in their organizations. This management tool gives operators and data teams the ability to monitor the different components of Kafka system like topics, producers, consumers and understand what's happening with their data pipelines.
With Control Center, operators can examine their data environment at the message level to understand message delivery, possible bottlenecks, and observe the end-to-end delivery of messages in their native Kafka environment. Control Center UI allows operators to connect new data sources to the cluster and configure new data source connectors to meet their specific needs.
If you are interested in learning more about the Control Center, checkout the upcoming webinar.
InfoQ spoke with Joseph Adler (Director of Product Management and Data Science) and Michael Noll (Product Manger) at Confluent to talk about these announcements and how these products can help the developers and operations teams.
InfoQ: How does Kafka Streams compare with other stream data processing frameworks like Storm, Spark Streaming or Apache Flink?
Joseph Adler & Michael Noll: Stream processing developers have a lot of different choices for a stream processing framework. The reality is many of them are already using Kafka for their stream data pipeline. Kafka Streams builds upon the strong technological foundation of Apache Kafka, from which it derives its scalability, elasticity, fault-tolerance and many of its other features. We believe Kafka Streams lowers the barrier of entry into the world of stream processing, thereby allowing many companies to benefit from real-time insights into their business. Kafka Streams also inherits Kafka's security model of encrypting data-in-transit, which makes it a good choice for industries such as finance.
Frameworks like Spark and Flink are most often used by central data engineering teams to leverage the power of their big data and data warehouse installations. They are designed for "heavy lifting"—running complex processes that can take hours or longer.
Kafka Streams is well-suited to "fast applications" or "stream applications"—where speed is critical to a fast response time. The output is a purchase decision, an in-context offer, or a security alert. These developers tend to be in a line of business.
With Kafka Streams, you do not need to install and operate a separate cluster for your real-time processing needs like you do with existing stream processing frameworks. Many people that work on real-time data processing (e.g. for fraud detection, user activity tracking or fleet monitoring) have already opted to use Kafka as the messaging backbone of their data platform, so it's a natural choice to process all this data in a Kafka-native environment with Kafka Streams rather than adding another, separate piece of infrastructure and technology that you must understand, tune, and keep running.
InfoQ: Flink doesn't use micro batches for stream data processing which is similar to how Kafka Streams works. Are there any other similarities or differences between Kafka Streams and Flink?
Adler & Noll: Kafka Streams has learned from the industry's previous experiences, including both academia and the open source community with projects such as Apache Samza. This explains similarities in important areas, such as a proper time model to distinguish between event-time vs. processing-time semantics, and the ability to correctly handle late-arriving, out-of-order data. These features are a must for any stream processing use case in practice.
Another key difference is that Kafka Streams supports elasticity, i.e. the ability to dynamically grow and shrink processing capacity. For example, in Kafka Streams, you are able to start out with running your stream processing application on only a single machine to process your incoming business data. When your data volume has grown so that one machine is not enough, then—during live operation and without downtime—you can simply start the same application on additional machines and they will automatically share the work.
InfoQ: Kafka Streams supports Windowing. Can you talk more about this feature and how it's useful in real time data processing?
Adler & Noll: Windowing allows you to break your continuous stream of data into smaller chunks. Most commonly this windowing is based on time—for example, to perform analysis based on 5-minute intervals. Windowing is a critical feature for many use cases such as fraud detection ("In the past this person has never used her credit card more than once in an hour, now we see fifty transaction in the past five minutes—maybe the credit card was stolen") or trending topics ("In the past 24 hours most users on Twitter were interested in the US presidential election, the new Apple MacBook, and the latest Justin Bieber video").
InfoQ: Can you discuss the differences between and when to use Time and Session based windowing options?
Adler & Noll: Time-based windowing splits a data stream into, say, chunks of five minutes worth of data. Think of using a stopwatch: every five minutes, you'd proclaim "new window of data!" Many use cases that require windowing, perhaps the great majority, are based on time.
In comparison, session-based windowing looks beyond such a strict stopwatch rule in order to group related events into so-called sessions. Think of these sessions as periods of activity. A common use case where you need session-based windowing is when analyzing user interaction events, for example to understand how users are reading the Financial Times website or interacting on Facebook.
InfoQ: Can you discuss the security support in Kafka framework in terms of restricting access to Messages and Topics as well as encrypting the data transmitted through Kafka servers?
Adler & Noll: On the authentication side, Kafka supports SASL/Kerberos, SASL/PLAIN and SSL/TLS. For authorization, Kafka provides ACLs to control read/write/admin access to specific topics, and this can be configured for both authenticated users and for specific IPs.
Data-in-transit can be encrypted using SSL/TLS, and it is encrypted from data producers to Kafka brokers (servers), from the Kafka brokers to data consumers, and for the inter-broker communication that happens within a Kafka cluster.
InfoQ: Can Kafka cluster be deployed on Docker containers? Are there any best practices or online resources for developers on this integration?
Adler & Noll: Yes, it is possible to deploy a Kafka cluster on Docker containers. Confluent provides experimental Docker images to run the Confluent Platform, which includes Apache Kafka. That said, customers that run Docker-based Kafka setups are still the exception and not the rule. On the one hand this is because Docker is, relatively speaking, still a young technology that isn’t super-mature yet. On the other hand, Kafka’s role in a data architecture is to store and serve data, i.e. it is a “stateful” service. Docker’s philosophy and best practice is to not run any such stateful service within containers—which preferably should be stateless—so bridging these two slightly orthogonal approaches requires some special care.
InfoQ: What's coming up in Kafka in terms of new features and enhancements?
Adler & Noll: In the forthcoming releases, the Apache Kafka community plans to focus on operational simplicity and stronger delivery guarantees. This work includes improved data balancing, more security enhancements, and support for exactly-once delivery in Apache Kafka. Confluent Platform will have more clients, connectors and extended monitoring and management capabilities in Confluent Control Center. Also, now that the first version of Kafka Streams has been released with Kafka 0.10, the Kafka community and Confluent will work on extending the functionality of Kafka Streams. One upcoming feature we’re working on is a SQL interface for implementing your stream processing applications. This is an example of functionality we want to include to broaden the user base of Kafka Streams, but also boost stream processing in general.