Confluent released KSQL: an interactive, distributed streaming SQL Engine for Apache Kafka. KSQL makes it easy to do stream processing operations like aggregations, joins, windowing, and sessionization on topics in Apache Kafka. Confluent announced the open source streaming SQL engine at the recent Kafka Summit conference in San Francisco.
KSQL allows developers to read, write, and process streaming data in real-time using SQL-like semantics. Some examples of stream processing include comparing two or more streams of data to understand anomalies and respond to them in real time. Unlike other distributed streaming and SQL frameworks, KSQL provides an event-at-a-time streaming SQL engine for Apache Kafka. Prior to KSQL, developers were required to use Java or Python programming to process data streams in Kafka.
Neha Narkhede, co-founder and CTO at Confluent, wrote about the features of KSQL framework and the use cases where it can be applied such as anomaly detection, monitoring, and streaming ETL.
Under the covers, KSQL uses Kafka’s Streams API to manipulate Kafka topics. There are two core abstractions in KSQL that are also the core abstractions in the Streams API: Streams and Tables.
Stream: Stream is a first order construct and a first class citizen in stream processing applications. A stream is an unbounded sequence of structured data ("facts") that are immutable (which means new facts can be inserted to a stream, but existing facts can never be updated or deleted). Streams can be created from a Kafka topic or derived from existing streams and tables.
Table: A Table in Kafka is a view of a STREAM or another TABLE and represents a collection of evolving facts. It is equivalent to a table in traditional databases but is kept continuously updated with the arrival of every new event and supports additional streaming semantics such as windowing. Facts in a table are mutable, which means new facts can be inserted to the table, and existing facts can be updated or deleted. Tables can be created from a Kafka topic or derived from existing streams and tables.
A topic in Apache Kafka can be represented as either a Stream or a Table in KSQL, depending on the intended semantics of the processing on the topic.
The following diagram shows how KSQL works with two steams of data coming into the system.
InfoQ spoke with Narkhede about the KSQL announcement. She talked about the motivation behind creating a SQL interface to run queries against streaming data.
KSQL is a critical part of their vision of a streaming-first data architecture with Kafka. In the streaming-first world, Kafka and KSQL provide the capabilities that were previously either not possible to do in real-time or were cumbersome. The Kafka log is the core storage abstraction for streaming data, which means the same data that goes into an offline data warehouse is now available for stream processing. Everything else is a streaming materialized view over the log created using KSQL, be it various databases, search indexes, or other data serving systems in the company. All data enrichment and ETL needed to create these derived views can now be done in a streaming fashion using KSQL.
InfoQ: Can you discuss the technical details of how KSQL works in terms of clustering and failover?
Neha Narkhede: There is a KSQL server process which executes queries. A set of KSQL processes run as a cluster. You can dynamically add more processing capacity by starting more instances of the KSQL server. These instances are fault-tolerant: if one fails, the others will take over its work. Queries are launched using the interactive KSQL command line client which sends commands to the cluster over a REST API. The command line allows you to inspect the available streams and tables, issue new queries, check the status of and terminate running queries. Internally KSQL is built using Kafka’s Streams API; it inherits its elastic scalability, advanced state management, and fault tolerance, and support for Kafka’s recently introduced exactly-once processing semantics. The KSQL server embeds this and adds on top a distributed SQL engine (including some fancy stuff like automatic byte code generation for query performance) and a REST API for queries and control.
InfoQ: Are there any performance considerations when using KSQL queries compared to other solutions like accessing the streaming data using the Kafka API?
Narkhede: KSQL is built using Kafka’s Streams API and integrates with Kafka tightly. This tight integration with the core fundamentals in Apache Kafka removes extra layers of data movement and serialization that you have to go through with non-native options for stream processing data in Kafka. So the overhead of using KSQL for processing streams of data in Kafka topics is relatively low. Having said that, KSQL is still in developer preview and we don’t yet have performance benchmarks. Our goal with the developer preview is to work with the Kafka community to ensure that the user experience of KSQL is outstanding. And in the months to come, we will invest in performance improvements, testing and operational stability.
InfoQ: What role do you see KSQL playing in the future in terms of providing a standard for querying the data streams?
Narkhede: Back when we created Kafka, JMS was the standard for messaging and Kafka’s simple APIs based on the log paradigm were new to the industry. Today, Kafka is the standard for not just messaging, but for managing data in real time. It did that due to the simplicity of user experience and broad applicability to the new problem domain of large-scale streaming data. Similarly, KSQL offers a SQL-like interface that modifies the SQL standard to make it suitable for stream processing. KSQL does that to support streams and tables as first-class abstractions which is essential for using the full potential of stream processing for real-world use cases like Streaming ETL, Monitoring, Anomaly Detection and Analytics. The simplicity and ease of operation that KSQL brings to the stream processing world will help it influence the new standard for querying streams of data.
InfoQ: Can you talk about the Kafka roadmap and any upcoming features that our readers would be interested in learning about?
Narkhede: We are releasing KSQL as a developer preview to start building the community around it and gathering feedback. We plan to add several more capabilities as we work with the open source community to turn it into a production-ready system from quality, stability, and operability of KSQL to supporting a richer SQL grammar including further aggregation functions and point-in-time
SELECT
on continuous tables–i.e., to enable quick lookups against what’s computed so far in addition to the current functionality of continuously computing results off of a stream.
KSQL is currently available in Developer Preview under Apache 2.0 license model and the team plans to make it production-ready in next couple of months.
Checkout the Quick Start and KSQL Docker Image to learn more about the tool. If you want to participate with the community there is also a KSQL Community Slack Channel. Another resource on KSQL is this screencast video that shows how you can use KSQL for real-time monitoring, anomaly detection, and alerting.