Apache Druid is a high-performance real-time datastore and its latest release, version 25.0, provides many improvements and enhancements. The main new features are: the multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready; Kubernetes can be used to launch and manage tasks eliminating the need for middle managers; simplified deployment; and a new dedicated binary for Hadoop 3.x users.
In order to produce real-time analytics and reduce time to insight for a variety of use cases, Druid's design incorporates concepts from data warehouses, time-series databases, and search systems.
It has a microservice-based distributed architecture that is designed to be cloud-ready and comprises several types of services such as: Coordinator service that manages data availability on the cluster, Overlord service that controls the assignment of data ingestion workloads, Broker service that handles queries from external clients and MiddleManager services that ingest data.
The image below shows the architecture of Apache Druid:
During the ingestion phase, Druid reads the data from the source system and stores it in data files called segments. In general, segment files contain a few million rows each. Every segment file is partitioned by time and organized in a columnar structure stored separately to decrease query latency by scanning only those columns actually needed for a query.
Druid supports both streaming and batch ingestion. It connects to a source of raw data, typically a message bus such as Apache Kafka (for streaming data loads), or a distributed file system, such as HDFS or cloud-based storage like Amazon S3 and Azure Blob Storage (for batch data loads), and can convert raw data to a more read-optimized format (segment) in a process called "indexing" Apache Druid can ingest denormalized data in JSON, CSV, Parquet, Avro and other custom formats.
It is possible to query data in Druid data sources using Druid SQL. Druid translates SQL queries into its native query language.
Druid comes with a web console that may be used to load data, manage data sources and tasks, and control server status and segment information. Additionally, you can execute SQL and native Druid queries in the console.
The image below shows the web console of Druid:
For situations where real-time ingest, fast query performance and high uptime are crucial, Apache Druid is frequently employed.
As a result, Druid is commonly used as a backend for highly concurrent APIs that require quick aggregations or to power the GUIs of analytical apps. Druid works best with event-oriented data.
Typical application areas are: Clickstream analytics (web and mobile analytics), Risk/fraud analysis, Network telemetry analytics (network performance monitoring), Application performance metrics and Business intelligence / OLAP.
It is used by many big players like Airbnb, British Telecom, Cisco, eBay, Expedia, Netflix and Paypal and has more than 12k stars on Github.