Druid — An introduction
Real-time analytics database
What is Druid?
Druid is an open source, high-performance, real-time analytics database to power analytics applications at any scale, for high number of concurrent users.
Features of Druid
- Open Source
- High Ingestion rate
- Low query latency (sub-second query) on terabyte datasets
- Support for high dimensionality and high cardinality data
- Column Oriented
- Distributed Data Store
- Self-healing & Self-balancing
- Supports Both real-time & batch
- Cloud-native
- Supports Indexes for query
- Time-based partitioning of data, Compressed data
- Auto Summarization, Late materialization
- Approximate algorithm
- Support for Text search
Node Types
Master Node
- Coordinator — manages data availability on the cluster
- Overlords— controls the assignment of data ingestion workload
Data Node
- Middle Manager —Ingest And Query Real-time Data (Can have multiple peons)
- Historical — Response to query in general
Query Node
- Broker — handles queries from external clients.
- Routers — route requests to Brokers, Coordinators, and Overlord
- Query Console (Powered by Jetty)
- API Service
As each node type has different memory and CPU requirements due to the nature of the work they perform it will be a good idea to use different instance types for these. A good set of options to choose from will be (r5, m5, i3, and c5 series instances).
External System dependency
- Zookeeper — manage cluster state
- Metadata store — MySQL or Postgres
- Deep Storage — S3, HDFS, or NFS
Trying out Druid on EC2
# launch an t2.xlarge EC2 based on ubuntu 22.04 image
sudo apt update -y
sudo apt install openjdk-8-jdk -y
# download binary
wget https://dlcdn.apache.org/druid/24.0.0/apache-druid-24.0.0-bin.tar.gz
# untar
tar -xzf apache-druid-24.0.0-bin.tar.gz
cd apache-druid-24.0.0
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export DRUID_HOME=/home/ubuntu/apache-druid-24.0.0
PATH=$JAVA_HOME/bin:$DRUID_HOME/bin:$PATH
# start druid
./bin/start-micro-quickstart
Installing Druid on K8s
helm dependency update helm/druid
helm install druid helm/druid --namespace demo --create-namespace
helm status druid -n demo
OR using operator
kubectl create namespace druid-operator
git clone https://github.com/druid-io/druid-operator.git
cd druid-operator/
helm -n druid-operator install cluster-druid-operator ./chart
pydruid is a python package that exposes a simple API to create, execute, and analyze Druid queries
pip install pydruid[async, pandas, cli, sqlalchemy]
Data Storage Format
Segment
- Versioned immutable file
- Partitioned by time
- Size range 300MB to 700MB
- Uses direct memory map for faster access
These segments are prefetched into the cluster. The location of segments in data nodes is var/druid/segments/<datasource>
. We can combine, replace and even delete segments if we think it is not being used. To delete the segment first, we use the ~/markUnused
API to mark the segments and then use kill
task type to delete using the ~/druid/indexer/v1/task
API.
During a node failure, the segments are loaded to new data nodes. This process is called balancing.
Optimization
- Conversion to a columnar format
- Indexing with bitmap indexes
- Compression
- Dictionary encoding with id storage minimization for String columns
- Bitmap compression for bitmap indexes
- Type-aware compression for all columns
Druid Data Model
- Datasources — Similar to table
- Primary timestamp — mandatory
- Dimensions — (Attributes & Measures), Dictionary Encoding, Bitmap
- Metrics — pre-computed aggregates
Ingestion
Converting raw input data into segments is called ingestion. It involved two process Indexing & Handoff. Indexing creates new segments and Handoff publishes the segments to be served by Historicals. We define what happens during ingestion via the UI or via the JSON specification called Ingestion Specification. This spec can be submitted via rest to the overlord service.
Types of Ingestion
Batch File Ingestion
- Native Batch Ingestion
- Hadoop Batch Ingestion
Stream Ingestion
- Kafka Indexing Service (Stream Pull)
- Stream Push
Connect & Parse
- Connect — Connect to the data source
- Parse — input data format
Note: If we are reading data in other formats like ORC we need to enable the extension for the file type. The list of extensions can be modified in the common-runtime.properties
file. The field to change is druid.extensions.LoadList
.
Transform & Configure
- Parse Time — (format and default value)
- Transform — can add transformation before loading the data
- Filter — selective choice on what rows to include
- Configure Schema
Tune Parameters
- Partition
- Tune
- Publish
The ingestion specification is broadly categorized into three sections:
dataschema — datasource, timestamp, dimension, granularity
ioConfig — configuration about how data is read from source
tunningConfig — config about how to construct segments
The data ingestion can also be triggered externally with Apache Airflow using DruidOperator.
Observability
Druid exposes metrics for its various components.
Refer to this page for the full list.
We can leverage druid-exporter coupled with kube-prometheus-stack to track metrics of Druid.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm -n monitoring install kube-prometheus-stack prometheus-community/kube-prometheus-stack
helm upgrade druid-exporter ./helm/ --install --namespace druid \
--set druidURL="DRUIDURL" \
--set druidExporterPort="8080" \
--set logLevel="info" --set logFormat="text" \
--set serviceMonitor.enabled=true --set serviceMonitor.namespace="monitoring"
# common.runtime.properties
druid.emitter=http
druid.emitter.logging.logLevel=debug
druid.emitter.http.recipientBaseUrl=http://druid-exporter-prometheus-druid-exporter.monitoring.svc.cluster.local:8080/druid
druid.monitoring.monitors=["org.apache.druid.java.util.metrics.SysMonitor", "org.apache.druid.java.util.metrics.JvmMonitor"]
Druid Use cases
- Ads data analysis (digital marketing)
- User behavior analysis
- Real-time APM
- High-speed OLAP
- IoT and devices metrics
Companies Already Using Druid
Some companies have chosen Snowpipe as an alternative to Druid..