Druid — An introduction

Amit Singh Rathore

Published in

Geek Culture

5 min readOct 19, 2022

Real-time analytics database

What is Druid?

Druid is an open source, high-performance, real-time analytics database to power analytics applications at any scale, for high number of concurrent users.

Features of Druid

Open Source
High Ingestion rate
Low query latency (sub-second query) on terabyte datasets
Support for high dimensionality and high cardinality data
Column Oriented
Distributed Data Store
Self-healing & Self-balancing
Supports Both real-time & batch
Cloud-native
Supports Indexes for query
Time-based partitioning of data, Compressed data
Auto Summarization, Late materialization
Approximate algorithm
Support for Text search

Node Types

Master Node

Coordinator — manages data availability on the cluster
Overlords— controls the assignment of data ingestion workload

Data Node

Middle Manager —Ingest And Query Real-time Data (Can have multiple peons)
Historical — Response to query in general

Query Node

Broker — handles queries from external clients.
Routers — route requests to Brokers, Coordinators, and Overlord
Query Console (Powered by Jetty)
API Service

As each node type has different memory and CPU requirements due to the nature of the work they perform it will be a good idea to use different instance types for these. A good set of options to choose from will be (r5, m5, i3, and c5 series instances).

External System dependency

Zookeeper — manage cluster state
Metadata store — MySQL or Postgres
Deep Storage — S3, HDFS, or NFS

Trying out Druid on EC2

# launch an t2.xlarge EC2 based on ubuntu 22.04 image

sudo apt update -y
sudo apt install openjdk-8-jdk -y
# download binary
wget https://dlcdn.apache.org/druid/24.0.0/apache-druid-24.0.0-bin.tar.gz
# untar
tar -xzf apache-druid-24.0.0-bin.tar.gz
cd apache-druid-24.0.0
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export DRUID_HOME=/home/ubuntu/apache-druid-24.0.0
PATH=$JAVA_HOME/bin:$DRUID_HOME/bin:$PATH

# start druid
./bin/start-micro-quickstart

Installing Druid on K8s

helm dependency update helm/druid
helm install druid helm/druid --namespace demo --create-namespace
helm status druid -n demo

OR using operator

kubectl create namespace druid-operator
git clone https://github.com/druid-io/druid-operator.git
cd druid-operator/
helm -n druid-operator install cluster-druid-operator ./chart

pydruid is a python package that exposes a simple API to create, execute, and analyze Druid queries
pip install pydruid[async, pandas, cli, sqlalchemy]

Data Storage Format

Segment

Versioned immutable file
Partitioned by time
Size range 300MB to 700MB
Uses direct memory map for faster access

These segments are prefetched into the cluster. The location of segments in data nodes is var/druid/segments/<datasource> . We can combine, replace and even delete segments if we think it is not being used. To delete the segment first, we use the ~/markUnused API to mark the segments and then use kill task type to delete using the ~/druid/indexer/v1/task API.

During a node failure, the segments are loaded to new data nodes. This process is called balancing.

Optimization

Conversion to a columnar format
Indexing with bitmap indexes
Compression
Dictionary encoding with id storage minimization for String columns
Bitmap compression for bitmap indexes
Type-aware compression for all columns

Druid Data Model

Datasources — Similar to table
Primary timestamp — mandatory
Dimensions — (Attributes & Measures), Dictionary Encoding, Bitmap
Metrics — pre-computed aggregates

Ingestion

Converting raw input data into segments is called ingestion. It involved two process Indexing & Handoff. Indexing creates new segments and Handoff publishes the segments to be served by Historicals. We define what happens during ingestion via the UI or via the JSON specification called Ingestion Specification. This spec can be submitted via rest to the overlord service.

Types of Ingestion

Batch File Ingestion

Native Batch Ingestion
Hadoop Batch Ingestion

Stream Ingestion

Kafka Indexing Service (Stream Pull)
Stream Push

Connect & Parse

Connect — Connect to the data source
Parse — input data format

Note: If we are reading data in other formats like ORC we need to enable the extension for the file type. The list of extensions can be modified in the common-runtime.properties file. The field to change is druid.extensions.LoadList .

Transform & Configure

Parse Time — (format and default value)
Transform — can add transformation before loading the data
Filter — selective choice on what rows to include
Configure Schema

Tune Parameters

Partition
Tune
Publish

The ingestion specification is broadly categorized into three sections:

dataschema — datasource, timestamp, dimension, granularity
ioConfig — configuration about how data is read from source
tunningConfig — config about how to construct segments

The data ingestion can also be triggered externally with Apache Airflow using DruidOperator.

Observability

Druid exposes metrics for its various components.

Refer to this page for the full list.

We can leverage druid-exporter coupled with kube-prometheus-stack to track metrics of Druid.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm -n monitoring install kube-prometheus-stack prometheus-community/kube-prometheus-stack

helm upgrade druid-exporter ./helm/ --install --namespace druid \
--set druidURL="DRUIDURL" \
--set druidExporterPort="8080" \
--set logLevel="info" --set logFormat="text" \
--set serviceMonitor.enabled=true --set serviceMonitor.namespace="monitoring"

# common.runtime.properties
druid.emitter=http
druid.emitter.logging.logLevel=debug
druid.emitter.http.recipientBaseUrl=http://druid-exporter-prometheus-druid-exporter.monitoring.svc.cluster.local:8080/druid
druid.monitoring.monitors=["org.apache.druid.java.util.metrics.SysMonitor", "org.apache.druid.java.util.metrics.JvmMonitor"]

Druid Use cases

Ads data analysis (digital marketing)
User behavior analysis
Real-time APM
High-speed OLAP
IoT and devices metrics

Companies Already Using Druid

Some companies have chosen Snowpipe as an alternative to Druid..