Apache Druid — Analytics DB for Timeseries Data

Published in

Subex AI Labs

7 min readMay 17, 2021

In this article, we will have a complete overview of Apache Druid and its salient features, starting from when and where to use Druid, how to deploy it and a deep dive into its architecture. All the details that are required to get a basic idea as to what exactly does Druid architecture looks like, have been covered. It is very much important to understand what’s happening beneath Apache Druid in order to get the maximum benefit to your use case.

Official definition:- Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics (“OLAP” queries) on large data sets.

In simple words, say your business or platform is generating huge amounts of data every second and you are sitting on a goldmine of data. This data by itself is of no value unless we extract some meaningful information and get insights from it, in order to understand the various characteristics of your business. To achieve this, one might have to execute multiple complex queries on this dataset. In most scenarios, while scaling up for large datasets, this will add up to the latency and the load on the servers increases at an exponential rate. In such analytical use cases, Apache Druid is one of the best databases available in the market today as it tackles these issues with ease.

Important Features:-

Druid is most often used as a database for powering use cases where real-time ingestion, fast query performance, and high uptime are important. Druid’s main value add is to reduce time to insight and action.

Druid excels in timestamp:- That means the way backend storage and querying are designed is purely based on timestamps, which makes it the best fit for time-related queries.

Column-oriented:- This means, it can have multiple columns but can scan through only selected columns, unlike traditional row-wise scan.

In perspective of UI, Used in dashboarding of powerful analytical applications. In perspective of backend, Used for fast aggregations and concurrent API’s executions.

Time-Series Chart showing daily trends of online purchases for an E-Commerce use case — Source: CrunchMetrics

When you should use druid:-

For data having time component:- Druid is designed and optimized around timestamps and hence everything is built around ingesting and sorting data around timestamps
If your data has high cardinality data columns
If most of the queries are group by queries. (Also scan and search is supported)
You have your own data warehouse and if you are looking for a secondary DB for your analytics

Cases where Druid is not a right fit:-

If you need updates on your table rather than inserts
If your query has huge joins on multiple tables
If you are looking for a data warehouse, then its not a right fit, because Druid was specifically designed for data analytics and hence loading your raw data into Druid for storage purposes would be a bad idea
If you have huge data and do not have enough computational hardware(CPU and Memory). Apache Druid has been designed and developed solely with the agenda of getting low latency results and hence it requires more memory and CPUs. As always there is a trade-off between performance and hardware.
If you need an offline reporting system, and latency is not of concern.

So, How do you get started with Druid deployment??

Druid can be deployed as either a process or as a docker. The docker deployment is generally preferred as it gives you the flexibility of microservice architecture and moreover it’s easy to scale up and scale down. An important factor to consider while creating druid deployment is the configuration profiles. All the components in Druid are optimized to work based on your machine’s CPU and memory. So do make sure that you are using the right set of configurations based on your data load and machine specs. And it’s quite easy to pick the right set of configuration profile as Druid provides a bunch of configuration file for specific machine specs. You can pick the right one based on your machine. For single-server deployments following config files are provided by Druid: nano-quickstart, micro-quickstart, small, medium, large, xlarge

Druid Architecture — Simplified

The Druid server in the core of Apache Druid is a unique system, unlike any of its competitors. Druid has multiple processes at the core of its design and each process can be configured and scaled individually.

To understand why and how druid is so optimal in its functioning, let us review its architecture. Before going into the technical nitty-gritty, let us explore the data lifecycle in druid first.

Lifecycle of data within Druid:-

Data Ingestion
Creation of segments for ingested data and storing these segments into deep storage
Load data from deep storage into historical cache and store as queryable data
Execute queries on these segments on historical
Drop segments based on retention rules

Every component on Druid is given one of the above responsibility(Simple)

Druid has mainly 3 server types

1. Master Servers: Manages data availability and ingestion.

> Coordinator:- (Responsibility:- Segments)

Manages the segment. Coordinator constantly communicates with historical to load/drop segments based on configurations and also create replications.

> Overload:- (Responsibility: Data Ingestion tasks)

The Overlord process is responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning statuses to callers and hand them to coordinators.

2. Query Servers: Receives queries from external clients and forwards queries to Data servers

> Brokers:- (Responsibility:- Queries)

The Broker is the process to route queries from external clients. It keeps track of where the data is i.e., on which historical process, using zookeeper. This process also merges the result sets from all the individual processes together.

Question:- How zookeeper knows where the data is?

On startup, Historical processes announce themselves about the segments they are serving in Zookeeper

> Routers:- (Responsibility:- API gateway in front of Druid Brokers, Overlords, and Coordinators. Also runs Druid console). The Apache Druid Router process can be used to route queries to different Broker processes. Druid Console is hosted by the Router process.

3. Data Server: Executes ingestion jobs and stores all queryable data.

> Historical:- (Responsibility:- Storing queryable data)

The data stored on historical is immutable

> MiddleManager:- (Responsibility: Ingesting data)

The MiddleManager process is a worker process that executes submitted tasks. Middle Managers forward tasks to Peons that run in separate JVMs. The reason we have separate JVMs for tasks is for resource and log isolation. Each Peon is capable of running only one task at a time, however, a MiddleManager may have multiple Peons.

Middle managers take data and analyze, build aggregations, build indexes, partition data, include dictionary data and write to segments.

External Dependencies:-

4. Metadata storage:- (Responsibility: Stores metadata about druid system components)

Druid uses relational database like “postgres/mysql/Derby” to store various metadata about the system, but not to store the actual data. These include multiple details like where segments are, what’s in each segment, all the rules for querying and arranging data into segments, tasks details, runtime configurations about supervisors and data sources, etc.

5. Deep storage:-

Druid uses deep storage to store any data that has been ingested into the system. It's like a backup and is a way to transfer data in the background between Druid processes. To respond to queries, Historical processes do not read from deep storage, but instead, read prefetched segments from their local disks before any queries are served. i.e., historical processes do not fetch data on demand. Below are the deep storage components.

In case of clustered deployment: s3/HDFS/network mounted filesystem
In case of single-server deployment:- Local disk

ZooKeeper:-

Apache Druid uses Apache Zookeeper(ZK) for the management of the current cluster state. The operations that happen over ZK are

Coordinator/Overload:- Leader election
Segment “publishing” protocol from Historical
Segment load/drop protocol between Coordinator and Historical

Conclusion

We have tried to cover the overall architecture in a simplified way. However, there are a lot of other features which is responsible to boost the performance parameters of Apache Druid. For instance storage formats(Segments), Querying layer,(Historical), and so on. Also, Druid is horizontally scalable, which means you can keep increasing the servers as you feel the need. There are different data sources that Druid supports for both batch and live ingestion. We would love to cover more details on these topics in the coming days.

Reference:- https://druid.apache.org/

Thanks for reading. Hope it was helpful:)

Apache Druid — Analytics DB for Timeseries Data

Written by Rakshith S