What is Apache Druid?

Published in

Data Reply IT | DataTech

7 min readFeb 22, 2021

In this article we will have a complete overview of the Apache Druid framework, starting from what a timeseries is, how we could handle this kind of data, and a description of its architecture. In the end, we will see how data is ingested and queried inside the platform.

Timeseries: a definition

A time series is a sequence of data points ordered over time.

These discretized points are usually stored and analyzed in order to make predictions or find behavioural patterns. In the common Big Data architectures, the standard approach is to gather data coming from all kinds of sources (user data, application data, web analytics, bi metrics, etc.), send all of them to an ETL stream processor and visualize the information in live dashboards.

In order to tailor the flow for a more analytic approach and to be able to respond quicker to timeseries queries, our architecture can be divided in two distinct pipelines:

one dedicated only to user and application data, with the standard structure as described above;
another one that, for the rest of the gathered data, and after the ETL step, stores data into an analytic database before connecting it to the live dashboards.

As our analytic database, we will need something that can ingest data in real-time, has low latency queries and has high uptime… could Apache Druid be a perfect fit?

An introduction to Apache Druid

Druid is a high performance analytics data store for event-driven data, written in Java, born in 2011 and moved to an Apache License in February 2015. It has a lot of features, the most remarkable being:

Scalability, thanks to a self-healing architecure
Typical <1s to 4–5s query latency
Flexible table schemas over a columnar dataset
Time-based partitioning and querying
Native JSON + standard SQL querying
Fast data pre-aggregations

Data can be stored in Druid via two kinds of datasources: Batch (Native, Hadoop, etc.) and Streaming (Kafka, Kinesis, Tranquility, etc.). In both cases, data is stored in a way that is pretty equivalent to a table in a relational DB, using time partitioning during the ingestion, building chunks into partition segments.

After the ingestion, a segment becomes an immutable columnar compressed file. This immutable file is persisted in a deep storage (e.g. S3 or HDFS) and can be recovered even after failure of all Druid servers. Storage exploits maps and compressed bitmaps in order to reach high compression rates.

In Apache Druid, there are three types of columns:

Timestamp → used for partitioning and sorting
Dimension → a field that can be queried via filters
Metric → an optional field, representing data aggregated from other fields (e.g the sum of characters in another field)

Queries can be sent to Druid in two different ways, without incurring in performance penalties: Native Druid JSON or DruidSQL. Druid is optimized for queries on a single fact data source and joins on a single, simple query-time lookup table only.

Further joins are not fully supported, and not optimal anyway in terms of query response time.

Let’s take a deeper look under the hood.

Framework architecture

The Druid architecture can be divided in three different layers:

Data servers
Master servers
Query servers

Each Druid layer is devoted to a very specific task.

Data Server Layer

In this layer, data is ingested, stored and retrieved.

Data enter the framework through MiddleManagers, the component that handles ingestions by processing tasks and creating new Segments. They are responsible of how data is indexed, pre-aggregated, split into segments and published into Deep Storage.

Deep Storage is responsible of persisting all data ingested and handling storage backups and availability between processes. This component is pluggable, so it’s the user who can choose the most suitable storage.

Other components that are part of this sub-category are Zookeeper (another Apache project used for internal service discovery, coordination, and leader election) and Metadata Storage (backed by relational DBs, like MySQL or PostgreSQL), containing all available metadata previously collected by all the components.

The last entities of this layer are Historicals: they are responsible for handling query processing by fetching segments from Deep Storage to the local disk and serving them when requested by queries. If no info is found in cache, metadata (location and compression) of new segment is retrieved from Zookeeper and the data is processed at that location. Historicals are triggered at every new segment entry in the ingestion data queue.

The capability to adapt to different scenarios is the very reason this tool has been named Druid

Master Server Layer

This layer manages data availability and ingestion using Overlords, components that balance data over MiddleManager processes, assigning ingestion tasks and coordinating segment availability.

Coordinators, on the other hand, are components that balance data among Historical processes, notifying them when segments need to be loaded, dropping outdated segments and managing segment replication.

Query Server Layer

This layer is responsible for handling queries from external clients and providing the results. It uses components called Brokers that receive queries from clients, identify which Historical and MiddleManager nodes are serving those segments, split and rewrite the main query into subqueries and send everything to each one of these processes. At the very end, Brokers collect and merge the results to be provided to the client.

After introducing the architecture, we will see how the actual Druid tasks are performed.

Ingestion flow

The main phases that define the ingestion of data from our sources are two:

Indexing
Fetching

In the indexing phase, at first data are gathered from the sources (1), then they are indexed and pre-aggregated (2) in order to create new segments and published (3) into the Deep Storage.

Indexing phase (components and relations used in this phase are highlighted)

In the fetching phase, since Coordinators are listening to Metadata Storage periodically for newly published segments (4), when one of them finds a segment that is published and used but unavailable, it chooses a Historical process to load that segment in memory and instructs that Historical to do so (5). At the end, the Historical performs its task and begins serving the data(6).

Fetching phase (components and relations used in this phase are highlighted)

Query flow

What happens when a user submits a query to Druid?

The process is described below, in a single phase called querying phase.

It all starts when the client makes a request that is managed by Brokers (7), who will identify processes serving those corresponding segments (8). The query will be then splitted, rewritten and sent (9) to Historicals and MiddleManagers to be executed over the whole cluster(10).

Historicals fetch the data from the deep storage. However, for real-time data, the query may be sent to the MiddleManagers as well: some data may not be available in DeepStorage yet, but existing only as segments still waiting to be written in the underlying file system.

At the end, results of the query are collected and merged together by Brokers and showed to the Client (11).

Querying phase (components and relations used in this phase are highlighted)

The key point is that the cost, in terms of time spent, at the ingestion phase in order to pre-aggregate and index data, is recovered back at query time. This lets users reach high speed results with average sub-second query response times.

Conclusions

In this article, we have analyzed:

the main features of the Apache Druid framework
which components define its architecture, and their purpose
the main steps that define ingestion and query phases

Apache Druid’s structure is not really lightweight, because of the amount of components involved. Still, this allows for a proper horizontal scaling of the architecture, simply by incrementing the number of the nodes (MiddleManagers, Historicals, etc.) Furthermore, it delivers high perfomances in terms of processing data, both in batch and real-time fashion.

Thanks to the open source code, the community that uses or develops on Druid is growing and it is possible, for example, to find or define connectors in order to include the framework in needing architectures.

The project has currently been used in production by many successful companies, due to its good performance and the great variety of the use-cases it fits into.

Do you want to test Apache Druid?

Thanks to quickstart guides written directly by the developers, you can easily try it out on your laptop with Docker, on an actual machine, or on a cluster (both on-premise and in the cloud!).