Choosing OLAP Storage: Apache Druid

Published in

Towards Data Engineering

15 min readMar 16, 2024

Previously, I described key metrics that should be considered when making the critical choice of OLAP storage for your needs in this article — How to Choose the Right OLAP Storage. Now, I have decided to apply this approach in practice and examine the most popular, as well as some of the not-so-popular, OLAP storages. The main goal is to discern the strengths and weaknesses of each OLAP storage solution and determine the most fitting use case for each.

What is OLAP Storage?

What is online analytical processing? Online analytical processing (OLAP) is software technology you can use to analyze business data from different points of view. In OLAP scenarios, datasets can be massive — billions or trillions of rows. Data is organized in tables that contain many columns, and only a few columns are selected to answer any particular query. Results must be returned in milliseconds or seconds. Basically, OLAP storage refers to storage optimized for analytical workloads.

Here is the list of OLAP storages in this research:

ClickHouse
StarRocks
Apache Druid ⬅️You Are Here
Snowflake
Azure Synapse Analytics
SQL Server With Columnstore Indexes
AWS Redshift
Google Cloud Platform (GCP) BigQuery

Overview

Developers —Apache Software
Written in —Java
Type — Column-oriented DBMS, Real-Time

Apache Druid is an open-source, column-oriented, distributed data store written in Java. It is specifically designed for rapid ingestion of large volumes of event data, offering low-latency queries on the stored information. Druid stands out as a robust real-time analytics database tailored for swift slice-and-dice analytics, commonly known as OLAP queries, especially when dealing with extensive datasets. A notable feature of Druid is its implementation of a columnar storage format. This format selectively loads only the essential columns required for a given query, significantly enhancing query speed, especially when targeting specific subsets of data. The name “Druid” is inspired by its architectural design, which draws parallels with the Druids of ancient Celtic history. The Druids were renowned for their ability to serve as a bridge between the physical and spiritual worlds.

Storage Architecture & Semi-Structured Data Support

1. Storage Format:

Druid uses column-oriented storage, meaning it only needs to load the exact columns needed for a particular query. This gives a huge speed boost to queries that only hit a few columns. In addition, each column is stored optimized for its particular data type, which supports fast scans and aggregations.

2. Separation of Compute and Storage:

Apache Druid does not separate compute and storage. Instead, it combines them for performance reasons. While the lack of separation can lead to performance benefits, it also means that you cannot independently scale storage and compute resources. This could potentially lead to inefficiencies in resource utilization.

3. Semi-Structured Data Support:

Apache Druid 24.0 added native support for ingesting and storing semi-structured data as-is. This is particularly useful for data from web APIs, mobile and IoT devices, which often come in semi-structured formats. Many databases require flattening of nested structures before storing and processing to provide good performance during querying. However, with Druid’s new capability, developers can now ingest and query nested fields and retain the performance they’ve come to expect from Druid on fully flattened columns. Internal benchmark exercises show that query performance on nested columns is very similar to flattened data or better. Druid handles semi-structured data in a very natural way with automatic schema evolution, meaning new fields are picked up seamlessly. All the fields are indexed, allowing for super fast data exploration. Druid supports most popular file formats for structured and semi-structured data.

Deployment & Pricing

1. Deployment Model:

Apache Druid can be deployed in different ways depending on your needs. Here are some of the common deployment models:

Single Server Deployment: This is suitable for small machines like laptops and is intended for quick evaluation use-cases. Druid includes a launch script, bin/start-druid, that automatically sets various memory-related parameters based on available processors and memory. It also includes a set of reference configurations and launch scripts for single-machine deployments.
Clustered Deployment: Apache Druid is designed to be deployed as a scalable, fault-tolerant cluster. A simple cluster will feature: A Master server to host the Coordinator and Overlord processes, two scalable, fault-tolerant Data servers running Historical and MiddleManager processes, a query server, hosting the Druid Broker and Router processes.
Cloud Deployment: Apache Druid can be deployed in the cloud, and it’s particularly well-suited for cloud environments like AWS

Druid can process a query in parallel across the entire cluster. It is typically deployed in clusters of tens to hundreds of servers, and can offer ingest rates of millions of records/sec, retention of trillions of records, and query latencies of sub-second to a few seconds.

2. Fully Managed Service Options:

Apache Druid can be used as a fully managed service through various providers. Here are a few options:

Rill Data: Rill offers a fully managed Apache Druid service. They provide end-to-end tools for modeling, storing, exploring, and distributing real-time metrics. Rill’s services and platform ensure the performance, reliability, and security required to meet the most demanding SLAs. They also offer features like Druid Monitoring, Druid Support, and Druid-as-a-Service.
XenonStack: XenonStack provides Apache Druid Managed Services and Analytics Solutions. They offer features like Managed Backup, Full and Daily Snapshots, Managed Operating System Patches and Updates, Hardening, Configuration, and Tuning.

3. Scalability:

Apache Druid offers several scalability options to meet different use cases:

Horizontal Scaling: Druid can be scaled horizontally by adding more servers to the cluster. The Druid cluster automatically re-balances itself in the background without any downtime.
Vertical Scaling: You can scale up individual Druid servers to handle more data or queries.
Elastic Scaling on AWS: When deployed on AWS, Druid can be configured to automatically scale up or down based on the workload. This includes scaling the number of EC2 instances, adjusting the instance types, and tuning other resources.
Scalable Analytics on AWS: AWS provides a solution for scalable analytics using Apache Druid. This solution allows you to efficiently deploy, operate, manage, and customize a cost-effective, highly available, resilient, and fault-tolerant hosting environment for Apache Druid analytics databases on AWS.

4. Pricing Model:

Apache Druid is an open-source software, which means it’s free to download and use. However, the total cost of ownership can vary depending on your deployment model and infrastructure. Here are some factors that can affect the cost:

Self-Managed Deployment: If you choose to deploy and manage Apache Druid yourself, either on-premises or in the cloud, your costs will include the infrastructure (servers, storage, network, etc.), and operational costs (maintenance, monitoring, backups, etc.).
Cloud Deployment: If you deploy Apache Druid in a cloud environment like AWS, you’ll pay for the cloud resources you use. This can include costs for compute (EC2 instances), storage (S3, EBS), and other AWS services.
Fully Managed Services: Some companies offer fully managed Apache Druid services, such as Rill Data and XenonStack. These services handle all aspects of running Apache Druid, including deployment, scaling, monitoring, and maintenance. The pricing for these services varies, so you’ll need to check with the individual providers for details.

While Apache Druid itself is free, the costs associated with running it in production can add up, especially at scale. It’s important to consider these factors when planning your Apache Druid deployment.

Managemant

1. Community/Support:

The Apache Druid community is dynamic and engaged, offering diverse channels for support. The Apache Druid Slack platform serves as a bustling hub with numerous users and committers, making it an excellent resource to seek assistance. Additionally, the GitHub repository provides opportunities to track Druid’s development progress, report issues, and contribute pull requests. Beyond community support, several third-party companies, such as Imply and Rill Data, offer commercial support and services for Druid, further enhancing the ecosystem’s robustness.

For a more in-depth understanding of Druid’s community, refer to the Apache Druid Community documentation page.

2. Documentation:

Apache Druid has extensive documentation that provides a comprehensive overview of its features, capabilities, and use cases including quickstart and in-depth Guides. The documentation is open-source and community contributions are encouraged.

3. Ease of Management:

The challenge of managing Apache Druid can vary depending on the deployment model. A Single Server Deployment is the simplest and easiest to manage. However, as your data grows, transitioning to a more scalable Clustered deployment model might become necessary. While this offers greater scalability and fault tolerance, it also introduces more complexity in terms of setup and management. Managing multiple servers, each hosting different Druid processes, is a requirement. Additionally, tasks such as load balancing, data replication, and failure recovery need attention.

Deploying Druid in the cloud, such as on AWS, can simplify certain aspects of management, as many infrastructure tasks like hardware provisioning and scaling are handled by the cloud provider. However, you still need to manage the Druid software itself, and cloud deployments can introduce their own complexities, such as cost management and understanding cloud-specific performance characteristics.

Having a good understanding of Druid’s architecture and features will be crucial for effective management.

4. Learning curve:

Apache Druid is a powerful and complex system, and like any such system, there is a learning curve involved. Understanding the key concepts of Druid, such as its architecture, data model, indexing, and query language, is crucial. Practical experience with setting up and managing a Druid cluster, ingesting data, and running queries can greatly aid in learning Druid. For more complex use cases, such as large-scale data ingestion, real-time analytics, or multi-tenant deployments, the learning curve might be steeper, while Apache Druid is powerful and flexible, it does require an investment of time and effort to learn effectively.

5. SQL Support:

Apache Druid supports a native SQL layer that translates SQL queries into efficient Druid queries.

Integration

1. Supported Data Sources:

Apache Druid supports a wide range of data sources for ingestion:

Object Stores: Druid can ingest data from various object stores including HDFS, Amazon S3, Azure Blob, and Google Cloud Storage.
Message Buses: Druid can stream data from message buses such as Kafka and Amazon Kinesis.
Data Lakes: Druid can batch load files from data lakes such as HDFS and Amazon S3.
File Formats: Druid supports most popular file formats for structured and semi-structured data, including JSON, CSV, TSV, Parquet, ORC, Avro, and Protobuf.

2. Cloud Services Integration:

Apache Druid can be integrated with various cloud services:

AWS Cloud:

Druid can be deployed on AWS EC2 instances. It also supports AWS S3 for deep storage and AWS Kinesis for real-time data ingestion.

Microsoft Azure:

While specific documentation for Azure is not readily available, Druid’s flexible architecture allows it to be deployed on any cloud platform that provides compute instances and blob storage. Therefore, it can be deployed on Azure VMs and use Azure Blob Storage for deep storage.

GCP Cloud:

A reference architecture for Apache Druid on Google Cloud Platform includes best practices for leveraging GCP services such as Compute Engine, Cloud Storage, and Cloud SQL. Druid is commonly paired with Kafka on GCP for event monitoring, financial analysis, and IoT monitoring.

3. SDK Support:

Apache Druid has SDK support in various programming languages:

Java: Druid provides a native Java API2. There’s also a Java client and query generator for Druid called druidry.
Scala: scruid is a Scala client for Druid.
.NET: druid4net is a .NET client for Druid written in C#. It supports both the .NET full framework and .NET Core.
Rust: druid-io-rs is a fully asynchronous, future-enabled Apache Druid client library for the Rust programming language.

Please note that Apache Druid team does not maintain these libraries and hasn’t done any extensive testing to ensure their quality. For a full list of libraries, visit the Community and Third-Party Software documentation page.

4. Supported Visualization Tools:

Apache Druid supports a variety of visualization tools:

Apache Superset: Superset is a modern data exploration and data visualization platform.
Deep.Explorer: A UI built for slice & dice analytics, adhoc queries, and powerful, easy data visualizations.
Grafana: Druid can be integrated with Grafana through a plugin1.
Pivot: An exploratory analytics UI for Druid.
Plotly Python Library: You can use the Python Plotly library to create visualizations of your Apache Druid data.

For a full list of visualization tools, visit the Community and Third-Party Software documentation page.

Performance

1. Insert operations

Apache Druid is designed to ingest large volumes of event data and provides low latency queries on this data. Druid’s ingestion speed is one of its key strengths, and it can ingest millions of records per second. Here are some more details about its insert performance:

Real-Time Ingestion: Druid can ingest data in real-time from message buses like Kafka. This allows Druid to support use-cases that require real-time analytics.
Batch Ingestion: Druid also supports batch ingestion from object stores like HDFS or cloud storage. This is useful for backfilling data or ingesting large volumes of historical data.
Tuning Ingestion Performance: There are several ways to tune Druid’s ingestion performance. For example, you can adjust the size of the ingestion buffer to allow Druid to ingest larger batches of data at once. You can also tune the number of ingestion threads to allow Druid to ingest data in parallel.
Partitioning: Druid partitions your data during ingestion to ensure that it can be evenly distributed across your cluster and parallelized for faster query performance. You can control how your data is partitioned by specifying a partition spec in your ingestion spec.
Roll-Up: During ingestion, Druid can “roll up” data by aggregating it based on specified dimensions. This can significantly reduce the amount of data that needs to be stored and queried, improving both ingestion and query performance.

2. Update operations

Apache Druid supports data updates, but it’s important to note that it does not support single-record updates by primary key. Here are some key points about its update performance:

Overwrite: Druid supports overwriting existing data using time ranges. Data outside the replacement time range is not touched. Overwriting of existing data is done using the same mechanisms as batch ingestion.
Reindexing: Reindexing is an overwrite of existing data where the source of new data is the existing data itself. It is used to perform schema changes, repartition data, filter out unwanted data, enrich existing data, and so on.

3. Join operations

Apache Druid supports join operations during data ingestion and at query-time execution. Here are some key points about its join performance:

Join Operators: Druid has join operators for inner join, left join, right join, full outer join, and cross join available using a join datasource in native queries, or using the JOIN operator in Druid SQL.
Query-Time Lookups: These are simple key-to-value mappings preloaded on all servers involved in queries and can be accessed with or without an explicit join operator.
Performance Considerations: For best performance, it is recommended to avoid joins at query time whenever possible. Often this can be accomplished by joining data before it is loaded into Druid. For instance, you can use a SQL-based batch ingestion task with a JOIN clause in the SQL query. This allows you to join data from multiple tables during the ingestion process. However, there are situations where joins or lookups are the best solution available despite the performance overhead.
Star Schema Support: Druid now can handle star schema formats by allowing users to load dimension tables that can be joined with fact tables. This reduces the cost and time of a dimension update by 1000x, allowing them to always use the latest dimension data for their queries.
Parallel Execution: Druid can execute join operations in parallel across multiple nodes in the cluster. This allows it to take advantage of all available resources and can significantly improve the performance of join operations.

4. Aggregation queries

Apache Druid is designed to handle high-speed data ingestion and provides low latency queries on this data. Its aggregation performance is one of its key strengths. Here are some more details:

Aggregation Functions: Druid supports a wide range of aggregation functions, including count, sum, min, max, and average. These functions can be used both during ingestion to summarize data before it enters Druid, and at query time to summarize result data.
Post-Aggregations: Post-aggregations run through the result of your query and then “re-process” the result. This will have some overhead.

2. Materialized View Support:

Apache Druid does support Materialized Views. This feature enables Druid to greatly improve the query performance, especially when the query dataSource has a very large number of dimensions but the query only required several dimensions. In materialized view maintenance, dataSources user ingested are called base-dataSource. For each base-dataSource, you can submit derivativeDataSource supervisors to create and maintain other dataSources which are called derived-dataSource. The dimensions and metrics of derived-dataSources are the subset of base-dataSource’s. In materialized view selection, Druid implements a new query type view. When you request a view query, Druid will try its best to optimize the query based on query dataSource and intervals.

3. Indexing:

Apache Druid uses a unique indexing system to handle large volumes of data and provide low latency queries. Here are some details about its indexing:

Data Ingestion: Loading data in Druid is called ingestion or indexing. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called segments. For most ingestion methods, the Druid MiddleManager processes or the Indexer processes load your source data.
Indexing Service: The Apache Druid indexing service is a highly-available, distributed service that runs indexing related tasks. Indexing tasks are responsible for creating and killing Druid segments. The indexing service is composed of three main components: Peons that can run a single task, MiddleManagers that manage Peons, and an Overlord that manages task distribution to MiddleManagers.
Bitmap Indexes: Druid uses Roaring or CONCISE compressed bitmap indexes to create indexes that power fast filtering and searching across multiple columns.
Time-based Partitioning: The Druid system additionally utilizes Time-based Partitioning. Initially, Druid partitions data by time, and it can also partition based on other fields.

4. Streaming Ingestion:

Apache Druid supports real-time ingestion where data can be ingested event-by-event. This allows Druid to support query-on-arrival, meaning you can query the data as soon as it arrives. For streaming ingestion, the Middle Managers and indexers can respond to queries in real-time with arriving data. Streaming ingestion is controlled by a continuously-running supervisor. The supervisor runs and supervises a set of tasks over time.

Druid supports various ingestion methods, including Apache Kafka and Amazon Kinesis for streaming ingestion. Druid reads directly from Kafka or Kinesis and can ingest late data, meaning it can handle data that arrives after the time window for which it was expected.

Exactly-Once Guarantees: Druid provides exactly-once guarantees, ensuring that each event is processed exactly once, thereby preventing data duplication.

Strengths

Streaming Ingestion: Druid supports real-time ingestion, enabling query-on-arrival and providing exactly-once guarantees, making it suitable for scenarios where data arrives continuously.
Real-time Analytics and Low-Latency Queries: Apache Druid excels in providing low-latency queries on large volumes of event data, making it suitable for real-time analytics scenarios.
Semi-Structured Data Support: The ability to ingest and store semi-structured data without flattening nested structures allows for more flexible handling of diverse data formats, such as those from web APIs, mobile devices, and IoT devices.
Scalability: Apache Druid offers horizontal and vertical scaling options, allowing it to efficiently handle increasing workloads by adding more servers to the cluster or scaling up individual servers.

Weaknesses

Limited Update Support: While Druid supports data updates, it lacks support for single-record updates by primary key, potentially limiting its use in scenarios requiring granular data modifications.
Lack of Compute and Storage Separation: Druid combines compute and storage, which can limit the independent scalability of these resources, potentially leading to inefficiencies in resource utilization.
Management Complexity: Managing a clustered deployment, especially as data scales, introduces complexities such as load balancing, data replication, and failure recovery, requiring careful attention and expertise.
Learning Curve: The system’s complexity may pose a challenge, and users may need time and effort to grasp key concepts, set up and manage a cluster effectively.
Resource Cost in Cloud Deployment: While Apache Druid itself is open source, deploying it in a cloud environment incurs costs for cloud resources, and fully managed services from third-party providers may involve additional expenses.

Best use case

Apache Druid is best suited for scenarios that require real-time analytics, low-latency queries on massive datasets, and efficient handling of semi-structured data. Its strengths in horizontal and vertical scalability make it ideal for use cases with growing workloads, while its support for diverse integration options, including the ability to read directly from Apache Kafka and Amazon Kinesis, makes it adaptable to various data sources and tech ecosystems.

Worst use case

Druid may not be the best fit for scenarios where independent scalability of compute and storage resources is crucial. The lack of separation between compute and storage can be a limitation in cases where efficient resource scaling is essential. Additionally, the management complexity may pose challenges for smaller deployments or situations where a simpler storage solution suffices.

In summary, Apache Druid stands out as a robust OLAP storage solution with strengths in real-time analytics, low-latency queries, and efficient handling of semi-structured data. Its column-oriented storage format, scalability options, and diverse integration support make it ideal for applications with growing workloads and the need for timely analytics on massive datasets.

However, it’s essential to acknowledge certain limitations and considerations associated with Apache Druid. The lack of separation between compute and storage, while offering performance benefits, may limit independent scalability and resource optimization. The management complexity, especially in clustered deployments, requires careful attention and expertise.

Always check the official documentation for the latest information on Apache Druid.

For a more comprehensive understanding of how to assess this information, please refer to the key metrics outlined in the article How to Choose the Right OLAP Storage when making the crucial decision for your OLAP storage needs. To enhance your proficiency in data management, explore the Strategic guide on mastering data for software developers.