Choosing OLAP Storage: ClickHouse

Published in

Towards Data Engineering

13 min readMar 3, 2024

Previously, I described key metrics that should be considered when making the critical choice of OLAP storage for your needs in this article — How to Choose the Right OLAP Storage. Now, I have decided to apply this approach in practice and examine the most popular, as well as some of the not-so-popular, OLAP storages. The main goal is to discern the strengths and weaknesses of each OLAP storage solution and determine the most fitting use case for each.

What is OLAP Storage?

What is online analytical processing? Online analytical processing (OLAP) is software technology you can use to analyze business data from different points of view. In OLAP scenarios, datasets can be massive — billions or trillions of rows. Data is organized in tables that contain many columns, and only a few columns are selected to answer any particular query. Results must be returned in milliseconds or seconds. Basically, OLAP storage refers to storage optimized for analytical workloads.

Here is the list of OLAP storages in this research:

ClickHouse⬅️You Are Here
StarRocks
Apache Druid
Snowflake
Azure Synapse Analytics
SQL Server With Columnstore Indexes
AWS Redshift
Google Cloud Platform (GCP) BigQuery

Overview

Developers — ClickHouse, Inc.
Written in — C++
Type — Column-oriented DBMS

ClickHouse is a high-performance, column-oriented SQL database management system for online analytical processing. It is available as both an open-source software and a cloud offering.

In a column-oriented DBMS like ClickHouse, data is stored in columns, with values from the same columns stored together. This makes column-oriented databases better suited to OLAP scenarios as they are at least 100 times faster in processing most queries.

ClickHouse uses all available system resources to their full potential to process each analytical query as fast as possible. This is made possible due to a unique combination of analytical capabilities and attention to the low-level details required to implement the fastest OLAP database.

Storage Architecture & Semi-Structured Data Support

1. Storage Format:

ClickHouse uses a column-oriented storage format, which results in significant performance gains for analytical queries.

2. Separation of Compute and Storage:

ClickHouse does support the separation of compute and storage. This means that computing resources and storage resources are managed independently. This allows for better scalability, cost-efficiency, and flexibility. You can scale storage and compute resources separately as needed, optimizing performance and costs.

ClickHouse provides support for using S3 as the storage for the MergeTree engine using S3BackedMergeTree. This table engine enables users to exploit the scalability and cost benefits of S3 while maintaining the insert and query performance of the MergeTree engine.

Implementing and managing a separation of storage and compute architecture is more complicated compared to standard ClickHouse deployments

3. Semi-Structured Data Support:

Starting with version 22.3, ClickHouse supports the JSON data type. This feature gives ClickHouse greater flexibility in dealing with ingestion, storage, and retrieval of semi-structured data and complex objects.

Deployment & Pricing

1. Deployment Model:

ClickHouse can be deployed in various models depending on your requirements:

Single Node Deployment: ClickHouse can be deployed as a single instance. This option is simpler and helps keep costs down, making it an attractive alternative for many use cases. With sufficient resources, ClickHouse can process billions of records on a single server.
Cluster Deployment: ClickHouse can also be deployed as a cluster. This involves installing the ClickHouse server on all machines of the cluster, setting up cluster configs in configuration files, creating local tables on each instance, and creating a Distributed table. A distributed table is a kind of “view” to the local tables in a ClickHouse cluster. A SELECT query from a distributed table executes using resources of all cluster’s shards.
Cloud Deployment: ClickHouse can be deployed on cloud platforms such as AWS, Azure, and Google Cloud Platform.

Implementing and managing a cluster architecture is more complicated compared to standard ClickHouse deployments. While self-managed ClickHouse allows for separation of storage and compute, it is recommended to use ClickHouse Cloud, which allows you to use ClickHouse in this architecture without configuration using the SharedMergeTree table engine.

2. Fully Managed Service Options:

There are fully managed service options available for ClickHouse:

DoubleCloud: DoubleCloud offers a fully managed service for ClickHouse. It provides backups, monitoring, configuring sharding, replicas, and updates, freeing you up to focus on your business instead of routine tasks. DoubleCloud also supports ClickHouse over S3, so old data automatically transfers and is stored in S3 together with EBS. You can deploy Managed Clickhouse in your AWS account or Google Cloud.
ClickHouse Cloud: ClickHouse Cloud offers several service types including Development, Production, and Dedicated. Each service type is designed for specific use cases and offers different storage, memory, compute options, backup retention, and support.
Elest.io: Elest.io also offers a managed service for ClickHouse. They provide seamless deployment and maintenance of your ClickHouse instance with the backing of their expert support team.

While these services handle most of the management tasks, it’s important to understand your specific requirements and choose the service that best fits your needs.

3. Scalability:

ClickHouse is designed to provide excellent scalability. It can leverage all available CPU cores and disks to execute even a single query, not only on a single server but all CPU cores and disks of a cluster as well. As a managet service, ClickHouse Cloud offers enhanced scalability features. Here are some ways ClickHouse Cloud achieves scalability:

Vertical and Horizontal Scaling: ClickHouse Cloud Production services can be scaled both vertically (by switching to larger replicas), or horizontally (by adding replicas of the same size). Vertical scaling typically helps with queries that need a large amount of memory for long running inserts/reads, and horizontal scaling can help with parallelization to support concurrent queries.
Automatic Scaling: ClickHouse Cloud scales services based on CPU and memory usage. It constantly monitors the historical usage of a service over a lookback window. If the usage falls above or below certain thresholds, it scales the service appropriately to match the demand.

While ClickHouse provides these scalability features, implementing and managing a scalable architecture can be complex. It’s important to understand your specific requirements and choose the right scalability strategy for your needs.

4. Pricing Model:

ClickHouse is open-source and free to use. There are no licensing fees associated with ClickHouse. However, you may incur infrastructure costs if you deploy ClickHouse on cloud platforms or use managed services.

Managemant

1. Community/Support:

ClickHouse has a strong community and offers support services for all users and customers of ClickHouse Cloud. Users can access a community Slack channel and various resources, while cloud subscription customers receive dedicated support services.

2. Documentation:

ClickHouse provides comprehensive documentation on their official website. It includes a quick start guide, installation instructions, concepts, guides, and a detailed SQL reference. They also have a GitHub repository where the documentation is managed.

3. Ease of Management:

ClickHouse offers both self-managed and fully managed options. The self-managed option requires some expertise to operate and scale.

4. Learning curve:

The learning curve for ClickHouse can be steep, especially for beginners. While ClickHouse can be a powerful tool for data analysis, it’s important to note that it may require a significant investment of time and effort to learn, especially for those who are new to database management systems.

5. SQL Support:

ClickHouse supports SQL.

Integration

1. Supported Data Sources:

ClickHouse supports a wide range of data sources. Here are some of the core integrations:

Amazon MSK: Integration with Amazon Managed Streaming for Apache Kafka.
Amazon S3: Import from, export to, and transform S3 data in flight with ClickHouse built-in S3 functions.
Cassandra: Allows ClickHouse to use Cassandra as a dictionary source.
DeltaLake: Provides a read-only integration with existing Delta Lake tables in Amazon S3.
Google Cloud Storage: Import from, export to, and transform GCS data in flight with ClickHouse built-in S3 functions.
HDFS: Provides integration with the Apache Hadoop ecosystem by allowing to manage data on HDFS via ClickHouse.
Hive: The Hive engine allows you to perform SELECT queries on HDFS Hive table.
JDBC: Allows ClickHouse to connect to external databases via JDBC table engine. This is handy when there is no native built-in integration engine, table function, or external dictionary for the external data source available, but a JDBC driver for the data source exists.
Kafka: Integration with Apache Kafka, the open-source distributed event streaming platform.

Additional information about supported data sources can be found here: https://clickhouse.com/docs/en/integrations

2. Cloud Services Integration:

ClickHouse offers several integrations with Google Cloud, Azure, and AWS, providing a seamless experience for managing and analyzing data. Here are some of the core integrations:

GCP Cloud:

Google Cloud Storage: ClickHouse can import from, export to, and transform GCS data in flight.
ClickHouse Cloud on GCP: ClickHouse Cloud is available on Google Cloud Platform. This allows users to build real-time applications without the devops overhead of a self-managed installation.

Microsoft Azure:

Azure Blob Storage: RudderStack provides a Reverse ETL connection that makes it easy to send data from your ClickHouse Data Warehouse to Azure Blob Storage.
Azure Synapse Analytics: Pipes allows you to automatically load your ClickHouse data into Azure Synapse Analytics.
Bitnami package for ClickHouse: The Bitnami package for ClickHouse is available on the Microsoft Azure Marketplace. This package provides an up-to-date, secure, and ready-to-run virtual machine for ClickHouse.

AWS Cloud:

Amazon MSK: Integration with Amazon Managed Streaming for Apache Kafka.
Amazon S3: Import from, export to, and transform S3 data in flight with ClickHouse built-in S3 functions.
ClickHouse Cluster on AWS: This solution deploys a ClickHouse cluster on the AWS Cloud.
ClickHouse Cloud on AWS: The fastest and most resource-efficient analytical database, ClickHouse, is now available as a service on AWS.

3. SDK Support:

ClickHouse supports several SDKs and client libraries for different programming languages. Here are some of them:

Go: The Go client uses the native interface for a performant, low-overhead means of connecting to ClickHouse.
Java: The Java client and JDBC driver.
Python: Several Python libraries such as infi.clickhouse_orm, clickhouse-driver, clickhouse-client.
PHP: Libraries like smi2/phpclickhouse, 8bitov/clickhouse-php-client, bozerkins/clickhouse-client, simpod/clickhouse-client, seva-code/php-click-house-client.
NodeJs: Libraries such as clickhouse (NodeJs), node-clickhouse, nestjs-clickhouse, clickhouse-client, node-clickhouse-orm, clickhouse-ts.
Ruby: Libraries like ClickHouse (Ruby) and clickhouse-activerecord.
Rust: Libraries like clickhouse.rs and clickhouse-rs.
C#: Libraries like Octonica.ClickHouseClient, ClickHouse.Ado, and ClickHouse.Client.

Note that ClickHouse Inc does not maintain these libraries and hasn’t done any extensive testing to ensure their quality.

4. Supported Visualization Tools:

ClickHouse supports a wide range of data visualization tools. Here are some of them:

Explo: A self-service BI tool that allows you to explore your data and build beautiful, interactive dashboards.
Grafana: An open-source platform for monitoring and observability.
Tableau: An interactive data visualization software.
Looker: A business intelligence software and big data analytics platform.
Metabase: An open-source business intelligence tool.
Superset: A data exploration platform designed to be visual, intuitive, and interactive.
Deepnote: A new kind of data science notebook.
Draxlr: A simple, self-hosted platform for data analysis.
Rocket BI: A cloud-native business intelligence platform.
Zing Data: A data visualization tool.

Some of these tools connect to ClickHouse out-of-the-box, while others require a connector to be installed.

Performance

1. Insert operations

ClickHouse is optimized for high-speed insert operations, especially for columnar storage. Supports bulk inserts and parallel processing, making it efficient for large-scale data ingestion. Data is typically written to the MergeTree tables in parts, and background processes optimize the storage. It is recommended to add data in fairly large batches, such as 100,000 rows at a time, and group data by a partition key before uploading it to ClickHouse. Here are some key points about ClickHouse’s insert operations:

Insert Syntax: ClickHouse uses the INSERT INTO statement to insert data into a table. You can specify a list of columns to insert using the (c1, c2, c3) syntax. Data can be passed to the INSERT in any format supported by ClickHouse.
Performance: ClickHouse sorts the input data by primary key and splits them into partitions by month. If you insert data for mixed months, it can significantly reduce the performance of the INSERT query. To avoid this, add data in fairly large batches, such as 100,000 rows at a time.
Asynchronous Inserts: ClickHouse supports asynchronous inserts, which allows you to insert data into a table without waiting for the data to be written to disk. This can improve the performance of bulk insert operations by reducing the amount of time spent waiting for data to be written to disk.

2. Update operations

Primarily optimized for read-heavy workloads; updates can be less performant but it’s possible to modify or delete existing data. These operations are labeled “mutations” and are executed using the ALTER TABLE command. Here are some key points about ClickHouse’s update operations:

Updating Data: Use the ALTER TABLE...UPDATE command to update rows in a table.
Deleting Data: Use the ALTER TABLE...DELETE command to delete rows
Lightweight Deletes: Another option for deleting rows is to use the DELETE FROM command, which is referred to as a lightweight delete. The deleted rows are marked as deleted immediately and will be automatically filtered out of all subsequent queries, so you do not have to wait for a merging of parts or use the FINAL keyword. Cleanup of data happens asynchronously in the background.

3. Join operations

Join operations in ClickHouse can be slow if not optimized properly. Here are some key points about ClickHouse’s join operations:

Join Types: ClickHouse supports all standard SQL JOIN types: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, CROSS JOIN1. It also supports additional join types like SEMI JOIN, ANTI JOIN, ANY JOIN, ASOF JOIN.
Performance: When running a JOIN, there is no optimization of the order of execution in relation to other stages of the query. The join (a search in the right table) is run before filtering in WHERE and before aggregation1. Each time a query is run with the same JOIN, the subquery is run again because the result is not cached.
Optimization: ClickHouse supports two main types of joins: Nested-Loop Join and Merge-Scan Join. Choosing the right join strategy is crucial for optimizing query performance in ClickHouse.
Large Tables: Joining large tables in ClickHouse can result in an out of memory error or slow performance.

4. Aggregation queries

ClickHouse allows generating analytical data reports in real time. It’s known for its high performance on aggregation queries. Here are some key points about ClickHouse’s aggregation performance:

Parallel Processing: ClickHouse data warehouses collect aggregates using parallel processing followed by a merge to assemble final results.
GROUP BY keys: The number of GROUP BY keys can affect the performance. For example, a query with a few dozen GROUP BY values uses less memory and runs faster than a query with a larger number of GROUP BY value.
Complexity of Aggregate Function: The complexity of the aggregate function used in the query can also impact the performance.
Number of Threads: The number of threads used in the scan can vary the performance.

Remember, the specific performance can depend on the particular use case, data volume, and system configuration. Always refer to the official documentation for the most accurate and up-to-date information.

2. Materialized View Support:

ClickHouse supports materialized views, which are stored as tables with data calculated by a SELECT query. The data in the materialized views is updated automatically when the data in the underlying tables changes.

3. Indexing:

ClickHouse indexing is a powerful feature that can significantly improve query performance. Here’s an in-depth look at how indexing works in ClickHouse:

Index Types: ClickHouse supports several types of indexes, including primary key, secondary, and full-text indexes. These indexes are implemented using data structures such as B+ Trees, Radix Trees, and Hash Tables.
Sparse Primary Indexes: Unlike traditional relational database management systems, ClickHouse builds and uses a table’s sparse primary index. This means that instead of indexing every row, ClickHouse indexes every 8192nd row when using default settings.
Data Skipping Indexes: ClickHouse provides a different type of index, called “Skip” indexes, which can significantly improve query speed. These structures enable ClickHouse to skip reading significant chunks of data that are guaranteed to have no matching values.
Indexing Best Practices: There are best practices for indexing in ClickHouse, such as selecting a primary key that applies to the most common query patterns. However, no matter how carefully tuned the primary key, there will inevitably be query use cases that cannot efficiently use it.

4. Streaming Ingestion:

ClickHouse provides a native integration with Apache Kafka for real-time data ingestion. It also supports other integrations for data ingestion such as Amazon Managed Streaming for Apache Kafka, Amazon S3, Google Cloud Storage, and more.

Strengths

High Query Performance: ClickHouse’s columnar storage structure and advanced query optimization techniques drive fast processing of complex analytical queries.
Scalability: ClickHouse is built to handle large volumes of data and can be scaled horizontally or vertically as needed1.
Data Compression: By storing data in columns and utilizing advanced compression algorithms, ClickHouse reduces storage requirements1.
Distributed Processing: ClickHouse supports distributed data processing across multiple nodes, for more efficient processing of vast datasets and enhanced query performance.

Weaknesses

Nontraditional indexing: The sparse index makes ClickHouse not so efficient for point queries retrieving single rows by their keys.
Steep Learning Curve: ClickHouse can be complex to learn, especially for beginners.
No Full-Fledged Transactions: ClickHouse does not support full-fledged transactions.
Limited Data Modification: ClickHouse is not built modify or delete already inserted data with a high rate and low latency. There are batch deletes and updates available to clean up or modify data, for example, to comply with GDPR.
Limited Support for Complex Data Types: ClickHouse has limited support for complex data types.

Best use case

ClickHouse is best suited for real-time analytics, big data analytics, time-series databases, and data warehousing. It is also ideal for use cases that require fast aggregations on large datasets. Furthermore, ClickHouse provides a robust infrastructure for the training and evaluation of machine learning models.

Worst use case

ClickHouse should not be used as a replacement for your relational database. It is not designed to handle row updates and deletions efficiently. It is also not efficient for point queries retrieving single rows by their keys.

In summary, ClickHouse is a high-performance, column-oriented DBMS tailored for OLAP tasks. Its strengths lie in scalable architecture, efficient storage, and support for semi-structured data. ClickHouse excels in real-time analytics and big data, providing robust integration with various sources and tools.

Despite its steep learning curve and limitations in traditional indexing and transactions, ClickHouse is ideal for scenarios requiring fast aggregations on large datasets. However, it’s not recommended for frequent data modifications or as a relational database replacement.

Always check the official documentation for the latest information on ClickHouse.

For a more comprehensive understanding of how to assess this information, please refer to the key metrics outlined in the article How to Choose the Right OLAP Storage when making the crucial decision for your OLAP storage needs. To enhance your proficiency in data management, explore the Strategic guide on mastering data for software developers.