Beginner’s Guide to ClickHouse: Introduction, Features, and Getting Started

27 min readJul 28, 2024

Introduction

ClickHouse, an open-source Online Analytical Processing (OLAP) database, stands out for its remarkable speed and exceptional performance in various warehousing and analytics scenarios such as analytical and time-series data.

Its rapid growth in popularity is a testament to its high-performance capabilities, earning it a place in the tech stacks of renowned companies such as eBay, Microsoft, Lyft, and IBM etc.

Official ClickHouse website:

Fast Open-Source OLAP DBMS - ClickHouse

ClickHouse is a fast open-source column-oriented database management system that allows generating analytical data…

clickhouse.com

Various benchmarks can be found demonstrating the speed advantage ClickHouse has. For example, ClickHouse has several entries in the benchmarks querying 1.1 billion taxi ride records here:

Summary of the 1.1 Billion Taxi Rides Benchmarks

Benchmarks & Tips for Big Data, Hadoop, AWS, Google Cloud, PostgreSQL, Spark, Python & More...

tech.marksblogg.com

Performance benchmarks — ClickHouse compared to other databases on 1.1 billion taxi rides dataset

At the time of writing, most of the faster performers are databases that can take advantage of GPUs and co-processors with a very large number of cores such as Intel Xeon Phi or NVidia Tesla. These database systems include the likes of BrytlyteDB. An interesting point to note here is that ClickHouse running on a laptop with Intel Core i5 4670K processor, 16GB RAM, and SanDisk SSD has performed faster on this dataset than a 6 node Redshift cluster on one of the four test queries (which is on a smaller dataset). So a single-node ClickHouse running locally on a laptop is quite powerful tool for quick data exploration and analytics.

Here is a link to ClickHouse documentation explaining the design choices behind its performance:

Why is ClickHouse so fast? | ClickHouse Docs

It was designed to be fast. Query execution performance has always been a top priority during the development process…

clickhouse.com

Following are some other ClickHouse performance and cost comparisons with other database systems:

This benchmark compares ClickHouse with Druid and Rocketship, both in terms of performance and cost:

ClickHouse® Nails Cost Efficiency Challenge Against Druid & Rockset

In our latest benchmark report, we find ClickHouse is six times faster than Druid and four times faster than Rockset…

altinity.com

Article on CloudFlare using ClickHouse to replace the old system to handle scale:

HTTP Analytics for 6M requests per second using ClickHouse

One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to…

blog.cloudflare.com

Feature Highlights

Here are some prominent features of ClickHouse, which are beneficial for a wide range of use-cases:

ClickHouse is available both as an open-source product and as a managed cloud-based offering on AWS, Azure, and GCP. At the time of writing, Azure is in beta. ClickHouse cloud offers the benefits of managed offerings such as automatic scaling, pay-per-use model, and additional features such as separate scaling of storage and compute, among others. For more details and pricing, visit:

ClickHouse Pricing

Pricing options for every user, organization, and use case

clickhouse.com

ClickHouse is a multi-node database system, which allows for horizontal scaling. It can execute parts of queries on multiple nodes for improved performance.
The use of columnar storage in ClickHouse, together with Vectorized query engine enhance the efficiency of analytical queries. These features is particularly useful for typical analytical queries that aggregate a large number of rows. Data is stored by columns, and this storage mechanism where columns of multiple rows are together on disk makes quick aggregation possible.
ClickHouse is a very fast OLAP database system, optimized for reporting queries that perform aggregations on a large amount of data. ClickHouse is not an OLTP system, and it is not good for OLTP use-cases. ClickHouse is not built for performing a large number of mutations. Update and Delete operations are asynchronous, and there is no support for transactions. Additionally, ClickHouse is not built for handling a large number of parallel connections, therefore, beyond a certain scale, it should be used as a warehouse in warehouse-mart architecture with application and user-facing queries going to the data marts.

Setting up a local 3-node ClickHouse cluster

The first step in experiencing ClickHouse is to set-up a ClickHouse cluster locally. For running the code examples of this article, I have created a three node, single replica cluster using Docker and docker compose. So first, make sure Docker and Docker Composed are installed on your system.

Prerequisites

I prefer using Docker Desktop, as it comes with Docker and Docker Compose, and has a nice visual interface to manage containers and images, in addition to many other things.

Download and install Docker Desktop for your OS and system:

Docker Desktop

Docker Desktop is collaborative containerization software for developers. Get started and download Docker Desktop today…

www.docker.com

Make sure Docker and Docker Compose are running.

Commands to check docker and docker compose versions

Note: If you are familiar with docker compose, I’m using the command docker compose instead of docker-compose as the former is the recommended way of running commands in Docker Compose V2. Link:

Migrate to Compose V2

How to migrate from Compose V1 to V2

docs.docker.com

Setting up the cluster

Next, clone the following repository:

GitHub - SA01/clickhouse-docker-compose-cluster: An example three node one replica Clickhouse…

An example three node one replica Clickhouse cluster - SA01/clickhouse-docker-compose-cluster

github.com

Or alternatively, download and extract the code.

Next, in the terminal, navigate to the folder containing the cloned repository, and start the cluster using the following command:

Command to start the cluster

This command spins-up a 3 node cluster, with all 3 nodes running ClickHouse, and ClickHouse Keeper.

Run the docker ps command to see all three nodes running:

3 Node cluster running

This local cluster has 3 nodes, each with 4GB RAM and 2 CPUs. It is possible to modify this code to run 2 nodes or 1 node. However, keep in mind that it is recommended to run ClickHouse Keeper on an odd number of nodes.

So, if there are 2 ClickHouse nodes, a third node should run only the ClickHouse Keeper. Additionally, in production environments, it is recommended to run ClickHouse Keeper on separate nodes.

This link contains documentation about horizontally scaling ClickHouse:

Scaling out | ClickHouse Docs

Description

clickhouse.com

This page in ClickHouse documentation provides details about ClickHouse cluster configuration:

Cluster Deployment | ClickHouse Docs

This tutorial assumes you've already set up a local ClickHouse server

clickhouse.com

Additionally, this Github repository contains many example ClickHouse configurations that can be used as reference:

examples/docker-compose-recipes/README.md at main · ClickHouse/examples

ClickHouse Examples. Contribute to ClickHouse/examples development by creating an account on GitHub.

github.com

Alternate — Setting up a single-node ClickHouse instance

If you do not want to run the 3 node cluster, it is possible to run ClickHouse standalone in a Docker container using the official image:

https://hub.docker.com/r/clickhouse/clickhouse-server/

Connecting to the cluster

ClickHouse can be connected to using the ClickHouse java driver, which can be downloaded from here:

GitHub - ClickHouse/clickhouse-java: Java client and JDBC driver for ClickHouse

Java client and JDBC driver for ClickHouse. Contribute to ClickHouse/clickhouse-java development by creating an account…

github.com

Many database clients have ClickHouse connectivity built-in. I prefer using the community edition of DBeaver as database client. You can download it from here:

Download

Released on July 21th 2024 ( Milestones). It is free and open source ( license). Also you can get it from the GitHub…

dbeaver.io

To connect to ClickHouse’s first node, select to create new connection in DBeaver, and choose the ClickHouse driver. Because we are using the latest ClickHouse version, which at the time of writing this article, is 24.5.3.5, do not use the legacy ClickHouse driver.

Next, set the host to localhost and port to 8123 to connect to the first node. Username is default and no password:

Connection details to connect to ClickHouse node 1

Next, select Test Connection to make sure it can connect:

Similarly, create 2 more connections, second one to node 2 (port: 8124 keeping the rest same) and third to node 3 (port: 8125 keeping the rest same). Make sure you can connect to all 3 nodes:

On any node, run the following command:

Command to print ClickHouse version

It should print the running ClickHouse version. Next, run the SQL code:

Print ClickHouse cluster details

Details of three ClickHouse nodes

To shut down the setup, press Ctrl + C on the terminal window or tab where the docker compose up command was executed. This will terminate the nodes. Next, use the docker compose down command to clean up the nodes and start fresh if needed. It's better to delete the clickhouse-server.log and clickhouse-server.err.log files in the logs/logs_1, logs/logs_2, and logs/logs_3 folders to isolate logs for each run. These logs are also helpful in diagnosing any issues you may come across.

In production environments, it is recommended not to enable the default user. Instead, create users with appropriate permissions according to their needs and use cases. Please visit this page in the ClickHouse documentation to set up users and relevant settings:

Users and Roles Settings | ClickHouse Docs

The users section of the user.xml configuration file contains user settings.

clickhouse.com

We have not loaded any data into ClickHouse at this time, as loading data will be covered later in the article.

ClickHouse Table Engines

Table engines are central to optimizing ClickHouse to various use-cases. Table engines determine how data is stored, how queries are processed, data access etc. Choosing the correct engine requires understanding the main engine families, their way of data storage, replacement, and other properties. Quoting from the documentation, the engines in ClickHouse determine:

Link:

Table Engines | ClickHouse Docs

The table engine (type of table) determines:

clickhouse.com

One important thing to note is that ClickHouse allows choosing a different engine to store data in materialized views, and this will be explored later in the article. For now, let’s explore some engine families:

MergeTree: The MergeTree engine family is the most widely used and robust engine family, fitting most use cases.

ReplicatedMergeTree: This is the replicated version of the MergeTree engine family, supporting data replication for high availability and redundancy in multi-node ClickHouse clusters. It is recommended for production use cases, except when there is a single-node cluster.

Our example cluster doesn’t have replication as it is configured to maintain a single replica of the data, so we won’t be able to see replication in action. However, the links in the setting up example cluster section contain detailed documentation and Docker Compose examples for such a setup that can be used as a reference.

Log: The Log engine family includes TinyLog, which is recommended for working with small amounts of data (up to 1 million rows) and quick data insertion. Many ClickHouse tutorials and examples use this engine family because of its simplicity.

Special Engines: This includes special engines such as Distributed, which is used to create distributed tables across nodes of a ClickHouse cluster and is also covered in this article, as well as other special-purpose engines like Dictionary, File, etc.

Integration Engines: These are used to connect to external data sources. For example, the PostgreSQL engine allows connecting to data on a PostgreSQL instance and running queries on it.

MergeTree (and ReplicatedMergeTree) engine family

MergeTree engine family is the most robust and widely used engine family, it is designed for handling high data volume and high volume data ingestion. In this section, we’ll work with MergeTree tables.

For detailed documentation of MergeTree, please visit the links:

MergeTree Engine Family | ClickHouse Docs

Table engines from the MergeTree family are the core of ClickHouse data storage capabilities. They provide most…

clickhouse.com

MergeTree | ClickHouse Docs

The MergeTree engine and other engines of the MergeTree family (e.g. ReplacingMergeTree, AggregatingMergeTree ) are the…

clickhouse.com

The data in the MergeTree tables is sorted by the sorting key columns(or expressions), into blocks called granules of 8192 rows (this is a configurable parameter with a default value of 8192, which is suitable for most use-cases). If the sorting key contains more than columns or expressions, the data is first sorted by the first column in the sorting key, and within each value of the first column, it is sorted by the second key, and so on. Each granule has a mark, which stores the values of the key columns in for that block.

Primary key is optional, and can be provided if it is different from the sorting key. If no primary key is provided, the sorting key also becomes the primary key of the table.

The data is sorted by sorting key, and the sorting key marks (values of sorting key columns at the start of each granule) for granules are used to find the rows queried by the query filters. Therefore, the sorting key should be selected according to the usage patterns of common queries for fast data extraction. This way, the MergeTree family engines are required to be tuned for queries that run on the table, otherwise, a lot of data (or all data) will have to be scanned to find the required rows.

An important aspect of the MergeTree family is the underlying storage of data, and how sorting key and marks work in determining the data to be scanned, and how query filters work. This is covered in detail in the ClickHouse documentation on the MergeTree engine family:

MergeTree | ClickHouse Docs

The MergeTree engine and other engines of the MergeTree family (e.g. ReplacingMergeTree, AggregatingMergeTree ) are the…

clickhouse.com

The data used for demonstration is the popular TLC Trip Record dataset. The data contains Yellow and Green taxi trip records in New York City, containing pickup and drop off locations, fare, tip, distance etc.

The dataset can be obtained from:

TLC Trip Record Data

Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off…

www.nyc.gov

From here, download the Yellow Taxi trip records for year 2023, 2022 and 2021. They are parquet files. In the repository folder, place them in data/rides_2023, data/rides_2022 and data/rides_2021 folders:

The data folder has been mounted on all nodes of the ClickHouse cluster, to the /var/lib/clickhouse/user_files folder. Connect your terminal to node 1 by running the following commands:

ommands to connect to node 1 and list files

You can use the docker ps command to list all the running nodes of the cluster and to get their names for the docker exec command.

Creating the rides database, and trips table

Next, lets create the table, and load data into it, but first, lets create a database to hold the tables. Run the following SQL Code from DBeaver (or any client you are using to connect to ClickHouse) on node 1 (or any node).

Command to create the rides database

After that, hit refresh on all 3 nodes of the cluster, and the newly created database should be visible on all 3 nodes:

Rides database accessible on all 3 nodes.

Next, on the first node, create the table:

Create table command

The trips table gets created on the first node of the database. It is only accessible and queryable from the first node only. Creating distributed tables accessible from all 3 nodes is also covered in the later sections of this article. For now, lets insert data into the table. Run the following SQL code:

Code to insert the 2023 trips data into the trips table

Running this query from the client gives the following error:

Error on insertion

This query will have to be executed from the ClickHouse client from the node itself. Connect the terminal to node 1, and run the following commands:

Connecting to node 1 and running ClickHouse client

This, of course requires ClickHouse client to be present on the node. It should be installed and available on all nodes of the example cluster, otherwise, it can be installed. Link:

Command-Line Client | ClickHouse Docs

clickhouse-client

clickhouse.com

Run the same insert query on the ClickHouse client:

Note: Please note that insertion may fail if each node has less than 4GB RAM. If you are using a computer that doesn’t have sufficient RAM, the data can be inserted file by file, instead of using *.parquet in the insertion query, or a cluster with 2 nodes can be used instead.

The set contains 38.31 million rows. Lets run a row count, and total fare aggregation. It can be run from the client now. Run the following two queries:

Aggregating query on the trips table

These give the following results:

Result of the aggregate query on the entire 2023 trips data

The result was obtained very quickly, on my computer, it took a little over 0.5 seconds to sum, average and count all 38 million rows. Lets see the query plan:

Query plan of the first query

This query plan shows simple scan of the entire dataset, followed by the aggregate operations. Next, lets filter by tpep_pickup_datetime which is one of the ordering columns:

Query with filter

This query executes very fast, even faster than the previous one. Execution times vary a lot but are consistently faster than the previous query on the whole set. On my computer, the runs were faster than 0.1 second, and although it used over 3 million rows in the final result, it was really fast.

For comparison, lets filter on a column that is not part of the sorting key, i.e. passenger_count and filter on that:

Query filtered by passenger_count column

This query is slower, as the entire dataset is scanned to create that result. Although the number of rows returned by the filter is significantly smaller, the system cannot take advantage of the filter as the data is not ordered by the passenger_count column. On my computer, this query takes 0.3 to 0.7 seconds, which is similar to the time taken by the whole data aggregation example query.

Execution time of the query filtering data on passenger_count column

This demonstrates the advantage of optimizing tables for frequent queries, as less data scanned results in faster query execution over large datasets.

Next, lets look at materialized views, and the SummingMergeTree engine.

SummingMergeTree engine

SummingMergeTree is a form of MergeTree that sums up all numeric columns of all the rows of the sorting key. If multiple rows are present for the sorting key, they are combined into one row with numeric columns being equal to the sum of the values in the rows.

ClickHouse documentation recommends using MergeTree table to store the data, and SummingMergeTree for data that feeds the reports.

Link to documentation:

SummingMergeTree | ClickHouse Docs

The engine inherits from MergeTree. The difference is that when merging data parts for SummingMergeTree tables…

clickhouse.com

Another thing to note is that the data is summed eventually, which implies that it is possible to have some parts not aggregated because they have not been merged yet. It is possible to run the OPTIMIZE query on the table to merge all the remaining parts. This and other details about the SummingMergeTree table engine in detail with examples in the following article:

things to know about Clickhouse SummingMergeTree

the know-how of Clickhouse SummingMergeTree, how deduplication works, how optimize query can force the summing…

quoeamaster.medium.com

Also note that ReplicatedMergeTree and ReplicatedSummingMergeTree engines are also available for tables with replication.

Materialized Views

Materialized views in ClickHouse are pre-populated views that make faster query execution possible by shifting complex operations to insert process instead of the queries to retrieve data. Materialized views are refreshed in real time when new data is inserted in source table(s). ClickHouse documentation describes Materialized Views as:

Materialized views allow users to shift the cost of computation from query time to insert time, resulting in faster SELECT queries.
Unlike in transactional databases like Postgres, a ClickHouse materialized view is just a trigger that runs a query on blocks of data as they are inserted into a table. The result of this query is inserted into a second “target” table.

and

Materialized views in ClickHouse are updated in real time as data flows into the table they are based on, functioning more like continually updating indexes.

Link to documentation:

Materialized View | ClickHouse Docs

How to use materialized views to speed up queries

clickhouse.com

The primary benefit of using materialized views in ClickHouse is the shift of complex processing to the insertion process, making each query simpler and faster if directed to the views. This not only enhances performance but also reduces the burden on the system during query execution. By pre-computing and storing the results of complex queries, materialized views can significantly reduce the time required to retrieve data, making them ideal for real-time analytics and reporting.

Here is the syntax of MaterializedView in the CREATE VIEW documentation:

Materialized View syntax

Link:

CREATE VIEW | ClickHouse Docs

Creates a new view. Views can be normal, materialized, live, and window (live view and window view are experimental…

clickhouse.com

The POPULATE flag populates the view with existing data in the source table. If it is not specified, any new data that gets inserted in the table is inserted in the views, but not the data that exists before the got was created. However, the documentation advises against using it, as any data being inserted in the table at the time the view is getting created is not inserted in the view.

Additionally, as seen in the syntax, it is possible to specify an engine, which can be different from the table engine of the source table. Choosing the right table engine is the key to derive maximum performance from tables and views.

Lets see this in action, create a view that summarizes rides by date of pickup, vendor ID, and pickup location. It’ll use the SummingMergeTree engine to sum the passenger_count, trip_distance, total_amount and congestion_surcharge columns:

Query to create and populate the view

Here, we have defined the Sorting Key as VendorID, pickup_date, PULocationID. The rest of the columns get summed. As POPULATE flag is present, the data in the table is also inserted in the view. It takes a few seconds to create the view on my computer because it is running the aggregation on all the data in the table. Once complete, it can be queried:

Selecting a few rows from the view

Next, lets run two queries, one on the view and the second one on the table:

Queries to examine the aggregation done by SummingMergeTree engine

Here, the first query filters rows with VendorID = 1 and PULocationID = 1 and pickup_date = 2023-01-01. The result is only one row, because the SummingMergeTree engine merges all the rows for each value of the Sorting Key into one row:

Only one row is returned by running the query on the view

Whereas the second query returns all the rows from the table:

One row in the view contains aggregate of multiple rows in the table

Note that we haven’t used Group By expression in the definition of the materialized view, but due to the use of SummingMergeTree engine, all the rows with the same value of the Sorting Key get collapsed into a single row, and the numerical columns are summed in this new row. This can be easily verified from the data above. If MergeTree engine is used instead of SummingMergeTree, the view will contain all the rows:

Using MergeTree in the view

The above query returns the following result:

If MergeTree engine is used in the view, the rows are not aggregated

Additionally, views can be dropped by using the DROP VIEW query.

Next, lets examine the rows in the view by month and year:

Query to examine rows by month and year

The view contains the data in the table, which is mostly for the year 2023. It contains a few rows for years 2001, 2003, 2009 and others.

If new rows are inserted into the table, the view will also be updated. To demonstrate, lets insert rows for the month of January, 2024. Download the parquet file containing the data, and paste it in the data/rides_2024 folder in the repository. As mentioned before, this directory is mounted to all three nodes of the cluster. Next, connect to the first node as previously demonstrated, and insert the data using the ClickHouse client, and the view gets updated:

Command to insert January 2024 data into the table

After insertion, run the same query as above, and the number of rows for January 2024 in the view is updated:

View contains more rows for January 2024.

Additionally, there is optional TO <table> clause in the syntax of the SQL to create materialized views. It sends the result of the view query to populate a target table. Views is a very powerful concept in ClickHouse, and the documentation link demonstrates many examples of using views:

Materialized View | ClickHouse Docs

How to use materialized views to speed up queries

clickhouse.com

Additionally, this article from the ClickHouse blog demonstrates chainig Materialized Views to create a pipeline that populates multiple tables stemming from a streaming table connected to Kafka:

Chaining Materialized Views in ClickHouse

In this blog post, we'll learn about the power of chaining ClickHouse materialized views.

clickhouse.com

This demonstrates the power and flexibility of materialized views in ClickHouse.

Mutations in ClickHouse

In ClickHouse, UPDATE and DELETE operations are called mutations. Mutations are implemented to be asynchronous processes that run in the background. Mutations are processed in the background, and their effect is not instantly visible because the requested operations do not get applied immediately. Rather, the asynchronous mutations process run in the background to perform the update or delete operations. It is possible to track progress of mutations, because the commands that issue mutations return immediately. We’ll see how to check progress and completion of mutations later in this section of the article. Mutations are executed with ALTER TABLE query clause, and here is a full list of mutation operations:

ALTER | ClickHouse Docs

Most ALTER TABLE queries modify table settings or data:

clickhouse.com

An important thing to remember here is that ClickHouse is designed as a data warehouse, with the expectation that changes to existing data are infrequent, and in bulk. In general, the data stored in ClickHouse is not expected to change, and the insertions are in batches of large data.

Next, lets see update and delete operations in action:

ALTER TABLE … UPDATE and DELETE

To update rows, the ALTER TABLE <table name> UPDATE query clause is used. Its syntax is:

ALTER TABLE syntax

This operation is asynchronous by default, and it may not start or complete immediately. Additionally, it is not possible to update key columns. Update is a heavy operation and requires updating data across many parts. Lets try to update the number of passengers in each trip by adding 1 to it. The update query will be:

Query to update

But, instead of performing the update, it returns with an error:

Error in update query

This is because the WHERE clause is necessary in the update query syntax. Altering rows is a heavy operation and ClickHouse doesn’t allow it on all the data. Because the where clause is require, a hack is to execute it with a WHERE TRUE clause, but this should be done with a caution. The data is our example table is not that big, so lets run it:

Update clause returns immediately, and starts the asynchronous UPDATE operation

Status of mutations can be checked by querying the system.mutations table:

Query to check mutations

The query returns the following result:

Status of UPDATE operation

We can see in the query that the UPDATE operation is complete, as is_done is 1. It is 0 when mutation is in progress.

Next, lets update the pickup date for some rows:

Updating the tpep_pickup_datetime column

This query results in the following error:

Error when updating a key column

This update is not allowed because key columns cannot be updated.

Similar to update, delete operation is a mutation, and its syntax is:

Syntax of delete query

Link to documentation:

ALTER TABLE ... DELETE Statement | ClickHouse Docs

Deletes data matching the specified filtering expression. Implemented as a mutation.

clickhouse.com

Despite the fact that both UPDATE and DELETE are mutations and are performed asynchronously by default, it is possible to make the query wait till mutations are complete, and it is also possible to configured to wait till all replicas have been updated. This can be done using the mutations_sync setting.

Core Settings | ClickHouse Docs

All below settings are also available in table system.settings.

clickhouse.com

Lightweight DELETE for MergeTree engine

A lightweight DELETE operation is available on table that use the MergeTree engine. It uses the DELETE FROM syntax, and it works by marking the rows as deleted, making them not appear in subsequent queries, and this is done immediately. The rows marked for deletion are then cleaned up during the next merge operation. Read about the lightweight delete operation, its limitations, pros and cons, and use-cases here:

The Lightweight DELETE Statement | ClickHouse Docs

Lightweight deletes simplify the process of deleting data from the database.

clickhouse.com

Deleting bulk data by dropping partitions

It is recommended to drop data by dropping entire partitions. This is possible if the filter of the delete operation matches with the partition key of the table. This is another consideration while designing the schema for ClickHouse, bulk delete operations become very fast and simple if entire partitions can be dropped.

The example table rides.trips is partitioned by month year. In the create statement of the table, the partition key has been provided as PARTITION BY toYYYYMM(tpep_pickup_datetime). Therefore, we can drop partitions for certain months, deleting all the data for that month. For example, if we want to drop the data for the month of June, 2023, the following query can achieve that:

Deleting June 2023 data by dropping the partition

Now, when the data is queried, it can be observed that the data for the month of June, 2023 is no longer present:

In addition to dropping partitions, it is possible to detach and attach partitions. So, to replace or update data in bulk, it is possible to create a temporary table, fill it with data.

Then, after performing validations, attach the new partitions to the final table, detaching partitions containing old data. Then, depending on the requirements, they can be dropped, or attached to a history table.

This process can be depicted in the following diagram:

Replacement of partition with one containing new data

The process of data replacement by replacing partitions is covered in detail in my article:

Table Partitioning in PostgreSQL: A Beginner’s guide

Dive into PostgreSQL partitioning: A beginner-friendly guide to enhanced query speeds and efficient data management.

medium.com

Although the article focuses on PostgreSQL, the concepts are generic and applicable here. However, due to differences in database systems, the syntax and some mechanics may vary. Refer to ClickHouse documentation for implementation details.

Additionally, this article covers various mechanisms available in ClickHouse for updating and deleting data, each with their pros and cons and use-cases in detail, and I highly recommend reading it:

Handling Updates and Deletes in ClickHouse

With the recent addition of Lightweight Deletes, read about the latest best practices for handling updates and deletes…

clickhouse.com

One more article I highly recommend on partitions in ClickHouse:

Using partitions in Clickhouse

Clickhouse supports table partitioning which can be useful in cases we deal with serial data and need to work only with…

medium.com

In summary, updating and deleting data in ClickHouse involves understanding the asynchronous nature of mutations, the limitations on key columns, and the efficient strategies for bulk operations such as dropping partitions. By leveraging these features and understanding their implications, you can effectively manage data in ClickHouse and optimize your database operations.

Distributed table — using the DistributedTable engine

So far, all the tables created reside on one node of the cluster. In ClickHouse, it is possible to create tables, that span multiple nodes. Infact, it is possible to create tables that span all nodes, and all the parts of those tables are replicated on the replica nodes if they exist. Distributed tables can be created using the Distributed engine.

Distributed table doesn’t store any data, and it is a logical entity over a set of physical tables. Distributed table engine has been described in the ClickHouse documentation as:

Tables with Distributed engine do not store any data of their own, but allow distributed query processing on multiple servers.

Link:

Distributed Table Engine | ClickHouse Docs

To create a distributed table engine in the cloud, you can use the remote and remoteSecure table functions. The…

clickhouse.com

To demonstrate, we’ll create three tables, called rides.trips_part on all three nodes. The table on on node 1 will contain data for the year 2023, the one on node 2 will contain 2022, and the table on node 3 will contain year 2021:

Distributed table created using the rides tables on all three nodes.

First step is to create the table on all 3 nodes, and inserting data in them. Connect to all 3 nodes using docker exec command as demonstrated before. Use docker ps command to get the names of pods running the nodes. Then create the tables and load the data in them:

Commands to create rides.trips_part table on node 1 and insert trips data for 2023 into it.

Similarly, run the following on node 2:

Commands to create rides.trips_part table on node 2 and insert trips data for 2022 into it.

Similarly, run the following on node 3:

Commands to create rides.trips_part table on node 3 and insert trips data for 2021 into it.

After running the commands on all three nodes, refresh the connections to all three nodes on DBeaver and the newly created tables should be visible:

Newly created rides tables on all three nodes

Note: It is possible to run the create table query with on cluster <cluster_name> clause to create the table on all three nodes. But data load will have to be done on each node individually. To get the cluster name, run show clusters command.

For more information on running distributed DDL queries on the entire cluster instead of individual nodes, please visit this section of the ClickHouse documentation:

Distributed DDL Queries (ON CLUSTER Clause) | ClickHouse Docs

By default, the CREATE, DROP, ALTER, and RENAME queries affect only the current server where they are executed. In a…

clickhouse.com

Next, create the distributed table:

Command to create the distributed table

Here, we have specified the on cluster clause, so the distributed table gets created on all three nodes.

Additionally, we have specified the optional sharding key as toYear(tpep_pickup_datetime). This is a requirement to insert data into the distributed table, and will be covered later.

Here is the link to the documentation of Distributed table engine, containing syntax and other details about defining and using distributed tables in ClickHouse:

Distributed Table Engine | ClickHouse Docs

To create a distributed table engine in the cloud, you can use the remote and remoteSecure table functions. The…

clickhouse.com

Now, lets check the counts and other data:

Total rows on the distributed table, and the explain query to get the query plan

The above query returns:

And the second query returns the following query plan:

Query plan of a query picking data from multiple nodes.

In the query plan, there is ReadFromRemote (Read from remote replica), which indicates that some data has been obtained from other nodes. ClickHouse not only obtains but also partially processes data, including grouping and aggregations, on other nodes before sending them to the node on which the query got executed. This partial pre-processing is then used to assemble the final result.

Finally, the last query returns the following result:

Here, the data for all months of year 2021, 2022 and 2023, with the number of rows can be seen. There are some rows in each file that do not belong to these months, but overall, the result has been pulled from the rides.trips_part tables on all three nodes.

Inserting data into a distributed table

ClickHouse documentation mentions two ways of inserting data into distributed tables:

Insert directly into distributed table. Because the distributed table itself is a virtual entity and holds no data, the data is sent to the respective nodes, but it must have a sharding_key specified.
Insert data directly into the underlying tables on the nodes of the cluster. This method has been described as the mode efficient and flexible method in the documentation.

Conclusion, References and Additional Reading

In conclusion, ClickHouse is a powerful and versatile columnar database management system designed for real-time analytical processing. It offers efficient data storage, fast query processing, and flexible data manipulation through features like materialized views, distributed tables, and asynchronous mutations. With its ability to handle large-scale data and perform complex queries quickly, ClickHouse is an invaluable tool for businesses needing robust data analytics solutions. Whether dealing with real-time data streams or performing bulk data operations, ClickHouse provides the tools and performance required to meet diverse and demanding data processing needs.

ClickHouse is a highly capable system with a vast set of features, making it suitable for many analytics use cases. Covering it comprehensively is not possible in anything less than a 200-page book. This article merely scratches the surface and provides some insights into its capabilities. It aims to offer enough information for POCs and to help determine whether it is suitable for the reader’s use case.

Therefore, the links below, which cover various aspects of ClickHouse comprehensively with demonstrations of use cases and details not possible to include in this article, are the most valuable items I can provide.

In-depth: Clickhouse vs Big Query:

In-depth: ClickHouse vs BigQuery - PostHog

Both BigQuery and ClickHouse are databases designed to handle lots of data (like loads of data), but they have distinct…

posthog.com

How we turned ClickHouse into our event mansion:

How we turned ClickHouse into our event mansion - PostHog

Recently, PostHog was invited to speak at OSA Con 2021 , an open source analytics conference organised by Altinity. It…

posthog.com

How to use ClickHouse to store event data for 100M+ users aka the PostHog Event Mansion:

In-depth: Clickhouse vs. PostgreSQL:

In-depth: ClickHouse vs PostgreSQL - PostHog

Honestly, it is a bit ridiculous to compare Postgres and ClickHouse. The two database solutions are as similar as…

posthog.com

A look into primary key and index granularity in ClickHouse:

ClickHouse Primary Keys

Recently I dived deep into ClickHouse. ClickHouse is column-store database by Yandex with great performance for…

medium.com

ClickHouse Docker Compose recipes:

examples/docker-compose-recipes/README.md at main · ClickHouse/examples

ClickHouse Examples. Contribute to ClickHouse/examples development by creating an account on GitHub.

github.com

5 things to know about ClickHouse SummingMergeTree:

things to know about Clickhouse SummingMergeTree

the know-how of Clickhouse SummingMergeTree, how deduplication works, how optimize query can force the summing…

quoeamaster.medium.com

Using partitions in ClickHouse:

Using partitions in Clickhouse

Clickhouse supports table partitioning which can be useful in cases we deal with serial data and need to work only with…

medium.com

Chaining Materialized Views to create ingestion pipelines:

Chaining Materialized Views in ClickHouse

In this blog post, we'll learn about the power of chaining ClickHouse materialized views.

clickhouse.com

Following are some links from ClickHouse’s vast and comprehensive documentation that were used while researching for this article, and I believe they are useful when starting to use ClickHouse:

Why is CH so fast:

Why is ClickHouse so fast? | ClickHouse Docs

It was designed to be fast. Query execution performance has always been a top priority during the development process…

clickhouse.com

Table Engines in ClickHouse:

Table Engines | ClickHouse Docs

The table engine (type of table) determines:

clickhouse.com

MergeTree Engine:

MergeTree | ClickHouse Docs

The MergeTree engine and other engines of the MergeTree family (e.g. ReplacingMergeTree, AggregatingMergeTree ) are the…

clickhouse.com

Database Engines in ClickHouse:

Database Engines | ClickHouse Docs

Database engines allow you to work with tables. By default, ClickHouse uses the Atomic database engine, which provides…

clickhouse.com

JDBC Bridge for connecting to external sources:

Connecting ClickHouse to external data sources with JDBC | ClickHouse Docs

The ClickHouse JDBC Bridge allows ClickHouse to access data from any external data source for which a JDBC driver is…

clickhouse.com

Configuring external storage e.g. S3:

External Disks for Storing Data | ClickHouse Docs

Data, processed in ClickHouse, is usually stored in the local file system - on the same machine with the ClickHouse…

clickhouse.com

Distributed table engine:

Distributed Table Engine | ClickHouse Docs

To create a distributed table engine in the cloud, you can use the remote and remoteSecure table functions. The…

clickhouse.com

Horizontal scaling ClickHouse:

Scaling out | ClickHouse Docs

Description

clickhouse.com

Setting-up a simple ClickHouse cluster:

Cluster Deployment | ClickHouse Docs

This tutorial assumes you've already set up a local ClickHouse server

clickhouse.com

User Roles and Settings in ClickHouse:

Users and Roles Settings | ClickHouse Docs

The users section of the user.xml configuration file contains user settings.

clickhouse.com

Working with Parquet Data in ClickHouse:

Working with Parquet in ClickHouse | ClickHouse Docs

Parquet is an efficient file format to store data in a column-oriented way.

clickhouse.com

Please feel free to ask any questions or provide feedback!

Acknowledgements

I want to thank my colleague and friend Faheem Sharif helped me in writing this article, mainly contributing to the distributed table section.