Getting data into and scaling for Billions of records with ClickHouse

Shamsul
9 min readSep 22, 2022

--

This is part of my series of articles that started with In search of the fastest data processing engine. This is a step-by-step guide to setup a ClickHouse cluster, bulk insert data from S3, and querying the dataset. In this article, we’ll setup a simple single node (1 shard — 1 replica) ClickHouse cluster on Kubernetes in Amazon EKS. As we progress over the series of articles, we’ll continue to add complexities to the cluster by bringing in a larger dataset, multiple shards, replicas, and a special node named “Keeper”.

A generic implementation of a big data OLAP system that is optimized for every workload is difficult to acheive, if not impossible. Every implementation is optimized for a set of specific usecases. Thats one of the reasons benchmarks often get biased. So, even though I’m using a benchmark dataset here, the objective is not to create a formal benchmark for ClickHouse vs other OLAP systems. Also this is not directly comparable to the results produced by the AMPLab benchmark either.

Dataset

As the dataset, I have chosen AMPLab Big Data Benchmark dataset. This dataset has been prepared in various sizes of the input dataset in S3. This is one of the reasons I have chosen this dataset so that anyone can follow along from the laptop with a local Kubernetes setup with the smaller dataset. However, I’ll be doing this exercise on AWS. The dataset is available publicly at s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]

+-----------+------------+----------+-------------+------------+
| S3 Suffix | Rankings | Rankings | UserVisits | UserVisits |
+-----------+------------+----------+-------------+------------+
| /1node/ | 18 Million | 1.28GB | 155 Million | 25.4GB |
| /5nodes/ | 90 Million | 6.38GB | 775 Million | 126.8GB |
+-----------+------------+----------+-------------+------------+

Unlike the AMPLab benchmark, I will use both the 1node and 5nodes datasets in a single node ClickHouse cluster. You can find more information regarding the dataset here.

Note: If you don’t care about Kubernetes and just want to try the ClickHouse part in your own setup, you can skip the following sections and directly move to Load 1node dataset into ClickHouse section.

Setting up the Kubernetes cluster on EKS

Make sure you have the following tools installed in your machine from where you’ll be accessing the Kubernetes cluster. I’ll be using Kubernetes version 1.23 but older versions should work as well.

The IAM security principal that you’re using must have permissions to work with Amazon EKS IAM roles and service linked roles, AWS CloudFormation, and a VPC and related resources. You must complete all steps in the following as the same IAM user.

Three node EKS cluster

  • Create your Amazon EKS cluster with managed linux EC2 nodes. We will primarily utilize these initial 2 nodes to host control plane components.
eksctl create cluster --name clickhouse-cluster --region us-east-1 --node-volume-size 30 --version 1.23

When the above command is completed you should see something like the following.

EKS cluster "clickhouse-cluster" in "region-code" region is ready.
  • Update kube-proxy, coredns, and aws-node addons
eksctl utils update-kube-proxy --cluster=clickhouse-cluster --region us-east-1 --approveeksctl utils update-aws-node --cluster=clickhouse-cluster --region us-east-1 --approveeksctl utils update-coredns --cluster=clickhouse-cluster --region us-east-1 --approve
  • Create a managed node-group with 1 EC2 machine of type r6gd.2xlarge. We will use this third node with NVMe SSD instance store volume for hosting the ClickHouse pod.
eksctl create nodegroup --cluster clickhouse-cluster \--region us-east-1 --name chi-node-group \--node-type r6gd.2xlarge --nodes 1 --nodes-min 1 \--nodes-max 1 --node-volume-size 30 \--node-labels "disktype=nvme-ssd"
  • View your cluster nodes. By now we should have 3 nodes.
kubectl get nodes -o wide

Note: This EKS cluster, including the instance store volume in the EC2 machine, is not optimized for production use.

Configure a single-node Clickhouse cluster

We’ll use Altinity Operator for ClickHouse to instantiate our cluster. The operator makes it easy to create, configure and manage ClickHouse clusters running on Kubernetes. For example, once the operator is installed, a simple single-node cluster can be as simple as the following.

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "simple-01"
spec:
configuration:
clusters:
- name: "simple"
  • Install the operator
kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml
  • Verify that the operator is up and running
kubectl get pods -n kube-system -l app=clickhouse-operator

The operator should be in running status.

  • Now, create the following StorageClass named local-storage.
kubectl apply -f https://raw.githubusercontent.com/shams858/clickhouse-on-k8s/main/k8s/single-node/local-storage-class.yaml
  • We’ll create the following persistent volume that uses the NVMe SSD instance volume attached to the EC2 instance. We are utilizing the nodeAffinity property of PersistentVolume to consume the instance storage in the specified node with the label “nvme-ssd”.
kubectl apply -f https://raw.githubusercontent.com/shams858/clickhouse-on-k8s/main/k8s/single-node/nvme-ssd-pv.yaml
  • Now, we’ll create the Clickhouse cluster with 1 shard and 1 replica. We have defined a volumeClaimTemplate to use the previously created PersistentVolume and nodeAffinity to run the ClickHouse pod in the previously created r6gd.2xlarge EC2 instance with the label “nvme-ssd”.
kubectl apply -f https://raw.githubusercontent.com/shams858/clickhouse-on-k8s/main/k8s/single-node/cluster.yaml
  • Verify the PersistentVolumeClaim which should be in Bound STATUS.
kubectl get pvc
  • Verify that the pod is in Running STATUS
kubectl get po -l clickhouse.altinity.com/app=chop
  • Verify the block devices from the running pod.
kubectl  exec -it $(kubectl get po --output=jsonpath={.items..metadata.name} -l clickhouse.altinity.com/app=chop)  -- lsblk

You should see something like the following.

Great, if you have done everything correctly, we now have a running ClickHouse cluster on Kubernetes with a single node.

Monitoring with Prometheus and Grafana (optional)

Setup Prometheus by following the instructions provided here to pull data from ClickHouse into Prometheus. Then setup Grafana in order to display the data accumulated in Prometheus.

Load 1node dataset into ClickHouse

  • Get a shell to the ClickHouse running container to access the clickhouse-client from there.
kubectl exec -it $(kubectl get po --output=jsonpath={.items..metadata.name} -l clickhouse.altinity.com/app=chop)  -- /bin/sh

Now in the container shell that appears, run clickhouse-client.

clickhouse-client
  • Create rankings table with Engine Log. ClickHouse has various types of tables called Table Engine. Log is a lightweight engine with minimum functionality. They’re the most effective when you need to quickly write many small tables and read them later as a whole.
  • Create the second table uservisits with Engine type MergeTree. MergeTree is the most versatile and performant Engine family, also it comes with unique features and index management that other Engines lack.

Now that the tables have been created we can load data into them from S3. However, we’ll not sync data from S3 to our machine and then upload it to ClickHouse, instead, we’ll load data directly from S3 to the ClickHouse server. ClickHouse has a powerful set of TableFunctions that return temporary objects that can be used in the FROM clause of a query. By implementing the IStorage table abstraction you can query on anything like tables even though it's not a physical database table.

ClickHouse table abstraction
  • Now, execute the following query. Note the use of wildcards in the S3 path and how the format, structure of the table, and compression have been provided as the function arguments.

It took around 20 seconds to load 18 million rows.

  • Now, load data into the uservisits_1node table.

It took 402 seconds to insert 155 million rows at a speed of 384K rows per second. Can we accelerate this rate of insertion? By exploring the ClickHouse CPU metrics from Prometheus/Grafana, I found out that not all of the CPU cores were utilized while querying S3 for data loading. We would like to take advantage of max_insert_threads settings to utilize all of the CPU cores present.

We’ll first truncate the table and then rerun the Insert query with the required settings parameters.

truncate table uservisits_1node;

With the above settings parameters set, it took about 78 seconds to insert 155M rows at a speed of 1.98M per second. A very impressive number, right? We have gained more than 5 times speed by parallelizing the query.

As expected, now all of the cores have been utilized as shown in the following graph from Grafana. At the same time, a higher value for max_insert_threads has led to higher memory usage.

CPU core utilization
Memory for Queries

Now, let's take a look at some interesting graphs. ClickHouse uses sparse indexing and data is written to tables in parts. Later on, it merges those parts in the background. And each part has its own primary key, when the parts are merged, their primary keys are also merged and sorted. When similar data is stored together, we get better compression. From the graphs it's clearly evident that during insertion it takes more storage (and memory for the primary index), however, later on when the parts are merged, sorted, and compressed, much less storage is consumed.

Memory and Disk usage over time

Load 5node dataset into ClickHouse

  • Create rankings table.
  • Create uservisits table.
  • Load data into the 5nodes rankings table from S3.
7.17 million rows/sec
  • Load data into the 5nodes uservisits table from S3.
1.08 million rows/sec

Queries

  • Query 1
SELECT pageURL, pageRank FROM [rankings table] WHERE pageRank > 1000;
  • Query 2
SELECT substring(sourceIP, 1, 8), sum(adRevenue) FROM [uservisits table] GROUP BY substring(sourceIP, 1, 8);
  • Query 3

Results

  • Query1 on 1node dataset
6552 rows in set. Elapsed: 0.223 sec. Processed 18.00 million rows, 1.22 GB (80.67 million rows/s., 5.49 GB/s.)
  • Query2 on 1node dataset
1954299 rows in set. Elapsed: 3.074 sec. Processed 155.00 million rows, 4.07 GB (50.43 million rows/s., 1.33 GB/s.)
  • Query3 on 1node dataset
1 row in set. Elapsed: 0.859 sec. Processed 18.99 million rows, 1.32 GB (22.12 million rows/s., 1.53 GB/s.)
  • Query1 on 5nodes dataset
32888 rows in set. Elapsed: 0.978 sec. Processed 90.00 million rows, 6.12 GB (92.07 million rows/s., 6.26 GB/s.)
  • Query2 on 5nodes dataset
2067317 rows in set. Elapsed: 13.718 sec. Processed 751.75 million rows, 19.76 GB (54.80 million rows/s., 1.44 GB/s.)
  • Query3 on 5nodes dataset
1 row in set. Elapsed: 4.144 sec. Processed 94.40 million rows, 6.53 GB (22.78 million rows/s., 1.57 GB/s.)
1node and 5nodes dataset response time for all three queries

Cleanup

Let's delete the EKS cluster to cleanup the complete setup.

eksctl delete cluster --region us-east-1 --name clickhouse-cluster

Conclusion

This reproducible setup can be utilized to compare the performance of ClickHouse from version to version. For example, we have run the queries with a default setup of ClickHouse on Kubernetes, however, there is room for improvements at least with the Query2. We can bring in optimizations with this Group By query and compare the results using the same setup. All the Kubernetes manifests and SQLs are available in this GitHub repo.

Acknowledgment

I acknowledge the work of SIGMOD 2009 Pavlo et al. and U.C. Berkeley AMPLab for preparing the dataset and queries.

Wendell, P. (n.d.). Big Data Benchmark. Retrieved September 22, 2022, from https://amplab.cs.berkeley.edu/benchmark/

--

--

Shamsul
Shamsul

Written by Shamsul

Engineering leader with 17 years of developemnt experince. Passionate about Data Engineering, Software Architecture, Serverless Applications, AWS & Kubernetes.

No responses yet