Using AWS S3 to backup & restore Clickhouse data

Published in

DataDenys

5 min readOct 3, 2022

Using AWS S3 as a storage disk for tables also gives us opportunity to use S3 as a backup/restore storage for our database. Clickhouse can automatically manage backup and restore process, including incremental one, once storage disk is configured. Let’s see how to setup things right.

Setting up AWS S3 as a backup disk

In order to add S3 as a backup storage, add new s3.xml file under /etc/clickhouse-server/config.d dir:

This config declares s3 disk with given access credentials. Define <endpoint> to specify bucket and directory to save backups to. Then define <backups> block to allow using s3 disk as backup device. Node that <allowed_path> value should be the same as path directory from <endpoint> param (/backups/ in our case).

Restart Clickhouse to make sure everything works fine:

sudo clickhouse restart

Now we’re ready to backup data to AWS S3.

Backup a table to S3

In order to backup single table we use the following query:

BACKUP TABLE test TO Disk('s3', 'test_backup.zip')

Here, we create S3 backup, named test_backup.zip, from test table. It takes some time depending on our table size. If everything went fine, we get success message:

It took 34 seconds to backup our test table with 20m rows which takes ~200Mb compressed. Nice.

Restore a table

To restore a table we use the following query:

RESTORE TABLE test AS test_restored
FROM Disk('s3', 'test_backup.zip')

This query will restore a table from test_backup.zip backup into test_restored table. We can omit AS test_restored, then Clickhouse will restore directly to test table (it should be previously removed). If everything went fine, we get this message:

Backup & restore full database

Instead of backing up and restoring each table, we can backup full database instead. Let’s create sample database and tables in it.

CREATE DATABASE db;
USE db;
CREATE TABLE t1 (id UInt32) ENGINE=MergeTree ORDER BY id;
CREATE TABLE t2 (dt DateTime, msg String) ENGINE=MergeTree ORDER BY dt;

Let’s populate t1 and t2 tables with some generated data:

INSERT INTO t1 SELECT * FROM generateRandom('id UInt32') LIMIT 1000000;
INSERT INTO t2 SELECT * FROM generateRandom('dt DateTime, msg String') LIMIT 1000000;

Backing up database

Now let’s backup db database to S3 disk:

BACKUP DATABASE db TO Disk('s3', 'db.zip')

Restoring database from backup

First, let’s delete our local db database:

DROP DATABASE db;

Now let’s restore db from S3 db.zip backup:

RESTORE DATABASE db FROM Disk('s3', 'db.zip')

And in like 11 seconds our database was fully restored:

We can confirm all restored tables are available:

USE db;
SHOW tables;┌─name─┐
│ t1   │
│ t2   │
└──────┘

Incremental backups

In production environments creating full backup can be a problem for huge databases. And since we deal with Clickhouse, we most certainly deal with huge databases.

Backuping up full data can take more time than the period we’d want to backup data in. This is exactly the case for incremental backups, when we backup only a new portion of data arrived after last backup:

First of all we have to create base backup (for the first time only) which is done as usual backup:

BACKUP TABLE test TO Disk('s3', 'test-base.zip')

Now let’s assume some time has passed and new data portion has arrived:

INSERT INTO test
SELECT number + 2000000000, randomPrintableASCII(15), now() - rand32()
FROM numbers(1000);

In order to create incremental backup, we have to specify base_backup settings to let Clickhouse know which existing backup to use to understand data delta:

BACKUP TABLE test TO Disk('s3', 'test-2022-10-03.zip')
SETTINGS base_backup = Disk('s3', 'test-base.zip');

As we can see, this last backup took much less time to accomplish compared to full backup process:

Next day (or after defined period), you should specify test-2022-10-03.zip as the new base_backup so Clickhouse gets new delta correctly:

BACKUP TABLE test TO Disk('s3', 'test-2022-10-04.zip')
SETTINGS base_backup = Disk('s3', 'test-2022-10-03.zip');

An so on — you specify your previous backup file name as a base_backup for your new backup file.

Resroting incremental backup

In order to restore full table you have to specify last backup file name and Clickhouse will automatically merge all deltas:

RESTORE TABLE test FROM Disk('s3', 'test-2022-10-04.zip')

We can actually restore data from any snapshot of incremental backup (including, of course, base file):

RESTORE TABLE test FROM Disk('s3', 'test-2022-10-03.zip')

Managing backup data

You can access full list of your backup operations from system.backups table:

SELECT * FROM system.backups

We can see Clickhouse creates following files on S3 bucket:

If you use incremental backups, do not delete those files. In case you make full backup, you can delete older files periodically to save some space.

Summary

AWS S3 is natively supported by Clickhouse as a storage for backing up and restoring data. Using incremental backups allows using this solution in production environments.