Using AWS S3 to backup & restore Clickhouse data
Using AWS S3 as a storage disk for tables also gives us opportunity to use S3 as a backup/restore storage for our database. Clickhouse can automatically manage backup and restore process, including incremental one, once storage disk is configured. Let’s see how to setup things right.
Setting up AWS S3 as a backup disk
In order to add S3
as a backup storage, add new s3.xml
file under /etc/clickhouse-server/config.d
dir:
This config declares s3
disk with given access credentials. Define <endpoint>
to specify bucket and directory to save backups to. Then define <backups>
block to allow using s3
disk as backup device. Node that <allowed_path>
value should be the same as path directory from <endpoint>
param (/backups/
in our case).
Restart Clickhouse to make sure everything works fine:
sudo clickhouse restart
Now we’re ready to backup data to AWS S3.
Backup a table to S3
In order to backup single table we use the following query:
BACKUP TABLE test TO Disk('s3', 'test_backup.zip')
Here, we create S3 backup, named test_backup.zip
, from test
table. It takes some time depending on our table size. If everything went fine, we get success message:
It took 34 seconds to backup our test
table with 20m rows which takes ~200Mb
compressed. Nice.
Restore a table
To restore a table we use the following query:
RESTORE TABLE test AS test_restored
FROM Disk('s3', 'test_backup.zip')
This query will restore a table from test_backup.zip
backup into test_restored
table. We can omit AS test_restored
, then Clickhouse will restore directly to test
table (it should be previously removed). If everything went fine, we get this message:
Backup & restore full database
Instead of backing up and restoring each table, we can backup full database instead. Let’s create sample database and tables in it.
CREATE DATABASE db;
USE db;
CREATE TABLE t1 (id UInt32) ENGINE=MergeTree ORDER BY id;
CREATE TABLE t2 (dt DateTime, msg String) ENGINE=MergeTree ORDER BY dt;
Let’s populate t1
and t2
tables with some generated data:
INSERT INTO t1 SELECT * FROM generateRandom('id UInt32') LIMIT 1000000;
INSERT INTO t2 SELECT * FROM generateRandom('dt DateTime, msg String') LIMIT 1000000;
Backing up database
Now let’s backup db
database to S3 disk:
BACKUP DATABASE db TO Disk('s3', 'db.zip')
Restoring database from backup
First, let’s delete our local db
database:
DROP DATABASE db;
Now let’s restore db
from S3 db.zip
backup:
RESTORE DATABASE db FROM Disk('s3', 'db.zip')
And in like 11 seconds our database was fully restored:
We can confirm all restored tables are available:
USE db;
SHOW tables;┌─name─┐
│ t1 │
│ t2 │
└──────┘
Incremental backups
In production environments creating full backup can be a problem for huge databases. And since we deal with Clickhouse, we most certainly deal with huge databases.
Backuping up full data can take more time than the period we’d want to backup data in. This is exactly the case for incremental backups, when we backup only a new portion of data arrived after last backup:
First of all we have to create base backup (for the first time only) which is done as usual backup:
BACKUP TABLE test TO Disk('s3', 'test-base.zip')
Now let’s assume some time has passed and new data portion has arrived:
INSERT INTO test
SELECT number + 2000000000, randomPrintableASCII(15), now() - rand32()
FROM numbers(1000);
In order to create incremental backup, we have to specify base_backup
settings to let Clickhouse know which existing backup to use to understand data delta:
BACKUP TABLE test TO Disk('s3', 'test-2022-10-03.zip')
SETTINGS base_backup = Disk('s3', 'test-base.zip');
As we can see, this last backup took much less time to accomplish compared to full backup process:
Next day (or after defined period), you should specify test-2022-10-03.zip
as the new base_backup
so Clickhouse gets new delta correctly:
BACKUP TABLE test TO Disk('s3', 'test-2022-10-04.zip')
SETTINGS base_backup = Disk('s3', 'test-2022-10-03.zip');
An so on — you specify your previous backup file name as a base_backup
for your new backup file.
Resroting incremental backup
In order to restore full table you have to specify last backup file name and Clickhouse will automatically merge all deltas:
RESTORE TABLE test FROM Disk('s3', 'test-2022-10-04.zip')
We can actually restore data from any snapshot of incremental backup (including, of course, base file):
RESTORE TABLE test FROM Disk('s3', 'test-2022-10-03.zip')
Managing backup data
You can access full list of your backup operations from system.backups
table:
SELECT * FROM system.backups
We can see Clickhouse creates following files on S3 bucket:
If you use incremental backups, do not delete those files. In case you make full backup, you can delete older files periodically to save some space.
Summary
AWS S3 is natively supported by Clickhouse as a storage for backing up and restoring data. Using incremental backups allows using this solution in production environments.