Migrating Data from Snowflake to ClickHouse using S3 and Parquet
Snowflake is now the most popular SQL data warehouse according to the influential DB Engines site. That said, ClickHouse is catching up rapidly and is now the leading open source competitor. ClickHouse offers consistent low latency response, freedom to run anywhere, and outstanding cost efficiency. That last benefit includes costs that are both low and predictable, important properties in uncertain economic times.
So perhaps you are wondering how your Snowflake application would run on ClickHouse?
Fortunately it is not hard to load data into ClickHouse and find out. There are many ways, but we’ll focus on an easy method that allows you to migrate large tables efficiently using S3 object storage and Parquet files, a popular storage format for columnar data. The procedure can move data temporarily for testing or for a permanent migration.
The rest of this article shows the data migration process for a Snowflake sample table containing 150M rows. Along the way we’ll introduce some simple automation to generate SQL to create ClickHouse schema and load data. We will also show steps to optimize the table after loading.
Note on samples
You can find all code samples at https://github.com/Altinity/clickhouse-sql-examples. On the Snowflake side I used an X-Small data warehouse. On the ClickHouse side I used Altinity.Cloud with ClickHouse version 22.4.5.9 on an AWS m5.large VM.
Altinity.Cloud uses 100% open source Clickhouse builds. The procedures described in this article work on any ClickHouse build version 22.3 or above, whether from ClickHouse, Inc., or Altinity.
Dumping a Snowflake Table to Parquet
Let’s start by connecting to Snowflake with our favorite client. In my case, that’s DBeaver. We need the connection to execute SQL commands that actually move the data.
Once connected, issue the following commands to create a schema and a stage that points to S3. In case you have not used them before, stages are a Snowflake feature that allow you to read and write data to external locations. Fill in your own AWS access key and secret key.
CREATE SCHEMA S3CREATE OR REPLACE STAGE S3.BUCKET
URL='s3://your-migration-bucket/snowflake'
CREDENTIALS=
(aws_key_id='aws_access_key',aws_secret_key='aws_secret_key')
FILE_FORMAT = (TYPE = PARQUET);
Next, we write sample table SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.ORDERS using the Snowflake COPY INTO command. This table should already exist in your data warehouse. If it does not, refer to the Snowflake instructions on sample datasets. Here is the command to copy the table into S3.
COPY INTO @S3.BUCKET/SNOWFLAKE_SAMPLE_DATA/TPCH_SF100/ORDERS/
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.ORDERS
HEADER=TRUE
There are a few important points to note.
- The COPY INTO command writes Parquet files to s3://your-migration-bucket/snowflake/SNOWFLAKE_SAMPLE_DATA/TPCH_SF100/ORDERS/. If you look under this URL with a utility like ‘aws s3 ls’ you will see all the files there.
- We don’t need to specify Parquet as the output format, since the stage already does that.
- We do need to specify HEADER=TRUE. This ensures that the Parquet files will preserve the original column names.
Snowflake can select data from a stage as well as a table. It’s a nice feature to check that all the files were written properly. (Pro tip: it’s not fast.) The following commands should both print the same result.
SELECT COUNT(*)
FROM @S3.BUCKET/SNOWFLAKE_SAMPLE_DATA/TPCH_SF100/ORDERS/SELECT COUNT(*)
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.ORDERS
If the answers match, this part of migration is finished. Let’s assume it worked and move on to the next step.
(If not, you need to figure out why the COPY command did not work. Check the Snowflake error messages to diagnose issues.)
A brief digression: What is Parquet?
You can skip this section if you know about Parquet already.
Parquet is an open source format to store columnar data. It is defined by the Apache Parquet project and has a couple of important properties that make it well-suited for migration projects.
First, columnar organization gives good compression and enables data warehouses to run SQL queries directly on Parquet data. Second, it’s widely supported. Google BigQuery, Spark, Amazon Redshift, and many others also use Parquet. You can even read and write Parquet data yourself with easy-to-use libraries in languages like Python.
Here are three practical things to know about Parquet when migrating data.
- Files — Parquet data is stored in files. Each file contains the schema of the stored data as well as the data itself, organized in columns. Tools can read the schema and automatically figure out the names of columns, data types, and whether a column is nullable. You do not need to know where the data came from to load it into another database.
- Datasets — A collection of Parquet files is a dataset. Libraries to process Parquet data can read all files in the dataset at once.
- Storage media — You can store Parquet files on a file system, in object storage like Amazon S3, or HDFS. The simplest way to create datasets is to put Parquet files that belong to it into something that looks like a directory.
We can treat the files generated by Snowflake as a dataset. We used an S3 URL that looks like a directory. The last name in the directory path is the table name. This is not required by Parquet but is convenient to remember which files belong to which table. We’ll use this convention when reloading the data to ClickHouse.
Loading Parquet data into ClickHouse
The first step to load Parquet data into ClickHouse is to create an equivalent table in ClickHouse. The Snowflake description of the table looks like the following. (I used DESCRIBE TABLE to get this.)
name |type |kind |null?|
---------------|------------|------|-----|
O_ORDERKEY |NUMBER(38,0)|COLUMN|N |
O_CUSTKEY |NUMBER(38,0)|COLUMN|N |
O_ORDERSTATUS |VARCHAR(1) |COLUMN|N |
O_TOTALPRICE |NUMBER(12,2)|COLUMN|N |
O_ORDERDATE |DATE |COLUMN|N |
O_ORDERPRIORITY|VARCHAR(15) |COLUMN|N |
O_CLERK |VARCHAR(15) |COLUMN|N |
O_SHIPPRIORITY |NUMBER(38,0)|COLUMN|N |
O_COMMENT |VARCHAR(79) |COLUMN|N |
Let’s create an equivalent table definition in ClickHouse. Execute the following command in your favorite client. I recommend clickhouse-client as it does not time out and can use SET commands to change system settings, which in turn helps ClickHouse INSERT commands run faster.
CREATE TABLE IF NOT EXISTS ORDERS (
O_ORDERKEY Int128,
O_CUSTKEY Int128,
O_ORDERSTATUS String,
O_TOTALPRICE Decimal(12, 2),
O_ORDERDATE Date,
O_ORDERPRIORITY String,
O_CLERK String,
O_SHIPPRIORITY Int128,
O_COMMENT String
)
Engine=MergeTree()
PARTITION BY tuple()
ORDER BY tuple()
If you have not seen ClickHouse SQL previously, the CREATE TABLE may surprise you. The Engine clause gives MergeTree as the table type–it’s the workhorse table for big data in ClickHouse. PARTITION BY, and ORDER BY clauses are also specific to ClickHouse. The tuple() setting means we don’t care about table partitions or row order. Finally native data types are different from standard SQL but easy to understand.
This is an inefficient mapping, but it’s enough to load data and start looking at it in ClickHouse. Don’t panic if you are an experienced ClickHouse user–we’ll fix it by the end of the article.
Now that we have a table it’s time to load data. ClickHouse can read from S3 data using the s3 table function. If your table does not have nullable columns you can use a simple command like the following. The first SET enables the number of threads to use for insertion. You should make this as large as possible for your host.
-- Set to number of vCPUs.
SET max_insert_threads=2INSERT INTO ORDERS
SELECT * FROM s3('https://s3.us-east-1.amazonaws.com/your-migration-bucket/snowflake/SNOWFLAKE_SAMPLE_DATA/TPCH_SF100/ORDERS/*.parquet',
'aws_access_key', 'aws_secret_key', Parquet)
Due to outstanding issue 35346 ClickHouse cannot automatically read schema from Parquet files that have nullable columns. The ORDERS table in fact does not have nullable columns, so we could actually use the above command. However, you can also provide the full schema. The format is identical to the ClickHouse table definition above and looks like the following.
INSERT INTO ORDERS
SELECT * FROM s3('https://s3.us-east-1.amazonaws.com/your-migration-bucket/snowflake/SNOWFLAKE_SAMPLE_DATA/TPCH_SF100/ORDERS/*.parquet',
'aws_access_key', 'aws_secret_key', Parquet,
'O_ORDERKEY Int128, O_CUSTKEY Int128, O_ORDERSTATUS String, O_TOTALPRICE Decimal(12, 2), O_ORDERDATE Date, O_ORDERPRIORITY String, O_CLERK String, O_SHIPPRIORITY Int128, O_COMMENT String')
Execute either command as you please. It will take a few minutes to load. Once loading is complete, count the number of rows in the ORDERS table. You should see the following.
SELECT count() FROM ORDERS┌───count()─┐
│ 150000000 │
└───────────┘
Your data is ready. You can now start to run queries on the table.
Automating ClickHouse schema generation
Generating ClickHouse schema is painful if you have more than a couple tables. I therefore wrote a Python script that reads the Parquet data and automatically generates the CREATE TABLE and INSERT commands. You can do the same. Follow these steps on Linux.
First, clone the Altinity clickhouse-sql-examples GitHub repository.
git clone https://github.com/Altinity/clickhouse-sql-examples.git
Next, set up a Python virtual environment to run the generation script.
cd clickhouse-sql-examples/parquet
python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install --upgrade pip
pip install -r requirements.txt
Fill out environmentals required to access Amazon S3 and find the Parquet data set. To do this, copy env.sh.template to env.sh, fill in values, and source the file. Here are typical values.
-- Bucket and keys for S3 access; must have read permission on the bucket.
export AWS_S3_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="aws_access_key"
export AWS_SECRET_ACCESS_KEY="aws_secret_key"-- Path to Parquet dataset of the form bucket_name/dir1/.../dirN/table_name/
export S3_DATASET_PATH="your-migration-bucket/snowflake/SNOWFLAKE_SAMPLE_DATA/TPCH_SF100/ORDERS/"
Now run the generation script. If all goes well you’ll see the CREATE TABLE and INSERT commands pop out.
$ python generate-ch-schema.py
-- Automatically generated DDL and INSERT for Parquet data
-- AWS_REGION: us-east-1
-- S3_DATASET_PATH: your-migration-bucket/snowflake/SNOWFLAKE_SAMPLE_DATA/TPCH_SF100/ORDERS/
-- Table name: ORDERS
CREATE TABLE IF NOT EXISTS ORDERS (
O_ORDERKEY Int128,
O_CUSTKEY Int128,
O_ORDERSTATUS String,
O_TOTALPRICE Decimal(12, 2),
...
Note: This script has only been tested on a few tables. Log an issue on the project if you hit problems.
Optimizing the table after it is in ClickHouse
Snowflake partitions, compresses, and stores data in object storage without guidance from users. This means you get pretty good query speed just by loading the data. Users don’t have to specify settings, because Snowflake makes automatic choices to ensure good performance. (There are in fact ways to guide Snowflake storage organization, but many applications never need to use them.)
ClickHouse uses similar techniques but with an important difference: users make explicit choices about how to organize tables.
Let’s jump straight to a better schema for our data, then show how we figured it out. Here is a new CREATE TABLE statement that will reduce the table size by about 30%, which means less I/O to read the table. Run this command and the two that follow to create the table, copy in raw data, and then optimize the table part.
CREATE TABLE IF NOT EXISTS ORDERS_OPTIMIZED (
O_ORDERKEY UInt64,
O_CUSTKEY UInt64,
O_ORDERSTATUS FixedString(1),
O_TOTALPRICE Decimal(12, 2) CODEC(ZSTD(10)),
O_ORDERDATE Date,
O_ORDERPRIORITY String,
O_CLERK String CODEC(ZSTD(10)),
O_SHIPPRIORITY UInt8,
O_COMMENT String CODEC(ZSTD(10))
)
Engine=MergeTree()
PARTITION BY toYYYYMM(O_ORDERDATE)
ORDER BY (O_ORDERDATE, O_CUSTKEY)-- Set to number of vCPUs.
SET max_insert_threads=2INSERT INTO ORDERS_OPTIMIZED SELECT * FROM ORDERSOPTIMIZE TABLE ORDERS_OPTIMIZED FINAL
Note: The INSERT command is very slow on the small Altinity.Cloud instance I’m using. It runs quickly on larger VMs.
What did we just do? Here is a quick summary:
- Use PARTITION BY to split the table into separate parts for each month, which is a common choice for time series data that extends over multiple years. It results in about 80 partitions after running OPTIMIZE TABLE. That’s a good number for a data set this size. (Over 1000 partitions is generally not recommended as ClickHouse has to access more files when processing queries.)
- ORDER BY the order date followed by the customer key. This also creates an index on those values that makes queries with a time filter go fast. It helps somewhat to find values of O_CUSTKEY. This is again a common pick: order by time and arrange columns left to right by increasing cardinality.
- Reduce data type sizes wherever possible. UInt8 for O_SHIPPRIORITY and FixedString(1) for O_ORDERSTATUS are more compact than UInt128 and String.
- Use ZSTD compression for Strings and Decimal values instead of the default LZ4 compression. ZSTD can compress both relatively well even though the columns have test values with a lot of random characters.
The big question of course, is whether it helped. Let’s compare the overall table sizes using the following query on ClickHouse system.tables.
SELECT
name, total_rows AS rows,
formatReadableSize(total_bytes) AS size
FROM system.tables
WHERE name LIKE 'ORDERS%' ORDER BY name┌─name─────────────┬──────rows─┬─size─────┐
│ ORDERS │ 150000000 │ 6.06 GiB │
│ ORDERS_OPTIMIZED │ 150000000 │ 4.37 GiB │
└──────────────────┴───────────┴──────────┘
It did! In fact the table size after our quick hand optimization is just about the same size as the data size of 4.3GiB reported in Snowflake itself. That is another way to say that Snowflake storage optimization is pretty good, since it’s largely automatic.
We can delve into the details for each column using the following query on system.columns. This is a great query to guide further optimization, which we’ll leave as an exercise for the reader.
SELECT table, name,
formatReadableSize(sum(data_compressed_bytes)) AS tc,
formatReadableSize(sum(data_uncompressed_bytes)) AS tu
FROM system.columns
WHERE table LIKE 'ORDERS%'
GROUP BY table, name ORDER BY name, table ┌─table────────────┬─name────────────┬─tc─────────┬─tu─────────┐
│ ORDERS │ O_CLERK │ 754.98 MiB │ 2.24 GiB │
│ ORDERS_OPTIMIZED │ O_CLERK │ 432.78 MiB │ 2.18 GiB │
│ ORDERS │ O_COMMENT │ 2.79 GiB │ 6.92 GiB │
│ ORDERS_OPTIMIZED │ O_COMMENT │ 1.54 GiB │ 6.73 GiB │
│ ORDERS │ O_CUSTKEY │ 736.25 MiB │ 2.24 GiB │
│ ORDERS_OPTIMIZED │ O_CUSTKEY │ 647.09 MiB │ 1.09 GiB │
│ ORDERS │ O_ORDERDATE │ 1.29 MiB │ 286.10 MiB │
│ ORDERS_OPTIMIZED │ O_ORDERDATE │ 1.26 MiB │ 278.55 MiB │
…
Finally, we can run sample queries to check the difference in performance. As a quick check the following query runs in 0.084 seconds on ORDERS_OPTIMIZED vs. 0.18 seconds on ORDERS on my Altinity.Cloud server.
SELECT O_ORDERDATE AS date, uniq(O_CUSTKEY) AS unique_customers
FROM ORDERS
WHERE date BETWEEN toDate('1993-12-01') AND toDate('1993-12-31')
GROUP BY date ORDER by date
You might want to know the difference between Snowflake and Clickhouse performance. That’s a great topic that deserves more space than we have here. We’ll get to it in a future article that explores the trade-offs fully.
Conclusion
As this article demonstrates, S3 and Parquet provide a great mechanism to migrate data quickly between Snowflake and ClickHouse. Once on ClickHouse you can optimize the table structure and commence meaningful head-to-head comparisons between databases. I hope this will help you in your projects to explore moving between ClickHouse and Snowflake.
Would you like to try out ClickHouse yourself? You can get a free Altinity.Cloud trial and try out the examples in this article or (even better) load your own Snowflake data.
And if you do choose to migrate, Altinity has the deepest industry experience of any enterprise ClickHouse provider. Contact us at info@altinity.com or via our Contact Us page. We’re also on the AltinityDB slack workspace. No matter where you find us, we’ll be happy to explore your needs and illuminate your path to great analytic applications on ClickHouse.
Originally published on the Altinity Blog on June 1, 2022.