Getting Started with Snowflake and Matillion
Some of us here at Pandera Labs recently attended Snowflake’s Unite the Data Nation conference in Chicago. After seeing the product demo, understanding Snowflake’s capabilities, and seeing the real world customer demos, I got excited to try out Snowflake on my own.
Snowflake
I went to the Snowflake page and discovered that they offered a demonstration account with $400 of credits. So, I clicked the “Start for Free” link and I was on my way.
The Snowflake documentation contains step-by-step instructions for getting started including complementary videos. Following the instructions, I was up and running in no time. I looked over the “Snowflake in 20 Minutes” tutorial and decided to modify the instructions to use our own AWS S3 account and sample data, with a goal of loading data from our S3 account into this new Snowflake account.
I used the web interface menus to create my objects and load the table, but also captured the SQL for every step:
Creating a Warehouse
CREATE WAREHOUSE ETL WITH WAREHOUSE_SIZE = ‘XSMALL’ WAREHOUSE_TYPE = ‘STANDARD’ AUTO_SUSPEND = 300 AUTO_RESUME = TRUE;
Creating a Schema
CREATE SCHEMA “CSV_INGEST_POC”.stage;
Creating a Table
CREATE TABLE stage.stage_current_quarter_planned
(
product_id varchar(128),
territory_code varchar(128),
period int,
user varchar(128),
week_end_date date,
date date,
sum_total_pde numeric(13, 4)
)
Creating a File Format
CREATE FILE FORMAT “CSV_INGEST_POC”.”PUBLIC”.pipe_delimited
TYPE = ‘CSV’
COMPRESSION = ‘AUTO’
FIELD_DELIMITER = ‘|’
RECORD_DELIMITER = ‘\n’
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = ‘\042’
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = ‘NONE’
ESCAPE_UNENCLOSED_FIELD = ‘\134’
DATE_FORMAT = ‘AUTO’
TIMESTAMP_FORMAT = ‘AUTO’
NULL_IF = (‘’);
Creating an External Stage
CREATE STAGE “CSV_INGEST_POC”.”PUBLIC”.MY_S3 URL = ‘s3://my-sample-bucket/’
CREDENTIALS = (AWS_KEY_ID = ‘my_key’ AWS_SECRET_KEY = ‘my_secret_key’);
Loading the Table
COPY INTO “CSV_INGEST_POC”.”STAGE”.”STAGE_CURRENT_QUARTER_PLANNED” FROM ‘@”CSV_INGEST_POC”.”PUBLIC”.”MY_S3"/current-quarter-plan/my_current_quarter_planned_20180410.txt’ FILE_FORMAT = ‘“CSV_INGEST_POC”.”PUBLIC”.”PIPE_DELIMITED”’ ON_ERROR = ‘ABORT_STATEMENT’ PURGE = FALSE;
12.7 Million rows loaded in 22 seconds
Querying the Loaded Table
SELECT
case product_id
when ‘YYY’ then ‘Product Y’
when ‘XXX’ then ‘Product X’
when ‘ZZZ’ then ‘Product Z’
end as Product,
territory_code,
sum(sum_total_pde )
from stage.stage_current_quarter_planned
group by 1,2
order by count(*) desc
Aggregated to 27,00 rows in 1.7 seconds
Querying the External S3 File Directly
SELECT DISTINCT $1
FROM ‘@”CSV_INGEST_POC”.”PUBLIC”.”MY_S3"/current-quarter-plan/my_current_quarter_planned_20180410.txt’
( FILE_FORMAT => ‘“CSV_INGEST_POC”.”PUBLIC”.”PIPE_DELIMITED”’ );
14.8 seconds to get a list of distinct products
I completed all this work in less than 20 minutes. I found the interface very intuitive and the database to perform very fast, even with the smallest instance.
Matillion
I’m also not a big fan of using scripting languages and writing lots of code to develop data warehouses. We were looking for a ETL/ELT tool to complement Snowflake, and Matillion has some desirable features:
- Pay-as-you-go starting at $1.37 per hour.
- Push-down ELT architecture
- Github integration
- Scalability
I went to the Matillion website and signed up for a demo. The demo was impressive and we signed up for a two week trial. Like Snowflake, Matillion also had an easy to follow step by step getting started guide. Their documentation also has tons of examples and videos.
I wanted to try out two common test cases with Matillion. The first test case is to test a full file extract, and the second to test an incremental load.
Full File Extract
For the full file extract, I will load a file from an AWS S3 bucket into a staging table, then compare the staged data to the target table to determine what changes need to be applied to the target table. The last step applies the changes to the target.
Orchestration
- Truncate the staging table
- Our process is designed to only have one file in the folder per day, so I set the max iteration to 1 so it only processes one file.
- After the load, the detect changes transform is called
- After the detect changes, we apply the changes
- Finally, execute a Python script to move the process file to an archive
Detect Changes
- Read Stage
- Use Calculator to format fields and rename columns to match target
- Read Target
- Detect Changes
- Load Records into Work Table
Apply Changes
- Read Work Table
- Replicate to Update and Insert Paths
- Filter and Load New Records
- Filter and Update Changed Records
- Deletes and Unchanged Records are ignored
Incremental Load
For the incremental load, we will load a incremental file from our AWS S3 bucket. The source system does not provide an indicator of whether the record is new or changed, so we will be performing an upsert.
Orchestration
- Truncate Table
- Again, with our process designed to only have one file in the folder per day, I (again) set the max iteration to 1 so it only process one file.
- Call Transformer to upsert data
- Archive file
Upsert
- Read Stage Table
- Use table update component — Component supports both a separate update and insert mapping
Thoughts
I found both products intuitive and easy to use. The learning curve is not steep for anyone who has used a relational database and ELT/ETL tool in the past: armed with only the documentation from the products’ websites, I was able learn both products and successfully write real-world examples in a matter of hours. I even reached out to support and found both vendors to be very responsive and helpful. The combination of these two tools is both very cost-effective and powerful.