Getting Started with Snowflake and Matillion

Published in

Pandera Labs

5 min readMay 8, 2018

Some of us here at Pandera Labs recently attended Snowflake’s Unite the Data Nation conference in Chicago. After seeing the product demo, understanding Snowflake’s capabilities, and seeing the real world customer demos, I got excited to try out Snowflake on my own.

Snowflake

I went to the Snowflake page and discovered that they offered a demonstration account with $400 of credits. So, I clicked the “Start for Free” link and I was on my way.

The Snowflake documentation contains step-by-step instructions for getting started including complementary videos. Following the instructions, I was up and running in no time. I looked over the “Snowflake in 20 Minutes” tutorial and decided to modify the instructions to use our own AWS S3 account and sample data, with a goal of loading data from our S3 account into this new Snowflake account.

I used the web interface menus to create my objects and load the table, but also captured the SQL for every step:

Creating a Warehouse

CREATE WAREHOUSE ETL WITH WAREHOUSE_SIZE = ‘XSMALL’ WAREHOUSE_TYPE = ‘STANDARD’ AUTO_SUSPEND = 300 AUTO_RESUME = TRUE;

Creating a Schema

CREATE SCHEMA “CSV_INGEST_POC”.stage;

Creating a Table

CREATE TABLE stage.stage_current_quarter_planned
(
   product_id varchar(128),
   territory_code varchar(128),
   period int,
   user varchar(128),
   week_end_date date,
   date date,
   sum_total_pde numeric(13, 4)
)

Creating a File Format

CREATE FILE FORMAT “CSV_INGEST_POC”.”PUBLIC”.pipe_delimited 
   TYPE = ‘CSV’ 
   COMPRESSION = ‘AUTO’ 
   FIELD_DELIMITER = ‘|’
   RECORD_DELIMITER = ‘\n’ 
   SKIP_HEADER = 1 
   FIELD_OPTIONALLY_ENCLOSED_BY = ‘\042’ 
   TRIM_SPACE = FALSE 
   ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE 
   ESCAPE = ‘NONE’ 
   ESCAPE_UNENCLOSED_FIELD = ‘\134’
   DATE_FORMAT = ‘AUTO’ 
   TIMESTAMP_FORMAT = ‘AUTO’ 
   NULL_IF = (‘’);

Creating an External Stage

CREATE STAGE “CSV_INGEST_POC”.”PUBLIC”.MY_S3 URL = ‘s3://my-sample-bucket/’
 CREDENTIALS = (AWS_KEY_ID = ‘my_key’ AWS_SECRET_KEY = ‘my_secret_key’);

Loading the Table

COPY INTO “CSV_INGEST_POC”.”STAGE”.”STAGE_CURRENT_QUARTER_PLANNED” FROM ‘@”CSV_INGEST_POC”.”PUBLIC”.”MY_S3"/current-quarter-plan/my_current_quarter_planned_20180410.txt’ FILE_FORMAT = ‘“CSV_INGEST_POC”.”PUBLIC”.”PIPE_DELIMITED”’ ON_ERROR = ‘ABORT_STATEMENT’ PURGE = FALSE;

12.7 Million rows loaded in 22 seconds

Querying the Loaded Table

SELECT 
  case product_id 
    when ‘YYY’ then ‘Product Y’
    when ‘XXX’ then ‘Product X’
    when ‘ZZZ’ then ‘Product Z’
  end as Product,
  territory_code,
  sum(sum_total_pde )
from stage.stage_current_quarter_planned 
group by 1,2
order by count(*) desc

Aggregated to 27,00 rows in 1.7 seconds

Querying the External S3 File Directly

SELECT DISTINCT $1
 FROM ‘@”CSV_INGEST_POC”.”PUBLIC”.”MY_S3"/current-quarter-plan/my_current_quarter_planned_20180410.txt’ 
 ( FILE_FORMAT => ‘“CSV_INGEST_POC”.”PUBLIC”.”PIPE_DELIMITED”’ );

14.8 seconds to get a list of distinct products

I completed all this work in less than 20 minutes. I found the interface very intuitive and the database to perform very fast, even with the smallest instance.

Matillion

I’m also not a big fan of using scripting languages and writing lots of code to develop data warehouses. We were looking for a ETL/ELT tool to complement Snowflake, and Matillion has some desirable features:

Pay-as-you-go starting at $1.37 per hour.
Push-down ELT architecture
Github integration
Scalability

I went to the Matillion website and signed up for a demo. The demo was impressive and we signed up for a two week trial. Like Snowflake, Matillion also had an easy to follow step by step getting started guide. Their documentation also has tons of examples and videos.

I wanted to try out two common test cases with Matillion. The first test case is to test a full file extract, and the second to test an incremental load.

Full File Extract

For the full file extract, I will load a file from an AWS S3 bucket into a staging table, then compare the staged data to the target table to determine what changes need to be applied to the target table. The last step applies the changes to the target.

Orchestration

Truncate the staging table
Our process is designed to only have one file in the folder per day, so I set the max iteration to 1 so it only processes one file.
After the load, the detect changes transform is called
After the detect changes, we apply the changes
Finally, execute a Python script to move the process file to an archive

Detect Changes

Read Stage
Use Calculator to format fields and rename columns to match target
Read Target
Detect Changes
Load Records into Work Table

Apply Changes

Read Work Table
Replicate to Update and Insert Paths
Filter and Load New Records
Filter and Update Changed Records
Deletes and Unchanged Records are ignored

Incremental Load

For the incremental load, we will load a incremental file from our AWS S3 bucket. The source system does not provide an indicator of whether the record is new or changed, so we will be performing an upsert.

Orchestration

Truncate Table
Again, with our process designed to only have one file in the folder per day, I (again) set the max iteration to 1 so it only process one file.
Call Transformer to upsert data
Archive file

Upsert

Read Stage Table
Use table update component — Component supports both a separate update and insert mapping

Thoughts

I found both products intuitive and easy to use. The learning curve is not steep for anyone who has used a relational database and ELT/ETL tool in the past: armed with only the documentation from the products’ websites, I was able learn both products and successfully write real-world examples in a matter of hours. I even reached out to support and found both vendors to be very responsive and helpful. The combination of these two tools is both very cost-effective and powerful.

Getting Started with Snowflake and Matillion

Snowflake

Creating a Warehouse

Creating a Schema

Creating a Table

Creating a File Format

Creating an External Stage

Loading the Table

Querying the Loaded Table

Querying the External S3 File Directly

Matillion

Full File Extract

Orchestration

Detect Changes

Apply Changes

Incremental Load

Orchestration

Upsert

Thoughts

Written by Michael Gohl