Getting Started with Snowflake and Matillion

Michael Gohl
Pandera Labs
Published in
5 min readMay 8, 2018

--

Some of us here at Pandera Labs recently attended Snowflake’s Unite the Data Nation conference in Chicago. After seeing the product demo, understanding Snowflake’s capabilities, and seeing the real world customer demos, I got excited to try out Snowflake on my own.

Snowflake

I went to the Snowflake page and discovered that they offered a demonstration account with $400 of credits. So, I clicked the “Start for Free” link and I was on my way.

The Snowflake documentation contains step-by-step instructions for getting started including complementary videos. Following the instructions, I was up and running in no time. I looked over the “Snowflake in 20 Minutes” tutorial and decided to modify the instructions to use our own AWS S3 account and sample data, with a goal of loading data from our S3 account into this new Snowflake account.

I used the web interface menus to create my objects and load the table, but also captured the SQL for every step:

Creating a Warehouse

CREATE WAREHOUSE ETL WITH WAREHOUSE_SIZE = ‘XSMALL’ WAREHOUSE_TYPE = ‘STANDARD’ AUTO_SUSPEND = 300 AUTO_RESUME = TRUE;

Creating a Schema

CREATE SCHEMA “CSV_INGEST_POC”.stage;

Creating a Table

CREATE TABLE stage.stage_current_quarter_planned
(
product_id varchar(128),
territory_code varchar(128),
period int,
user varchar(128),
week_end_date date,
date date,
sum_total_pde numeric(13, 4)
)

Creating a File Format

CREATE FILE FORMAT “CSV_INGEST_POC”.”PUBLIC”.pipe_delimited 
TYPE = ‘CSV’
COMPRESSION = ‘AUTO’
FIELD_DELIMITER = ‘|’
RECORD_DELIMITER = ‘\n’
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = ‘\042’
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = ‘NONE’
ESCAPE_UNENCLOSED_FIELD = ‘\134’
DATE_FORMAT = ‘AUTO’
TIMESTAMP_FORMAT = ‘AUTO’
NULL_IF = (‘’);

Creating an External Stage

CREATE STAGE “CSV_INGEST_POC”.”PUBLIC”.MY_S3 URL = ‘s3://my-sample-bucket/’
CREDENTIALS = (AWS_KEY_ID = ‘my_key’ AWS_SECRET_KEY = ‘my_secret_key’);

Loading the Table

COPY INTO “CSV_INGEST_POC”.”STAGE”.”STAGE_CURRENT_QUARTER_PLANNED” FROM ‘@”CSV_INGEST_POC”.”PUBLIC”.”MY_S3"/current-quarter-plan/my_current_quarter_planned_20180410.txt’ FILE_FORMAT = ‘“CSV_INGEST_POC”.”PUBLIC”.”PIPE_DELIMITED”’ ON_ERROR = ‘ABORT_STATEMENT’ PURGE = FALSE;

12.7 Million rows loaded in 22 seconds

Querying the Loaded Table

SELECT 
case product_id
when ‘YYY’ then ‘Product Y’
when ‘XXX’ then ‘Product X’
when ‘ZZZ’ then ‘Product Z’
end as Product,
territory_code,
sum(sum_total_pde )
from stage.stage_current_quarter_planned
group by 1,2
order by count(*) desc

Aggregated to 27,00 rows in 1.7 seconds

Querying the External S3 File Directly

SELECT DISTINCT $1
FROM ‘@”CSV_INGEST_POC”.”PUBLIC”.”MY_S3"/current-quarter-plan/my_current_quarter_planned_20180410.txt’
( FILE_FORMAT => ‘“CSV_INGEST_POC”.”PUBLIC”.”PIPE_DELIMITED”’ );

14.8 seconds to get a list of distinct products

I completed all this work in less than 20 minutes. I found the interface very intuitive and the database to perform very fast, even with the smallest instance.

Matillion

I’m also not a big fan of using scripting languages and writing lots of code to develop data warehouses. We were looking for a ETL/ELT tool to complement Snowflake, and Matillion has some desirable features:

  • Pay-as-you-go starting at $1.37 per hour.
  • Push-down ELT architecture
  • Github integration
  • Scalability

I went to the Matillion website and signed up for a demo. The demo was impressive and we signed up for a two week trial. Like Snowflake, Matillion also had an easy to follow step by step getting started guide. Their documentation also has tons of examples and videos.

I wanted to try out two common test cases with Matillion. The first test case is to test a full file extract, and the second to test an incremental load.

Full File Extract

For the full file extract, I will load a file from an AWS S3 bucket into a staging table, then compare the staged data to the target table to determine what changes need to be applied to the target table. The last step applies the changes to the target.

Orchestration

  • Truncate the staging table
  • Our process is designed to only have one file in the folder per day, so I set the max iteration to 1 so it only processes one file.
  • After the load, the detect changes transform is called
  • After the detect changes, we apply the changes
  • Finally, execute a Python script to move the process file to an archive

Detect Changes

  • Read Stage
  • Use Calculator to format fields and rename columns to match target
  • Read Target
  • Detect Changes
  • Load Records into Work Table

Apply Changes

  • Read Work Table
  • Replicate to Update and Insert Paths
  • Filter and Load New Records
  • Filter and Update Changed Records
  • Deletes and Unchanged Records are ignored

Incremental Load

For the incremental load, we will load a incremental file from our AWS S3 bucket. The source system does not provide an indicator of whether the record is new or changed, so we will be performing an upsert.

Orchestration

  • Truncate Table
  • Again, with our process designed to only have one file in the folder per day, I (again) set the max iteration to 1 so it only process one file.
  • Call Transformer to upsert data
  • Archive file

Upsert

  • Read Stage Table
  • Use table update component — Component supports both a separate update and insert mapping

Thoughts

I found both products intuitive and easy to use. The learning curve is not steep for anyone who has used a relational database and ELT/ETL tool in the past: armed with only the documentation from the products’ websites, I was able learn both products and successfully write real-world examples in a matter of hours. I even reached out to support and found both vendors to be very responsive and helpful. The combination of these two tools is both very cost-effective and powerful.

--

--

Michael Gohl
Pandera Labs

Over 20 years experience designing and developing high performance ETL/ELT and data warehousing solutions.