Data Loading into Snowflake using Matillion — Saama Analytics

Saama Analytics
Saama Analytics
Published in
6 min readApr 8, 2020

Snowflake is a modern data warehouse, built for the cloud, supports distributed data processing, limitless auto-scaling in/out (horizontal) for concurrency & scaling up (vertical) for performance, with the complete power of ANSI SQL via a pay-as-you-go option (no upfront costs or lock-in) — making it one of the most preferred technologies — by enterprises looking to migrate to the cloud or trying to modernize their enterprise (read more here). This article discusses a practical migration use case where we used Matillion for data loading into Snowflake.

Data loading into Snowflake is typically done as follows:

  • Exporting data from the source (using Scripts or ETL/ELT tools)
  • Staging exported data in an external stage like AWS S3/ Azure Blob or Snowflake internal stage. Ideally, in chunks of 50–100 mb files for performance (leveraging cores).
  • Using the “Copy Into SQL” to ingest these files into Snowflake Database Tables.

For one of our customers, we had to migrate from SQL Server on-premise to Snowflake. This involved hundreds of tables. They also had various other data sources like Salesforce, SAP etc to name a few. Naturally, with so many data sources, data loading into Snowflake was an integral discussion.

We came up with the two options:

In-house framework (Developed in both Java and Python)

Use modern ELT/ETL tool for both orchestration and transformation

Provides a lot of automation scripts that leverage metadata and support for complex transformations.

Full-blown platform (provides most of the required connectors) for orchestration and transformation.

Any new scenarios had to be hand-coded and the coverage would improve only with time.

The automation implemented using an in-house framework was not readily available here.

We then adopted the best of both worlds — by integrating Matillion (as an engine) with our MetaData driven approach — which improved our time to value from 10 weeks (with 8 engineers) to 6 weeks (with 4 engineers).

Summarizing the need for automated data orchestration pipelines, we prioritized the ASKS as follows:

  • Rapid automation for new data sources and/or tables
  • Configure metadata using our existing framework
  • Leverage generic connectors
  • Export data from the source and stage it on AWS S3
  • Ingest it into Snowflake (Landing)
  • Transform from landing using SCD 2 into Data Lake / Operational Data Store (ODS)
  • Optionally, the ability to support data quality rules (DQ)
  • Optionally, the ability to support change data capture (CDC) / slowly changing dimension (SCD)
  • Leverage Snowflake best practices

Here are the details on various building blocks:

  1. Set up Snowflake Account
  2. Configure the required Meta-Data
  3. Develop the ELT and ETL jobs leveraging above
  • Ensure audit & error logs are captured
  • Validation or reconciliation mechanism

4. Actual performance matrix

1. Single Snowflake Account to support Multiple Environment Set Up

If there is a single Snowflake Account to support multiple environments for e.g. Dev, QA, UAT, PROD then the probability to get it out of hand is more. We need to be more careful to set up Environment-related Roles, Virtual warehouses, Databases and users accordingly so no environment collides with each other.

Sequence for Set Up ⇒ ROLE → VIRTUAL WAREHOUSE → DATABASE → USERS

  • Usage of Global Roles (AccountAdmin, SysAdmin, SecurityAdmin) should be restricted.
  • Leverage env specific role for managing and maintaining that env like Dev_Sysadmin, Dev_SecurityAdmin.
  • Any new user addition gets assigned env specific roles & permissions.
  • Dev_Data_Loader is only for Matillion Jobs and used by Load Process only.

2. Metadata Design Approach ⇒

Usually, when Data-Lake is built, the source system structure is kept in-take.
Using a metadata-driven approach to run Matillion Job offers a robust solution with modularity, ease of maintenance, less development cycle and fast onboarding changes if new tables get added or removed.

Model of Metadata table

Column Name

Description

METADATA_SK

Incremental Value

SOURCE_SYSTEM

SalesForce, SQL Server, S3?

SOURCE_OBJECT_NAME

Table name or S3 File URL

SOURCE_PREFIX_DB

Object Prefix in case of S3 Or Schema name/DB name setting at source to be used

LANDING_DB

Snowflake Landing database name{ as Matillion all Odbc or S3 Read stage data first }

LANDING_SCHEMA

Landing schema name

SalesForce, Sql_Server, S3

LANDING_TABLE

Landing table name

DATALAKE_DB

Snowflake data lake table name [ SCD1 or SCD2 types ]

DATALAKE_SCHEMA

Snowflake data lake schema name

SalesForce, SQL Server, S3

DATALAKE_TABLE

DataLake Table Name

CDC_COLUMN_NAME

Delta Identifier column name if only incremental data has to be fetched from Source

CDC_MIN_DATE

Minimum of Delta Identifier Fetched

CDC_MAX_DATE

Max of Delta Identifier Fetched

PRIORITY

Set priority if this has to be Run first

IS_ACTIVE

In case some table load has to be disabled or enabled

ADDITIONAL_WHERE_CLAUSE

At Source Side, apply additional where clause

AUDIT_RELATED_COLUMNS….

Entry Created By, Created Date, Updated By, Updated Date etc.

3. Basic Framework of ETL Orchestration Job

  1. Select all data from Metadata i.e. ETL_METADATA
  2. Set all as ETL Metadata Columns into variable to be Input in Matillion Job
  3. As variable values change, the ELT { one job } will be running.

This is how the Main Job will be designed which we will run on Parameters at ETL_TABLE.

Main Job which will

  • Query Metadata [ Entry for many tables to be loaded in Data Lake ] can be run concurrently or sequentially.
  • Load data from Source to Landing using
  • Database Component in Matillion stages it to Landing.
  • Database component can be Salesforce or SQL Server or any DBMS
  • S3 Load component to be used S3 Bucket i.e. CSV, XML, AVRO, PARQUET
  • The second job will be loading data from Landing to DataLake.
  • All parameters as table to load changes are derived from Metadata in Schema.
  • For every step, an audit/error notification common framework is called to record failover or normal processing.

SCD2 Generic Job ⇒

4. Actual Performance Matrix of migrating data from SQL Server to Snowflake

RA_LND = SQL Server to Landing
LND_DL = Landing to DataLake (ODS)

* Migration log of some of the tables.

From the Landing (in Snowflake) to Data lake (in snowflake) using SCD2:-

  • 76 Mins for 1 Billion Records including SCD2 functionality
  • [ snapshot above shows some of the large tables, total tables involved 600+ ]

This concludes our journey of data loading into Snowflake using Matillion and a metadata-driven approach.

However, Snowflake also announced Streams & Tasks last year at Snowflake, using which you can build ETL and SCD — Read here to for a complete How-To guide.

Would be great to hear about your experience and use of other ELT/ETL tools for migrating to Snowflake!

Author: Prajakta Borkar
Enterprise Datawarehouse Architect, 14+ years of experience a host of technologies including Oracle Exadata, Snowflake, SQL-server, Shell/Powershell/PLSQL script, Datastage, Ab-Initio and specialized in modern cloud technologies like Snowflake, Matillion, Denodo, PowerBI, Business Objects with Data Modelling, Data analytics etc.

Originally published at https://saamaanalytics.com

--

--