Snowflakes Data Ingestion for RDBMS

Vishal Garg
4 min readMar 21, 2023

--

Requirement :- We have a requirement that all OLTPs need to be ingested to snowflakes as our single source of truth/data platform. These DBs can be either on premises or on cloud.

Platform/tools used in POC:- I have used AWS platform for this requirement whereby source was on premises Microsoft SQL Server and destination i.e. Snowflakes is a SAS offering on AWS. I have used AWS glue, python, spark to achieve the end objective. Off course it’s worth mentioning that before you begin your journey for such ETL/ELT you should ensure the source to destination connectivity, firewall openings for various ports.

Solution Details:- To solve this requirement I have developed three solutions considering various aspect like frequency of data, lag of source data to destination, batch versus just real time approach, CDC (change data capture) to sync destination, where predicates, latency, volume of data. I have used following 3 approaches.

  1. Data Load using AWS Glue via S3 :- In this approach I have used S3 bucket with required IAM roles as staging for incoming data and then I have used Snowflakes Storage Integration till S3 to load the data. I have used “COPY INTO” to lift the data both for full loads and incremental. I have made use of AWS Glue Catalogue to capture meta data of the source DB via AWS crawler. Here I have configured IAM roles with permissions to S3, glue staging to capture the source. Important point to consider your VPC/Subnet CIDR should have minimum 5 free IPs to run the ELT job to load S3. I have learnt very hard way this important Network concept to ensure spark executors from the job gets sufficient IPs to run. I have used in the SQL query as the CDC filter to get the incremental load. We can also use Filter, custom transform as well to get the incremental data
Near Realtime Ingestion from SQL Server till Snowflake with Tableau presentation

2. Direct Data Load using Custom Connector:- I have used a custom jdbc connector to load (full, incremental) data from source till destination. I have made use of the Snowflake custom .jar to have the connectivity established from AWS Glue till Snowflake DB. In this I have used the following link to download the SF custom jdbc jar Central Repository: net/snowflake/snowflake-jdbc/3.13.28 (maven.org). Important point to remember you shall load this to a location in S3 from where the connection shall have access. So, I have defined IAM role to achieve the same, otherwise the connection will never succeed. The custom connector shall embrace the jdbc connection using this .jar along run time parameters to the SF. The source connectivity will still be same i.e. using GLUE catalogue. This is also a low code visual approach. I have used in the SQL query as the CDC filter to get the incremental load. We can also use Filter, custom transform as well to get the incremental data.

Direct load well suited for batch ingestion using custom JDBC .jar

3. Direct Data Load Using Spark:- This is a code heavy approach but highly flexible for loading data from source till destination. In this approach, I have used the following link to download the SF custom jdbc jar Central Repository: net/snowflake/snowflake-jdbc/3.13.28 (maven.org) and I have used an additional Spark jar from the https://repo1.maven.org/maven2/net/snowflake/spark-snowflake_2.12/2.11.2-spark_3.3/ to establish the connectivity from AWS GLUE till Snowflake. Once we have both the .jar (jdbc+spark) we need to place them in S3 with adequate permissions using IAM role. Then I have made used of the pySpark code to establish connectivity from Glue Catalogue, CDC using SQL queries with where predicates and finally used Spark context to load the data to Snowflakes. This is a code based approach but with lot of flexibility to play with the data. Here I have used Glue 4.0 and ensured number of free IPs in subnet CIDR is sufficient to spawn the executors. I have used in the SQL query as the CDC filter to get the incremental load. We can also use Filter, custom transform as well to get the incremental data.

Data Loading using Spark

Load Metrics:- Following table illustrates the statistics using the different approaches.

References:-

--

--