Doing DevOps for Snowflake with dbt in Azure
How to Use CI/CD to Deploy Your Snowflake Database Scripts with dbt in Azure DevOps
by Venkatesh Sekar
I recently wrote about the need within our Snowflake Cloud Data Warehouse client base to have a SQL-centric data transformation and DataOps solution. In my previous post, I stepped through how to create tables using custom materialization with Snowflake.
Continuing in that vein, I was recently asked by a customer to provide a path for them to do database DevOps for Snowflake. In general, database DevOps has involved quite a bit of complexity and ongoing tweaking to try and get it right. There are some tools available in the market today including:
But, other than Sqitch, they don’t support Snowflake yet, although, with the amount of momentum that Snowflake has in the market, I expect they will provide support in the not too distant future.
Having used dbt as a data transformation and Jinja template-based tool, I was interested to see if it could potentially be the key to help unlock database DevOps for Snowflake.
As noted above, I was able to create the ‘persistent_table’ materialization which provided an answer for creating ‘source tables’ in DBT, and having done that I next developed a simple CI/CD process to deploy database scripts for Snowflake with dbt in Azure DevOps.
Stay with me and I’ll step you through how to setup dbt to deploy the scripts. As always, the code is available in my git repo venkatra/dbt_hacks.
Here is a glimpse into the tools and solutions that I am using to make this happen…
dbt is a command line tool based on SQL and is primarily used by analysts to do data transformations. In other words, it does the ‘T’ in ELT.
It facilitates writing modular SQL Selects and takes care of dependencies, compilation, and materialization in run time.
Azure Devops provides developer services to support teams in planning work, collaborating on code development, and building and deploying applications.
Organizations across industries rely on Snowflake for their cloud data warehousing needs — net new data warehouses, migrations from legacy DW appliances (Netezza, Teradata, Exadata, etc.), and migrations from traditional Hadoop and Big Data platforms (Hive, HBase, Impala, Drill, etc.). Our clients are also using Snowflake for high-value solution areas such as Security Analytics and Cloud Visibility Monitoring.
Snowflake is fully relational ANSI SQL cloud data warehouse and allows you to leverage the tools that you are used to and familiar with while also providing instant elasticity, per second consumption-based pricing, and low management overhead across all 3 major clouds — AWS, Azure, and GCP (GCP is in private preview).
Continuous Integration with Azure Pipelines
The Continous Integration (CI) process is achieved using Azure Pipelines within Azure DevOps. This pipeline is typically invoked after the code has been committed, and the pipeline tasks generally handle:
- Code compilation
- Unit Testing
- Distributing to a repository, e.g., Maven
In the case of a database scripts file there isn’t a great deal of validation that can be done, other than the following:
- Code formatting check
- Scripts follow certain in-house practices, e.g., naming conventions
- Script compilation (this is possible in SQLServer via DACPAC/BACPAC).
Snowflake currently does not have a tool that validates the script before execution, but it can validate during deployment, so in the Build phase I typically do these checks:
- Code format check
- Naming convention check
- Distributing to a repository, e.g., Maven
Identifying the Commit Changes
Given the set of all scripts, it’s essential to determine which scripts were added or updated. If these scripts can’t be identified, you will end up re-creating the entire database, schema, etc. which is not desired.
To solve this issue, I’ll use the Azure DevOps Python API. Going through the docs, you will see different REST endpoints and determine detailed information on what was committed, when it was committed, and who committed it, etc.
The python script IdentifyGitBuildCommitItems.py has been developed in response to this. Its sole purpose is to get the list of commits that is part of the current build and their artifacts (the files that were added/changed). Once identified it would write them into a file ‘ListOfCommitItems.txt’ during execution.
I’ll review the results in the below sections.
Identifying the Deployable Scripts
During the course of development, the developer might have created scripts for table creation as well as developed transformation models, markdown documentation, shell scripts, etc. The ‘ListOfCommitItems.txt’ that was created earlier would contain all of these scripts. Note that if a file was committed multiple times, the script will not de-dup the commits.
To keep things modular, the script FilterDeployableScripts.py was created. Its responsibilities are to:
- Parse the ‘ListOfCommitItems.txt’
- Identify the SQL scripts from various commits
- Filter out only those scripts which are to be materialized as ‘persistent_tables’
- Write the result to the file ‘DeployableModels.txt’
The build pipeline is a series of steps and tasks:
- Install Python 3.6 (needed for the Azure DevOps API)
- Install Azure-DevOps python library
- Execute Python script: IdentifyGitBuildCommitItems.py
- Execute Python script: FilterDeployableScripts.py
- Copy the files into Staging directory
- Publish the artifacts (in staging directory)
These are captured in azure-build-pipeline.yml
The following screenshot highlights the list of artifacts that get published by the build. It also provides a sample output of ‘ListOfCommitItems.txt’ which was captured in the initial run.
Notice that the ‘DeployableModels.txt’ file contains only the CONTACT table definition file, and ignores all other files that are not meant to be run.
Now take a look at the next screenshot from a different build run — during this build run we saw the following:
- The script file ‘deploy_persistent_models.sh’ was updated
- The table definition for ‘ADDRESS’ was added. You could see that the script identified only these changes and captures them in the ‘ListOfCommitItems.txt’ and safely ignores all the other files.
Again, the ‘DeployableModels.txt’ file contains only the ADDRESS table definition file and is not concerned with any other files that are not meant to be run.
A Continuous Deployment (CD) process is achieved with Azure Release Pipelines. The pipeline we are working on is geared towards the actual deployment to a specific snowflake environment, e.g. Snowflake Development Environment.
The “Stage” section is usually specific to the environment in which the deployment needs to happen. It consists of the following tasks as seen below:
This is a Bash task with inline code below:
#Install the latest pipsudo pip install -U pip# Then upgrade cffisudo apt-get remove python-cffisudo pip install — upgrade cffisudo apt-get install git libpq-dev python-dev# Specify the version on installsudo pip install cryptography==1.7.2sudo pip install dbtdbt — help
This is a Bash task with inline code below:
export SNOWSQL_ACCOUNT=$(ENV_SNOWSQL_ACCOUNT)export SNOWSQL_USER=$(ENV_SNOWSQL_USER)export DBT_PASSWORD=$(ENV_DBT_PASSWORD)export SNOWSQL_ROLE=$(ENV_SNOWSQL_ROLE)export SNOWSQL_DATABASE=$(ENV_SNOWSQL_DATABASE)export SNOWSQL_WAREHOUSE=$(ENV_SNOWSQL_WAREHOUSE)export DBT_PROFILES_DIR=./chmod 750 ./deploy_persistent_models.sh./deploy_persistent_models.sh
This sets up the various env configurations needed for dbt and used as part of the execution.
The ‘ENV_’ are variables that will be substituted at run time. They need to be defined in the variable section as below:
Logs of the DBT_RUN Task
Upon release, the table will be created in Snowflake. Here is a screenshot of the successful run and the logs of the DBT_RUN task:
and below is the artifact in Snowflake:
Are There Any Limitations?
Keep these limitations in mind when leveraging dbt for CI/CD with database objects…
- Deployment of objects in a specific order is a roadmap item
- Enhancements for Views, Functions, etc. is a roadmap item
- Ultimately, every possible scenario won’t be covered by this approach so take a close look at what you need to do from a design and planning perspective
I hope you’ll try to replicate this simple CI/CD process to deploy database scripts for Snowflake with dbt in Azure DevOps. While there are some limitations, the potential is there to add value to your data ops pipelines.
You should also check out John Aven’s recent blog post (a fellow Hashmapper) on Using DBT to Execute ELT Pipelines in Snowflake.
If you use Snowflake today, it would be great to hear about the approaches that you have taken for Data Transformation, DataOps, and CI/CD along with the challenges that you are addressing.
Some of My Other Stories
I hope you’ll check out some of my other recent stories also…
Don’t Do Analytics Engineering in Snowflake Until You Read This (Hint: dbt)
Using dbt To Create Tables Using Custom Materialization
Quick Tips for Using Snowflake with AWS Lambda
A Working Example Using AWS Lambda Serverless Compute and Snowflake’s Cloud Data Warehouse Together
Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.
Venkat Sekar is Regional Director for Hashmap Canada and is an architect and consultant providing Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.