Using dbt to Execute ELT Pipelines in Snowflake
DevOps and DataOps for Snowflake with dbt and Azure DevOps
by John L. Aven and Prem Dubey
ELT — Extract, Load, and Transform has become increasingly popular over the last few years. This approach, in part, has been driven by the growing popularity of cloud data warehouses, such as Snowflake which our clients are using not only for new cloud data warehouses, but also as the top SaaS, cloud-native solution to migrate from legacy data warehouses such as Netezza, Teradata, Exadata, and traditional Big Data platforms.
Furthermore, with the increasing popularity of DevOps and DataOps practices to increase velocity and reduce risk, it has become desirable to be able to integrate the modification and deployment of changes in a data warehouse through a CI/CD pipeline technology such as Jenkins or Azure DevOps.
With that in mind, we will do our best in this post to provide you with:
- An introduction to dbt (Data Build Tool), an open-source DataOps solution
- A walkthrough to create an ELT Data Pipeline in Snowflake with dbt
- An approach for automating pipeline deployment via a CI/CD pipeline with Azure DevOps.
dbt for DataOps
While Snowflake provides an unmatched cloud data warehousing experience with a multi-cluster, shared data architecture that separates storage from compute (focus on cloud data warehousing), dbt is a game-changing approach to managing ELT workloads (orchestration and management of processing pipelines). Dbt provides a unique level of DataOps functionality that enables Snowflake to do what it does well while abstracting this need away from the cloud data warehouse service.
Dbt brings the software engineering world to the SQL-savvy ELT developer. It treats SQL queries as models — aka a SELECT script within a package. All models — in dbt— represent tables and views within the data warehouse. These models can be versioned, and their schema and data can be tested — much as one would write unit tests for a piece of software code. The models are linked by references to form a DAG (Directed Acyclic Graph) — a very common computing model found in many current data-centric tools (Spark, Airflow, Tensorflow, …). A DAG represents the order of query execution, as well as the lineage of data as generated through the models.
Components of dbt
Everything in dbt is one of three types of files — .sql, .yml, .md. Working in reverse order:
- .md files are documentation files. They are a variant on traditional markdown and are used as documentation for models
- .yml files are configuration files. These handle everything from configuring the project (dbt_project.yml) to the detailed configuration and testing in a schema.yml file. Dbt espouses the concept of Configuration-as-Code (CaC), and so most .yml files have some Jinja templating involved — or contain the template parameters for the models. Any configuration can also be (and frequently is) overridden in a .sql file according to design and performance requirements.
- .sql files contain models and tests. Models are SELECT statements that result in a new (or incrementally updated) table or view in the data warehouse. They can even be ephemeral and only exist while the pipeline is executing. These files are compiled into executable SQL scripts fit to run on the specified database. In fact, in Snowflake, ephemeral tables are nested in the compiled SQL and are effectively temporary tables.
After installing dbt via pip (or as we prefer, pipenv) a new project is created with ‘dbt init’.
A dbt project can have many different ‘modules’, and a number of these are highlighted below:
Models
This directory will contain the .sql files that make up the ELT pipeline(s). They are always a select statement and result in the creation of a table, view or ephemeral (temp) table of the data when executed. They form a sequence of execution stages by referencing other models.
Tests
These are .sql files that are used to execute custom tests on data. For example, if you want to make sure a percentage of values in a certain column is within a certain range, you would write a model that would validate this assumption on the resulting model.
Macros
These are .sql files that are templatized with Jinja. The idea is for these models to be highly reusable (think DRY — Don’t Repeat Yourself) and are important in a full CaC solution. In a software engineering solution, these are the functions used to build bigger solutions — and sometimes they too are called macros.
Analysis
These are models which do not run when you deploy the project. Instead, they are designed to be ‘on-demand’ queries that are most frequently associated with specialized views or tables used in analytics solutions — such as a data science pipeline.
Docs
These are markdown files (with special decorations — see official documentation) that are used to document models. These are optional — but should be used in any production environment.
Logs
These are run logs. They are generated when a ‘dbt run’ command is executed.
Target Directories (e.g. dev)
Either when compiling, running or building documentation, these are created. They contain all of the meta content and compiled SQL code.
Not all components shown above have been used in the demo, nor are all types of functionality addressed. If you would like to know more about the functionality supported in dbt please reference the official documentation.
Setting up dbt for Snowflake
To use dbt on Snowflake — either locally or through a CI/CD pipeline, the executing machine should have a profiles.yml within the ~/.dbt directory with the following content (appropriately configured). The ‘sf’ profile below (choose your own name) will be placed in the profile field in the dbt_project.yml.
sf:
target: dev <- environment to default to
outputs:
dev: <- configuring a specific environment
type: snowflake
account: YOUR_ACCOUNT
user: YOUR_USER_NAME
password: YOUR_PASSWORD
role: ROLE_WITH_RIGHTS
database: DATABASE_NAME
warehouse: SPECIFIC_WAREHOUSE
schema: SCHEMA_TO_WORK_IN
threads: NUMBER_OF_THREADS_BETWEEN_1_AND_8
Creating an ELT Data Pipeline
In this section, we’ll take a look at the series of transformations — these can be found within this git repo.
The problem we are addressing uses the TPC-H benchmark datasets. The goal is to create an analytics view of the data with aggregations of:
- Average Account Balance for all Customers in a Nation
- Average Available Quantity for each Product, regardless of Supplier
- Average Supply Cost for each Product, regardless of Supplier
These three values will be combined into a table with columns containing the following information: Supplier Name, Supplier Account Balance, Customer Account Balance, Part Available Quantity, Supply Cost for each Part and Supplier Pair, Part Name, Part Manufacturer, Part Brand, Part Type, Part Size, Nation, Region, Market Segment and Customer Name.
To generate the appropriate view of the data, various approaches could be taken. But, in order to demonstrate the functionality of the pipeline, we do so as follows:
- supplier_parts — join the PART, SUPPLIER and PARTSUPP tables on the PARTKEY and SUPPKEY fields
WITH supplier_parts AS (
SELECT *
FROM snowflake_sample_data.tpch_sf1.supplier
JOIN snowflake_sample_data.tpch_sf1.partsupp
ON s_suppkey=ps_suppkey
JOIN snowflake_sample_data.tpch_sf1.part
ON ps_partkey=p_partkey
)
SELECT
s_suppkey AS supplier_key,
s_name AS supplier_name,
s_nationkey AS nation_key,
s_acctbal AS account_balance,
ps_availqty AS available_quantity,
ps_supplycost AS supply_cost,
p_partkey AS part_key,
p_name AS part_name,
p_mfgr AS part_manufacturer,
p_brand AS part_brand,
p_type AS part_type,
p_size AS part_size,
p_container AS part_container,
p_retailprice AS part_retail_price
FROM supplier_parts
2. average_account_balance_by_part_and_supplier — compute an average aggregation of the supplier account balance by part and supplier
WITH customer_nation_region AS (
SELECT *
FROM snowflake_sample_data.tpch_sf1.customer
JOIN snowflake_sample_data.tpch_sf1.nation
ON n_nationkey=c_nationkey
JOIN snowflake_sample_data.tpch_sf1.region
ON n_regionkey=r_regionkey
)
SELECT
r_name AS region,
n_name AS nation,
n_nationkey AS nation_key,
c_name AS customer_name,
c_acctbal AS account_balance,
c_mktsegment As market_segment
FROM customer_nation_region
3. average_supply_cost_and_available_quantity_per_part — compute the average aggregation of the supply cost and available quantity by part for the supplier
SELECT
part_key,
avg(available_quantity) AS average_available_quantity,
avg(supply_cost) AS average_supply_cost
FROM {{ref('supplier_parts')}}
GROUP BY part_key
4. supplier_parts_aggregates — join the various aggregations back to the supplier_parts table
SELECT
sp.*,
ascaqsp.average_available_quantity,
ascaqsp.average_supply_cost
FROM {{ref('supplier_parts')}} as sp
JOIN {{ref('average_supply_cost_and_available_quantity_per_part')}} AS ascaqsp
ON ascaqsp.part_key = sp.part_key
5. customer_nation_region — join the CUSTOMER, NATION and REGION tables by the NATIONKEY
WITh customer_nation_region AS (
SELECT * from snowflake_sample_data.tpch_sf1.customer
JOIN snowflake_sample_data.tpch_sf1.nation
ON n_nationkey=c_nationkey
JOIN snowflake_sample_data.tpch_sf1.region
ON n_regionkey=r_regionkey
)
SELECT
r_name AS region,
n_name AS nation,
n_nationkey AS nation_key,
c_name AS customer_name,
c_acctbal AS account_balance,
c_mktsegment AS market_segment
FROM customer_nation_region
6. average_acctbal_by_nation — compute an aggregate average of the customer account balance over nation field
SELECT
nation_key,
avg(account_balance) as average_account_balance
FROM {{ref('customer_nation_region')}}
GROUP BY nation_key
7. customer_nation_region_aggregates — join the aggregate account balance back to the customer_nation_region table
SELECT
cnr.*,
aabn.average_account_balance AS avg_acct_bal_by_nation
FROM {{ref('customer_nation_region')}} AS cnr
JOIN {{ref('average_acctbal_by_nation')}} AS aabn
ON aabn.nation_key = cnr.nation_key
8. analytics_view — join of customer_nation_region_aggregates to supplier_parts_aggregates. Conceptually, this is to be used for either a BI Dashboard or a Data Analytics/Science workstream.
SELECT
spa.supplier_name,
spa.account_balance as supplier_account_balance,
cnra.account_balance as customer_account_balance,
cnra.avg_acct_bal_by_nation as average_account_balance_by_nation,
spa.available_quantity,
spa.average_available_quantity,
spa.supply_cost,
spa.average_supply_cost,
spa.part_name,
spa.part_manufacturer,
spa.part_brand,
spa.part_type,
spa.part_size,
cnra.nation,
cnra.region,
cnra.market_segment,
cnra.customer_name
FROM {{ref('supplier_parts_aggregates')}} AS spa
JOIN {{ref('customer_nation_region_aggregates')}} AS cnra
ON
spa.nation_key = cnra.nation_key
The above series of transformations form the DAG (Directed Acyclic Graph) below. This graph, as shown below, demonstrates the lineage, or provenance of the chain of transformations. This is a useful visualization to understand the chain of ‘transformations’ that resulted in the various tables/views throughout the transformation process.
Filters are applied throughout the various stages in order to guarantee the appropriate columns are present.
Steps 1, 2, 3, 5 and 6 materialize as ephemeral tables with steps 4 and 7 as tables, and 8 as a view. And, although the solution isn’t intended to be deployed from your console, you could download the source code and via the following commands create all of the documentation and deploy the documentation server.
Integrating dbt with Azure DevOps
In the spirit of bringing software engineering practices to the SQL-centric ELT data warehousing world, any modern approach would be incomplete without automating the testing and deployment process and integrating with a CI/CD process. In addition to local execution, we’ll review the ELT pipeline (dbt) that has been integrated with Azure DevOps.
To run the project locally, you can use Pipenv to setup dbt and the scripts in the Pipfile to execute various tasks. Pipenv is the “new kid on the python block” that brings the best of all the packaging worlds to Python. It creates a virtual environment while managing dependencies and scripts using a Pipfile.
Below are the steps to setup Pipenv and the supported commands for this project:
pip install pipenv # in case it isn't already installedpipenv install # at the project rootcd snowflake_tpc_demopipenv run build_and_serve # build and run the documentation serverpipenv run test # runs the testspipenv run build # Generate the docspipenv run build_and_run # Generate docs, serve, and run
To set up CI/CD for this project, you can use Jenkins or Azure DevOps. We are going to discuss Azure DevOps here, but for Jenkins, take a look at the Jenkinsfile in the Jenkins branch of the repo.
Getting Started with Azure DevOps
Azure DevOps is the rebranded and cloud-native version of Visual Studio Team Services (VSTS). It is a collection of services for DevOps enablement, and it currently offers:
- Azure Pipelines to continuously build, test, and deploy to any platform and cloud.
- Azure Boards for powerful work management.
- Azure Artifacts for Maven, npm, and NuGet package feeds.
- Azure Repos for unlimited cloud-hosted private Git repos.
- Azure Test Plans for planned and exploratory testing.
To get started with Azure DevOps, you’ll need an Azure account. Once you have that setup, login to the Azure portal and search for Azure DevOps in the search tab at the top. Once redirected to the Azure DevOps page, you’ll create an organization. After creating the organization, you can create a project, and invite other members of your team to the project for collaboration.
Automating Pipeline Deployment via CI/CD pipeline with Azure DevOps
To automate the flow, you can securely move the ~/.dbt/profiles.yml in the executor where the deployment will be running. Go to Library
in Pipelines and add upload your profiles.yml
as a secure file. You can download this file while running the deployment.
Azure Pipeline stores the pipeline as code in the azure-pipelines.yml file which uses YAML schema reference.
Let’s take a look at our azure-pipelines.yaml
The trigger container is used to define the branches which you plan to deploy.
The pool container is the base VM image we are going to use — Azure Pipelines currently supports images for MacOS, Ubuntu, and Windows.
If we want to run our deployment in a docker container, we can use the container block to define the name of the image and arguments.
Steps are the various stages of the pipeline, and you can have multiple steps in the pipeline. Next, you will need to download the profiles.yml
file from the Library to the executor container. Use DownloadSecureFiles
task to download the file — you must download it in your Azure DevOps Organization before you can use it. Once you have downloaded the file, move it to the ~\.dbt
directory as profiles.yml
so the DBT
on executor can pick it up.
You can either have DBT
prepackaged in your docker container, or if you want to use the latest version, you can just download it as part of the deployment process using pip install dbt
. You need to set /home/vsts_azpcontainer/.local/bin
to $PATH
as pip installs dbt
in this directory by default.
You’re Ready to Start Using dbt for DataOps with Snowflake
Once all of this is done, go to the project directory and execute the appropriatedbt
commands. It is advisable to keep compile, test, and run in different stages in that order so that the pipeline breaks if there is an error.
While dbt
is typically invoked from the command line, it is possible to import and invoke dbt
as a Python module. You can also use prefect, Airflow, dbt Cloud, and other orchestration tools to execute the pipeline on custom schedules for your dbt
project. Since, dbt
has an API for Python, you can also run it as a serverless function with Azure Functions or AWS Lambda.
More advanced DataOps approaches could be implemented — e.g. serving the dbt
documentation server. However, since going much further would involve building out infrastructure as code, we’ll save that for a later post.
Hopefully, we’ve helped increase your understanding of the capabilities of dbt as a tool for DataOps, given you a clear path for creating an ELT Data Pipeline in Snowflake, and demonstrated automating pipeline deployment with Azure DevOps.
Usingdbt
as a key part of your DataOps solution to design and execute your ETL processes in Snowflake brings SQL data pipelines to the software engineering world of DevOps.
Need Help with Your Snowflake Cloud Data Warehouse, DevOps, or DataOps?
If you’d like additional assistance with Snowflake, DevOps, or DataOps, Hashmap offers a range of enablement workshops and assessment services, cloud migration services, and consulting service packages as part of our Snowflake service offerings — we would be glad to work through your specific requirements — please reach out.
Did you know we’re a dbt technology partner? That means we guide you to a better understanding of applying dbt to a modern data stack in the cloud to start bringing you value. Check out our dbt focus page.
Other Tools and Content You Might Like
Prem Dubey is a Site Reliability Engineer (connect with him on LinkedIn) at Hashmap working across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers, partners, and the community.
John Aven, PhD, is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.