Using dbt to Execute ELT Pipelines in Snowflake

DevOps and DataOps for Snowflake with dbt and Azure DevOps

John Aven

Published in

Hashmap, an NTT DATA Company

12 min readAug 9, 2019

by John L. Aven and Prem Dubey

ELT — Extract, Load, and Transform has become increasingly popular over the last few years. This approach, in part, has been driven by the growing popularity of cloud data warehouses, such as Snowflake which our clients are using not only for new cloud data warehouses, but also as the top SaaS, cloud-native solution to migrate from legacy data warehouses such as Netezza, Teradata, Exadata, and traditional Big Data platforms.

Furthermore, with the increasing popularity of DevOps and DataOps practices to increase velocity and reduce risk, it has become desirable to be able to integrate the modification and deployment of changes in a data warehouse through a CI/CD pipeline technology such as Jenkins or Azure DevOps.

With that in mind, we will do our best in this post to provide you with:

An introduction to dbt (Data Build Tool), an open-source DataOps solution
A walkthrough to create an ELT Data Pipeline in Snowflake with dbt
An approach for automating pipeline deployment via a CI/CD pipeline with Azure DevOps.

dbt for DataOps

While Snowflake provides an unmatched cloud data warehousing experience with a multi-cluster, shared data architecture that separates storage from compute (focus on cloud data warehousing), dbt is a game-changing approach to managing ELT workloads (orchestration and management of processing pipelines). Dbt provides a unique level of DataOps functionality that enables Snowflake to do what it does well while abstracting this need away from the cloud data warehouse service.

Dbt brings the software engineering world to the SQL-savvy ELT developer. It treats SQL queries as models — aka a SELECT script within a package. All models — in dbt— represent tables and views within the data warehouse. These models can be versioned, and their schema and data can be tested — much as one would write unit tests for a piece of software code. The models are linked by references to form a DAG (Directed Acyclic Graph) — a very common computing model found in many current data-centric tools (Spark, Airflow, Tensorflow, …). A DAG represents the order of query execution, as well as the lineage of data as generated through the models.

Components of dbt

Everything in dbt is one of three types of files — .sql, .yml, .md. Working in reverse order:

.md files are documentation files. They are a variant on traditional markdown and are used as documentation for models
.yml files are configuration files. These handle everything from configuring the project (dbt_project.yml) to the detailed configuration and testing in a schema.yml file. Dbt espouses the concept of Configuration-as-Code (CaC), and so most .yml files have some Jinja templating involved — or contain the template parameters for the models. Any configuration can also be (and frequently is) overridden in a .sql file according to design and performance requirements.
.sql files contain models and tests. Models are SELECT statements that result in a new (or incrementally updated) table or view in the data warehouse. They can even be ephemeral and only exist while the pipeline is executing. These files are compiled into executable SQL scripts fit to run on the specified database. In fact, in Snowflake, ephemeral tables are nested in the compiled SQL and are effectively temporary tables.

After installing dbt via pip (or as we prefer, pipenv) a new project is created with ‘dbt init’.

A dbt project can have many different ‘modules’, and a number of these are highlighted below:

Models

This directory will contain the .sql files that make up the ELT pipeline(s). They are always a select statement and result in the creation of a table, view or ephemeral (temp) table of the data when executed. They form a sequence of execution stages by referencing other models.

Tests

These are .sql files that are used to execute custom tests on data. For example, if you want to make sure a percentage of values in a certain column is within a certain range, you would write a model that would validate this assumption on the resulting model.

Macros

These are .sql files that are templatized with Jinja. The idea is for these models to be highly reusable (think DRY — Don’t Repeat Yourself) and are important in a full CaC solution. In a software engineering solution, these are the functions used to build bigger solutions — and sometimes they too are called macros.

Analysis

These are models which do not run when you deploy the project. Instead, they are designed to be ‘on-demand’ queries that are most frequently associated with specialized views or tables used in analytics solutions — such as a data science pipeline.

Docs

These are markdown files (with special decorations — see official documentation) that are used to document models. These are optional — but should be used in any production environment.

Logs

These are run logs. They are generated when a ‘dbt run’ command is executed.

Target Directories (e.g. dev)

Either when compiling, running or building documentation, these are created. They contain all of the meta content and compiled SQL code.

Not all components shown above have been used in the demo, nor are all types of functionality addressed. If you would like to know more about the functionality supported in dbt please reference the official documentation.

Setting up dbt for Snowflake

To use dbt on Snowflake — either locally or through a CI/CD pipeline, the executing machine should have a profiles.yml within the ~/.dbt directory with the following content (appropriately configured). The ‘sf’ profile below (choose your own name) will be placed in the profile field in the dbt_project.yml.

sf:
  target: dev <- environment to default to
  outputs:
    dev: <- configuring a specific environment
      type: snowflake
      account: YOUR_ACCOUNT
      user: YOUR_USER_NAME
      password: YOUR_PASSWORD
      role: ROLE_WITH_RIGHTS
      database: DATABASE_NAME
      warehouse: SPECIFIC_WAREHOUSE
      schema: SCHEMA_TO_WORK_IN
      threads: NUMBER_OF_THREADS_BETWEEN_1_AND_8

Creating an ELT Data Pipeline

In this section, we’ll take a look at the series of transformations — these can be found within this git repo.

The problem we are addressing uses the TPC-H benchmark datasets. The goal is to create an analytics view of the data with aggregations of:

Average Account Balance for all Customers in a Nation
Average Available Quantity for each Product, regardless of Supplier
Average Supply Cost for each Product, regardless of Supplier

These three values will be combined into a table with columns containing the following information: Supplier Name, Supplier Account Balance, Customer Account Balance, Part Available Quantity, Supply Cost for each Part and Supplier Pair, Part Name, Part Manufacturer, Part Brand, Part Type, Part Size, Nation, Region, Market Segment and Customer Name.

To generate the appropriate view of the data, various approaches could be taken. But, in order to demonstrate the functionality of the pipeline, we do so as follows:

supplier_parts — join the PART, SUPPLIER and PARTSUPP tables on the PARTKEY and SUPPKEY fields

WITH supplier_parts AS (
  SELECT * 
    FROM snowflake_sample_data.tpch_sf1.supplier
      JOIN snowflake_sample_data.tpch_sf1.partsupp
      ON s_suppkey=ps_suppkey
        JOIN snowflake_sample_data.tpch_sf1.part
        ON ps_partkey=p_partkey
)

SELECT
s_suppkey AS supplier_key,
s_name AS supplier_name,
s_nationkey AS nation_key,
s_acctbal AS account_balance,
ps_availqty AS available_quantity,
ps_supplycost AS supply_cost,
p_partkey AS part_key,
p_name AS part_name,
p_mfgr AS part_manufacturer,
p_brand AS part_brand,
p_type AS part_type,
p_size AS part_size,
p_container AS part_container,
p_retailprice AS part_retail_price
  FROM supplier_parts

2. average_account_balance_by_part_and_supplier — compute an average aggregation of the supplier account balance by part and supplier

WITH customer_nation_region AS (
  SELECT * 
    FROM snowflake_sample_data.tpch_sf1.customer
      JOIN snowflake_sample_data.tpch_sf1.nation
      ON n_nationkey=c_nationkey
        JOIN snowflake_sample_data.tpch_sf1.region
        ON n_regionkey=r_regionkey
)

SELECT
r_name AS region,
n_name AS nation,
n_nationkey AS nation_key,
c_name AS customer_name,
c_acctbal AS account_balance,
c_mktsegment As market_segment
  FROM customer_nation_region

3. average_supply_cost_and_available_quantity_per_part — compute the average aggregation of the supply cost and available quantity by part for the supplier

SELECT
part_key,
avg(available_quantity) AS average_available_quantity,
avg(supply_cost) AS average_supply_cost
  FROM {{ref('supplier_parts')}}
  GROUP BY part_key

4. supplier_parts_aggregates — join the various aggregations back to the supplier_parts table

SELECT
sp.*,
ascaqsp.average_available_quantity,
ascaqsp.average_supply_cost
  FROM {{ref('supplier_parts')}} as sp
    JOIN {{ref('average_supply_cost_and_available_quantity_per_part')}} AS ascaqsp
    ON ascaqsp.part_key = sp.part_key

5. customer_nation_region — join the CUSTOMER, NATION and REGION tables by the NATIONKEY

WITh customer_nation_region AS (
  SELECT * from snowflake_sample_data.tpch_sf1.customer
    JOIN snowflake_sample_data.tpch_sf1.nation
        ON n_nationkey=c_nationkey
            JOIN snowflake_sample_data.tpch_sf1.region
                ON n_regionkey=r_regionkey
)

SELECT
r_name AS region,
n_name AS nation,
n_nationkey AS nation_key,
c_name AS customer_name,
c_acctbal AS account_balance,
c_mktsegment AS market_segment
    FROM customer_nation_region

6. average_acctbal_by_nation — compute an aggregate average of the customer account balance over nation field

SELECT
nation_key,
avg(account_balance) as average_account_balance
  FROM {{ref('customer_nation_region')}}
    GROUP BY nation_key

7. customer_nation_region_aggregates — join the aggregate account balance back to the customer_nation_region table

SELECT
cnr.*,
aabn.average_account_balance AS avg_acct_bal_by_nation
  FROM {{ref('customer_nation_region')}} AS cnr
    JOIN {{ref('average_acctbal_by_nation')}} AS aabn
      ON aabn.nation_key = cnr.nation_key

8. analytics_view — join of customer_nation_region_aggregates to supplier_parts_aggregates. Conceptually, this is to be used for either a BI Dashboard or a Data Analytics/Science workstream.

SELECT
spa.supplier_name,
spa.account_balance as supplier_account_balance,
cnra.account_balance as customer_account_balance,
cnra.avg_acct_bal_by_nation as average_account_balance_by_nation,
spa.available_quantity,
spa.average_available_quantity,
spa.supply_cost,
spa.average_supply_cost,
spa.part_name,
spa.part_manufacturer,
spa.part_brand,
spa.part_type,
spa.part_size,
cnra.nation,
cnra.region,
cnra.market_segment,
cnra.customer_name
  FROM {{ref('supplier_parts_aggregates')}} AS spa
    JOIN {{ref('customer_nation_region_aggregates')}} AS cnra
      ON
      spa.nation_key = cnra.nation_key

The above series of transformations form the DAG (Directed Acyclic Graph) below. This graph, as shown below, demonstrates the lineage, or provenance of the chain of transformations. This is a useful visualization to understand the chain of ‘transformations’ that resulted in the various tables/views throughout the transformation process.

Filters are applied throughout the various stages in order to guarantee the appropriate columns are present.

Steps 1, 2, 3, 5 and 6 materialize as ephemeral tables with steps 4 and 7 as tables, and 8 as a view. And, although the solution isn’t intended to be deployed from your console, you could download the source code and via the following commands create all of the documentation and deploy the documentation server.

Integrating dbt with Azure DevOps

In the spirit of bringing software engineering practices to the SQL-centric ELT data warehousing world, any modern approach would be incomplete without automating the testing and deployment process and integrating with a CI/CD process. In addition to local execution, we’ll review the ELT pipeline (dbt) that has been integrated with Azure DevOps.

To run the project locally, you can use Pipenv to setup dbt and the scripts in the Pipfile to execute various tasks. Pipenv is the “new kid on the python block” that brings the best of all the packaging worlds to Python. It creates a virtual environment while managing dependencies and scripts using a Pipfile.

Below are the steps to setup Pipenv and the supported commands for this project:

pip install pipenv # in case it isn't already installedpipenv install # at the project rootcd snowflake_tpc_demopipenv run build_and_serve # build and run the documentation serverpipenv run test # runs the testspipenv run build # Generate the docspipenv run build_and_run # Generate docs, serve, and run

To set up CI/CD for this project, you can use Jenkins or Azure DevOps. We are going to discuss Azure DevOps here, but for Jenkins, take a look at the Jenkinsfile in the Jenkins branch of the repo.

Getting Started with Azure DevOps

Azure DevOps is the rebranded and cloud-native version of Visual Studio Team Services (VSTS). It is a collection of services for DevOps enablement, and it currently offers:

Azure Pipelines to continuously build, test, and deploy to any platform and cloud.
Azure Boards for powerful work management.
Azure Artifacts for Maven, npm, and NuGet package feeds.
Azure Repos for unlimited cloud-hosted private Git repos.
Azure Test Plans for planned and exploratory testing.

To get started with Azure DevOps, you’ll need an Azure account. Once you have that setup, login to the Azure portal and search for Azure DevOps in the search tab at the top. Once redirected to the Azure DevOps page, you’ll create an organization. After creating the organization, you can create a project, and invite other members of your team to the project for collaboration.

Automating Pipeline Deployment via CI/CD pipeline with Azure DevOps

To automate the flow, you can securely move the ~/.dbt/profiles.yml in the executor where the deployment will be running. Go to Library in Pipelines and add upload your profiles.yml as a secure file. You can download this file while running the deployment.

Azure Pipeline stores the pipeline as code in the azure-pipelines.yml file which uses YAML schema reference.

Let’s take a look at our azure-pipelines.yaml

The trigger container is used to define the branches which you plan to deploy.

The pool container is the base VM image we are going to use — Azure Pipelines currently supports images for MacOS, Ubuntu, and Windows.

If we want to run our deployment in a docker container, we can use the container block to define the name of the image and arguments.

Steps are the various stages of the pipeline, and you can have multiple steps in the pipeline. Next, you will need to download the profiles.yml file from the Library to the executor container. Use DownloadSecureFiles task to download the file — you must download it in your Azure DevOps Organization before you can use it. Once you have downloaded the file, move it to the ~\.dbt directory as profiles.yml so the DBT on executor can pick it up.

You can either have DBT prepackaged in your docker container, or if you want to use the latest version, you can just download it as part of the deployment process using pip install dbt . You need to set /home/vsts_azpcontainer/.local/bin to $PATH as pip installs dbt in this directory by default.

You’re Ready to Start Using dbt for DataOps with Snowflake

Once all of this is done, go to the project directory and execute the appropriatedbt commands. It is advisable to keep compile, test, and run in different stages in that order so that the pipeline breaks if there is an error.

While dbt is typically invoked from the command line, it is possible to import and invoke dbt as a Python module. You can also use prefect, Airflow, dbt Cloud, and other orchestration tools to execute the pipeline on custom schedules for your dbt project. Since, dbt has an API for Python, you can also run it as a serverless function with Azure Functions or AWS Lambda.

More advanced DataOps approaches could be implemented — e.g. serving the dbt documentation server. However, since going much further would involve building out infrastructure as code, we’ll save that for a later post.

Hopefully, we’ve helped increase your understanding of the capabilities of dbt as a tool for DataOps, given you a clear path for creating an ELT Data Pipeline in Snowflake, and demonstrated automating pipeline deployment with Azure DevOps.

Usingdbt as a key part of your DataOps solution to design and execute your ETL processes in Snowflake brings SQL data pipelines to the software engineering world of DevOps.

Need Help with Your Snowflake Cloud Data Warehouse, DevOps, or DataOps?

If you’d like additional assistance with Snowflake, DevOps, or DataOps, Hashmap offers a range of enablement workshops and assessment services, cloud migration services, and consulting service packages as part of our Snowflake service offerings — we would be glad to work through your specific requirements — please reach out.

Did you know we’re a dbt technology partner? That means we guide you to a better understanding of applying dbt to a modern data stack in the cloud to start bringing you value. Check out our dbt focus page.

Using dbt to Execute ELT Pipelines in Snowflake

DevOps and DataOps for Snowflake with dbt and Azure DevOps

dbt for DataOps

Components of dbt

Models

Tests

Macros

Analysis

Docs

Logs

Target Directories (e.g. dev)

Setting up dbt for Snowflake

Creating an ELT Data Pipeline

Integrating dbt with Azure DevOps

Getting Started with Azure DevOps

Automating Pipeline Deployment via CI/CD pipeline with Azure DevOps

You’re Ready to Start Using dbt for DataOps with Snowflake

Need Help with Your Snowflake Cloud Data Warehouse, DevOps, or DataOps?

Other Tools and Content You Might Like

Snowflake Utilities & Accelerators | Do more with Snowflake | Hashmap

Try out all the Snowflake utilities that Hashmap has available and do more with Snowflake: Snowflake Inspector…

10 Minute Beginner’s Guide to Snowflake Cloud Data Warehouse

We kicked off the OKC Snowflake User Group Meetup a couple of months ago in order to assist a rapidly growing group of…

Hashmap Megabytes | Bite-Size Video Series

Hashmap Megabytes is a weekly video series in which mega cloud ideas are explained in bite-size portions.

Orchestration and DAG Design in Apache Airflow — Two Approaches

Orchestration of ETL processes — aka data pipelines — is a conceptually simple exercise, it’s the implementation that…

How to Setup a CI/CD Pipeline for Snowflake using Sqitch and Jenkins

Change management with databases is painful. So should you continue without it? Can you work without it? Of course, you…

Written by John Aven