Using dbt to Execute ELT Pipelines in Snowflake

DevOps and DataOps for Snowflake with dbt and Azure DevOps

John Aven
Hashmap, an NTT DATA Company
12 min readAug 9, 2019

--

by John L. Aven and Prem Dubey

ELT — Extract, Load, and Transform has become increasingly popular over the last few years. This approach, in part, has been driven by the growing popularity of cloud data warehouses, such as Snowflake which our clients are using not only for new cloud data warehouses, but also as the top SaaS, cloud-native solution to migrate from legacy data warehouses such as Netezza, Teradata, Exadata, and traditional Big Data platforms.

Furthermore, with the increasing popularity of DevOps and DataOps practices to increase velocity and reduce risk, it has become desirable to be able to integrate the modification and deployment of changes in a data warehouse through a CI/CD pipeline technology such as Jenkins or Azure DevOps.

With that in mind, we will do our best in this post to provide you with:

  1. An introduction to dbt (Data Build Tool), an open-source DataOps solution
  2. A walkthrough to create an ELT Data Pipeline in Snowflake with dbt
  3. An approach for automating pipeline deployment via a CI/CD pipeline with Azure DevOps.

dbt for DataOps

While Snowflake provides an unmatched cloud data warehousing experience with a multi-cluster, shared data architecture that separates storage from compute (focus on cloud data warehousing), dbt is a game-changing approach to managing ELT workloads (orchestration and management of processing pipelines). Dbt provides a unique level of DataOps functionality that enables Snowflake to do what it does well while abstracting this need away from the cloud data warehouse service.

Dbt brings the software engineering world to the SQL-savvy ELT developer. It treats SQL queries as models — aka a SELECT script within a package. All models — in dbt— represent tables and views within the data warehouse. These models can be versioned, and their schema and data can be tested — much as one would write unit tests for a piece of software code. The models are linked by references to form a DAG (Directed Acyclic Graph) — a very common computing model found in many current data-centric tools (Spark, Airflow, Tensorflow, …). A DAG represents the order of query execution, as well as the lineage of data as generated through the models.

Components of dbt

Everything in dbt is one of three types of files — .sql, .yml, .md. Working in reverse order:

  • .md files are documentation files. They are a variant on traditional markdown and are used as documentation for models
  • .yml files are configuration files. These handle everything from configuring the project (dbt_project.yml) to the detailed configuration and testing in a schema.yml file. Dbt espouses the concept of Configuration-as-Code (CaC), and so most .yml files have some Jinja templating involved — or contain the template parameters for the models. Any configuration can also be (and frequently is) overridden in a .sql file according to design and performance requirements.
  • .sql files contain models and tests. Models are SELECT statements that result in a new (or incrementally updated) table or view in the data warehouse. They can even be ephemeral and only exist while the pipeline is executing. These files are compiled into executable SQL scripts fit to run on the specified database. In fact, in Snowflake, ephemeral tables are nested in the compiled SQL and are effectively temporary tables.

After installing dbt via pip (or as we prefer, pipenv) a new project is created with ‘dbt init’.

A dbt project can have many different ‘modules’, and a number of these are highlighted below:

Models

This directory will contain the .sql files that make up the ELT pipeline(s). They are always a select statement and result in the creation of a table, view or ephemeral (temp) table of the data when executed. They form a sequence of execution stages by referencing other models.

Tests

These are .sql files that are used to execute custom tests on data. For example, if you want to make sure a percentage of values in a certain column is within a certain range, you would write a model that would validate this assumption on the resulting model.

Macros

These are .sql files that are templatized with Jinja. The idea is for these models to be highly reusable (think DRY — Don’t Repeat Yourself) and are important in a full CaC solution. In a software engineering solution, these are the functions used to build bigger solutions — and sometimes they too are called macros.

Analysis

These are models which do not run when you deploy the project. Instead, they are designed to be ‘on-demand’ queries that are most frequently associated with specialized views or tables used in analytics solutions — such as a data science pipeline.

Docs

These are markdown files (with special decorations — see official documentation) that are used to document models. These are optional — but should be used in any production environment.

Logs

These are run logs. They are generated when a ‘dbt run’ command is executed.

Target Directories (e.g. dev)

Either when compiling, running or building documentation, these are created. They contain all of the meta content and compiled SQL code.

Not all components shown above have been used in the demo, nor are all types of functionality addressed. If you would like to know more about the functionality supported in dbt please reference the official documentation.

Setting up dbt for Snowflake

To use dbt on Snowflake — either locally or through a CI/CD pipeline, the executing machine should have a profiles.yml within the ~/.dbt directory with the following content (appropriately configured). The ‘sf’ profile below (choose your own name) will be placed in the profile field in the dbt_project.yml.

Creating an ELT Data Pipeline

In this section, we’ll take a look at the series of transformations — these can be found within this git repo.

The problem we are addressing uses the TPC-H benchmark datasets. The goal is to create an analytics view of the data with aggregations of:

  1. Average Account Balance for all Customers in a Nation
  2. Average Available Quantity for each Product, regardless of Supplier
  3. Average Supply Cost for each Product, regardless of Supplier

These three values will be combined into a table with columns containing the following information: Supplier Name, Supplier Account Balance, Customer Account Balance, Part Available Quantity, Supply Cost for each Part and Supplier Pair, Part Name, Part Manufacturer, Part Brand, Part Type, Part Size, Nation, Region, Market Segment and Customer Name.

To generate the appropriate view of the data, various approaches could be taken. But, in order to demonstrate the functionality of the pipeline, we do so as follows:

  1. supplier_parts — join the PART, SUPPLIER and PARTSUPP tables on the PARTKEY and SUPPKEY fields

2. average_account_balance_by_part_and_supplier — compute an average aggregation of the supplier account balance by part and supplier

3. average_supply_cost_and_available_quantity_per_part — compute the average aggregation of the supply cost and available quantity by part for the supplier

4. supplier_parts_aggregates — join the various aggregations back to the supplier_parts table

5. customer_nation_region — join the CUSTOMER, NATION and REGION tables by the NATIONKEY

6. average_acctbal_by_nation — compute an aggregate average of the customer account balance over nation field

7. customer_nation_region_aggregates — join the aggregate account balance back to the customer_nation_region table

8. analytics_view — join of customer_nation_region_aggregates to supplier_parts_aggregates. Conceptually, this is to be used for either a BI Dashboard or a Data Analytics/Science workstream.

The above series of transformations form the DAG (Directed Acyclic Graph) below. This graph, as shown below, demonstrates the lineage, or provenance of the chain of transformations. This is a useful visualization to understand the chain of ‘transformations’ that resulted in the various tables/views throughout the transformation process.

Filters are applied throughout the various stages in order to guarantee the appropriate columns are present.

Steps 1, 2, 3, 5 and 6 materialize as ephemeral tables with steps 4 and 7 as tables, and 8 as a view. And, although the solution isn’t intended to be deployed from your console, you could download the source code and via the following commands create all of the documentation and deploy the documentation server.

Integrating dbt with Azure DevOps

In the spirit of bringing software engineering practices to the SQL-centric ELT data warehousing world, any modern approach would be incomplete without automating the testing and deployment process and integrating with a CI/CD process. In addition to local execution, we’ll review the ELT pipeline (dbt) that has been integrated with Azure DevOps.

To run the project locally, you can use Pipenv to setup dbt and the scripts in the Pipfile to execute various tasks. Pipenv is the “new kid on the python block” that brings the best of all the packaging worlds to Python. It creates a virtual environment while managing dependencies and scripts using a Pipfile.

Below are the steps to setup Pipenv and the supported commands for this project:

To set up CI/CD for this project, you can use Jenkins or Azure DevOps. We are going to discuss Azure DevOps here, but for Jenkins, take a look at the Jenkinsfile in the Jenkins branch of the repo.

Getting Started with Azure DevOps

Azure DevOps is the rebranded and cloud-native version of Visual Studio Team Services (VSTS). It is a collection of services for DevOps enablement, and it currently offers:

To get started with Azure DevOps, you’ll need an Azure account. Once you have that setup, login to the Azure portal and search for Azure DevOps in the search tab at the top. Once redirected to the Azure DevOps page, you’ll create an organization. After creating the organization, you can create a project, and invite other members of your team to the project for collaboration.

Automating Pipeline Deployment via CI/CD pipeline with Azure DevOps

To automate the flow, you can securely move the ~/.dbt/profiles.yml in the executor where the deployment will be running. Go to Library in Pipelines and add upload your profiles.yml as a secure file. You can download this file while running the deployment.

Azure Pipeline stores the pipeline as code in the azure-pipelines.yml file which uses YAML schema reference.

Let’s take a look at our azure-pipelines.yaml

The trigger container is used to define the branches which you plan to deploy.

The pool container is the base VM image we are going to use — Azure Pipelines currently supports images for MacOS, Ubuntu, and Windows.

If we want to run our deployment in a docker container, we can use the container block to define the name of the image and arguments.

Steps are the various stages of the pipeline, and you can have multiple steps in the pipeline. Next, you will need to download the profiles.yml file from the Library to the executor container. Use DownloadSecureFiles task to download the file — you must download it in your Azure DevOps Organization before you can use it. Once you have downloaded the file, move it to the ~\.dbt directory as profiles.yml so the DBT on executor can pick it up.

You can either have DBT prepackaged in your docker container, or if you want to use the latest version, you can just download it as part of the deployment process using pip install dbt . You need to set /home/vsts_azpcontainer/.local/bin to $PATH as pip installs dbt in this directory by default.

You’re Ready to Start Using dbt for DataOps with Snowflake

Once all of this is done, go to the project directory and execute the appropriatedbt commands. It is advisable to keep compile, test, and run in different stages in that order so that the pipeline breaks if there is an error.

While dbt is typically invoked from the command line, it is possible to import and invoke dbt as a Python module. You can also use prefect, Airflow, dbt Cloud, and other orchestration tools to execute the pipeline on custom schedules for your dbt project. Since, dbt has an API for Python, you can also run it as a serverless function with Azure Functions or AWS Lambda.

More advanced DataOps approaches could be implemented — e.g. serving the dbt documentation server. However, since going much further would involve building out infrastructure as code, we’ll save that for a later post.

Hopefully, we’ve helped increase your understanding of the capabilities of dbt as a tool for DataOps, given you a clear path for creating an ELT Data Pipeline in Snowflake, and demonstrated automating pipeline deployment with Azure DevOps.

Usingdbt as a key part of your DataOps solution to design and execute your ETL processes in Snowflake brings SQL data pipelines to the software engineering world of DevOps.

Need Help with Your Snowflake Cloud Data Warehouse, DevOps, or DataOps?

If you’d like additional assistance with Snowflake, DevOps, or DataOps, Hashmap offers a range of enablement workshops and assessment services, cloud migration services, and consulting service packages as part of our Snowflake service offerings — we would be glad to work through your specific requirements — please reach out.

Did you know we’re a dbt technology partner? That means we guide you to a better understanding of applying dbt to a modern data stack in the cloud to start bringing you value. Check out our dbt focus page.

Other Tools and Content You Might Like

Prem Dubey is a Site Reliability Engineer (connect with him on LinkedIn) at Hashmap working across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers, partners, and the community.

John Aven, PhD, is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

--

--

John Aven
Hashmap, an NTT DATA Company

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”