Data modeling using dbt Core and Snowflake

Maksim Kazartsev
8 min readOct 30, 2023

--

My name is Maksim, and I like learning and experimenting with new tools and technologies.

Today I will share how to upload Superstore dataset source files to Snowflake and use dbt Core to transform and organize the data into the dimensional model.

Data lineage (generated in VS Code)

Tools used

  • Snowflake
  • dbt Core
  • git
  • VS Code (IDE)

Project phases

The project comprises 4 phases:

  1. Installing and configuring dbt Core and environment on laptop
  2. Loading raw data into Snowflake
  3. Configuring connection to Snowflake
  4. Creating models in dbt

Step 1. Installing and configuring dbt Core and environment on laptop

Prerequisites: Prior to installing dbt Core, I downloaded and installed git, python, pip and venv.

Create a new virtual environment

python3 -m venv dbt-env

Activate the environment

dbt-env\Scripts\activate

Install dbt adapter for Snowflake. It installs dbt-core and dbt-snowflake

pip install dbt-snowflake

Initialize a dbt project. I created the directory dbt_superstore_project for the project, then opened a terminal and navigated to that directory.

dbt init dbt_superstore_project

Initialize a git project

git init

On Github, I created a repository with the project name dbt_superstore_project. In the <> Code settings, copied the path to it https://github.com/kazarmax/dbt_superstore_project.git

Set the remote URL for my local repository to point to my GitHub repository:

git remote add origin https://github.com/kazarmax/dbt_superstore_project.git

Push my local repository to GitHub:

git add *  
git commit -m "Initial commit"
git branch -M main
git push -u origin main

Step 2. Loading raw data into Snowflake

Prerequisites: First, I created an account at https://www.snowflake.com/ and signed in.

Below are the steps to load three Superstore data source files to Snowflake: supestore_orders.csv, supestore_people.csv, and supestore_returns.csv

In Snowflake, create a new SQL worksheet and run the following commands:

create a new virtual warehouse

create warehouse wh_superstore;

create a database for raw data

create database raw;

create a database for dbt models

create database analytics;

create schema for the raw database

create schema raw.superstore;

Create tables for the raw data

orders table (will store data from the supestore_orders.csv source file)

CREATE TABLE raw.superstore.orders(
row_id INTEGER NOT NULL PRIMARY KEY
,order_id VARCHAR(14) NOT NULL
,order_date DATE NOT NULL
,ship_date DATE NOT NULL
,ship_mode VARCHAR(14) NOT NULL
,customer_id VARCHAR(8) NOT NULL
,customer_name VARCHAR(22) NOT NULL
,segment VARCHAR(11) NOT NULL
,country VARCHAR(13) NOT NULL
,city VARCHAR(17) NOT NULL
,state VARCHAR(20) NOT NULL
,postal_code VARCHAR(50)
,region VARCHAR(7) NOT NULL
,product_id VARCHAR(15) NOT NULL
,category VARCHAR(15) NOT NULL
,subcategory VARCHAR(11) NOT NULL
,product_name VARCHAR(127) NOT NULL
,sales NUMERIC(9,4) NOT NULL
,quantity INTEGER NOT NULL
,discount NUMERIC(4,2) NOT NULL
,profit NUMERIC(21,16) NOT NULL
);

sales_managers table (will store data from the supestore_people.csv source file)

CREATE TABLE raw.superstore.sales_managers(
person VARCHAR(17) NOT NULL PRIMARY KEY
,region VARCHAR(7) NOT NULL
);

returned_orders table (will store data from the supestore_returns.csv source file)

CREATE TABLE raw.superstore.returned_orders(
returned Boolean NOT NULL
,order_id VARCHAR(14) NOT NULL
);

Create internal stage in Snowflake to store source data files by running the following SQL commands:

USE DATABASE raw;
USE SCHEMA superstore;
CREATE STAGE file_stage;

Load the .csv source files to the stage in the Snowflake interface

Loading source files to stage in Snowflake

Fill in the created tables with data from the uploaded .csv source files

-- Loading data into the 'orders' table 
copy into raw.superstore.orders
from @file_stage/supestore_orders.csv
file_format = (
type = 'CSV'
field_delimiter = '|'
skip_header = 1
);
-- Loading data into the 'sales_managers' table
copy into raw.superstore.sales_managers
from @file_stage/supestore_people.csv
file_format = (
type = 'CSV'
field_delimiter = '|'
skip_header = 1
);
-- Loading data into the 'returned_orders' table
copy into raw.superstore.returned_orders
from @file_stage/supestore_returns.csv
file_format = (
type = 'CSV'
field_delimiter = '|'
skip_header = 1
);

Step 3. Configuring connection to Snowflake

Fill in the profiles.yml file in the ~/.dbt/ directory using the Snowflake account settings

dbt_superstore_project:
outputs:
dev:
account: nb11111.ca-central-1.aws
database: analytics
password: my_password
role: dev role
schema: dbt_superstore_mkazartsev
threads: 1
type: snowflake
user: snowflake_user
warehouse: wh_superstore
target: dev

Check connection. In the dbt project directory, in terminal, run

dbt debug

In case the previous steps were made correctly and the profiles.yml file was filled in properly, it shows Connection test: OK connection ok

Testing connection to Snowflake

Step 4. Creating models in dbt

Declaring the sources

Create a new YML file models/staging/src_superstore.yml and fill it in to declare the sources. Also, provide basic description for tables and id columns, along with not-null generic tests to ensure id columns in the source tables do not have missing values.

version: 2

sources:
- name: superstore
description: Data from Superstore dataset
database: raw
schema: superstore
tables:

- name: orders
description: Raw orders data
columns:
- name: order_id
description: Order identifier
tests:
- not_null
- name: customer_id
description: Identifier of customer who made the order
tests:
- not_null
- name: product_id
description: Identified of the product in the order
tests:
- not_null

- name: returned_orders
description: Raw data for returned orders
columns:
- name: order_id
description: Order identifier
tests:
- not_null

- name: sales_managers
description: Raw data for regional sales managers

Creating staging models

For each of the source tables, create a staging model for light transformation, cleaning and standardizing purposes.

For the orders table

-- models\staging\stg_orders.sql
select
row_id
,order_id
,order_date
,ship_date
,ship_mode
,customer_id
,customer_name
,segment
,country
,city
,state
,case when city = 'Burlington' and postal_code is null then '05401' else postal_code end as postal_code -- clean piece of data
,region
,product_id
,category as product_category
,subcategory as product_subcategory
,product_name
,sales
,quantity
,discount
,profit

from {{ source('superstore', 'orders') }}

For the returned_orders table

-- models\staging\stg_returned_orders.sql
select
distinct order_id

from {{ source('superstore', 'returned_orders') }}

For the sales_managers table

-- models\staging\stg_sales_managers.sql
select
person as manager_name
,region

from {{ source('superstore', 'sales_managers') }}

Create a new YML file models/staging/stg_superstore.yml and fill it in to declare the staging models. Also, provide basic descriptions for models and columns.

version: 2

models:
- name: stg_sales_managers
description: Staged data of regional sales managers
columns:
- name: manager_name
description: First name and last name of regional sales managers
- name: region
description: Region a sales manager responsible for

- name: stg_returned_orders
description: Staged data of returned orders. Has just 1 field with ID of return order.
columns:
- name: order_id
description: Identifier of returned order

- name: stg_orders
description: Staged data of orders.

Edit the dbt_project.yml file to configure the materialization settings so that the staging models are materialized as view and the marts models are materialized as table by adding the following block at the end of the file:

models:
superstore:
staging:
+materialized: view
marts:
+materialized: table

Run the dbt build command in the dbt project directory to run the created models and tests for them. In case everything was done correctly in the previous steps, you should see the Completed successfully message as a result of the command and created views in Snowflake for the staging models.

Materialized staging models in Snowflake

Creating dimensional models

Based on the created staging models, create dimensional models in the models/marts/core/ dbt project directory

dim_customers

-- models\marts\core\dim_customers.sql
with customers as (
select
distinct customer_id
,customer_name
from {{ ref('stg_orders') }}
)
select
100 + ROW_NUMBER() OVER(order by null) as id
,*
from customers

dim_products

-- models\marts\core\dim_products.sql
with products as (
select
distinct product_id
,product_name
,product_category
,product_subcategory
,segment

from {{ ref('stg_orders') }}
)
select
100 + ROW_NUMBER() OVER(order by null) as id
,*
from products

dim_shipping

-- models\marts\core\dim_shipping.sql
with shipping as (
select
distinct ship_mode
from {{ ref('stg_orders') }}
)
select
100 + ROW_NUMBER() OVER(order by null) as id
,*
from shipping

dim_regions

-- models\marts\core\dim_regions.sql
with regions as (
select
distinct region as region_name
from {{ ref('stg_sales_managers') }}
)
select
100 + ROW_NUMBER() OVER(order by null) as region_id
,*
from regions

dim_regional_managers

-- models\marts\core\dim_regional_managers.sql
with regional_managers as (
select
manager_name
,r.region_id as region_id

FROM {{ ref('stg_sales_managers') }} sm
JOIN {{ ref('dim_regions') }} r ON sm.region = r.region_name
)

select
1000 + ROW_NUMBER() OVER(order by null) as id
,*
from regional_managers

dim_geo

-- models\marts\core\dim_geo.sql
with geo as (
select
distinct country
,city
,state
,r.region_id
,postal_code

from {{ ref('stg_orders') }} as o
join {{ ref('dim_regions') }} r ON o.region = r.region_name
)
select
10 + ROW_NUMBER() over(order by null) as id
,*
from geo

dim_calendar

-- models\marts\core\dim_calendar.sql
with recursive date_cte as (
select dateadd(day, 0, '2000-01-01') as date
union all
select dateadd(day, 1, date)
from date_cte
where date <= '2030-01-01'
)

select
date::date as date,
to_char(date, 'yyyymmdd')::int as date_id,
extract(year from date)::int as year,
extract(quarter from date)::int as quarter,
extract(month from date)::int as month,
extract(week from date)::int as week,
dayofweekiso(date)::int as dow,
dayname(date) as week_day
from date_cte

Create two fact models: fct_sales and fct_returns

fct_sales

-- models\marts\core\fct_sales.sql
with sales as (
select
order_id
,to_char(order_date, 'yyyymmdd')::int AS order_date_id
,to_char(ship_date, 'yyyymmdd')::int AS ship_date_id
,sales
,profit
,quantity
,discount
,products.id as dim_products_id
,customers.id as dim_customers_id
,shipping.id as dim_shipping_id
,geo.id as dim_geo_id

from {{ ref('stg_orders') }} as orders

join {{ ref('dim_products') }} as products on
products.product_id = orders.product_id
and products.product_name = orders.product_name
and products.product_category = orders.product_category
and products.product_subcategory = orders.product_subcategory
and products.segment = orders.segment

join {{ ref('dim_customers') }} as customers on
customers.customer_id = orders.customer_id

join {{ ref('dim_shipping') }} as shipping on
shipping.ship_mode = orders.ship_mode

join {{ ref('dim_geo') }} as geo on
geo.country = orders.country
and geo.city = orders.city
and geo.state = orders.state
and geo.postal_code = orders.postal_code
)

SELECT
100 + ROW_NUMBER() over(order by null) AS id
, *
from sales

fct_returns

-- models\marts\core\fct_returns.sql
with returned_orders as (
select
distinct order_id

from {{ ref('stg_returned_orders') }}
)
select
100 + ROW_NUMBER() OVER(order by null) as id
,*
from returned_orders

Run the dbt build command in the dbt project directory to build the created models and run tests for them. In case everything was done correctly in the previous steps, you should see the Completed successfully message as a result of the command, created views in Snowflake for the staging models, and created tables for the dimensional models.

Materialized dimensional models in Snowflake

Conclusion

dbt is a powerful data transformation tool. In this project, I showcased how to install, configure and use dbt to clean, test, and transform data from raw level to ready for analysis.

Link the project repository on Github — https://github.com/kazarmax/dbt_superstore_project

dbt project in VS Code

Follow me on LinkedIn — https://www.linkedin.com/in/kazarmax/

--

--