Building a dbt project from scratch

Alice Bui
Joon Solutions Global
8 min readJan 15, 2024

Building a dbt project from scratch can be tough for some data folks. I looked around, and there’s no blog that covers all the steps we need in one place. So, here it is! This article will take you through the whole process, from connecting to your data warehouse and configuring your dbt project to setting up CI/CD & dbt Cloud.

Step 1: Setting up the environment & git repo and organizing dbt project

Prerequisites: dbt is installed.

1.1. Creating a git repository

Follow this guideline to create a git repository: Quickstart for repositories — GitHub Docs

1.2. Setting up Python environment & dbt dependencies

  • Set up a Python environment for developing
# Create Python virtual environment
python -m venv path/to/your/venv

# Install dependencies
source path/to/your/venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt
  • Install dbt dependencies
dbt deps

1.3. Organizing folders according to the dbt project structure

Lastly, structure the files, folders, and models as guidelines: How we structure our dbt projects | dbt Developer Hub

Step 2: Connecting to the data warehouse using profile.yml files

You’ll need a profiles.yml file containing your data platform’s connection details. In your profiles.yml file, you can store as many profiles as you need. Typically, you would have one profile for each warehouse you use. This file generally lives outside of your dbt project to avoid sensitive credentials being checked into version control, but profiles.yml can be safely checked in when using environment variables to load sensitive credentials.

2.1. Configuring profile.yml file locally

The configuration of the profiles.yml file varies based on the data warehouse you are using. For detailed configuration instructions for each type of data platform provider, please refer to the dbt documentation: About dbt Core data platform connections | dbt Developer Hub

To set up the file correctly, there are certain concepts you should grasp:

  • Understanding targets: A typical profile for an analyst using dbt locally will have a target named dev, and have this set as the default.
  • Understanding target schemas: The target schema represents the default schema that dbt will build objects into, and is often used as the differentiator between separate environments within a warehouse. In development, a pattern we’ve found to work well is to name the schema in your dev target dbt_<username>.
  • Understanding warehouse credentials: It’s recommended that each dbt user has their own set of database credentials, including a separate user for production runs of dbt — this helps debug rogue queries, simplifies ownerships of schemas, and improves security.
# example 1 - snowflake
joon_bi_dbt:
target: dev
outputs:
dev:
account: # your datawarehouse's account
database: DEV
password: # passwword of your datawarehouse's account
role: # role of your datawarehouse's account, e.g. TRANSFORMER
schema: # personal schema, e.g. DBT_ALICE
threads: 10
type: snowflake
user: # user name of your datawarehouse's account, e.g. alice
warehouse: # warehouse name, e.g. TRANSFORMING

# example 2 - bigquery
joon_bi_dbt:
target: dev
outputs:
dev:
type: bigquery
dataset: # personal schema, e.g. dbt_alice
fixed_retries: 1
location: asia-southeast1
method: oauth
priority: interactive
project: # Big Query's project id, e.g. joon-bi-data-warehouse-dev
threads: 32
timeout_seconds: 300

2.2. Using environment variables in profile.yml file

Otherwise, credentials can be placed directly into the profiles.yml file or loaded from environment variables. Using environment variables is especially useful for production deployments of dbt. We can set it up as follows, and this file can be checked into version control:

joon_bi_dbt:
target: dev
outputs:
dev:
type: snowflake
account: # your datawarehouse's account
# Replace by your user/password or set the environment variables
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
role: TRANSFORMER
database: DEV
warehouse: TRANSFORMING
# Replace with dbt_yourname or set environment variable DBT_SCHEMA
schema: "{{ env_var('DBT_SCHEMA') }}"
threads: 10
client_session_keep_alive: False
ci:
type: snowflake
account: # your datawarehouse's account
# Replace by your user/password or set the environment variables
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
role: TRANSFORMER
database: DEV
warehouse: TRANSFORMING
# Replace with dbt_yourname or set environment variable DBT_SCHEMA
schema: "{{ env_var('DBT_SCHEMA_CI') }}"
threads: 10
client_session_keep_alive: False
prod:
type: snowflake
account: # your datawarehouse's account
# Replace by your user/password or set the environment variables
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
role: PRODUCER
database: PROD
warehouse: PRODUCTION
# Replace with dbt_yourname or set environment variable DBT_SCHEMA
schema: "{{ env_var('DBT_SCHEMA_PROD') }}"
threads: 10
client_session_keep_alive: False

If the DBT_USER and DBT_PASSWORD environment variables are present when dbt is invoked, then these variables will be pulled into the profile as expected. If any environment variables are not set, then dbt will raise a compilation error.

Details can be found in these guidelines: Connection profiles | dbt Developer Hub, About env_var function | dbt Developer Hub

Step 3: Defining project-specific configurations in dbt_project.yml

3.1. Defining the directory of the dbt project

## dbt_project.yml
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'joon_bi_dbt'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'joon_bi_dbt'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target" # directory which will store compiled SQL files

3.2. Configuring models

## dbt_project.yml (.cont)
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

models:
joon_bi_dbt:
+required_tests: {"unique.*|not_null|at_least_one": 3}
+materialized: table
# Config indicated by + and applies to all files under models/example/

staging:
+database: "{% if target.name == 'prod' %} PROD {% else %} DEV {% endif %}"
+schema: staging

intermediate:
+database: "{% if target.name == 'prod' %} PROD {% else %} DEV {% endif %}"
+schema: intermediate

mart:
+database: "{% if target.name == 'prod' %} PROD {% else %} DEV {% endif %}"
+schema: mart

snapshots:
+database: "{% if target.name == 'prod' %} PROD {% else %} DEV {% endif %}"
+schema: snapshot
  • Setting materialization types for models: It’s a case-by-case setup. In some projects, depending on the data size & the frequency that raw data gets updated, we can materialize base & stg tables as views.
  • Adding required tests: unique, not_null, at_least_one are configured as required tests for models in all layers.
  • Generating custom database & schema: With the configuration above, the outcome will be:
dbt database design
dbt schema design

By default, dbt will generate the schema name for a model by concatenating the custom schema to the target schema, as in: <target_schema>_<custom_schema>. The target.name (target_schema) context variable is used to change the schema name that dbt generates for models. You must additionally ensure that your different dbt environments are configured appropriately. It is recommended that:

  • dev: Your local development environment; configured in a profiles.yml file on your computer.
  • ci: A continuous integration environment running on Pull Requests in GitHub, GitLab, etc.
  • prod: The production deployment of your dbt project, like in dbt Cloud, Airflow, or similar.

Details can be found in these documents:Custom schemas | dbt Developer Hub, Custom databases | dbt Developer Hub

Step 4: Implementing CI/CD workflow

4.1. Adding a Github action workflow to .github folder

Workflow pr_to_master is triggered when a merge request to the main branch is created. It does lint code with sqlfluff, checks required tests, runs and tests models.

##.github/workflows/pr_to_master.yml
name: pr_to_main

on:
pull_request:
branches:
- main
paths:
- 'models/**'
- 'snapshots/**'
- 'dbt_project.yml'
- 'packages.yml'
- 'requirement.txt'

env:

DBT_PROFILES_DIR: ./
ARTIFACT_DBT_MANIFEST_NAME: dbt_manifest
WORKFLOW_DBT_MANIFEST: push_to_master.yml
DBT_USER: ${{ secrets.DBT_USER }}
DBT_PASSWORD: ${{ secrets.DBT_PASSWORD }}
DBT_SCHEMA_CI: ${{ secrets.DBT_SCHEMA_CI }}
DBT_SCHEMA: ${{ secrets.DBT_SCHEMA }}

jobs:
pr_to_main:
name: pr_to_main
runs-on: ubuntu-latest

steps:
- name: Check out
uses: actions/checkout@main

- uses: actions/setup-python@v1
with:
python-version: "3.7.x"

- uses: actions/setup-node@v3.4.1

- name: Install dependencies
run: |
pip3 install -r requirements.txt
dbt deps --target ci

# remove as we no longer use Lightdash
# - name: Install lightdash CLI
# run: npm install -g @lightdash/cli

- name: Seeding
run: dbt seed --target ci

- name: Snapshot
run: dbt snapshot --target ci

- name: Check required tests
run: |
dbt run-operation required_tests --target ci

- name: Lint with sqlfluff
run: |
git fetch origin main:main
git diff main --name-only --diff-filter=d | egrep '^models/.*sql$$' | xargs -r sqlfluff lint

## Download dbt manifest uploaded in another workflow (default: push_to_master.yml).
# This manifest file will be used to filter out delta models.
- name: Download dbt manifest
id: download_dbt_manifest
uses: dawidd6/action-download-artifact@v2
with:
github_token: ${{ secrets.GIT_TOKEN }}
workflow: ${{ env.WORKFLOW_DBT_MANIFEST }}
workflow_conclusion: success
branch: main
name: ${{ env.ARTIFACT_DBT_MANIFEST_NAME }}
path: .
check_artifacts: false
continue-on-error: true

## In some case, manifest file artifact cannot be downloaded.
# Then, compile current code to create a new manifest file.
# Note, because previous dbt state is missing, all models will be
# run and tested.
- name: Create new Dbt manifest if cannot download
if: steps.download_dbt_manifest.outcome == 'failure'
run: |
dbt compile
cp target/manifest.json .

## Run delta models
- name: Run models
id: run_dbt
run: dbt run --target ci

## Test delta models
- name: Test models
run: dbt test --target ci
## pr_to_master.properties.json
{
"name": "Check Pull Request Workflow",
"description": "Template of workflow to check every PR made to the default branch.",
"iconName": "example-icon",
"categories": [
"SQL", "Python"
],
"filePatterns": [
"packages.yml",
"dbt_project.yml",
"profiles.yml",
"requirements.txt"
]
}

4.2. Adding secrets in GitHub actions

  • Go to project Settings > Secrets and variables > Actions
  • Create repository secrets for DBT_USER, DBT_PASSWORD, DBT_SCHEMA_CI, DBT_SCHEMA (to configure target schema for CI run).

Step 5: Setting up dbt Cloud

To set up dbt Cloud, please follow the guidance as follows:

5.1. Connecting data platform dbt Cloud

About data platform connections | dbt Developer Hub

5.2. Configuring git

Git configuration in dbt Cloud | dbt Developer Hub

5.3. Deployment environments & jobs

Deployment environments | dbt Developer Hub

Deploy jobs | dbt Developer Hub

5.4. Adding environment variables

To configure target schema for production:

  • Go to Deploy > Environments > Environment variables
  • Create variables DBT_USER, DBT_PASSWORD, DBT_SCHEMA_PROD

Further steps

After the steps mentioned above, you can take the following steps for your dbt project:

  • Create testing models to check CI/CD workflow and check whether models are stored in the correct database and schema.
  • Migrate & refactor legacy SQL to dbt

Good luck with your dbt project!

--

--

Alice Bui
Joon Solutions Global

Analytics Engineer @ Joon Solutions | GDE, dbt, Looker, Airflow Certified