Building a dbt project from scratch

Published in

Joon Solutions Global

8 min readJan 15, 2024

Building a dbt project from scratch can be tough for some data folks. I looked around, and there’s no blog that covers all the steps we need in one place. So, here it is! This article will take you through the whole process, from connecting to your data warehouse and configuring your dbt project to setting up CI/CD & dbt Cloud.

Step 1: Setting up the environment & git repo and organizing dbt project

Prerequisites: dbt is installed.

1.1. Creating a git repository

Follow this guideline to create a git repository: Quickstart for repositories — GitHub Docs

1.2. Setting up Python environment & dbt dependencies

Set up a Python environment for developing

# Create Python virtual environment
python -m venv path/to/your/venv

# Install dependencies
source path/to/your/venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt

Install dbt dependencies

dbt deps

1.3. Organizing folders according to the dbt project structure

Lastly, structure the files, folders, and models as guidelines: How we structure our dbt projects | dbt Developer Hub

Step 2: Connecting to the data warehouse using profile.yml files

You’ll need a profiles.yml file containing your data platform’s connection details. In your profiles.yml file, you can store as many profiles as you need. Typically, you would have one profile for each warehouse you use. This file generally lives outside of your dbt project to avoid sensitive credentials being checked into version control, but profiles.yml can be safely checked in when using environment variables to load sensitive credentials.

2.1. Configuring profile.yml file locally

The configuration of the profiles.yml file varies based on the data warehouse you are using. For detailed configuration instructions for each type of data platform provider, please refer to the dbt documentation: About dbt Core data platform connections | dbt Developer Hub

To set up the file correctly, there are certain concepts you should grasp:

Understanding targets: A typical profile for an analyst using dbt locally will have a target named dev, and have this set as the default.
Understanding target schemas: The target schema represents the default schema that dbt will build objects into, and is often used as the differentiator between separate environments within a warehouse. In development, a pattern we’ve found to work well is to name the schema in your dev target dbt_<username>.
Understanding warehouse credentials: It’s recommended that each dbt user has their own set of database credentials, including a separate user for production runs of dbt — this helps debug rogue queries, simplifies ownerships of schemas, and improves security.

# example 1 - snowflake
joon_bi_dbt:
  target: dev
  outputs:
    dev:
      account: # your datawarehouse's account
      database: DEV
      password: # passwword of your datawarehouse's account
      role: # role of your datawarehouse's account, e.g. TRANSFORMER
      schema: # personal schema, e.g. DBT_ALICE
      threads: 10
      type: snowflake
      user: # user name of your datawarehouse's account, e.g. alice
      warehouse: # warehouse name, e.g. TRANSFORMING
 
 # example 2 - bigquery  
joon_bi_dbt:
  target: dev
  outputs:
    dev:
      type: bigquery
      dataset: # personal schema, e.g. dbt_alice
      fixed_retries: 1
      location: asia-southeast1
      method: oauth
      priority: interactive
      project: # Big Query's project id, e.g. joon-bi-data-warehouse-dev
      threads: 32
      timeout_seconds: 300

2.2. Using environment variables in profile.yml file

Otherwise, credentials can be placed directly into the profiles.yml file or loaded from environment variables. Using environment variables is especially useful for production deployments of dbt. We can set it up as follows, and this file can be checked into version control:

joon_bi_dbt:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: # your datawarehouse's account
      # Replace by your user/password or set the environment variables
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: TRANSFORMER
      database: DEV
      warehouse: TRANSFORMING
      # Replace with dbt_yourname or set environment variable DBT_SCHEMA
      schema: "{{ env_var('DBT_SCHEMA') }}"
      threads: 10
      client_session_keep_alive: False
    ci:
      type: snowflake
      account: # your datawarehouse's account
      # Replace by your user/password or set the environment variables
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: TRANSFORMER
      database: DEV
      warehouse: TRANSFORMING
      # Replace with dbt_yourname or set environment variable DBT_SCHEMA
      schema: "{{ env_var('DBT_SCHEMA_CI') }}"
      threads: 10
      client_session_keep_alive: False
    prod:
      type: snowflake
      account: # your datawarehouse's account
      # Replace by your user/password or set the environment variables
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: PRODUCER
      database: PROD
      warehouse: PRODUCTION
      # Replace with dbt_yourname or set environment variable DBT_SCHEMA
      schema: "{{ env_var('DBT_SCHEMA_PROD') }}"
      threads: 10
      client_session_keep_alive: False

If the DBT_USER and DBT_PASSWORD environment variables are present when dbt is invoked, then these variables will be pulled into the profile as expected. If any environment variables are not set, then dbt will raise a compilation error.

Details can be found in these guidelines: Connection profiles | dbt Developer Hub, About env_var function | dbt Developer Hub

Step 3: Defining project-specific configurations in dbt_project.yml

3.1. Defining the directory of the dbt project

## dbt_project.yml
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'joon_bi_dbt'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'joon_bi_dbt'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files

3.2. Configuring models

## dbt_project.yml (.cont)
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

models:
  joon_bi_dbt:
    +required_tests: {"unique.*|not_null|at_least_one": 3}
    +materialized: table
    # Config indicated by + and applies to all files under models/example/
  
    staging:
      +database: "{% if target.name == 'prod' %} PROD {% else %} DEV {% endif %}"
      +schema: staging

    intermediate:
      +database: "{% if target.name == 'prod' %} PROD {% else %} DEV {% endif %}"
      +schema: intermediate

    mart:
      +database: "{% if target.name == 'prod' %} PROD {% else %} DEV {% endif %}"
      +schema: mart

snapshots:
  +database: "{% if target.name == 'prod' %} PROD {% else %} DEV {% endif %}"
  +schema: snapshot

Setting materialization types for models: It’s a case-by-case setup. In some projects, depending on the data size & the frequency that raw data gets updated, we can materialize base & stg tables as views.
Adding required tests: unique, not_null, at_least_one are configured as required tests for models in all layers.
Generating custom database & schema: With the configuration above, the outcome will be:

By default, dbt will generate the schema name for a model by concatenating the custom schema to the target schema, as in: <target_schema>_<custom_schema>. The target.name (target_schema) context variable is used to change the schema name that dbt generates for models. You must additionally ensure that your different dbt environments are configured appropriately. It is recommended that:

dev: Your local development environment; configured in a profiles.yml file on your computer.
ci: A continuous integration environment running on Pull Requests in GitHub, GitLab, etc.
prod: The production deployment of your dbt project, like in dbt Cloud, Airflow, or similar.

Details can be found in these documents:Custom schemas | dbt Developer Hub, Custom databases | dbt Developer Hub

Step 4: Implementing CI/CD workflow

4.1. Adding a Github action workflow to .github folder

Workflow pr_to_master is triggered when a merge request to the main branch is created. It does lint code with sqlfluff, checks required tests, runs and tests models.

##.github/workflows/pr_to_master.yml
name: pr_to_main

on:
  pull_request:
    branches:
      - main
    paths:
      - 'models/**'
      - 'snapshots/**'
      - 'dbt_project.yml'
      - 'packages.yml'
      - 'requirement.txt'

env:

  DBT_PROFILES_DIR: ./
  ARTIFACT_DBT_MANIFEST_NAME: dbt_manifest
  WORKFLOW_DBT_MANIFEST: push_to_master.yml
  DBT_USER: ${{ secrets.DBT_USER }}
  DBT_PASSWORD: ${{ secrets.DBT_PASSWORD }}
  DBT_SCHEMA_CI: ${{ secrets.DBT_SCHEMA_CI }}
  DBT_SCHEMA: ${{ secrets.DBT_SCHEMA }}

jobs:
  pr_to_main:
    name: pr_to_main
    runs-on: ubuntu-latest

    steps:
      - name: Check out
        uses: actions/checkout@main

      - uses: actions/setup-python@v1
        with:
          python-version: "3.7.x"

      - uses: actions/setup-node@v3.4.1

      - name: Install dependencies
        run: |
          pip3 install -r requirements.txt
          dbt deps --target ci
      
      # remove as we no longer use Lightdash
      # - name: Install lightdash CLI
      #   run: npm install -g @lightdash/cli

      - name: Seeding
        run: dbt seed --target ci

      - name: Snapshot
        run: dbt snapshot --target ci

      - name: Check required tests
        run: |
          dbt run-operation required_tests --target ci

      - name: Lint with sqlfluff
        run: |
          git fetch origin main:main
          git diff main --name-only --diff-filter=d | egrep '^models/.*sql$$' | xargs -r sqlfluff lint

      ## Download dbt manifest uploaded in another workflow (default: push_to_master.yml).
      # This manifest file will be used to filter out delta models.
      - name: Download dbt manifest
        id: download_dbt_manifest
        uses: dawidd6/action-download-artifact@v2
        with:
          github_token: ${{ secrets.GIT_TOKEN }}
          workflow: ${{ env.WORKFLOW_DBT_MANIFEST }}
          workflow_conclusion: success
          branch: main
          name: ${{ env.ARTIFACT_DBT_MANIFEST_NAME }}
          path: .
          check_artifacts:  false
        continue-on-error: true

      ## In some case, manifest file artifact cannot be downloaded.
      # Then, compile current code to create a new manifest file.
      # Note, because previous dbt state is missing, all models will be
      # run and tested.
      - name: Create new Dbt manifest if cannot download
        if: steps.download_dbt_manifest.outcome == 'failure'
        run: |
          dbt compile
          cp target/manifest.json .
            
      ## Run delta models
      - name: Run models
        id: run_dbt
        run: dbt run --target ci

      ## Test delta models
      - name: Test models
        run: dbt test --target ci

## pr_to_master.properties.json
{
    "name": "Check Pull Request Workflow",
    "description": "Template of workflow to check every PR made to the default branch.",
    "iconName": "example-icon",
    "categories": [
        "SQL", "Python"
    ],
    "filePatterns": [
        "packages.yml",
        "dbt_project.yml",
        "profiles.yml",
        "requirements.txt"
    ]
}

4.2. Adding secrets in GitHub actions

Go to project Settings > Secrets and variables > Actions

Create repository secrets for DBT_USER, DBT_PASSWORD, DBT_SCHEMA_CI, DBT_SCHEMA (to configure target schema for CI run).

Step 5: Setting up dbt Cloud

To set up dbt Cloud, please follow the guidance as follows:

5.1. Connecting data platform dbt Cloud

About data platform connections | dbt Developer Hub

5.2. Configuring git

Git configuration in dbt Cloud | dbt Developer Hub

5.3. Deployment environments & jobs

Deployment environments | dbt Developer Hub

Deploy jobs | dbt Developer Hub

5.4. Adding environment variables

To configure target schema for production:

Go to Deploy > Environments > Environment variables
Create variables DBT_USER, DBT_PASSWORD, DBT_SCHEMA_PROD

Further steps

After the steps mentioned above, you can take the following steps for your dbt project:

Create testing models to check CI/CD workflow and check whether models are stored in the correct database and schema.
Migrate & refactor legacy SQL to dbt

Good luck with your dbt project!