Step-by-step guide to run dbt in production with Google Cloud Platform

Ivan Toriya
9 min readJul 8, 2021

--

If you’re here, I suppose you already know what dbt is (if not, check the official docs or video introduction) and ask the same question that I had: “How to run dbt in production (read, setting up a system to run a dbt job on a schedule) for open-source dbt CLI?”. Though, there is a great article from Mete Atamel on Scheduled serverless dbt + BigQuery service, it is quite tech-savvy for some (it was for me at the time), it demands from you to know gcloud CLI, and to understand the Google Cloud Platform ecosystem. Here, I want to extend on that article and democratize dbt implementation further by writing a simple step-by-step guide to run dbt in production with GCP.

Introduction

Before we begin, I suppose you already know:

  1. How to work with dbt locally(if not, check the tutorial).
  2. How to store your dbt project as GitHub repository.
  3. How to create a project in GCP.

Let’s start by drafting the diagram of servives, we’re going to use:

dbt in production inside GCP + alerting (optional)

Overview, how the implementation works:

  • When you commit anything to the master or main branch in GitHub it triggers Cloud Build to create new Docker image with updated source code.
  • On the other hand, Cloud Scheduler runs this image every 24 hours (can be configured) with a script of your dbt commands.
  • Also I use Logging + Monitoring to send alerts in Telegram Chat if my models or tests failed

Create a service account in GCP

First, you need to create a service account in GCP with both roles BigQuery Data Editor andBigQuery User and save the JSON keyfile somewhere save. This service account will be used by dbt to access your BigQuery project.

Note: If your are using external tables (i.e. you query the data stored in Google Cloud Storage, Drive or Bigtable), then you also need Storage Admin permission to the resource you’re using.

Navigation Menu in GCP → IAM & Admin → Service Accounts → Create Service Account

Name the account like this, so you know the function of it:

Grant the roles BigQuery Data Editor andBigQuery User

Click “Continue” and then “Done”:

Now, click on the created account and go to “Keys” and create a new key:

Create a private JSON key and save it in a secure place:

Create profiles.yml inside dbt project

- dbt-project-name/
- analysis
- data
- logs
- macros
- models
- snapshots
- tests
- dbt_project.yml
- packages.yml
- README.md
- profiles.yml

When you are using dbt locally, you store profiles.ymloutside your dbt project for security sake — to never commit your credentials into a repository. But, for the GCP setup you should create this file inside your dbt project, because we are going to store credentials securely in Secret Manager.

The file should look like this:

i_am_a_profile_name:  # your profile name from dbt_project.yml
target: dev # your development environment
outputs:

dev:
type: bigquery
method: service-account
project: project-name # name of the project in BigQuery
dataset: dbt_yourname # your dataset for development (usually your name)
threads: 4
keyfile: /secrets/dbt-service-keyfile
timeout_seconds: 300
priority: interactive
retries: 1

prod: # your production environment
type: bigquery
method: service-account
project: project-name # name of the project in BigQuery
dataset: dbt_production # your dataset for production
threads: 4
keyfile: /secrets/dbt-service-keyfile
timeout_seconds: 300
priority: interactive
retries: 1

Create script.sh inside dbt project

- dbt-project-name/
- analysis
- data
- logs
- macros
- models
- snapshots
- tests
- dbt_project.yml
- packages.yml
- README.md
- profiles.yml
- invoke.go
- Dockerfile
- script.sh

The script is just the commands we want dbt to run in GCP:

#!/bin/sh
dbt deps --profiles-dir . # Pulls the most recent version of the dependencies listed in your packages.yml from git
dbt debug --target dev --profiles-dir .
dbt debug --target prod --profiles-dir .
dbt run --target prod --profiles-dir .
dbt test --data --target dev --profiles-dir .

You can add any commands that you normally use, just don’t forget about the --profiles-dir . flag, as by default dbt expects to find profiles.yml in ~/.dbt/, not inside the project.

Create invoke.go inside dbt project

- dbt-project-name/
- analysis
- data
- logs
- macros
- models
- snapshots
- tests
- dbt_project.yml
- packages.yml
- README.md
- profiles.yml
- invoke.go

Cloud Run expects HTTP requests, but we run shell commands in dbt CLI. That’s why we need to set up a Go based HTTP server that calls a script.sh with our commands, when it receives a request. The invoke.go file looks like this:

UPDATE 13.03.2022 | Not all logs are printed, as if interrupted by request status code response (I don’t know why). I’ve fixed the code below and now all logs are printed as intented.

package mainimport (
"fmt"
"log"
"net/http"
"os"
"os/exec"
)
func handler(w http.ResponseWriter, r *http.Request) {
log.Print("helloworld: received a request")
cmd := exec.Command("/bin/sh", "script.sh")
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
err := cmd.Run()
if err != nil {
log.Fatalf("cmd.Run() failed with %s\n", err)
}
}
func main() {
log.Print("helloworld: starting server...")
http.HandleFunc("/", handler)port := os.Getenv("PORT")
if port == "" {
port = "8080"
}
log.Printf("helloworld: listening on %s", port)
log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s", port), nil))
}

Create Dockerfile inside dbt project

- dbt-project-name/
- analysis
- data
- logs
- macros
- models
- snapshots
- tests
- dbt_project.yml
- packages.yml
- README.md
- profiles.yml
- invoke.go
- Dockerfile

Now, we need to bring it all together. We write instructions in Dockerfile for Cloud Run to build image from it. You can find the latest version of dbt docker image here. Currently I’m using 1.2.0 version.

UPDATE 14.09.2022 | dbt version and correct location for new images

FROM golang:1.13 as builder
WORKDIR /app
COPY invoke.go ./
RUN CGO_ENABLED=0 GOOS=linux go build -v -o server

FROM ghcr.io/dbt-labs/dbt-bigquery:1.2.latest
USER root
WORKDIR /dbt
COPY --from=builder /app/server ./
COPY script.sh ./
COPY . ./

ENTRYPOINT "./server"

Cloud Run set up

Now, we are ready to build and deploy our container inside Cloud Build. We can do it straight from Cloud Run. Create a new service:

Name it:

Choose “Set up with Cloud Build”:

Connect to GitHub and choose your repository with dbt project:

Enter your ^master$ or ^main$ branch, location to the Dockerfile and click “Save”

Choose “Require authentication (Manage authorized users with Cloud IAM)”, because we don’t want this service to be public. Then click “Create”:

The service was created and Docker image should be deployed. Now we need to enter our credentials that will be accessed inside our image. Remember the profiles.yml file, where one of the key-value pairs is keyfile: /secrets/dbt-service-keyfile ? We need to store the json keyfile somewhere. And because we’re in GCP why not use Secret Manager for this. Find the product and enable its API if needed.

Create new secret:

Upload the saved JSON keyfile:

Now, go back to Cloud Run, click on your created dbt-production service, then go to “Edit & Deploy New Revision”:

Go to “Variables & Secrets”, click on “Reference a secret”, choose secret dbt-service-keyfile

Grant needed permissions, in Reference method choose “Mounted as volume” and in Mount path secrets, so this file will be mounted as /secrets/dbt-service-keyfile and we can now access it in profiles.yml :

Click Done and then Deploy.

Testing

To test your service manually use Cloud Shell. Click on info near your URL and then “Invoke in Cloud Shell” :

Hit Enter and click “Authorize”:

In script.sh I have commands:

dbt debug --target dev --profiles-dir .
dbt debug --target prod --profiles-dir .

After running we can see the results in the console (or in the Cloud Run Logs). They should be all valid:

Note: If all of your commands in script.sh take longer than 5 minutes to complete, you should change the “Request timeout” setting in Cloud Run (by default it’s 300 seconds. Maximum is 3600 seconds). Otherwise, it will return 504 Error. Go to “Edit & Deploy New Revision”, inside “Container” settings under “Capacity” you can find “Request timeout”.

Cloud Scheduler set up

We’re almost there. The last step is to create a scheduled job, that will invoke our Cloud Run service with desired frequency.

First, we need to create a service account for this:

Grant Cloud Run Invoker and Cloud Run Service Agent roles to it:

Now go to Cloud Scheduler and create a new job:

Frequency is in the unix-cron format, 0 3 */1 * * means “At 03:00 on every day-of-month.” You can use this site to understand the format better.

Configure the job’s target

  1. Choose HTTP target type
  2. Enter your Cloud Run URL
  3. Choose GET method for HTTP requests
  4. Choose Add OIDC token
  5. Enter service account email that you’ve created
  6. Enter your Cloud Run URL

Click “Create”

You can test it by clicking “Run Now” button:

After a while the Result should change to “Success”. But that just means that the request was done without errors. To check that dbt commands run successfully go to Cloud Run and click on “Logs”:

After inspecting the logs, you can tell if all dbt commands run successfully. If so, we’ve implemented dbt in production. Congratulations!

Costs (without BigQuery)

Well, it’s really nothing. For the implementation alone we’ve spent $0.30 running dbt in production last month:

Next Steps

I hope the guide helped and now you have dbt running in production. In the next post, I’ll write about how to monitor the health of your data using Logging and Monitoring GCP’s services and send alerts on emails and Telegram chats if something happens with your dbt models or tests. Stay tuned!

If you have any questions feel free to reach me on LinkedIn or write them in the comments.

--

--