Data Transformations With DBT

George Saju
Litmus7 Systems Consulting
6 min readOct 17, 2022

What is DBT?

Data Build Tool(DBT) is a SQL based powerful transformation workflow tool that marries conventional SQL based transformations with ‘software engineering best practices’ like modularity, portability, CI/CD, and documentation.

With DBT, anyone who knows how to write SQL SELECT statements has the power to build blocks of SQL based functions (called as models), carry out functional and technical tests in a structured manner, and schedule jobs to produce reliable, actionable datasets for analytics. The tool acts as an orchestration layer on top of the data warehouse to improve and accelerate data transformations and integrations.

DBT (data build tool) makes data engineering activities easier for people with data analyst skills to transform the data in the warehouse using simple select statements, effectively creating the entire transformation process with code. We can write custom business logic using SQL, automate data quality testing, deploy code, and create data documentation side-by-side with the code. This is more important today because there is a great shortage of data engineering professionals in the marketplace and the end destination of most current data engineering pipelines are still data warehouses and hence SQL based transformation would work. Anyone who knows SQL can now build production-grade data pipelines, reducing the barrier to entry that previously limited staffing capabilities for legacy technologies. However when adopting a SQL based transformation engine special care has to be given to ensure that the queries are optimized from a CPU impact/compute standpoint , especially given that most of the modern day data warehouses are “rent-seeking” cloud platforms where whilst storage is cheap compute is weighed in gold

Connecting DBT with Bigquery

  1. Click BigQuery to set up your connection.

2. Click Upload a Service Account JSON File in BigQuery settings.

3. Select the JSON file you downloaded in Generate Bigquery Credentials. dbt Cloud will fill in all the necessary fields.

4. Click Test at the top. This verifies that DBT Cloud can access your BigQuery account.

5. If you see “Connection test Succeeded!” then click Continue. If it fails, you might need to go back and regenerate your BigQuery credentials.

6. The connection details will be visible in the settings as follows:

DBT Models

A model is basically a select statement. Models are defined in .sql files typically in the models directory. Models properties can be declared in .yml files in the models/ directory.

Here is a sample model we created for obtaining the department_name,department_id,product_id in models/staging (dbt_department_test.sql)

DBT Test Cases

schema.yml file can be used to define assertions that the data should adhere to . For eg : here we want to ensure that the attribute , department_id, has to be unique and not null. Setting these via a configuration file ensures that in the future if we want to modify these properties then it can be done without having to touch the main SQL block and hence minimizing chances of regression issues being introduced.

DBT Common Commands

Some of the commonly used DBT commands are as follows : [2]

  1. dbt run executes compiled sql model files against the current target database.bt connects to the target database and runs the relevant SQL required to materialize all data models using the specified materialization strategies.

2. dbt compile generates executable SQL from source model, test, and analysis files.

3. dbt init helps get you started using dbt Core

4. dbt build command will run models,test tests,snapshot snapshots,seed seeds.

5. dbt docs has two supported subcommands: generate and serve

dbt docs generate is responsible for generating your project’s documentation which contains metadata about the tables and views produced by the models in your project.

dbt docs serve command starts a webserver on port 8000 to serve your documentation locally

Jinja Templating in DBT

Jinja is a templating engine used for improving code readability and for promoting re-usable SQL queries. Templating as a coding standard is followed in core software programming coding that helps to separate data, logic and presentation layers in MVC[1] Using Jinja turns the dbt project into a programming environment for SQL that aren’t normally possible in SQL like using control structures and environment variables in the dbt project for production deployments.

We can recognize Jinja based on the delimiters and the language uses, which we refer to as “curlies”:

  • Expressions {{ … }}: Expressions are used when we want to output a string. We can use expressions to reference variables and call macros.[2]
  • Statements {% … %}: Statements are used for control flow, for example, to set up for loops and if statements, or to define macros.[2]
  • Comments {# … #}: Jinja comments are used to prevent the text within the comment from compiling[2]

Here’s an example of a dbt model that leverages Jinja we used :

The below screenshot shows how we can write software engineering style if- then- else based control statements

DBT Documentation

DBT provides a way to generate documentation for the dbt project and render it as a website.Here is an example of our model documentation. This provides an interface for data analyst , business analysts etc to search of existing columns , their descriptions and also the code base used to populate the table.

DBT Lineage

In DBT Documentation we can see the lineage at the bottom right .Here represented the dbt_department_test lineage graph.

DBT Core Vs DBT Cloud

DBT provides two options for implementation (1) dbt Core and (2) dbt Cloud

  • dbt Core includes a command line interface (CLI) so that users can execute dbt commands using a terminal program.dbt core switch between the terminal and the user interface of the data warehouse which makes analytics engineering practices difficult.
  • On other hand, Dbt cloud integrates analytics engineering workflows in better way.

In dbt core we need extra scheduling tools like jenkins and airflow whereas in dbt cloud we can schedule by using dbt commands.

References

[1] https://medium.com/analytics-and-data/jinja-the-sql-way-of-the-ninja-9a64fc815564

[2] https://docs.getdbt.com/docs/building-a-dbt-project/documentation

--

--