Onboarding: DBT 101

Shaojun Luo
Aug 6 · 3 min read

What’s DBT:

In Ro, we use Data Building Tools (DBT) to integrate SQL commands into data pipelines. While Snowflake’s UI is used for light-weighted analytic work. DBT is used more for the automation work to interact with remote database. Here is a demonstration of how we develop a data pipeline use DBT:

Image for post
Image for post
A demonstration of DBT workflow in Ro

Typically, a DBT project is consist of several models. Each model consist of some SQL files. When executing the DBT on local machine, DBT compiles the project, generates the documents and create several executable files for server deployment. These executable files can be integrated with Airflow or other services to create ETL pipelines with scheduled tasks.

Before you use DBT:

(1) You need to understand how Git works and clone the repository dataops
(2) You should have a Snowflakes account setup.
(3) Compile the set up DBT use this instruction in dataops/buid Folder
(4) Have some idea about ETL pipeline.

Some tips:
(1) If you use Anaconda as your python source, do not use conda to create virtual environment. Use the virtualenv as instructed.
(2) If you both install Python 2 and Python 3, use pip3 to install requirements.
(3) In .bashProfile, do not use your email as SNOFLAKE_USER variable. It will cause syntax error. Use arbitrary name which can identify yourself.

Start Building a Project:

After finishing the setup, your can create a small testing project of your own to make sure everything is OK. The official tutorial may help but here we provide a more specific version for Ro’s environment:

  1. Open terminal, direct to dataops/build folder and activate DBT virtual environment.
cd dataops/build
. env/bin/activate
  1. Type the following command to create a project named ro_onboard.
dbt init ro_onboard

3. Now, you should find a new folder called ro_onboard. Open the folder using cd ro_onboard. The DBT already creates a simple project template. Open the project configuration file db_project.yml and modify the following fields:

...name: '[name_your_project]'...profile: 'rodata'...models:
[name_your_project]:
example:
materialized: view
...

Also, you can change the materialized method to table so that after compiling it will generate table instead of views.

4. Inside ro_onboard folder, execute command:

dbt run

DBT then will compile (materialized) the models and create (1) alocal folder called target with compiled SQL files that can serve as executable files for Airflow and other programs. (2) In Snowflake UI, you should see there are two tables generated under ROMAN.[SNOFLAKE_USER]_dbt schema. This is the default location of your DBT data object (you should also put all your staging data objects here).

5. Now put your own SQL queries inside models folder as .sql files. Modify the corresponding fields in db_project.yml. Re-run the project again to see if your scripts create new tables which named the same as your SQL files inROMAN.[SNOFLAKE_USER]_dbt .

If everything goes as expected, you should now be able to fully interact with the remote database. Whenever you have any questions, DBT’s doc is always a good reference.

Final Notes:

Most of Ro’s Project files are in the dataops/buid folder. Feel free to explore more advanced features and logics:

  • Project configuration is in dbt_project.yml
  • Model tags are also in dbt_project.yml
  • Models are defined in models/
  • Model documentation and schema tests are defined in models/models.yml
  • Sources are defined in models/sources.yml
  • Macros are defined in macros/
  • Analysis (queries that will be compiled but not materialized, e.g. for exports) are defined in analysis/
  • “Data tests” (as opposed to the “schema tests” mentioned above) would be defined in tests/, though this project doesn't have any yet
  • “Seed data” would be stored in data/, though this project doesn't have an yet
  • Code I wrote to help create/maintain this project, e.g. to convert models to use ref/source and to generate sources.yml, is in tasks/

If you want to check the DAGs and description and of all Ro’s models, you can run:

dbt compile
dbt docs generate
dbt docs serve

It will popup a page in web browser where you can find all the documents about these models.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium