In Ro, we use Data Building Tools (DBT) to integrate SQL commands into data pipelines. While Snowflake’s UI is used for light-weighted analytic work. DBT is used more for the automation work to interact with remote database. Here is a demonstration of how we develop a data pipeline use DBT:
Typically, a DBT project is consist of several models. Each model consist of some SQL files. When executing the DBT on local machine, DBT compiles the project, generates the documents and create several executable files for server deployment. These executable files can be integrated with Airflow or other services to create ETL pipelines with scheduled tasks.
Before you use DBT:
(1) You need to understand how Git works and clone the repository
(2) You should have a Snowflakes account setup.
(3) Compile the set up DBT use this instruction in
(4) Have some idea about ETL pipeline.
(1) If you use Anaconda as your python source, do not use conda to create virtual environment. Use the
virtualenv as instructed.
(2) If you both install Python 2 and Python 3, use
pip3 to install requirements.
.bashProfile, do not use your email as
SNOFLAKE_USER variable. It will cause syntax error. Use arbitrary name which can identify yourself.
Start Building a Project:
After finishing the setup, your can create a small testing project of your own to make sure everything is OK. The official tutorial may help but here we provide a more specific version for Ro’s environment:
- Open terminal, direct to
dataops/buildfolder and activate DBT virtual environment.
- Type the following command to create a project named
dbt init ro_onboard
3. Now, you should find a new folder called
ro_onboard. Open the folder using
cd ro_onboard. The DBT already creates a simple project template. Open the project configuration file
db_project.yml and modify the following fields:
...name: '[name_your_project]'...profile: 'rodata'...models:
Also, you can change the materialized method to
table so that after compiling it will generate table instead of views.
ro_onboard folder, execute command:
DBT then will compile (materialized) the models and create (1) alocal folder called
target with compiled SQL files that can serve as executable files for Airflow and other programs. (2) In Snowflake UI, you should see there are two tables generated under
ROMAN.[SNOFLAKE_USER]_dbt schema. This is the default location of your DBT data object (you should also put all your staging data objects here).
5. Now put your own SQL queries inside
models folder as .sql files. Modify the corresponding fields in
db_project.yml. Re-run the project again to see if your scripts create new tables which named the same as your SQL files in
If everything goes as expected, you should now be able to fully interact with the remote database. Whenever you have any questions, DBT’s doc is always a good reference.
Most of Ro’s Project files are in the
dataops/buid folder. Feel free to explore more advanced features and logics:
- Project configuration is in
- Model tags are also in
- Models are defined in
- Model documentation and schema tests are defined in
- Sources are defined in
- Macros are defined in
- Analysis (queries that will be compiled but not materialized, e.g. for exports) are defined in
- “Data tests” (as opposed to the “schema tests” mentioned above) would be defined in
tests/, though this project doesn't have any yet
- “Seed data” would be stored in
data/, though this project doesn't have an yet
- Code I wrote to help create/maintain this project, e.g. to convert models to use
sourceand to generate
sources.yml, is in
If you want to check the DAGs and description and of all Ro’s models, you can run:
dbt docs generate
dbt docs serve
It will popup a page in web browser where you can find all the documents about these models.