Simple tips to make datasets cleaner using dbt and BigQuery
It is quite common that during the process of building a data warehouse (DW), table names can be changed, leaving several old tables unused in the DW, or for each personal schema, there are test tables no longer in use. In the long run, this can create a lot of pollution within these schemas so developers can get confused about whether that table is still in use or not, or whether such tables can be deleted. Also, these tables can stay there in the database just taking up unnecessary space. So, thinking about improving this situation, I want to describe two very simple tips that can help to make schemas (called datasets in BigQuery) a little more organized.
The first tip pertains to table expiration settings in the development environment
By default, tables created by dbt never expire. However, you can define expiration rules through dbt so that tables generated by a model will expire after a specified number of hours. There are two methods to configure this:
{{ config(
hours_to_expiration = 72
) }}
select *
from my_model
This configuration above can be done in the model and causes the table generated by the model to expire in 72 hours (you can choose the expiration time that is most convenient for you). However, I particularly prefer the second option below, which is made in dbt_project.yml.
...
models:
project_dbt:
+hours_to_expiration: 72
staging: ...
I prefer this second option as I find it easier to set up with some jinja a configuration that causes table expiration in 72h to be inserted only into tables in the development schemas/datasets. An example of how to do this is shown below.
models:
project_dbt:
+hours_to_expiration: "{{ '72' if target.name=='dev' else none }}"
staging: ...
Usually, the development targets are set as “dev” in profiles.yml in dbt, that’s why I use it in the code
This configuration dictates that if the model runs in development, a clause will be enforced causing the tables to expire in 72 hours. However, if the model runs in production, this setting will not be applied. It’s crucial to avoid using this configuration in the production pipeline to prevent the expiration of production tables.
Now, you can apply this setting differently to development schemas if you believe it’s more suitable for the project. When a table generated by a model is no longer in use and is forgotten in the development database, there’s no need to worry because it will be automatically deleted from the database after the specified time.
The second tip is related to the ephemeral configuration in dbt
Another strategy to maintain a cleaner schema/dataset in production involves using ephemeral materialization in dbt. This materialization type doesn’t directly write to the database, making it suitable for use on layers before the mart layer. With this setup, dbt writes the layers preceding the materialized layer as CTEs, avoiding the creation of intermediate tables in the database.
Implementing this configuration results in a cleaner production schema, as it eliminates the need for staging and intermediate tables that form the marts’ layer. Moreover, it can lead to storage savings, especially in Data Warehouses where storage costs apply.
However, it’s essential to exercise caution when configuring ephemeral materialization. Personally, I recommend applying this configuration only in production. During development, materializing as a table might be more appropriate.
One of the main drawbacks of ephemeral materialization is its impact on the debugging process. Since it doesn’t write to the database, debugging becomes more challenging as observing its output and correcting errors becomes more cumbersome. This is why I prefer applying this configuration solely in the production environment.
Having laid out these considerations, here’s an example of how to configure ephemeral materialization specifically for the production environment. This configuration is implemented in the dbt_project.yml file.
...
models:
project_dbt:
+hours_to_expiration: "{{ '72' if target.name=='dev' else none }}"
staging:
+materialized: "{{'ephemeral' if target.name=='prod' else 'table' }}"
intermediate:
+materialized: "{{'ephemeral' if target.name=='prod' else 'table' }}"
marts:
+materialized: table
...
In this simple article, I showed you how to make datasets/schemas a little cleaner with table expiration time strategies and materialization as ephemeral.
Feel free to comment with your questions and suggestions, or to criticize.
For more content like this, you can follow Indicium here on Medium and Linkedin, and follow me on those networks as well. ;)
Thank you!