A dbt Model Code Generator

Alex Reilly
3 min readJan 9, 2023

--

If, like us here at LGC, you are an avid user of dbt but also have difficulty working with thousands of raw tables (dbt sources) then this article might be for you.

Trying to follow best practices and dbt community guidance around simplicity of your data model design can be a challenge when you have a large number of sources. Creating staging models and trying to avoid doing too much heavy lifting in a single model means your project size can quickly get out of hand.

Unfortunately, as of today, dbt has no dynamic way of defining models — which means one model per file. There’s an interesting discussion on the practicality of dynamic models in dbt you can find here.

We currently have 7000+ models in our dbt project. Nearly all of our base and staging models are identical in contents except for perhaps a few column/model names and refs. Here’s an example of one:

{{
config(
materialized='view'
)
}}

{{
generate_staging_model(
src=source('sfdc’, ‘campaign'),
unique_columns=['campaign_id', 'timestamp'],
unique_key='id'
)
}}

Creating and maintaining thousands of these files becomes nearly impossible with projects this large. We looked at the dbt-labs codegen package which seemed great for creating YAML files but we found the base model generator wasn’t what we were looking for — so we decided to build our own.

dbtgen

We built an application in Python to use a user-defined SQL template and model-scoped variables to generate the dbt models. You can find the GitHub project here.

Using the staging model example above, we wanted to be able to generate this from a template that looked something like below. Note, both the file contents and the file name have been parameterised.

# .dbtgen/models/staging/stg_sfdc_{name}.sql

{{
config(
materialized='view'
)
}}

{{
generate_staging_model(
src=source('sfdc', '${name}'),
unique_columns=${unique_columns},
unique_key='id'
)
}}

Then, we could define the key-value pairs for parameters that were referenced within the template. These could be defined in YAML like so:

# .dbtgen/models/staging/models.yml

models:
account:
unique_columns:
- account_id
campaign:
unique_columns:
- campaign_id
- timestamp
opportunity:
unique_columns:
- opportunity_id

We decided to reserve the model name using the parameter with key ‘name’. Any other parameter keys could be defined in the YAML file above.

Within the dbt project, we used the directory .dbtgen/ for storing these templates and parameters files, where the templates would be stored in the corresponding sub-directory of the models directory used for dbt. By default, this is models/.

└── my-project
├── .dbtgen
│ ├── models
│ │ └── staging
│ │ ├── stg_sfdc_{name}.sql # Template file
│ │ └── models.yml # Parameters file
·
·
├── models
│ └── staging
│ ├── stg_sfdc_account.sql
│ ├── stg_sfdc_campaign.sql
│ └── stg_sfdc_opportunity.sql
·
·

The application would then need to traverse through each folder in .dbtgen/ to find both a template (*.sql) and a models.yml in order to generate the models.

Arguments

Borrowing syntax from dbt, we added a few command line arguments to the app.

+-----------------+------------------------------------------------------+
| Argument | Description |
+-----------------+------------------------------------------------------+
| --select, -s | Select models, qualified path to directory of models |
| --run, -r | Run in execute mode, default = False (compile mode) |
| --overwrite, -o | Overwrite any existing model files, default = False |
+-----------------+------------------------------------------------------+

By default, we wanted the application to run in “compile” mode and simply log the contents of the files so that the user could check these looked as expected before creating them.

Once happy, the run argument could then be passed into the same command to create the files.

And that’s it! Rather than maintaining a few thousand model files individually we can now manage these using a handful of YAML files instead.

--

--