Serving dbt docs on Gitlab (Static) Pages

Tom Klimovski
gammadata
Published in
4 min readOct 27, 2022

It can be argued that Data engineering is already a specialized form of software engineering, however what people interpret as DE’s being slow to adopt best practices from traditional software engineering is more about the unique difficulties of working with data (especially at scale) and less about the awareness or desire to use best practices.

For the data engineer, dbt is a great step in the right direction for the Data Engineer. Finally bringing some Software Dev practices to the DE world.

In this post, we will explain how to capture the output of dbt docs generate and host them on Gitlab Static Pages.

Pre Requisites

  1. You already have a Gitlab Account and project set up
  2. Connectivity and Secrets between your Gitlab pipeline and database have already been configured

Add a ‘Pages’ step to your .gitlab-ci.yml

There are a couple of ways to do this. You can either step through the Gitlab Wizard by clicking on the navigation bar:

Or we can just add the Pages step directly to our gitlab-ci.yml

pages:
stage: deploy
environment:
name: production

script:
- poetry install
- "poetry run dbt deps --project-dir dbt --profiles-dir
\"{DBT_PROFILES_DIR}\"" --vars \"{DBT_VARS}\""
- "poetry run dbt docs generate --project-dir dbt --profiles-dir
\"{DBT_PROFILES_DIR}\"" --vars \"{DBT_VARS}\""
- python ${CI_PROJECT_DIR}/scripts/dbt_documentation.py
- rm ${CI_PROJECT_DIR}/dbt/target/index.html && mv ${CI_PROJECT_DIR}/dbt/target/index2.html ${CI_PROJECT_DIR}/public/index.html
artifacts:
paths:
# The folder that contains the files to be exposed
# at the Page URL
- public
rules:
# On main branch and this isn't a pipeline triggered
# by an external repo wanting to update its target
# version via webhook
- if: $CI_COMMIT_BRANCH == 'main' && $TARGET_ID == null

Get around CORS

dbt docs generate outputs a number of files, the ones of concern are both the manifest.json, catalog.json and index.html.

Just moving those files into public for Gitlab to serve doesn’t work because the index.html points to our json files which then get blocked by Cross Origin Resource Sharing (CORS). Without a Web Server, it isn’t possible to read or share this documentation which is what dbt docs serve does for you; spins up a web server for that documentation.

This is described in detail at this issue here: https://github.com/dbt-labs/dbt-docs/issues/53

To get around this, we can use Data Banana’s python script which updates the javascript code within the index.html to include the contents of the json files directly.

In case their page gets deleted, here’s an excerpt of their code:

search_str = 'o=[i("manifest","manifest.json"+t),i("catalog","catalog.json"+t)]'

with open(os.path.join(PATH_DBT_PROJECT, 'target', 'index.html'), 'r') as f:
content_index = f.read()

with open(os.path.join(PATH_DBT_PROJECT, 'target', 'manifest.json'), 'r') as f:
json_manifest = json.loads(f.read())

# In the static website there are 2 more projects inside the documentation: dbt and dbt_bigquery
# This is technical information that we don't want to provide to our final users, so we drop it
# Note: depends of the connector, here we use BigQuery
IGNORE_PROJECTS = ['dbt', 'dbt_bigquery']
for element_type in ['nodes', 'sources', 'macros', 'parent_map', 'child_map']: # navigate into manifest
# We transform to list to not change dict size during iteration, we use default value {} to handle KeyError
for key in list(json_manifest.get(element_type, {}).keys()):
for ignore_project in IGNORE_PROJECTS:
if re.match(fr'^.*\.{ignore_project}\.', key): # match with string that start with '*.<ignore_project>.'
del json_manifest[element_type][key] # delete element

with open(os.path.join(PATH_DBT_PROJECT, 'target', 'catalog.json'), 'r') as f:
json_catalog = json.loads(f.read())

with open(os.path.join(PATH_DBT_PROJECT, 'target', 'index2.html'), 'w') as f:
new_str = "o=[{label: 'manifest', data: "+json.dumps(json_manifest)+"},{label: 'catalog', data: "+json.dumps(json_catalog)+"}]"
new_content = content_index.replace(search_str, new_str)
f.write(new_content)

This is the line in gitlab-ci.yml that runs /scripts/db_documentation.py

The script above outputs index2.html, which the following lines in gitlab-ci.yml move to the public folder for serving within Gitlab.

dbt docs and working with images

Within dbt docs, you can also link to pictures. If you do, and you use the assets directory as described in the documentation, be sure to move that asset directory to public/assets so the index.html within gitlab pages knows where to find that image.

Push and visit the documentation URL

After you commit and push to your repository, your documentation page should be live on https://YOUR_GITLAB_USERNAME.gitlab.io/YOUR_REPOSITORY. Only users that have access to your GitLab repository can access this page.

Also within the Settings/Pages navigation within Gitlab, you should be able to find pertinent information to your static pages site.

Conclusion

This setup will generate and deploy the dbt documentation after every change. Having the documentation hosted on a tool that you already use is always a helpful hand.

--

--

Tom Klimovski
gammadata

Engineer with a strong focus on GCP. Love a great API and an even better IPA.