Serving dbt docs on Gitlab (Static) Pages
It can be argued that Data engineering is already a specialized form of software engineering, however what people interpret as DE’s being slow to adopt best practices from traditional software engineering is more about the unique difficulties of working with data (especially at scale) and less about the awareness or desire to use best practices.
For the data engineer, dbt is a great step in the right direction for the Data Engineer. Finally bringing some Software Dev practices to the DE world.
In this post, we will explain how to capture the output of dbt docs generate
and host them on Gitlab Static Pages.
Pre Requisites
- You already have a Gitlab Account and project set up
- Connectivity and Secrets between your Gitlab pipeline and database have already been configured
Add a ‘Pages’ step to your .gitlab-ci.yml
There are a couple of ways to do this. You can either step through the Gitlab Wizard by clicking on the navigation bar:
Or we can just add the Pages step directly to our gitlab-ci.yml
pages:
stage: deploy
environment:
name: production
script:
- poetry install
- "poetry run dbt deps --project-dir dbt --profiles-dir
\"{DBT_PROFILES_DIR}\"" --vars \"{DBT_VARS}\""
- "poetry run dbt docs generate --project-dir dbt --profiles-dir
\"{DBT_PROFILES_DIR}\"" --vars \"{DBT_VARS}\""
- python ${CI_PROJECT_DIR}/scripts/dbt_documentation.py
- rm ${CI_PROJECT_DIR}/dbt/target/index.html && mv ${CI_PROJECT_DIR}/dbt/target/index2.html ${CI_PROJECT_DIR}/public/index.htmlartifacts:
paths:
# The folder that contains the files to be exposed
# at the Page URL
- public
rules:
# On main branch and this isn't a pipeline triggered
# by an external repo wanting to update its target
# version via webhook
- if: $CI_COMMIT_BRANCH == 'main' && $TARGET_ID == null
Get around CORS
dbt docs generate
outputs a number of files, the ones of concern are both the manifest.json, catalog.json and index.html.
Just moving those files into public
for Gitlab to serve doesn’t work because the index.html points to our json files which then get blocked by Cross Origin Resource Sharing (CORS). Without a Web Server, it isn’t possible to read or share this documentation which is what dbt docs serve
does for you; spins up a web server for that documentation.
This is described in detail at this issue here: https://github.com/dbt-labs/dbt-docs/issues/53
To get around this, we can use Data Banana’s python script which updates the javascript code within the index.html
to include the contents of the json files directly.
In case their page gets deleted, here’s an excerpt of their code:
search_str = 'o=[i("manifest","manifest.json"+t),i("catalog","catalog.json"+t)]'
with open(os.path.join(PATH_DBT_PROJECT, 'target', 'index.html'), 'r') as f:
content_index = f.read()
with open(os.path.join(PATH_DBT_PROJECT, 'target', 'manifest.json'), 'r') as f:
json_manifest = json.loads(f.read())
# In the static website there are 2 more projects inside the documentation: dbt and dbt_bigquery
# This is technical information that we don't want to provide to our final users, so we drop it
# Note: depends of the connector, here we use BigQuery
IGNORE_PROJECTS = ['dbt', 'dbt_bigquery']
for element_type in ['nodes', 'sources', 'macros', 'parent_map', 'child_map']: # navigate into manifest
# We transform to list to not change dict size during iteration, we use default value {} to handle KeyError
for key in list(json_manifest.get(element_type, {}).keys()):
for ignore_project in IGNORE_PROJECTS:
if re.match(fr'^.*\.{ignore_project}\.', key): # match with string that start with '*.<ignore_project>.'
del json_manifest[element_type][key] # delete element
with open(os.path.join(PATH_DBT_PROJECT, 'target', 'catalog.json'), 'r') as f:
json_catalog = json.loads(f.read())
with open(os.path.join(PATH_DBT_PROJECT, 'target', 'index2.html'), 'w') as f:
new_str = "o=[{label: 'manifest', data: "+json.dumps(json_manifest)+"},{label: 'catalog', data: "+json.dumps(json_catalog)+"}]"
new_content = content_index.replace(search_str, new_str)
f.write(new_content)
This is the line in gitlab-ci.yml
that runs /scripts/db_documentation.py
The script above outputs index2.html
, which the following lines in gitlab-ci.yml
move to the public
folder for serving within Gitlab.
dbt docs and working with images
Within dbt docs
, you can also link to pictures. If you do, and you use the assets
directory as described in the documentation, be sure to move that asset directory to public/assets
so the index.html
within gitlab pages knows where to find that image.
Push and visit the documentation URL
After you commit and push to your repository, your documentation page should be live on https://YOUR_GITLAB_USERNAME.gitlab.io/YOUR_REPOSITORY
. Only users that have access to your GitLab repository can access this page.
Also within the Settings/Pages navigation within Gitlab, you should be able to find pertinent information to your static pages site.
Conclusion
This setup will generate and deploy the dbt documentation after every change. Having the documentation hosted on a tool that you already use is always a helpful hand.