Centralized data catalog for decentralized teams
Objective
- In the data engineering team of Taager, we have a crucial mission to provide a complete data catalog that the data analyst can use as the first step in their data discovery journey.
- But in the data engineering team, we are also adapting our data infrastructure to new business requirements and needs. Meanwhile this new data infrastructure does not replace 100 % of the features provided by the current platform; we can not deprecate it. That’s why we decided to create a new data project to guide this update without impacting business.
- This (microservice) approach could be good enough for a software engineer, but as we stated previously, we want to merge all the data generated by both projects in a single place. To solve this, we’ve defined a pipeline that automatically updates the data catalog, no matter if the legacy, the new repository, or both repositories are updated.
Tools
We are going to use the following:
- dbt to generate the data catalog
- Gitlab Pages to publish the data catalog
- Gitlab CI to publish the new version of the data catalog after a modification
- Python to generate the HTML standalone page used by Gitlab Pages
Implementation
1 . Linking repositories
At first, we have two independent repositories(let’s call them legacy and new) with no connection.
The first step is to create a deploy token in the legacy repository following these instructions. This token only needs read_repository scope.
Then we include this new token name and secret value as CI/CD variables.
Finally, to import the legacy models into our new project once we run:
dbt deps
In our new repository’s packages.yml file, we need to include the git URL of our legacy project with the two new variables embedded, as you can see in the snippet below.
packages:
- git: "https://{{env_var('DBT_USER_NAME')}}:{{env_var('DBT_ENV_SECRET_DEPLOY_TOKEN')}}@gitlab.com/<legacy-repository>.git"
revision: master # or any branch you want to pull
2 . Include trigger
Now, we can manually trigger a dbt deps in our new project, and we’ll have all our models in a single place, but as we hate manual steps, we are going to include a trigger that automatically executes dbt deps.
In our old repository’s .gitlab-ci.yml, we’ll include the following code as a final step.
staging:
stage: trigger
trigger:
project: <new-repository>
branch: main
only:
- master
We need to include in the new repository’s .gitlab-ci.yml file a step executed by the main branch, like the following one, so the previous trigger can start.
stage: buildimage: name: ghcr.io/dbt-labs/dbt-snowflake:1.2.0 entrypoint: [""]script: - dbt depsonly: - main
3 . Generate the standalone HTML
At this point, we can automatically generate a dbt docs with ALL our datasets, but we want to publish it, so we’ll generate a standalone HTML page that can be published on its own.
The following python script will perform this process
import jsonproject_dir = '/builds/<new-repo>'search_str = 'o=[i("manifest","manifest.json"+t),i("catalog","catalog.json"+t)]'with open(f'{project_dir}/target/index.html', 'r') as f: content_index = f.read()with open(f'{project_dir}/target/manifest.json', 'r') as f: json_manifest = json.loads(f.read())with open(f'{project_dir}/target/catalog.json', 'r') as f: json_catalog = json.loads(f.read())with open(f'{project_dir}/target/dbt_docs.html', 'w') as f: new_str = "o=[{label: 'manifest', data: "+json.dumps(json_manifest)+"},{label: 'catalog', data: "+json.dumps(json_catalog)+"}]" new_content = content_index.replace(search_str, new_str) f.write(new_content)
We need to include this python code in our new repository. We named it generate_dbt_docs.py
This code needs to be triggered by our CI pipeline; hence we need to update the new repository’s .gitlab-ci.yml
stage: buildimage: name: ghcr.io/dbt-labs/dbt-snowflake:1.2.0 entrypoint: [""]script: - dbt deps - dbt docs generate - python3 utils/generate_dbt_docs.pyonly: - main
Our pipeline will generate a dbt_docs.html file that users can open in a browser.
4. Publish to Gitlab Pages
The last thing to do is to host the dbt_docs.html file generated in the previous step.
Gitlab provides a free hosting service called Gitlab Pages for code in our Gitlab repository.
The only thing we need to do is to update again our new repository’s .gitlab-ci.yml file.
pages: stage: build image: name: ghcr.io/dbt-labs/dbt-snowflake:1.2.0 entrypoint: [""] script: - dbt deps - dbt docs generate - python3 utils/generate_dbt_docs.py - mkdir public - mv target/dbt_docs.html public/index.html artifacts: paths: - public only: - main
This step needs to be named pages so the runner knows that it is requested to publish a Gitlab pages. By default, Gitlab pages look for a file named index.html, which is included in the public artifact.
Next Steps
This is the Taager data engineering team’s first step in providing a complete self-service data platform.
One possible next step is allowing data analysts to build their data marts on our new repository. They’ll only need to include this repository in their packages.yml. The first step towards data mesh 😜
Another step could be sending a notification to data analysts once the documentation has suffered a modification, including in the message the last commit description or the models that dbt detected changes.
Bibliography
Python code extracted from: https://lightrun.com/answers/dbt-labs-dbt-docs-export-documentation-site-as-a-set-of-static-pages.