Centralized data catalog for decentralized teams

Omar Helwani
Taager Tech Blog
Published in
4 min readNov 2, 2022

Objective

  • In the data engineering team of Taager, we have a crucial mission to provide a complete data catalog that the data analyst can use as the first step in their data discovery journey.
  • But in the data engineering team, we are also adapting our data infrastructure to new business requirements and needs. Meanwhile this new data infrastructure does not replace 100 % of the features provided by the current platform; we can not deprecate it. That’s why we decided to create a new data project to guide this update without impacting business.
  • This (microservice) approach could be good enough for a software engineer, but as we stated previously, we want to merge all the data generated by both projects in a single place. To solve this, we’ve defined a pipeline that automatically updates the data catalog, no matter if the legacy, the new repository, or both repositories are updated.

Tools

We are going to use the following:

  • dbt to generate the data catalog
  • Gitlab Pages to publish the data catalog
  • Gitlab CI to publish the new version of the data catalog after a modification
  • Python to generate the HTML standalone page used by Gitlab Pages

Implementation

1 . Linking repositories

At first, we have two independent repositories(let’s call them legacy and new) with no connection.

The first step is to create a deploy token in the legacy repository following these instructions. This token only needs read_repository scope.

Form to generate the deploy token in the legacy project

Then we include this new token name and secret value as CI/CD variables.

Example of how these variables should be in the CI/CD variables in the new project repository

Finally, to import the legacy models into our new project once we run:

dbt deps

In our new repository’s packages.yml file, we need to include the git URL of our legacy project with the two new variables embedded, as you can see in the snippet below.

packages:
- git: "https://{{env_var('DBT_USER_NAME')}}:{{env_var('DBT_ENV_SECRET_DEPLOY_TOKEN')}}@gitlab.com/<legacy-repository>.git"
revision: master # or any branch you want to pull

2 . Include trigger

Now, we can manually trigger a dbt deps in our new project, and we’ll have all our models in a single place, but as we hate manual steps, we are going to include a trigger that automatically executes dbt deps.

In our old repository’s .gitlab-ci.yml, we’ll include the following code as a final step.

staging:
stage: trigger
trigger:
project: <new-repository>
branch: main
only:
- master

We need to include in the new repository’s .gitlab-ci.yml file a step executed by the main branch, like the following one, so the previous trigger can start.

stage: buildimage:     name: ghcr.io/dbt-labs/dbt-snowflake:1.2.0     entrypoint: [""]script:    - dbt depsonly:    - main
How the pipeline looks like after including the trigger

3 . Generate the standalone HTML

At this point, we can automatically generate a dbt docs with ALL our datasets, but we want to publish it, so we’ll generate a standalone HTML page that can be published on its own.

The following python script will perform this process

import jsonproject_dir = '/builds/<new-repo>'search_str = 'o=[i("manifest","manifest.json"+t),i("catalog","catalog.json"+t)]'with open(f'{project_dir}/target/index.html', 'r') as f:     content_index = f.read()with open(f'{project_dir}/target/manifest.json', 'r') as f:     json_manifest = json.loads(f.read())with open(f'{project_dir}/target/catalog.json', 'r') as f:     json_catalog = json.loads(f.read())with open(f'{project_dir}/target/dbt_docs.html', 'w') as f:     new_str = "o=[{label: 'manifest', data: "+json.dumps(json_manifest)+"},{label: 'catalog', data: "+json.dumps(json_catalog)+"}]"     new_content = content_index.replace(search_str, new_str)     f.write(new_content)

We need to include this python code in our new repository. We named it generate_dbt_docs.py

This code needs to be triggered by our CI pipeline; hence we need to update the new repository’s .gitlab-ci.yml

stage: buildimage:    name: ghcr.io/dbt-labs/dbt-snowflake:1.2.0    entrypoint: [""]script:   - dbt deps   - dbt docs generate   - python3 utils/generate_dbt_docs.pyonly:   - main

Our pipeline will generate a dbt_docs.html file that users can open in a browser.

4. Publish to Gitlab Pages

The last thing to do is to host the dbt_docs.html file generated in the previous step.

Gitlab provides a free hosting service called Gitlab Pages for code in our Gitlab repository.

The only thing we need to do is to update again our new repository’s .gitlab-ci.yml file.

pages:    stage: build         image:               name: ghcr.io/dbt-labs/dbt-snowflake:1.2.0               entrypoint: [""]    script:        - dbt deps        - dbt docs generate        - python3 utils/generate_dbt_docs.py        - mkdir public        - mv target/dbt_docs.html public/index.html   artifacts:        paths:             - public    only:            - main

This step needs to be named pages so the runner knows that it is requested to publish a Gitlab pages. By default, Gitlab pages look for a file named index.html, which is included in the public artifact.

You should see something similar when accessing https://your.username.example.io/<new-repository>

Next Steps

This is the Taager data engineering team’s first step in providing a complete self-service data platform.

One possible next step is allowing data analysts to build their data marts on our new repository. They’ll only need to include this repository in their packages.yml. The first step towards data mesh 😜

Another step could be sending a notification to data analysts once the documentation has suffered a modification, including in the message the last commit description or the models that dbt detected changes.

Bibliography

Python code extracted from: https://lightrun.com/answers/dbt-labs-dbt-docs-export-documentation-site-as-a-set-of-static-pages.

--

--