Democratizing the DBT Docs and the Elementary Quality Reports

Albert Franzi
Albert Franzi
Published in
4 min readApr 18, 2024

In today’s fast-paced digital world, having immediate access to clear, accurate data documentation and quality reports isn’t just nice to have — it’s essential. Whether you’re a data analyst looking for the latest data transformations or a team leader needing assurance on data quality, quick and secure access to this information can drastically improve decision-making and operational efficiency across the company.

In this article, we’re diving into a practical solution to automate and secure the delivery of dbt docs and Elementary data quality reports. By using Airflow for orchestration, Amazon S3 as static content storage, and Nginx hosted on EKS to serve content within our VPN, we create a robust system that not only safeguards your data but also makes it readily accessible to the entire company through a secure connection.

Background and Tools Overview

To truly democratize data within a company, it’s essential to understand and effectively utilize the tools that enable such accessibility. In this article, we focus on a toolkit that integrates dbt, Elementary data quality, Airflow, Amazon S3, and Nginx within an Amazon EKS. Here’s how each component contributes to a streamlined data reporting system:

  • dbt docs: It generates and provides a complete catalog about our DBT project, with the lineage and internal dependencies between all our models and business logic.
  • elementary-data: In addition to being a DBT plugin that consumes all the execution and test metrics from our DBT executions, elementary also provides a set of capabilities (i.e. anomaly tests detection, performance metrics, etc), being the generation of a quality report one of them.
  • nginx: deployed within an Amazon EKS (Elastic Kubernetes Service) cluster, acts as a web server to serve the static sites generated by dbt and Elementary.
  • s3: provides a secure and scalable way to store and retrieve any amount of data. In our solution, it serves as the hosting platform for the static content generated by dbt and Elementary, while also it secures the access behind a set of private network policies.
  • airflow: the orchestrator tool which we will schedule all the work to be done by dbt and Elementary relying on the K8sPodOperator.

By removing manual interventions and centralizing data access, this setup not only increases efficiency but also empowers all team members with the knowledge they need to drive the company forward.

The process

1. DBT Catalog

As we mentioned above, we love the K8sPodOperator, that why encapsulate all our DBT models with our packages into an ECR image every time we merge our code to the main branch.

Therefore, we provide the following script which we will execute after our dbt build task.

DBT Airflow Graph

The build_dbt_operator method is just a wrapper we define on top of the KubernetesPodOperator to inject the DBT docker image, the env vars, the resource requests, and the service account.

2. The S3 catalog hosting

When hosting sensitive data documentation such as dbt docs on S3, it’s crucial to implement robust security measures to ensure that the data remains protected and accessible only to authorized users.

One effective security measure is to restrict access to the S3 bucket by filtering based on IP addresses, in our case by the NAT Gateway IPs to which the Nginx K8s pod is defined.

3. Nginx routing the VPN private traffic

Once we have our DBT catalog and the Elementary Observability report in S3, we will proceed with deploying an Nginx pod to proxy the internal request within the VPN to our static websites.

Our Nginx helm contains the ALB annotations required to provide two DNSs for our resources accessible only within the private network.

Since we are using ArgoCD for deploying our helms we will inject all the ${variable} when terraforming the ArgoCD Application.

And that’s it, we have our DBT Catalog and Elementary reports ready to be consumed and accessible only within our magic VPN 🚀.

Your Insights Are Invaluable

As we continue to refine our approach and explore new tools and techniques, your feedback and insights are invaluable. We encourage you to share your experiences:

  • Have you implemented similar solutions in your own environments?
  • What challenges have you faced during such implementations?
  • Do you have suggestions or best practices that might enhance this framework further?

Please leave your comments below or reach out directly. Your input is crucial for us to learn, adapt, and improve our processes. Together, we can push the boundaries of what’s possible in data and make our data ecosystems not only more efficient but also more secure and compliant.

--

--