Ensuring End-to-End Data Quality with dbt — How Migo did it
Learn how Migo, a customer marketing platform, makes use of modern data tools and methods to ensure data quality on model changes before merging PRs.
Bruce, a data engineer in Migo, shared valuable insights on “Ensuring End-to-End Data Quality with dbt” during the dbt Taipei Meetup #24. You can access Bruce’s slide presentation and recording for more details. Presented in Chinese, it’s pity that the valuable content was limited to a Chinese-speaking audience. Therefore, this blog post aims to share Bruce’s key points in English, ensuring these insights reach a wider audience.
About Migo
Migo, a Customer Data Platform, integrates multiple channels to send marketing messages, manage customer loyalty systems, and provide customer dashboards. To achieve these data-driven functionalities, they utilize dbt and BigQuery as part of their data tech stack:
The Importance of Data Quality
For Migo, data quality is paramount. As their data products are consumed by end customers, any discrepancies can lead to significant customer issues and damage their brand. For instance, if the data is incorrect, customers might send erroneous messages or target the wrong users, leading to a potential backlash. To mitigate such risks, Migo focuses on preemptive data quality measures.
Bruce cited three major data quality issues that Migo had encountered in the past.
- Null data from the source.
- Errors in transformation, leading to duplicated data.
- Changes in table names without updating downstream dashboards.
To tackle these challenges, they implemented tests on source data and automated tests in their CI pipeline.
Testing Source Data
dbt tests form the foundation of their approach, covering basic checks like unique
, not_null
, accepted_values
, and relationships
. Additionally, they use the open-source package dbt-utils for extended testing capabilities and create custom generic data tests when needed.
Moreover, Migo employs Elementary to detect anomalies in data quality metrics and to alert teams about issues via Slack and MS Teams.
Their process involves:
- Running tests on source data:
bt test --select 'source:*',{models}
2. Using a post-hook to trigger dbt_elementary
, saving results in elementary_test_results
.
3. Notifying results to the data team through email and MS Teams.
Automating Tests in CI
To validate SQL queries before execution, Migo uses dbt-dry-run
, leveraging BigQuery's dry run capability. This approach ensures queries are error-free and efficient, preventing bad data from entering the warehouse without the cost of running. In dbt v1.8, you can try dbt run — empty instead.
In addition to dry run, they compare data before and after changes by maintaining separate environments for CI and production. They build a Python tool identifies modifications and notifies PR authors and reviewers of changes such as modified models, columns, row differences, exposures, and orphan tables or views.
Their CI command:
dbt compile
dbt-dry-run
dbt build --select {modified-files} --vars 'env: ci'--exclude 'source:*'
python modification.py
And the Python tool adds comments as PR report on Github.
Benefits of Testing
Implementing these tests not only resolves critical data quality issues but also aids in developing dbt models. Early testing saves time and prevents errors, ensuring a smoother development process.
Take Action on Data Quality Today
Data quality is a critical concern, as highlighted in the State of Analytics Engineering 2024. While not all teams may have the resources to implement comprehensive testing, adopting some of these practices can significantly improve your data pipeline.
Inspired by PipeRider, the predecessor of Recce, Migo built their own tool to compare data changes before and after transformations. Now, you can take advantage on Recce, an open source project, to streamline your data validation and integrate it into your CI pipeline, saving time and enhancing data quality.