Navigating Data Quality Maze with Great Expectations, DBT, and Trino

Published in

Data Engineering Indonesia

8 min readOct 6, 2023

In this article, we’ll continue our previous discussion in “Defusing Data ‘Time Bombs’ with DataHub Observability”. We’ll dive into the world of data quality, exploring the journey we embarked on in the eFishery data team. What started as a seemingly simple problem revealed its complexity, reaching beyond our data context.

We believe this issue isn’t unique to our team but a common challenge in many data-heavy companies. We hope our experiences can inspire and provide compelling reasons for you to initiate your data quality efforts. We’ll focus on the data quality problem from the perspective of a data team and how we tackled it using widely adopted solutions like Great Expectations (GX), DBT, and Trino.

Data quality requires a certain level of a sophistication within a company to even understand that it’s a problem. — Colleen Graham

The “Why” Part

Every team endeavors to maximize its key metrics, whether by adding features for customers, reducing deployment times, or enhancing product engagement rates. Typically, teams have monitoring dashboards and established metrics. But here’s the catch: when those metrics turn green, there’s joy, and when they turn red, there’s disappointment.

That’s the instinctive response we all have. However, here’s where the doubt creeps in — how certain are you about the changes in those metrics? How confident are you in the metrics you’ve created?

This uncertainty stems from the fact that most of the attention is directed toward the values and colors of the metrics, not the data’s integrity and quality. Consequently, while the dashboard provides you with data, it doesn’t always come with trust. It’s a straightforward truth: data can be obtained, but trust in that data is not guaranteed.

The Realization

We became acutely aware of the poor data quality when stakeholders began flagging data discrepancies across several dashboards meant to display identical values. This discovery understandably raised suspicions. Upon a thorough investigation, we uncovered a rather simple explanation for the discrepancies. Dashboard A relied on Table X, while Dashboard B drew data from Table Y, which was closely related to Table X. The anomaly lay in the SQL logic within Table Y, causing slight variations in the aggregated values compared to Table X.

Concerned about potential future incidents, we initiated a data quality drive. We led the way in constructing a robust data quality platform, leveraging Great Expectations (GX) within our data warehouse. Implementing GX was straightforward; it served as our quality assurance mechanism. Additionally, we designed a user-friendly data quality dashboard for efficient monitoring and visualization of results.

Yet one important question remained: Who was truly responsible for this initiative? Would it be the Data Engineers (DE), Data Analysts/Architects (DA), Data Scientists (DS), or Data Governance Analysts (DG)?

The answer is all of the above.

The Data Engineers were tasked with building and managing the data quality platform.
The Data Governance Analysts formulated clear rules of engagement and defined data quality goals.
The Data Analysts/Architect and Data Scientists, possessing the most intimate knowledge of the data, established quantitative criteria for data quality.

The Quality Metric

To provide further context regarding the technical implementation, our data quality metrics didn’t employ a simple pass-or-fail model. Instead, we relied on an “unexpected percentage” metric. This metric calculated the percentage of rows in a table that failed to meet certain expectation criteria. For example, if we expected an integer column’s values to be greater than 5, the unexpected percentage would be derived from the ratio of rows with values less than or equal to 5 to the total row count.

Naturally, this metric couldn’t be applied uniformly across all expectations. Some GX expectations would result in a 100% unexpected percentage when they failed, particularly those related to column data types. For instance, if we expected a column to be of integer type and it turned out to be a date, the unexpected percentage would be 100%.

The reason behind adopting this “unexpected percentage” metric was the diversity of data sources feeding into our data warehouse. Solving data quality issues was rarely a one-shot process; it often entailed gradual improvements on a per-source basis. Using this metric allowed us to prioritize which databases required thorough examination. Additionally, it facilitated tracking progress and establishing clearer data quality criteria.

The Resonance

As the initiative expanded, Our Data Governance Analysts (DG) saw the potential benefits of extending data quality checks further upstream, not just within our data warehouse but also at the source databases. This sounded promising, but it added complexity for the DE team. GX, as it was implemented, could only inspect data within the data warehouse. Thus, it couldn’t examine data in databases beyond the warehouse.

To address this limitation, the DE team decided to broaden the scope of our data quality platform by incorporating a query engine. But why choose a query engine? Primarily because it allowed us to perform data quality checks with consistent logic while also remaining database engine agnostic. Luckily, we already had an existing query engine in place, so we opted to use Trino.

Upon conducting data quality checks on several database sources and sharing the results with product teams and analysts, the data team received overwhelmingly positive feedback. It turned out that many of them had shared our concerns about data integrity and quality. The direct data quality checks at the database source provided them with the assurance they needed. In reality, this initiative resonated effectively with other teams because we all shared a common concern.

The Renaissance

After becoming a topic of discussion for many teams, the data quality initiative gained prominence. Now, almost every team wants data quality and integrity checks for their data metrics. It’s another great moment for the data team, but not so much for the DE team (again!). The more teams want data checks, the more diverse types of checks they request, creating a new challenge for the DE team to handle these new types of check requests.

GX already provides various types of expectations that can be readily used, such as expect column max to be between or expect column distinct values to be in set. However, due to the diversity of our data and the need for specific data quality checks, we developed custom GX expectations. One example is expect column values to equal in second table column.

Building custom expectations proved challenging initially, considering that GX’s documentation was not as comprehensive as it is at the time of writing this article. Now, you can easily understand the concept and technical requirements needed to create custom GX expectations through its official documentation.

The Quality Gate

With our growing confidence in our data, another delightful outcome of our data quality drive was our shift from mere quality assurance to the implementation of quality gates. We adopted these quality gates by harnessing the power of DBT model-contracts within our data warehouse. For those unacquainted with the world of model contracts in DBT, you’re in for a treat; it’s like discovering a magical feature!

Picture this: Model contracts in DBT operate on a simple principle — if a table fails to meet the predefined data quality or data standards, it’s a no-go for publication. The secret sauce lies in DBT’s blue-green deployment method.

Here’s the scoop: DBT makes an attempt to construct a table in a temporary spot, runs a data quality check, and if it aces the test, the data from the temporary table gracefully transitions into the target build-up table. But, if it flunks, an alert goes off.

A big shoutout to our Data Warehouse team for seamlessly upgrading the installed DBT to support model contracts and rolling them out to most of our critical key tables! 🚀

Conclusion

In conclusion, our journey through the data quality maze has been enlightening. What initially seemed like a simple issue in our eFishery data team turned out to be a complex challenge (we believe) relevant to many data-heavy companies.

We’ve realized that data quality is a shared responsibility across departments, with trust in data being of utmost importance. It goes beyond surface-level metrics and hinges on data integrity and quality.

Our adoption of tools like Great Expectations, DBT, and Trino has resonated with other teams who recognize the value of data integrity at the source. As the initiative gains momentum, we face new challenges, but our trust in data and the implementation of quality gates equip us to navigate them.

In the end, our data quality journey has taught us that solving complex problems is a shared endeavor, and it’s not about placing blame but taking collective responsibility. By fostering a culture of data quality, we believe we can continue to grow and make data-driven decisions with confidence, ultimately driving success for fish farmers at eFishery and beyond.

Credit

This article represents the collaborative efforts of the dedicated members of the eFishery data team. I extend my sincere appreciation to our team leads, including Mas Rifan Kurnia, Wa’ Dimas Gilang, and Bang Khairida. Our fellow engineers, including Yusuf, Agung, and Fajar, have also made significant contributions.

Special thanks go out to our valued governance analysts, Almas, Vicca, and Ihsan, as well as our esteemed data warehouse architects, Tito, Verdy, and Irvan, for their enthusiastic participation in crafting this article.

We would also like to express our gratitude to the Data Engineering Indonesia community for their invaluable insights and reviews, which have enabled us to share this knowledge with a wider audience. Your contributions have played a pivotal role in ensuring the accessibility of this information. Thank you all!