Mastering Data Quality: 6 Considerations for Success

Published in

Wix Engineering

4 min readApr 23, 2023

Introduction — The Importance of Data Quality

In large data organizations, data quality is crucial for many reasons. For one, accurate data leads to better decision-making. Additionally, high-quality data inspires trust within and outside of the organization. It also results in more efficient development since less time is spent on debugging data issues.
At Wix, we own thousands of ETLs (most of them daily batch jobs), which create our data warehouse tables. Our large number of data assets prompted us to devise a centralized solution that would allow our data engineers to test their data efficiently and with ease. We recognized the need to avoid implementing multiple internal solutions that would essentially perform the same function.
In this blog post, I will highlight six key topics that we have encountered in our own data quality journey, which you may want to consider when implementing a data quality solution within your organization.

Open Source vs. Managed vs. In-House Solution

You have three options when it comes to using a data quality solution: building an in-house solution, using an existing data quality solution (either open-source or managed), or choosing a combination of both.
There are pros and cons to each option — Developing an in-house solution provides a lot of flexibility but requires significant development effort. On the other hand, using an existing data quality solution saves development time and provides new features with minimal dev effort, but it has limited customization and flexibility.
At Wix, we chose to use Great Expectations, a leading open-source solution for data quality. Through our use of Great Expectations, we’re able to draw on the expertise of a broad community of developers and data quality experts, while also customizing the solution to meet our specific needs. Additionally, we have the ability to contribute new tests and features to the package as needed.

Standard vs. User Tests

Who decides what to test and how? This depends on the structure and dynamics of your data organization and data warehouse. Do you have a central data warehouse where you want all tables to follow the same quality standards? Or do you prefer to allow each user or team/group to test their data according to team/individual decisions?
Having a clear testing strategy ensures that everyone is on the same page and that the right data quality tests are performed on the right data.

Ownership and Alerts

Ownership is essential for efficient data quality processes. When a data quality test fails, various stakeholders should be informed and/or handle the issue, depending on the cause of the failure. Proper ownership and alerting ensure that data quality issues are handled quickly and efficiently.
Alongside alerting, monitoring is equally important in providing an overall view of data quality across multiple assets, helping to identify problematic areas such as specific tests or data assets that frequently fail.

Data Lineage

Data issues are seldom isolated incidents; they are frequently interconnected with upstream problems. If a particular table fails a test, it could be due to a bug present in a process that is several levels upstream from the table. A proper lineage allows for tracing the data quality issue back to its origin, the upstream process.
Additionally, having a clear downstream lineage is also crucial as it enables us to understand the potential impact of a data issue on downstream processes and stakeholders. This helps us to prioritize and address the issue accordingly, reducing any negative consequences on our business operations.

Scalability

When designing a data quality solution, it’s essential to consider scalability. Will you need to run dozens, or maybe tens of thousands of tests per day?
What is the planned load on each component in the data quality solution? The scheduler, query engine or application that runs the tests, and the database where test results are saved are all crucial components to consider.
Moreover, you may also want to consider implementing features like parallel processing or distributed computing to help scale the solution.

Moderation / Not testing everything

Testing is a crucial component in maintaining data quality, but it is not without cost. To optimize cost, it is imperative to carefully consider which data assets require testing. Rather than conducting quality tests on all data assets, it is important to strike a balance between identifying upstream issues that affect a wider range of assets and ensuring confidence in the downstream data used by end-users by directly testing it.
Additionally, it is not always necessary to test entire tables regularly, particularly if they are updated frequently with new data. In these cases, it may be more efficient to only test the newly updated data during each run.

In conclusion, introducing a data quality solution in a large data organization is crucial for accurate decision-making, building trust, and efficient development. When implementing a data quality solution, it is essential to consider the topics we discussed — open-source, managed, or in-house solutions, testing strategy, ownership and alerts, proper lineage, scalability and cost of tests, and moderation in testing.
By taking these factors into account, organizations can ensure that their data quality solution is effective, efficient, and tailored to their specific needs.

Mastering Data Quality: 6 Considerations for Success

Written by Itai Sevitt