Data Quality Roadmap. Part I

Published in

Wrike TechClub

12 min readMay 24, 2021

We at Wrike are creating a roadmap with best practices to help data engineers achieve good data quality in their teams and projects.

In Part 1 of this article we are describing:

Dimensions of data quality.
Practices that could help achieve data quality in a given dimension.
Best practices, relevant materials, and case studies.

As the data engineer you can:

Describe new dimensions of data quality.
Describe best practices that aren’t covered by this article.
Add relevant materials or examples of best practices.
Describe the case of data quality in your company and make a Gap Analysis for your data engineering team.

Dimensions

Dimensions of data quality include (based on Airbnb’s Data Quality articles)

Accurate data: contains everything that’s connected to its name for the whole history.
Consistent data: everybody looks at the same data.
Timely data: refreshes on time at the right cadence.
Cost-Effective data: resources are spent on data effectively.
Usable data: easy to find and access.
Available data: all data that’s needed is available.

How to achieve quality in these dimensions

To achieve quality in all of these dimensions, we will approach it from several angles:

Practices — What should we start doing to improve quality?
Communication with data users — How and what should we communicate?
Internal processes of data pipeline design — How should data engineering teams work internally?
Knowledge sharing about data domain — How should we share our knowledge about the data domain?

We’ll dive deeper into every dimension, link it with the relevant practices, and describe every practice in detail.

Practices from this spreadsheet described below in details

Data quality practices

Validation of data sources

Helps achieve:

Accurate data

We can break this down into the following:

Automatic validations

Basic sanity checks
Testing of data semantics (nullability, foreign keys, range of values, number of distinct values, and so on)
Anomaly detection on the new data generated by the pipeline
Expectations of the dependencies that aren’t covered by the data owners

Manual validations

Covers the checks that are hard to automate:

Comparison against existing data sources and metrics.
Anomaly detection on historical data.
Advanced sanity checks by cohorts, segments, and so on.

Validations should be reusable once the model is updated, e.g., the code could be saved in Jupyter Notebooks.

Validations could be reviewed as part of the data review process.

More tools:

Great Expectations is the python framework for automatic data validations.
Facebook Prophet and Uber’s Orbit are the python library for time series forecasting that could be used for anomaly detection.

What can go wrong without this practice?

Every time you use your data, you make basic validations: you’re checking that the data is present, and sometimes you dive deeper and make manual validations and may even add automatic validations. So we should compare the lazy validation approach with the validations made during the data source development.

The lazy approach may save a significant amount of time on the stage of data pipeline implementation because validations require a good knowledge of the domain and the domain could be hard or change more frequently than you’re analyzing this domain.

However, the lazy approach also has some risks:

If some validations aren’t passing all your collected data before the validation may be broken, you should wait until the fix to collect the data and continue your analysis.
The time needed for analysis may become unpredictable, because sometimes it may require a significant amount of time on validation and sometimes even more time to wait until enough data is collected (if the first risk happens).
If the analysis has strict deadlines, users may do more shallow validations comparing the data source designers or perform only manual validations, so that all the validations will be duplicated when the data is needed again.
Ignoring data quality may lead to making wrong decisions based on wrong data. So data-driven decisions may not seem reliable.
It may lead to a high threshold of the skills and awareness needed to work. In the edge cases, only a small amount of analysts and managers have access to the data. Their resources may be restricted and may lead to a bigger cost of data-driven decisions, so gut-feeling decisions may be preferred.

Testing of data pipelines

Helps achieve:

Accurate data

Tests are useful to ensure that you can safely modify data pipelines. Tests help minimize the feedback loop and ensure the guarantees you have on the pipeline or transformation.

You can break it down to:

Integration testing
Unit testing of reused parts
Unit testing of business logic

See:

More about minimizing feedback loops: Maximizing Developer Effectiveness.

What can go wrong?

When you’re working with data pipelines, it’s typical for pipelines to take a lot of time to process the dataset. If you’re not creating a test dataset and test infrastructure, the biggest amount of feedback you get is when you’re deploying the code on acceptance or production, leading to a big feedback loop.

Typically you have limited time and attention to validate your dataset. If the basic checks aren’t covered by tests, you may spend most of your time resolving trivial bugs and bigger bugs or anomalies can go unnoticed. So the faster your feedback loop is, the more bugs you can find and resolve.

Another typical failure mode is when you’re fixing bugs in business logic and not writing tests, they may return in the next code update.

Collect information about the usage of your data sources

Helps achieve:

You’re going to need this practice to:

Own data products end-to-end.
Be responsible for the quality of data source users see: dashboards, spreadsheets, Slack messages, reports, derived data source, etc.
Find out the expectations end users have on the derived data sources and products.
Find out how many resources are spent on processing your data source.
Find out where raw data sources are used instead of aggregated data sources and data marts.

Sometimes statistics and data lineage are available in data portals. I’ll explain it in the Knowledge Sharing section.

What can go wrong without this practice?

You’re simply blind without the proper collection of the usage information. You don’t know the whole set of data users, their expectations, and usage patterns, so any major change will be difficult. You won’t be able to optimize your data source, find out the data usage patterns, and make relevant sources based on this information.

You should rely only on communication, which can be difficult in big organizations. It’s an unreliable approach since some users may be on vacation or simply forget about the use cases they had a while ago.

If the data is consistent on your level, it can be spoiled during the following stages of the data pipeline and lead to inconsistent data seen by the data user.

These problems may add up and undermine the credibility of the data.

Cover all data sources with clear SLAs

Helps achieve:

Timely data

Focus points:

Data should be fresh.
Once you have a delay, it should be communicated clearly to your data users.
Data freshness should meet your clients’ expectations and you should make it clear if you’re not ready to meet them.
Sometimes you should use chaos engineering and increase delay artificially to make sure that your data clients are ready to work with outdated data.
To make your SLAs and data freshness clear, you can use visualization. See the Airbnb approach.

What can go wrong?

Data freshness is one of the signs of reliable data for data users. So when the data is outdated, the data source may look broken for them. If the data source breaks frequently, it’s hard to believe that the other parts of data quality are reliable, so they may duplicate manual validations or even a part of data processing to depend on fewer moving parts.

Sometimes your data users may say that your data source is working slowly or constantly out of date. At that moment it’s better to have metrics and SLAs to find out where you can improve the quality of your source.

Sometimes your data source works better than you’ve committed to and your clients start expecting it to be ready earlier. They could become too confident, and once your data source is broken, they won’t expect it and make a decision based on partially ready data.

Make sure that needed data sources are available at the right time

Helps achieve:

Available data

Data engineers should be involved in the design of information systems, data collection, and SaaS integrations, as well as the delivery of new features to provide all the relevant data at the right time.

This practice is connected to internal processes of data pipeline design and data governance and we have similar options here:

The traditional Data Warehouse approach (one team of data engineers works on data users’ requests).
The Data Mesh approach of data engineers working inside the Domain to delivering data products.
Allowing data users to own data by themselves: Data analysts or data scientists will be responsible for the data sources they’re actively using.

What can go wrong?

When data engineers aren’t involved in the process that’s connected with the production of new data, their knowledge of the domain may be outdated and data users may prefer to make integrations by themselves or use raw data instead. In this case, it’s impossible to ensure the data quality and create a consistent data warehouse.

Data users can own data sources in the short run, and this approach may be cheaper and more natural to implement. But as the need for cross-domain analysis grows, without governance the data may be siloed and have uneven data quality across domains.

Communication with data users

Clear communication helps you achieve all of your practices and reveals the benefits of practices and knowledge sharing.

This is often a two-way process:

You state the data is accurate, consistent, and usable, making the process transparent for users.
Your users validate your claims when they use the data, and give feedback when the quality doesn’t meet their expectations.
You collect all this feedback and are transparent about how you’re improving your data quality, or share your knowledge about known exceptions.

So communication about the data quality is an ongoing process that doesn’t stop when you’ve released your data source. If you aren’t nurturing this process, other sources of truth or reliable sources may occur, resulting in your data users duplicating work or making wrong decisions.

To achieve all these goals you should:

Take responsibility for data accuracy and process your users’ feedback.
Communicate that the data is a single source of truth and ensure that there’s no work duplication.
Align expectations of data users with SLAs.
Optimize your bottlenecks: Data users are open to hearing that you’ll optimize/rewrite their data sources to make them more cost-efficient.
Promote the usage of data sources.
Survey your users (e.g., by using NPS).
Take responsibility for the data domain once the data is ready even before your data source is in production.

You may use a lot of mediums there:

Central communication channels
Data help desk channel
Automated notifications about the deprecated sources based on real-time analysis of the usage
You can improve the cohesion and discoverability of knowledge sharing by linking the needed documentation

What can go wrong

Advanced data users are constantly thinking about data quality; they’re trying to estimate how reliable the data is and how it could impact their decisions.

When you’re not claiming reliability or making the process transparent, data users may spend extra time reproducing the quality checks you’ve already done. If the data source is used frequently, a lot of time can be wasted.

Even if you’re claiming the data quality, data users could validate data additionally because they may have new requirements. If you’re not capturing these requirements, data sources may become unreliable and other sources of truth may appear.

Internal processes of data pipeline design

If you want to ensure data quality and compatibility regardless of the data origin, the listed practices should be established internal processes across all data engineering teams.

Many of these points are related to data governance and engineering processes. These topics are wide and should be covered in a separate article, but here we’ll focus on the requirements to these processes, which are important for data quality:

Ensure that the validations and tests of data pipelines are present during the data pipeline implementation and are easy to review.
A single source of truth is ensured by the governance.
You have clear practices for data evolution.
You can easily identify how your data source can impact other sources and update the tree of SLAs with new changes.
You’ve designed and implemented engineering best practices to ensure the cost efficiency of your data sources.
You analyze the use cases of your data sources.
You can keep up with the pace of changes in the data domain and have enough resources to deliver data sources on time.

One of the important things to highlight here is scalability. Depending on the frequency of change and the size of your team:

You may consider centralized (traditional) or decentralized and computational (data mesh) approaches. See comparison for more details.
You may need to choose between denormalized (Kimball or Inmon, Stars and Snowflake topologies) and normalized data models (Data Vault or Anchor modeling).
In data engineering, the normalized approach often goes together with computational data model management and computational governance.

What can go wrong?

Data engineers process a lot of data and sources at the same time. Bad practices on the time of source design may add up, leading to an unscalable set of pipelines that consume too many resources and are hard to fix.

You may have a single team of data engineers that design everything. In that case, it’s easy to make a single source of truth, especially in slowly changing domains or in domains that have no changes at all.

But once your organization scales, you may need more data engineers working in separate domains and make more frequent changes in your domains. The separate teams could have different definitions for the same term and store this information in different data sources, which means your data users could end up with inconsistent results.

Knowledge sharing about the data domain

You need high cohesion to make all these approaches work together smoothly. For example, lack of knowledge about the data quality may look the same as the lack of validations or SLAs for your data users. So you need to implement the process of knowledge sharing.

Considering requirements to achieve all your goals, you could focus on these points to improve your data quality:

Data sources are as full and accurate as possible and all the edge cases are clearly communicated:
All the anomalies and bugs in data are either fixed or clearly communicated.
All the major reasons for the change in data should be communicated clearly.
Applied validations and tests should be discoverable for users if they need to make their own validations.
Users who would like to understand the domain should be able to find all the relevant data sources before they find the data source that contains raw data or isn’t the single source of truth.
Information about changes in the domain and data sources should be available for users.
Data source SLAs should be available for users before they make any implicit expectations.
Best practices for the interaction with data are available to data users.
It’s easy to find all the needed information about your data sources.
You can share the knowledge or even data for data sources in changing and emerging domains.

Knowledge sharing may take different forms:

Domain experts who can answer all the relevant questions
Data users collecting knowledge from one another
Data is provided only to those who already know the domain well
Training/internal courses
Documentation
A data platform that provides metadata and manual documentation at the same time: Uber’s Databook, more info, Lyft’s Amundsen, Spotify’s Lexikon, Airbnb’s Dataportal, and their approach to knowledge sharing about data timeliness
Or you may use a combination of these approaches

There are many different use cases of data usage, and there’s no clear winner — some of these approaches may be tailored perfectly for you.

Knowledge sharing is easy to do the wrong way, so its metrics should be present and monitored: WAU for documentation view and update, number of questions in chats not covered by documentation, and so on.

Data discovery and metadata management could also be covered as a separate topic, but I won’t dive into the details here.

What can go wrong?

Verbal communication between data users or with domain experts may take a lot of time, and some information may be not reliable or accurate, so a lack of knowledge sharing often leads to additional data validations or duplication of code.

If we’re implementing documentation and have low coverage, users may not want to spend extra time searching the documentation. It can be present but not discoverable, so it’s the same as if there’s no documentation.

If we’re making documentation on the level of the table and not referencing the whole domain, it may impact discoverability, too.

The lack of edge case documentation and cross-reference may lead to errors.
Knowledge sharing about the single source of truth without communicating data validations, tests, and SLAs may lead to the duplication of data validations or SLAs.
Lack of knowledge sharing about domain changes may lead to outdated end data sources and so on.

References

Based on Airbnb articles about data quality (part 1, part 2).
Data Mesh articles (part 1, part 2).
Inspired by TeamLead Roadmap (Russian)

We can make it better together

You can add comments and share your practices in the comments to make this guide more complete. You can also schedule a meeting with Alexander Eliseev (the main maintainer of this roadmap) if you’d like any help with applying this roadmap or you have feedback.

In Part 2, I’m going to describe case studies showing how these practices applied in different companies.

Data Quality Roadmap. Part I

Dimensions

How to achieve quality in these dimensions

Data quality practices

Validation of data sources

What can go wrong without this practice?

Testing of data pipelines

What can go wrong?

Collect information about the usage of your data sources

What can go wrong without this practice?

Cover all data sources with clear SLAs

What can go wrong?

Make sure that needed data sources are available at the right time

What can go wrong?

Communication with data users

What can go wrong

Internal processes of data pipeline design

What can go wrong?

Knowledge sharing about the data domain

What can go wrong?

References

We can make it better together

Written by Alexander Eliseev