Data Quality Roadmap. Part II: Case Studies

Published in

Wrike TechClub

12 min readJun 1, 2021

This is the second part of the Data Quality Roadmap article showing how different companies are applying described practices

Wrike Product Data Engineering case study

This is a case study for one of Wrike’s data engineering teams: Product Data Engineering.

We’re responsible for data sources that are connected to our product: SaaS for collaborative work management. We help analysts and product managers make decisions on product development and our engineering teams get feedback on their features.

Works well:

Collect information about the usage of your data sources
Cover all data sources with clear SLAs
Communication with data users
Internal processes of data pipeline design

In progress:

Validation of data sources
Knowledge sharing about the data domain

Planned for the near future:

Testing of data pipelines
Make sure that needed data sources are available at the right time

Data quality practices

Validation of data sources

Current state: In progress

We have a big story connected with this principle:

Initially, we started with a small part of domains supported by data engineers with good validations but we weren’t able to add new sources when they were needed for data users.

Our architecture and design approach weren’t scalable enough to keep up with these requests.

We decided to reduce the amount of data validations, tests, and our knowledge about the domain to the point where we can add sources as fast as possible. We also decided that we needed to transfer a data source in one or two hours and established a process to create minimal automatic and manual validations that will provide reasonable quality to data users. So at this point, we decided to have worse data validation to gain better data availability.

Our next big goal now is to add data engineers back to the domain teams. We have a good foundation to improve the overall data quality and gain more data users.

We’ve already implemented tools for automatic validation: basic sanity checks, anomaly detection, monitoring of data schema, and publishing sources only after all checks have passed. As for manual validations, we have a data review process along with a code review process. We’re also publishing Jupyter Notebooks together with the code and validating them once the data is updated.

Now, we’re improving our best practices for data validations and taking responsibility for the bigger data domain, so this practice is in active development.

Testing of data pipelines

Current state: Planned for the near future

Now our data pipelines are under-tested for sure:

We require tests on the hard part of transformations, such as real-time pipelines with a big amount of business logic.
We have a plan to add required tests for airflow operators and common libraries.
But we still don’t require tests of all data pipelines.

For a long time, it wasn’t the most beneficial approach to improve the quality of our data. We’ve mostly used the ELT approach with a small amount of business logic. The common parts were tested manually and covered by data validations, so no testing was required.

We’re planning to add tests for common libraries in the near future but the requirements for integration testing aren’t in our close roadmap.

We’d be happy to hear your thoughts on making this process simple and beneficial.

Collect information about the usage of your data sources

Current state: Works well

Most of our users use BigQuery, so we collect usage statistics from all relevant projects automatically.

We use this statistic to deprecate data sources, communicate the issues in data with the relevant people, and highlight the Tableau dashboards that may be impacted.

Sometimes the application of this data requires additional thinking so we’re planning to reduce tension by integrating this data into our knowledge base.

Cover all data sources with clear SLAs

Current state: Works well

We currently use Airflow as our main service for automation of ETLs and have several instances of Airflow — production, acceptance, and analytical — in several different locations (both on-prem and in GCP). We use a separate database to store metadata across all instances.

We use our own MetaDB to store and manage all SLAs and use a similar approach to the SLA in Airflow, but adapted it for our use cases:

We have two kinds of SLAs: The first-duty data engineer knows about the problem before publishing it to our data users so they can fix it before anyone notices.
We publish the state of data sources so we can automatically identify when dependencies are ready.
We make publishing explicit, so we can validate before publishing. We do not publish stage data sources that should be private inside this transformation.

You can see more details in our presentation on Airflow Summit 2020.

Make sure that needed data sources are available at the right time

Current state: Planned for the near future

As we’ve described in data validations, we’re almost uninvolved in the domain, so we provide mostly raw data.

We’ve designed a process to help transfer all the needed data sources or events in one or two hours by request, and provide the SLA to our clients that relevant data sources can be transferred in a day.

The approach of transferring raw data suits the basic needs of our clients but we have a plan to improve the data quality for derived data sources, too. Now, we’re working on a project to add data engineers inside the domain, inspired by the Data Mesh approach.

Our main goal is to increase the number of questions that could be covered by the self-service analysis: use case analysis of the feature, state of the account visible to the user, adoption of the feature inside the account, and so on.

Communication with data users

Current state: Works well

We’ve separated our data on the Production and Acceptance levels, promising truthfulness and careful change management on the Production layer.

We’re also improving on the process of deprecation in the production layer.

Although production isn’t always the layer with the single source of truth, data may sometimes be duplicated and unreliable for some use cases.

We use Google BigQuery and have made several datasets that could be qualified as “Gold Standard,” and constantly communicate the quality to our end users, carefully processing all the feedback and nudging users to use the single source of truth layer.

We haven’t implemented data lineage and data monitoring yet, but plan to do that to improve the usage of our “Gold Standard” layer by adding automatic recommendations of usage of new data sources.

We also have a plan to improve this process by improving our data discoverability and documentation.

Internal processes of data pipeline design

Current state: Works well

Data Mesh describes an approach to data governance as decentralized, federated, and computational. The classic data warehousing approach is a centralized data warehouse managed by a single data engineering team.

As for now, we have something in between. Our data domain is so big and complex that it can’t be managed by a single team, but we haven’t adopted the decentralized approach just yet. As we’ve described in “Make sure that needed data sources are available at the right time,” we have a plan to introduce the domain teams, so we’ll need federated and computational governance.

Currently, our approach for providing the raw layer is centralized: We review data, code, and validations inside the whole team. We also have data analysts who are responsible for designing and providing derived data sources; their data sources are reviewed internally in analytical domain teams, so they’re not following the same standards.

Our current approach is working well but as we dive deeper into domains, we have a plan to improve our governance approach.

The other thing we may consider is the engineering practices. They’re working well for us, and our duty and review process helps us share our knowledge. On the level of typical use cases, we have checklists that help us ensure a high standard.

Knowledge sharing about the data domain

Current state: In progress

We created a service based on XWiki that provides manual documentation for Data Sources together with automatically generated metadata: usage information, links to airflow DAGs, BigQuery tables, and data lineage.

Now we’re trying to improve adoption: adding the documentation to Definition of Ready of ETLs, helping other teams to start using documentation internally, linking documentation to the threads in slack, and so on.

During our experiments with data engineers inside product domains, we’re making the documentation for data domains similar to Airbnb’s design schemas.

Airbnb’s case study

This is a case study for Airbnb, compiled based on public information and made by authors of the roadmap. The roadmap is based on their description of data quality (part 1, part 2).

Works well:

Collect information about the usage of your data sources
Cover all data sources with clear SLAs
Internal processes of data pipeline design
Validation of data sources
Knowledge sharing about the data domain
Testing of data pipelines
Make sure that needed data sources are available at the right time

Unknown:

Communication with data users

Data quality practices

Validation of data sources

Current state: Works well

They have:

Tooling for automatic validations, which is mandatory for new pipelines.
Mandatory automatic and manual validations based on the design specification of all certified sources.
A review of data validations before release.

Testing of data pipelines

Current state: Works well

We required that pipelines be built with thorough integration tests that run as part of our Continuous Integration processes.

Collect information about the usage of your data sources

Current state: Unknown

Cover all data sources with clear SLAs

Current state: Works well

Make sure that needed data sources are available at the right time

Current state: Works well

The new role requires Data Engineers to be strong across several domains, including data modeling, pipeline development, and software engineering.

Communication with data users

Current state: Works well

They’ve made a project that provides a clear guarantee the data quality:

Internal processes of data pipeline design

Current state: Works well

They have a Midas certification process that ensures best practices are used and good data governance.

Knowledge sharing about the data domain

Current state: Works well

Uber’s case study

This is a case study for Uber, based on their Journey Toward Better Data Culture From First Principles. This case study is compiled based on their public information made by the authors of the roadmap.

Works well:

Testing of data pipelines
Collect information about the usage of your data sources
Cover all data sources with clear SLAs
Communication with data users
Knowledge sharing about the data domain
Make sure that needed data sources are available at the right time

In progress:

Validation of data sources
Internal processes of data pipeline design

Data quality practices

Validation of data sources

Current state: In progress

After several iterations, we landed on five main types of data quality checks described below. Every dataset must come with these checks and a default SLA configured:

Freshness: time delay between production of data and when the data is 99.9% complete in the destination system including a watermark for completeness (default set to 3 9s), as simply optimizing for freshness without considering completeness leads to poor quality decisions
Completeness: % of rows in the destination system compared to the # of rows in the source system
Duplication: % of rows that have duplicate primary or unique keys, defaulting to 0% duplicate in raw data tables, while allowing for a small % of duplication in modeled tables
Cross-data-center consistency: % of data loss when a copy of a dataset in the current datacenter is compared to the copy in another datacenter
Semantic checks: captures critical properties of fields in the data such as null/not-null, uniqueness, # of distinct values, and range of values

We are continuing to work on more sophisticated checks in terms of consistency of concepts across datasets and anomaly detection on top of the checks above on time dimension.

In addition to quality measures, it’s also necessary to have a way to associate datasets with different degrees of importance to the business, so we can easily highlight the most important data.

Testing of data pipelines

Current state: Works well

We automatically generated tests where it made sense (for raw data — which are dumps of Kafka topics into the warehouse — we can generate four categories of tests, except semantic tests, automatically) and made it easy to create new tests with minimal input from dataset owners.

Collect information about the usage of your data sources

Current state: Works well

Usage is collected and provided in Databook:

Usage metadata: statistics on who used it, when, popular queries, and artifacts that are used together.

Cover all data sources with clear SLAs

Current state: Works well

Data quality is known: Data artifacts must have SLAs for data quality, SLAs for bugs, and incident reporting and management just like we do for services. The owner is responsible for upholding those SLAs.

Make sure that needed data sources are available at the right time

Current state: Works well

Organize for data: Teams should aim to be staffed as “full-stack,” so the necessary data engineering talent is available to take a long view of the data’s whole life cycle.

Communication with data users

Current state: Works well

In addition to quality measures, it’s also necessary to have a way to associate datasets with different degrees of importance to the business, so we can easily highlight the most important data.

We have developed automation to generate “tiering reports” for orgs that show the datasets that need tiering, usage of tired data, etc., which serve as a measure of the organization’s “data health.” We are also tracking these metrics as part of our “eng excellence” metrics. With more adoption and feedback, we are continually iterating on the exact definitions and measurement methodologies, further improving upon them.

Internal processes of data pipeline design

Current state: In Progress

Uber has a project for metrics standardization

See more info in: The Journey Towards Metric Standardization.

Collaboration with the Engineering team to set up an automatic capture of contextual information:

After digging into the mobile framework to build apps at Uber, we realized that the mobile app development framework (open sourced previously) already has a natural structure built into it that can provide critical information about the state of the app when the user experienced it.

While this library didn’t completely solve all the logging problems we set out to solve, it did provide a structure for logs that made a lot of analytics easier, as described below. We are iterating on this library to solve the other problems outlined.

Mandatory ownership: We improved data tools and surfaces at the root of data production (schema definition, Kafka topic creation, pipelines that create data, metric creation, dashboard creation, etc.) to make ownership information mandatory when we cannot automatically infer the owner.

Data as code: Data should be treated as code. Creation, deprecation, and critical changes to data artifacts should go through the design review process with appropriate written documents where consumers’ views are taken into account. Schema changes have mandatory reviewers who sign off before changes are landed.

Knowledge sharing about the data domain

Current state: Works well

We wanted to present comprehensive metadata to users about every data artifact (table, column, metric):

Basic metadata: such as documentation, ownership information, pipelines, source code that produced the data, sample data, lineage, and tier of the artifact
Usage metadata: statistics on who used it, when, popular queries, and artifacts that are used together
Quality metadata: tests on the data, when do they run, which ones passed, and aggregate SLA provided by the data
Cost metadata: resources used to compute and store the data, including monetary cost
Bugs and SLAs: bugs filed against the artifact, incidents, recent alerts, and overall SLA in responding to issues by owners

We standardized metadata vocabulary, made it easy to add new metadata attributes to existing entities, designed extensibility to easily define new entity types with minimal onboarding effort, and integrated most of our critical tools into this system, and published their metadata into this central place connecting the dots between various data assets, tools, and users.

See the article on Uber’s Databook and more info.

We can make it better together

Feel free to describe your case studies in replies, I’ll be happy to make the second part of case studies articles with your experience included.

You can also schedule a meeting with Alexander Eliseev (the main maintainer of this roadmap) if you’d like any help with applying this roadmap or you have feedback.

Data Quality Roadmap. Part II: Case Studies

Wrike Product Data Engineering case study

Data quality practices

Validation of data sources

Testing of data pipelines

Collect information about the usage of your data sources

Cover all data sources with clear SLAs

Make sure that needed data sources are available at the right time

Communication with data users

Internal processes of data pipeline design

Knowledge sharing about the data domain

Airbnb’s case study

Data quality practices

Validation of data sources

Testing of data pipelines

Collect information about the usage of your data sources

Cover all data sources with clear SLAs

Make sure that needed data sources are available at the right time

Communication with data users

Internal processes of data pipeline design

Knowledge sharing about the data domain

Uber’s case study

Data quality practices

Validation of data sources

Testing of data pipelines

Collect information about the usage of your data sources

Cover all data sources with clear SLAs

Make sure that needed data sources are available at the right time

Communication with data users

Internal processes of data pipeline design

Knowledge sharing about the data domain

We can make it better together

Written by Alexander Eliseev