Data Quality Roadmap. Part II: Case Studies
This is the second part of the Data Quality Roadmap article showing how different companies are applying described practices
Wrike Product Data Engineering case study
This is a case study for one of Wrike’s data engineering teams: Product Data Engineering.
We’re responsible for data sources that are connected to our product: SaaS for collaborative work management. We help analysts and product managers make decisions on product development and our engineering teams get feedback on their features.
Works well:
- Collect information about the usage of your data sources
- Cover all data sources with clear SLAs
- Communication with data users
- Internal processes of data pipeline design
In progress:
- Validation of data sources
- Knowledge sharing about the data domain
Planned for the near future:
- Testing of data pipelines
- Make sure that needed data sources are available at the right time
Data quality practices
Validation of data sources
Current state: In progress
We have a big story connected with this principle:
Initially, we started with a small part of domains supported by data engineers with good validations but we weren’t able to add new sources when they were needed for data users.
Our architecture and design approach weren’t scalable enough to keep up with these requests.
We decided to reduce the amount of data validations, tests, and our knowledge about the domain to the point where we can add sources as fast as possible. We also decided that we needed to transfer a data source in one or two hours and established a process to create minimal automatic and manual validations that will provide reasonable quality to data users. So at this point, we decided to have worse data validation to gain better data availability.
Our next big goal now is to add data engineers back to the domain teams. We have a good foundation to improve the overall data quality and gain more data users.
We’ve already implemented tools for automatic validation: basic sanity checks, anomaly detection, monitoring of data schema, and publishing sources only after all checks have passed. As for manual validations, we have a data review process along with a code review process. We’re also publishing Jupyter Notebooks together with the code and validating them once the data is updated.
Now, we’re improving our best practices for data validations and taking responsibility for the bigger data domain, so this practice is in active development.
Testing of data pipelines
Current state: Planned for the near future
Now our data pipelines are under-tested for sure:
- We require tests on the hard part of transformations, such as real-time pipelines with a big amount of business logic.
- We have a plan to add required tests for airflow operators and common libraries.
- But we still don’t require tests of all data pipelines.
For a long time, it wasn’t the most beneficial approach to improve the quality of our data. We’ve mostly used the ELT approach with a small amount of business logic. The common parts were tested manually and covered by data validations, so no testing was required.
We’re planning to add tests for common libraries in the near future but the requirements for integration testing aren’t in our close roadmap.
We’d be happy to hear your thoughts on making this process simple and beneficial.
Collect information about the usage of your data sources
Current state: Works well
Most of our users use BigQuery, so we collect usage statistics from all relevant projects automatically.
We use this statistic to deprecate data sources, communicate the issues in data with the relevant people, and highlight the Tableau dashboards that may be impacted.
Sometimes the application of this data requires additional thinking so we’re planning to reduce tension by integrating this data into our knowledge base.
Cover all data sources with clear SLAs
Current state: Works well
We currently use Airflow as our main service for automation of ETLs and have several instances of Airflow — production, acceptance, and analytical — in several different locations (both on-prem and in GCP). We use a separate database to store metadata across all instances.
We use our own MetaDB to store and manage all SLAs and use a similar approach to the SLA in Airflow, but adapted it for our use cases:
- We have two kinds of SLAs: The first-duty data engineer knows about the problem before publishing it to our data users so they can fix it before anyone notices.
- We publish the state of data sources so we can automatically identify when dependencies are ready.
- We make publishing explicit, so we can validate before publishing. We do not publish stage data sources that should be private inside this transformation.
You can see more details in our presentation on Airflow Summit 2020.
Make sure that needed data sources are available at the right time
Current state: Planned for the near future
As we’ve described in data validations, we’re almost uninvolved in the domain, so we provide mostly raw data.
We’ve designed a process to help transfer all the needed data sources or events in one or two hours by request, and provide the SLA to our clients that relevant data sources can be transferred in a day.
The approach of transferring raw data suits the basic needs of our clients but we have a plan to improve the data quality for derived data sources, too. Now, we’re working on a project to add data engineers inside the domain, inspired by the Data Mesh approach.
Our main goal is to increase the number of questions that could be covered by the self-service analysis: use case analysis of the feature, state of the account visible to the user, adoption of the feature inside the account, and so on.
Communication with data users
Current state: Works well
We’ve separated our data on the Production and Acceptance levels, promising truthfulness and careful change management on the Production layer.
We’re also improving on the process of deprecation in the production layer.
Although production isn’t always the layer with the single source of truth, data may sometimes be duplicated and unreliable for some use cases.
We use Google BigQuery and have made several datasets that could be qualified as “Gold Standard,” and constantly communicate the quality to our end users, carefully processing all the feedback and nudging users to use the single source of truth layer.
We haven’t implemented data lineage and data monitoring yet, but plan to do that to improve the usage of our “Gold Standard” layer by adding automatic recommendations of usage of new data sources.
We also have a plan to improve this process by improving our data discoverability and documentation.
Internal processes of data pipeline design
Current state: Works well
Data Mesh describes an approach to data governance as decentralized, federated, and computational. The classic data warehousing approach is a centralized data warehouse managed by a single data engineering team.
As for now, we have something in between. Our data domain is so big and complex that it can’t be managed by a single team, but we haven’t adopted the decentralized approach just yet. As we’ve described in “Make sure that needed data sources are available at the right time,” we have a plan to introduce the domain teams, so we’ll need federated and computational governance.
Currently, our approach for providing the raw layer is centralized: We review data, code, and validations inside the whole team. We also have data analysts who are responsible for designing and providing derived data sources; their data sources are reviewed internally in analytical domain teams, so they’re not following the same standards.
Our current approach is working well but as we dive deeper into domains, we have a plan to improve our governance approach.
The other thing we may consider is the engineering practices. They’re working well for us, and our duty and review process helps us share our knowledge. On the level of typical use cases, we have checklists that help us ensure a high standard.
Knowledge sharing about the data domain
Current state: In progress
We created a service based on XWiki that provides manual documentation for Data Sources together with automatically generated metadata: usage information, links to airflow DAGs, BigQuery tables, and data lineage.
Now we’re trying to improve adoption: adding the documentation to Definition of Ready of ETLs, helping other teams to start using documentation internally, linking documentation to the threads in slack, and so on.
During our experiments with data engineers inside product domains, we’re making the documentation for data domains similar to Airbnb’s design schemas.
Airbnb’s case study
This is a case study for Airbnb, compiled based on public information and made by authors of the roadmap. The roadmap is based on their description of data quality (part 1, part 2).
Works well:
- Collect information about the usage of your data sources
- Cover all data sources with clear SLAs
- Internal processes of data pipeline design
- Validation of data sources
- Knowledge sharing about the data domain
- Testing of data pipelines
- Make sure that needed data sources are available at the right time
Unknown:
- Communication with data users
Data quality practices
Validation of data sources
Current state: Works well
They have:
- Tooling for automatic validations, which is mandatory for new pipelines.
- Mandatory automatic and manual validations based on the design specification of all certified sources.
- A review of data validations before release.
Testing of data pipelines
Current state: Works well
Collect information about the usage of your data sources
Current state: Unknown
Cover all data sources with clear SLAs
Current state: Works well
- …We also require that teams incorporate data pipeline SLAs into their quarterly OKR planning.
- SLA Tracker is a visual analytics tool to facilitate a culture of data timeliness at Airbnb.
Make sure that needed data sources are available at the right time
Current state: Works well
Communication with data users
Current state: Works well
They’ve made a project that provides a clear guarantee the data quality:
- …Midas, the initiative we developed as a mechanism to unite the company behind a shared “gold standard” that serves as a guarantee of data quality at Airbnb.
- Usability: certified data is clearly labeled in internal tools, and supported by extensive documentation of definitions and computation logic.
Internal processes of data pipeline design
Current state: Works well
They have a Midas certification process that ensures best practices are used and good data governance.
Knowledge sharing about the data domain
Current state: Works well
- Usability: certified data is clearly labeled in internal tools, and supported by extensive documentation of definitions and computation logic.
- Design specs are reviewed before data source implementation to share knowledge.
Uber’s case study
This is a case study for Uber, based on their Journey Toward Better Data Culture From First Principles. This case study is compiled based on their public information made by the authors of the roadmap.
Works well:
- Testing of data pipelines
- Collect information about the usage of your data sources
- Cover all data sources with clear SLAs
- Communication with data users
- Knowledge sharing about the data domain
- Make sure that needed data sources are available at the right time
In progress:
- Validation of data sources
- Internal processes of data pipeline design
Data quality practices
Validation of data sources
Current state: In progress
- Freshness: time delay between production of data and when the data is 99.9% complete in the destination system including a watermark for completeness (default set to 3 9s), as simply optimizing for freshness without considering completeness leads to poor quality decisions
- Completeness: % of rows in the destination system compared to the # of rows in the source system
- Duplication: % of rows that have duplicate primary or unique keys, defaulting to 0% duplicate in raw data tables, while allowing for a small % of duplication in modeled tables
- Cross-data-center consistency: % of data loss when a copy of a dataset in the current datacenter is compared to the copy in another datacenter
- Semantic checks: captures critical properties of fields in the data such as null/not-null, uniqueness, # of distinct values, and range of values
Testing of data pipelines
Current state: Works well
Collect information about the usage of your data sources
Current state: Works well
Usage is collected and provided in Databook:
Cover all data sources with clear SLAs
Current state: Works well
Make sure that needed data sources are available at the right time
Current state: Works well
Communication with data users
Current state: Works well
Internal processes of data pipeline design
Current state: In Progress
Uber has a project for metrics standardization
See more info in: The Journey Towards Metric Standardization.
Collaboration with the Engineering team to set up an automatic capture of contextual information:
Knowledge sharing about the data domain
Current state: Works well
- Basic metadata: such as documentation, ownership information, pipelines, source code that produced the data, sample data, lineage, and tier of the artifact
- Usage metadata: statistics on who used it, when, popular queries, and artifacts that are used together
- Quality metadata: tests on the data, when do they run, which ones passed, and aggregate SLA provided by the data
- Cost metadata: resources used to compute and store the data, including monetary cost
- Bugs and SLAs: bugs filed against the artifact, incidents, recent alerts, and overall SLA in responding to issues by owners
See the article on Uber’s Databook and more info.
We can make it better together
Feel free to describe your case studies in replies, I’ll be happy to make the second part of case studies articles with your experience included.
You can also schedule a meeting with Alexander Eliseev (the main maintainer of this roadmap) if you’d like any help with applying this roadmap or you have feedback.