From Talk to Action: Reviewing What Prevents You from Raising Data Quality

Karen Hsieh
In the Pipeline
Published in
7 min readJul 18, 2024
Generated by DALL·E

Many people talk about data quality but don’t act. Tiankai Feng stated that “data quality management starts with human responsibility and collaboration, and not with buying a tool”.

Tiankai Feng stated about data quality management.

In an era where data is a crucial asset, maintaining its quality is paramount. Despite widespread recognition of its importance, many organizations struggle to turn this understanding into action. This challenge is compounded by the misconception that purchasing a tool can solve all data quality issues.

Just like in fitness, buying a gym membership or workout equipment won’t automatically make you fit. True success requires consistent effort and commitment. Similarly, while tools play a vital role in data quality management, they are not a silver bullet.

Lively discussion in Tiankai’s post

Recce, the data validation toolkit for dbt projects, was developed with the belief that true data quality starts with human efforts. It is designed to facilitate and encourage these efforts, making it easier for teams to collaborate and uphold high data standards. It’s about empowering people to take responsibility and work together towards maintaining data quality.

Bad Data Quality Breaks Trust

Poor data quality impacts teams across the organization, leading to a significant erosion of trust. When data is unreliable, it undermines confidence in reports, dashboards, and analytics. Decision-makers become hesitant, and the organization may suffer from indecisiveness and inefficiency. Furthermore, external stakeholders, such as customers and partners, lose trust when they encounter inconsistencies and errors in the data presented to them. This erosion of trust can lead to strained relationships and a tarnished reputation.

  • Errors in Downstream Tables: Renaming only the source tables without updating downstream dependencies can lead to cascading errors, affecting multiple reports and analyses.
  • Empty Numbers in Business Reports: When referral columns no longer exist, it results in missing or incorrect data in business reports, causing confusion and misguided decisions.
  • Downtime on Reports and Dashboards: Inaccurate or incomplete data can lead to downtime on critical reports and dashboards, delaying important business insights.
  • Failure of Training ML Models: Poor data quality can lead to failures in training machine learning models, resulting in inaccurate predictions and unreliable outcomes.

A bad data merge may cause a flood of customer calls, leave executives without reports in crucial meetings, and destroy the efforts of your colleagues. These are not mere examples; these are real cases we heard from our user interviews. They illustrate the pervasive impact of poor data quality and underscore the importance of maintaining high standards to preserve trust within the organization and with external stakeholders.

What’s Stopping You from Taking Action on Data Quality?

You’ve probably heard about the issues mentioned above or even experienced them within your own team. So why haven’t you taken action 😵 ? The answer often lies in the absence of certain critical factors. Without these elements, organizations struggle to move from discussion to implementation. Below are the key factors that, when absent, hinder individuals and teams from successfully committing to data quality.

You’ve Gotten Away Without It So Far

When real cases happen, people learn. As the saying goes, “No amount of guidance from others can replace the lesson learned from one’s own mistake.” It often takes experiencing a significant data quality issue firsthand to truly understand its impact. Data quality problems can severely damage trust, both within the organization and with external stakeholders. Only after facing these challenges do teams fully appreciate the importance of proactive data quality management.

Absence of Leadership Support and Resources

Effective data quality management requires support from leadership. This includes allocating resources, setting priorities, and fostering a culture that values data quality. Without leadership driving initiatives and making data quality a strategic focus, it is challenging to make significant improvements.

Lack of Cross-Team Collaboration

High data quality is essential when multiple teams, such as Data Engineering (DE), Data Scientists (DS), and Data Analysts (DA), work together in the data pipeline. Without cross-functional collaboration, constant firefighting becomes the norm. Ensuring everyone works with reliable data requires effective communication and cooperation across teams.

Insufficient Management of High Volume Data Models

Managing a vast number of data models, often 500+ or even 1,000+, requires meticulous attention to detail and systematic processes to ensure data integrity across all models. Without these processes in place, the complexity can quickly lead to errors and inconsistencies.

Inadequate Handling of Complex Data Pipelines

The complexity of data pipelines, often represented by long Directed Acyclic Graphs (DAGs) that span multiple sources, stages, intermediates, and marts, all the way to multiple BI tools, data products, and ML models, means that a small error can propagate and magnify, causing significant issues downstream. Without rigorous data quality measures, managing this complexity is highly challenging.

Reliance Solely on Code Testing and CI

Even with thorough testing and continuous integration (CI), data issues can still occur. Basic dbt tests and additional ones like syntax checks, SQLFluff, dbt-project-evaluator, dbt-checkpoint, etc., are essential but not sufficient on their own. Relying solely on code testing and CI highlights the limitations and underscores the need for continuous improvement and vigilance. Additionally, robust data tests are required to ensure the integrity and reliability of the data itself. Organizations must constantly refine their processes and address emerging issues to maintain data quality.

Lack of a Culture of Accountability

A culture of accountability is essential for maintaining high data quality. When everyone in the organization understands their role in data quality and takes ownership, the overall quality improves. Without training and awareness programs, fostering this culture is difficult, and data quality remains a fragmented responsibility.

Practical Steps to Achieve High Data Quality

While the absence of certain elements can hinder action on data quality, it’s not necessary to have everything in place before you start making improvements. Instead, use these points to review your current situation and identify what might be missing. For example, you might not have a huge number of data models or a complex data pipeline, but if you have a culture of accountability supported by leadership, you are already in a strong position to take action. Assess your specific context, leverage your strengths, and address any gaps to enhance your data quality practices effectively.

Conduct Root Cause Analysis

Effective root cause analysis is the first step towards resolving data quality issues. It involves identifying the underlying causes of data problems rather than just addressing the symptoms. Here’s how you can approach it:

  • Retro to Face the Problems: Conduct retrospectives to face the problems head-on. This involves setting aside time to reflect on recent data issues, understanding what went wrong, and discussing potential improvements with the team.
  • Collaborate and Discuss: Engage in collaborative discussions within the team to dive deep into the problems. Utilize the collective knowledge and experience of your team members to uncover the key factors contributing to data quality issues

Research Best Practice

Ensuring high data quality requires staying informed about and implementing best practices. These practices provide a foundation for effective data quality management and can significantly improve the reliability of your data.

You would like to exhaustively research best practices before implementing new workflows. Because the research you make during these months will impact the decision that shape data quality for years.

Typically, it starts with an individual surveying solutions, doing Proof of Concepts (PoC) locally, sharing multiple possibilities with the team, and deciding on solutions. It’s no different from surveying tech tools.

Make Strategic Build or Buy Decisions

Ethan Aaron discussed the myth of Build vs Buy in the data world.

Ethan Aaron talked about the myth of Build vs Buy in the data world.

Does your team know Python? You are going to BUILD.
Does your team not know Python? You are going to BUY.

Ethan argues that tech-driven data teams love building things, while business-driven data teams buy tools since they cannot build. It’s natural. He reminds us to make decisions based on team and company priorities, considering maintenance and iteration efforts.

Even if your team has the ability to build, it’s important to evaluate each situation carefully. Leverage open-source solutions and contribute to projects if they lack something you need. Calculate the effort involved, including working hours and budget. Often, getting a great deal with a paid service can be less costly than dedicating extensive engineering time to building a solution from scratch.

Taking Action on Data Quality

Achieving and maintaining high data quality is a multifaceted challenge, but it’s crucial for building trust, ensuring accurate decision-making, and driving organizational success. By understanding the factors that influence data quality and implementing practical steps, you can make significant improvements.

Now is the time to act 💪.

We’d love to hear your stories and experiences with data quality. Share what you do to raise data quality and the challenges you face. We’d like to help! 🙌

Recce

--

--