5 Common Misconceptions About Data Quality for Data Engineers

Jens Wilms
In the Pipeline
Published in
4 min readSep 7, 2023

As a product manager in the data space, I’ve talked to over a hundred data engineers and realized that everybody knows that data quality is something they should do, but often it gets overlooked or put in the backlog. Which is completely unnecessary! These are the five common misconceptions that prevent data engineers from having complete trust in their data.

Myth 1: Improving Data Quality Takes Too Long to Set Up

Improving data quality? Do you mean manually setting up tests for every pipeline, discussing all data quality rules with SMEs, and daily monitoring all pipelines?

It doesn’t have to be that difficult. Many new tools in the modern data stack have made it almost effortless to create trust in your data. Dbt can create a manageable single source of truth, Setting up CI/CD process with GitHub Actions or GitLab CI/CD can ensure no breaking changes when making a change to your code, and tools like PipeRider can make it easy to set up monitoring and validation to ensure that even the small data quality issues such as data drift can get captured easily.

Myth 2: Maintaining High Data Quality is Expensive

Certain tools available around data monitoring can be seen as expensive, and the bill can easily grow, running expensive queries on Snowflake.

But, in reality, not putting any effort into data quality is more expensive. a study by IBM showed that bad data quality is costing the US economy 3 Trillion dollars. Not being proactive about the data quality means bad business decisions, and perhaps more importantly, dropping everything to get to a “firefighting” mode when the analyst discovers an issue.

Investing time and money in establishing data quality infrastructure and practices may appear costly initially, but it proves beneficial in the long run. Moreover, there are several easy-to-use open-source tools available that can assist you without requiring significant investment.

Myth 3: It’s Difficult to Track Changes and Lineage

As data pipelines grow larger and more complex across multiple models, schemas, and sources, many assume that tracking the end-to-end data lineage and changes becomes inherently difficult without substantial manual effort. Tracking upstream and downstream impacts seems difficult.

However, tools like PipeRider integrated with dbt provide easy and automated ways to gain visibility into data quality and changes across pipelines. PipeRider can automatically generate visual lineage graphs showing dependencies between dbt models and how a new Pull Request will change these dependencies. Compared to piecing together lineage manually, this provides a bird’s eye view of how data flows through the DAG of models. PipeRider’s lineage diff feature can also highlight specific changes between code versions to see impacted areas. Rather than a black box, teams can now clearly trace data from raw sources to final serving layers and reveal the impacts of changes with just a couple of commands.

Myth 4: Data Quality is Just About Monitoring

Many believe that guaranteeing strong data quality is only about reactive monitoring — running checks in production to identify issues and then scrambling to address them. The assumption is that as long as you can detect problems, you can manage quality.

In practice, purely reactive monitoring is not enough — you need a proactive strategy. Steps like version control, thorough impact analysis of changes, testing frameworks, and gradual deployment help prevent issues before they reach production. Proactive data governance also facilitates collaboration and knowledge sharing around improving quality. This doesn’t have to be difficult, implementing checks at various stages in the process can help. Rather than just sounding alarms, modern tools like PipeRider help bake quality directly into development workflows.

Myth 5: There’s No Easy Way to Collaborate on Data Quality

The one thing that frustrates all data engineers: never-ending meetings. It’s frustrating to deal with data quality on top of an already-packed meeting schedule, isn’t it? The misalignment arises when downstream data consumers, such as analysts, define their quality requirements separately from the upstream engineering implementation, resulting in the need for even more meetings.

Nevertheless, there are numerous practices that can greatly enhance collaboration. Consider implementing data contracts, conducting CI/CD checks, leveraging Slack notifications, and sharing reports. Ultimately, the goal is to establish a self-service approach to manage data quality.

In conclusion, understanding these five misconceptions about data quality will empower data engineers to make better decisions when it comes to managing their data pipelines. With the right investment in tools, processes, and collaboration, it’s possible to achieve high data quality without breaking the bank or losing precious time.

--

--