Data quality: Garbage in garbage out
Fixing typos and the likes is a $4.5 billion market
My first job years ago was building financial models for banks. This was back when model features were hand-crafted and approved by steering committees. Pre-deep learning days. Since features were hand-crafted, extensive data quality checks were the norm. I regularly audited how data flowed from one system to another. ‘Garbage in, garbage out’ (gigo — one letter away from the Filipino ‘stupid’ word) as my ex-boss would say.
Data quality was a big problem. There would always be that duplicate, outlier, missing data, corrupted text, or typo. But even after years of advances in data engineering and “artificial intelligence”, data quality remains a big problem. In fact, it is a growing problem. But that is also why it is an exciting problem to solve.
As always, I like to preface my articles with what it is not about:
- This is not about what data quality is as a concept and how it is solved. There are great articles already out there: here, here, and here
- This is focused on structured tabular data. The nature of data quality issues differ across data types like images/video, audio, text
Why data quality is a growing problem
More decisions are based on data. The growth of BI tools like Tableau, market intelligence tools like AppAnnie, and A/B testing tools like Optimizely all point to the trend of data-driven decisions. These tools are useless if the underlying data cannot be trusted. We lose trust in these tools fast if data quality issues keep popping up. Bad decisions are made and business opportunities lost. Gartner regularly surveys the cost of bad data and for years it remains at >$10 million per enterprise even after spending $200,000 annually on data quality tools.
More companies are built on data. Waves of machine learning companies are built on data — look at the chart below :). If you’ve talked to machine learning engineers and data scientists, you’ll always hear how important good data is.
Why data quality remains a problem
Lots of data but static one-time checks is still the status quo. Cloud changed the mindset from storing only useful data to storing all potentially useful data. Streaming technologies like Kafka and cloud data warehouses like Snowflake are lower the friction to store more data — at a higher velocity, wider variety, and larger volume. To ensure quality, data rules specifying the range and schema are manually written by consultants or data quality analysts. This is an expensive time-consuming process that is usually done only once. Set and forget. There are two problems with this. The first is that majority of the rules are learned in production — when a problem surfaces. A case of unknown unknowns. The second is data is dynamic, requiring rules to be dynamic.
“Trusted” sources of data are still problematic. Three sectors have data quality top of mind more than any other: public sector, healthcare, and financial services. But these industries continue to be plagued with bad data. The data governance leader of a healthcare company told me that government data they receive monthly are littered with errors. They needed dedicated staff to clean it. Imagine, we are being governed based on dirty data. On the side of the financial services, stock data from tier 1 data providers often have some problems. Bloomberg is known for having better data quality than other vendors but I remember having to always sense check the data and tell customer support the data does not make sense.
Specialization of data professionals. It used to be that data professionals covered more of the gamut of ingesting data, transforming, cleaning, and analyzing/querying it. Now we have different professionals tackling each step. Data engineers handle ingestion and transformation. Sometimes transformation is skipped and data is just pushed directly into data warehouses. Data scientists build models and are busy tuning deep learning knobs to improve accuracy. And business analysts query and write reports. They have more context of what the data should look like yet are separate from the data engineers who write and implement data quality checks.
Tldr: a growing problem with no great solutions
Huge opportunity up for grabs
The current data quality software market is almost $2 billion. But the real opportunity is much bigger at $4.5 billion since there is a large labor component that can be automated. There are 21,000 data quality roles in the US each paid $80,000 annually. This doesn’t capture those who work in professional services firms that have ‘consultants’ as job titles. Assuming this multiplies the direct labor component by 1.5x, the total labor component is $2.5B. The market will continue to grow healthily as data quality becomes more important in ML applications and business decisions.
While Informatica, SAP, Experian, Syncsort, and Talend currently dominate the market, a new wave of startups is propping up. Current products and how they are sold can be much better — from being a turnkey solution plugged into any part of the data pipeline, to not requiring a 6-month implementation period and 6-figure upfront commitments, to automating data quality rule writing and monitoring.