Easy, not Simple: Diagnosing Data Quality Issues
No data science program will be successful if its source data quality issues aren’t addressed. Anyone that says their source data doesn’t have data quality issues hasn’t looked at it hard enough, talked to business users or data warehouse analysts enough. Every source system will have endemic quality issues — it is the duty of the data professional to address them in a meaningful way.
Many organizations struggle at even defining what the issues are because it seems like an overly simple exercise in Who, What, When, and Where — but it’s easy — not simple.
Who
To diagnose your data quality issues, you’ll need to enlist some aid. If you are approaching this problem from an IT standpoint, you can’t just rely on IT analysts, you’ll need to listen to all of the business users / downstream users of the data to understand what they might be doing to transform that data into something usable. If you are a downstream analyst / internal data broker/citizen data developer, then you’ll need to listen to more than just the concerns of those within your data consumer circle, you’ll need to take into account some of the upstream vulnerabilities that IT may know of.
The more serious data quality issues you have and the larger your organization, then the larger their downstream systems will be. Throughout any data exercise, you’ll find many workarounds in the wild. Data is where business and IT collide, and it’s often a huge friction point of misunderstandings and…