Member-only story
Making Data Useful
All about data provenance
Unstructured data, inherited data, exhaust data, obfuscated data, and other goblins
If you’re about to jump on the citizen data scientist bandwagon, there are a few things you should know about data provenance…
Data provenance: “Who collected it and why?”
Society is plagued by distorted expectations regarding data, littered with nonsense like “numbers can’t lie” and “it’s just your opinion until you show me the data” (no, it’s still your opinion) and “I looked at data, so now I’m informed.”
There comes a time in every child’s life when they must learn that:
1) The tooth fairy isn’t real.
2) Things don’t just magically work out because you have some numbers. It really matters where those numbers came from. (Some children are a few decades overdue for this developmental milestone.)
Anyone can put some electronic scribbles in a table and call it data. That doesn’t make it good/true/useful/worthy in the sense you associate with Science.
Even if the dataset was collected carefully, are you sure you know what happened to it on its way to you? The only reason that a villain won’t remove inconvenient rows from your dream dataset (“Hide data from you? I would never! Those were outliers.”) or aggregate things in a way that skews the message before sharing data…

