In theory, theory and practice are the same. In practice, however…

…all this shiny data science has one catch.
Dataset preparation sounds so ordinary and uncool that virtually nobody writes about it. Even the industry people.

Furthermore, I often see things in papers that outright give me shivers, like “we ignored every data point which we’re unable to parse.” Don’t they realize that “unparsable points” are not just outliers, but usually a heavily biased subset? Which biases the remaining dataset in some other unknown way?


Let’s leave the resulting bias to the authors, though. We have a different problem which is bad enough on its own. “Cleaning” the dataset this way often means that these results or even the entire algorithm/method are not applicable to real-world tasks at all.

For example, “98% precision at speech recognition” sounds great, until you realize that this 2% were street names, addresses, and phone numbers because they are “out of vocabulary” by virtue of each one of them being too rare on its own. And “2% of words” quickly turn into something like “70% of real-world dialogs failed to parse”.

I can understand academics who need to get the paper published instead of trying to bite off too much, of course. But this gap between theory and practice creates a very biased impression and the subsequent shock in people who move from studying the theory to application of said theory.

And the idea that this problem does not exist also kills the market niche for products which could solve it, because why would you pay for the product that solves the problem you don’t believe exists? Until you face it yourself, of course.

Now, I have a decent programming background to deal with it, so the real question is, what do I recommend to the people who don’t? Any ideas?