With AI and Data, it’s “Junk in, Junk Out”

Eric David Halsey
Source Institute
Published in
3 min readJun 6, 2017
Does your data look like this? Photo credit: rawdonfox via Foter.com / CC BY

At a recent fixer session we facilitated for Brian Hankey to analyze data from a payday loan company to determine what factors predict loan repayment as part of a practical AI course we’re building, I ran into the concept of “cleaning” the data. It’s a concept I’d encountered before in reading about AI, but it ended up being much more important than I would have expected. Here’s what I learned:

First, data cleaning covers a variety of techniques to prepare data for use by all kinda of AI algorithms (more on that in a later post). This includes things like turning all the data into numbers (male and female into 1 and 0), making sure it’s all on the same scale (not some data from 1–5 and others from 0–100), and getting rid of data that won’t be useful or could be misleading. On that last point. while going through the process, the following question came up:

Salim Virani: Why wouldn’t you include [all the data]?

Valyo Yolovski: Because some things can actually, your model can actually start performing worse if you feed it junk. It’s junk in, junk out.

The models Valyo was referring to were decision trees and random forests (more on those here). He explained that these specific algorithms can actually perform worse with more data. This included incomplete data, as Vladimir (Vladi) Tsvetkov, the CTO of Hacker.Works explained:

Salim Virani: Why would you exclude something that has mostly blanks?

Vladimir Tsvetkov: Yes, because it’s not bringing any information when it’s blank. And if you have a couple filled in then you’re fitting the model for just these two records.

Salim Virani: So if overfits [matches too closely with the data you have such that it’s not good at predicting with new data]?

Vladimir Tsvetkov: Yeah, it might.

But more than that, there’s a broader point about how we should think about data. Sally Hadidi phrased it this way:

The problem with Big Data is that it is inherently a big, ugly mess that data scientists have to spend hours or days unraveling and putting in order before they can do anything meaningful.

Now she’s talking about big data, but what we found in this fixer session was that the same principle (albeit on a smaller scale) applied to just a few hundred data points. We underestimated the time and importance of data cleaning. As a result, the fixer session became largely about that process. Without cleaning the data in this way, we risked getting “junk” results.

In this context, Valyo framed the dangers of seeing AI as just putting data into an algorithm and getting great results:

That’s the danger of API machine learning, it’s the same thing as ‘I have this machine learning classifier, one line of code and I put data in and it spits something out and I don’t even know what’s going on but someone gave me data and I put it there.’ That doesn’t work, you need people who understand how things underneath work.

But for Brian, this was an important takeaway. By the end of the session, it was clear to him that the first step toward getting the conclusions he wanted was tackling problems with his data, combining it, and cleaning it. This knowledge led him to rethink the process of using algorithms to analyze his data and reconsider how he would acquire and store his data in the future.

--

--