Death, Taxes, and Data Cleaning

Eric David Halsey
Source Institute
Published in
4 min readJun 6, 2017
Photo credit: the@w00d via Foter.com / CC BY-NC-SA

Brian Hankey ran a payday loan business in Singapore. He wanted to use his payment data to predict future defaults and he knew his data was getting less useful with every passing day.

We organized a 2.5 hour session between him, Valyo Yolovski, a data scientist who was just finishing up a year of work on a bunch of quick AI projects, and Vladimir (Vladi) Tsvetkov, the CTO of Hacker.Works. We’re building an online course based on practical AI challenges like this. Valyo started the session off:

It would be useful to first look at whether there’s any high correlations between any of the features.

A feature is data scientist speak for the name of one of the variables, in this case, information like zip code, income, and whether or not the loan was repaid.

They started by using linear regression to spot those obvious correlations. 5 minutes later, they were looking at something like a multiplication table showing the correlation between every feature in the data. It showed high income correlated to higher rent and having many other loans correlated to defaulting. Vladi explained the results and pointed out that they need to go further to get something useful for Brian.

I mean, if we look at who paid and who hasn’t, who’s been approved and who hasn’t, most likely we’ll come up with samples from a data science book, some obvious correlations. Like between rent and income. They’re not going to be any insights there.

Valyo and Vladi agreed that they’d likely be able to generate better loan approval guidelines with an optimised decision tree. In this case, a decision tree would be a way to ask the right questions to get a more accurate probability of repayment.

They’d get there by using a random forest algorithm. A random forest basically tries a whole bunch of different, randomly-modified decision trees to see which performs best.

Brian was eager to get started, but Valyo made it clear that wasn’t how this was going to work:

First is getting to know the data and cleaning it, and that takes three hours. Then, maybe in an hour and a half you actually do the machine learning part.

Brain’s eyes widened.

Valyo explained:

I think I need a few more hours for proper cleaning. Then we would have some results almost immediately, because they don’t take time with such a small data set, like 2,000… But then it’s going to begin recognizing things that are noise as patterns. But if you manage to bring up your accuracy, you might find some interesting insights.

But again, this might not happen with decision trees, that might do it 75 or 80%, it might even go up to 90% but then it’s going to plateau. It’s not going to move forward. And then you might need neural networks and more data. They might get you 95 or 96% accuracy but then you’re not going to know what’s going on in the background. That’s the tradeoff.

After the last push to finish cleaning the data, Valyo stood up, put down his pizza, and explained why data cleaning isn’t just about preparing the data for an algorithm. It was useful for him to get a deeper understanding of the data he was working with:

I think a lot of things are going to come up in the data cleaning process that [the owner of the data] needs to be there to talk about it. And then you frame. You have a goal at the beginning, you do data cleaning, you do some basic statistics correlation, things like that, then you touch base again. Now I have a much better idea about what the data is. Now if I sit down and do things, it’s not going to be three hours, it’s going to be an hour and a half maybe. Because now I’m in the mindset, but that takes time. Then you can do the machine learning thing for like an hour, hour and a half: tweak models, test, discuss what’s going on…

You need people who understand how things underneath work. Not only understand, are good developers, but also need to have the thinking of “okay, this is the data and I understand what data is” that doesn’t just mean you can read the rows and understand that it’s clean, you understand the context behind the problem. So it’s credit card default rates, this is payday loans, I understand the payday loan industry.

The random forest did ultimately find 2 specific types of borrowers that are very likely to repay. But the unexpected lesson for me was that out of the 2.5 hour session, the first 5 minutes were spent doing linear regression, another 5 minutes doing the random forest, and the remaining 2:20 cleaning the data.

That got me thinking about taxes. Because when you’re growing up, nobody ever explains how to pay them, but they’re so damned important. Data cleaning seemed to work the same way. You may not read about it a lot, but for most AI algorithms, it’s just a fact of life.

And nobody died, it was just a catchy headline. I regret nothing.

--

--