Data Cleaning


A while ago I created my first real model at my last internship. I have already done logistic regressions and neural networks before, but the dataset was always ready to plug into a black box and it would work. For example, the MNIST data set loading from keras, all you need to do is reshape the 28x28 pixels that corresponds to a digit to a 1x2⁸² matrix and normalize it. It would look something like this:

But in the real world, before we do anything, we must first select what are the relevant information for our model and only then, we treat them. So the first thing I had to do was readjust my expectations. I was not going to design a super cool neural network on day one. Actually designing the model is about 30% of the work. The biggest chunk of the job is understanding the problem, creating and cleaning the database.

My model wanted to predict which costumer would close their account in the next 3 or 6 months. To select the variables in the almost literal sea of data that the bank has (you can even put the astrological sign of the costumer, but I highly doubt it would be significant) first we had to select the ones that would make sense: cash flow, credit, etc.

Checking the correlation between variables

To maintain the algorithm stable, we need to guarantee that there is no strong correlation between the variables and that the variables have some sort of correlation with the event we are trying to predict.

Correlation with the response variable

Unfortunately, in the industry, we often don’t have the luxury of perfectionism. Deadlines are tight and therefore we need to be fast. At the bank, usually a simple model is made in order to be able to check with our client if the results make sense. So a quick test we usually did there was to calculate the correlation between the response variable with a column of random numbers (some people even use the CPF, kinda like a SSN here in Brazil). If a variable has a lower correlation than the random column we just created, we get rid of it.

Correlation between each other

An example that helped me understand why that was relevant for logistic regressions was trying to predict if a car accident will be fatal or not. Suppose that our database has 2 linear dependent columns, which means that they are perfectly correlated: speed in km/h and in m/s (v_kmh = 0.27*v_ms). So our equation would look something like this:

log(P(fatal)/P(not_fatal))=β0+β_v_kmh*x1+β_v_ms*x2+(other terms)

But because of the correlation between them, it can be rewritten as

log(P(fatal)/P(not_fatal))=β0+(β_v_kmh*+ 0.27*β_v_ms)*x1+(other terms)

Obviously that is not the case, but suppose that, speed is not relevant at all in predicting fatality. We would then expect our β_v_kmh = β_v_ms = 0. But because our variables are linearly dependent we can end up with β_v_kmh = -0.27 and β_v_ms = 1. Which would also lead to the weight of x1 being zero. Therefore our model is unstable.

Balancing the Dataset

The event we were trying to predict here — the closing of the accounts — is very rare. Like < 1%. Which is great for the bank, as they are not losing their business and the chances of me getting a full time offer are bigger, but the model is way more complex (which can also be a plus as it is waaay more interesting)

Every tutorial that I’ve done so far, the ration between event and not event are nicely distributed, which makes it hard for a random guess to be an accurate prediction. But if the dataset is unbalanced, it is easy for the model to converge to a “fake optimal” solution. This happens if you don’t have access to the cost function of your neural network or logistic regression. As I was using SAS routine proc neural I could not change the cost function (just choose from a few).

This may be a bit confusing, so let me try to paint a picture: If you have an exam and you know that 99 of the questions have the answer A and only 1 B, you’d probably only guess A and call it a day. But for us, labeling every costumer as one that would not close its account, doesn’t help at all. We can’t make any decisions! (this also makes accuracy a terrible indicator of how well your model is performing. I will talk about that on a different post)

When balancing the dataset, we have to be careful to not make the model too aggressive. If our balanced set has a ratio of 1:1, we will flag a lot more accounts as future cancelations and the bank would probably lose money trying to retain a customer that was not going to leave in the first place.

At the bank people did it in 2 ways: getting 100% of the events and sample non-events until the desired ratio — which is great for a fast prototype — but that will lose a ton of information. On my baseline model, a 1:2 ratio would only use .6% of the dataset!!

Another way it is to resample the events until we have the desired ratio. But I feel that could lead to an overfit of the model, but it is a super uneducated guess, so if you know about it, I’d love to hear!

In the end, building your own model, not use functions from Python libraries like Keras or SAS’ proc neural would probably be the best way to do it, but since creating a new cost function would also require us to come up with a relationship between false-negatives and false-positives and so on would take so much time it was not possible to do so.

Next time I will talk about how we measured the effectiveness of our model using the Kolmogorov–Smirnov test.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.