Dealing with data preprocessing problems

6 min readOct 1, 2018

From June 2020, I will no longer be using Medium to publish new stories. Please, visit my personal blog if you want to continue to read my articles: https://vallant.in.

Using data from internet repositories to study Machine Learning (ML) is great, but it comes with a price: a lot of the data we use is already clean and ready to use. ML training datasets usually come with few things to be done. You just have to account for missing data or create new variables from the current ones.

But we don’t always find clear data to work with. In fact, you can find a lot of problems when you receive a raw data file to work with.

Data preprocessing problems can come in many flavors, but some of the most commons are:

Missing data
Manual input
Data inconsistency
Regional formats
Numerical units
Wrong data types
File manipulation
Missing anonymization

Let’s talk about some all of these problems now.

How to label text for sentiment analysis — good practises

Have you ever started a sentiment analysis or other text classification task only to see that you are not getting good…

medium.com

Missing data

Missing data is something so common that we are all used to it. There are many ways to deal with missing data, including deciding not to use full registers at all. But the best way to decide what to do when you have missing data is understanding why it is missing. And it’s usually due to three facts:

Missing completely at random (MCAR) — basically, the propensity for a data point to be missing has no relation with any values in the data set, missing or observed. Therefore, it’s random. For instance, let’s say that some salaries on a dataset are missing because someone forgot to type them or you just lost information.

Missing at random data (MAR) — MAR is more complicated. Here, we still have a random behavior, but the reason why data is missing is connected to some factor present on the data, but not to the data itself. Let’s say, for instance, that you have noticed that managers are likely less inclined to disclose their salaries, independently of their salary values.

MAR: notice that only managers omit their salaries. Omission is connected to the ‘Role’ variable.

Missing not at random (MNAR) — MNAR occurs when data ‘missingness’ depends on the data itself. So, let’s say that you have noticed that people who earn more than 5 000 € /month omit their salaries, so, you have a MNAR case. Here, omission is linked to the salary value, so, to the ‘Salary’ variable itself.

MNAR: missing salaries have a strong connection with the salary value itself.

What to do with missing data? The decision here is between to input data into the model or to delete missing values. Data deletion is accepted on the first case, but deleting data on the second and third cases can lead to bias. This occurs because you are likely to have less representatives of a given group if you delete MNAR and MAR data. A solution could be to delete an entire variable instead of deleting the cases.

Other possible solutions to missing data is imputation. Here, you replace missing values with other values such as mean, median, mode, random sampling, interpolation etc. If you have a categorical value, you can even create a new N/A class.

Manual input

Manual input error is a problem because it can lead to missing data, but can also to data inconsistency. Let’s revisit the salary example. Let’s say that instead of using only numerical values, someone decides to input salary with other non-numerical symbols. Before dealing with this data, you have to remove these symbols in order to transform your data to numerical.

Data Inconsistency

Data inconsistency is a huge problem because it creates variations on data that simply should not exist. A simple example of data inconsistency is when someone starts to represent gender by “male” and “female” and alternates to “M” and “F”. Since “male” and “M” are the same thing, they should be represented in the same way. The same for “female” and “F”.

Another inconsistency problem is when different information is stored under the same variable. Let’s say that a variable phone number stores only one number, but someone inputs two. Which one to use in this case? In summary, data inconsistency compromise data integrity.

Regional formats

When we receive raw data, it’s important to check in which format data is expressed. For instance, if you are working with dates, which should be the format you will use? Should it be day/month or month/day? Are your decimal numbers represented by a comma or by a period? Adjust the date to the format you have to and keep it. Look for inconsistent representations.

Numerical units

Data on the same dimension should be expressed with the same unit. If you decide to use Fahrenheit to measure temperature, don’t suddenly start to store Celsius on the same dimension. If you need both ways to express the same measure, create different columns. Pay also attention to things like “1 000” and “1K” or “1 mi” and “1 000 000”. If one thing is the same thing, keep the same unit.

Wrong data types

Databases are very good into storing information of different types. This happens for optimization reasons, but also to avoid human error. So, be careful when you are trying to add something into your database. “3” and “three” are the same thing for a human, but not for a computer. Always use the requested data type. Look for wrong representations on the data when you start to work with it.

File manipulation

Sometimes, you have to open a file to see what’s inside it. And sometimes the software you use for it can break your data. This occurs, for instance, with Excel. Excel can change the dates, long numbers (such as credit card numbers) etc. File manipulation can also become problem when you need to deal with CSV and text formats. Pay attention to things like the separator, the qualifier, the text encoding etc.

Missing anonymization

Not all data is good for everything. In fact, some data must me anonymized or simply removed before analysis. This is important to keep privacy, leverage security and avoid bias.

When you are working with data, sensitive information like passwords, credit card numbers, names, emails, addresses or any other information capable of identifying an individual should be removed as you never know who will have access to data.

Also, it’s important to remove some data for ethical reasons and to avoid introducing bias to the model.

Summary

These are some of the few problems I encountered while analysing data. But you may find others. The important thing here is to remember that preprocessing information is as important as creating a good model. This should be also a concern when dealing with Machine Learning.

Why is removing stop words not always a good idea

Removing stop words is a difficult choice. You should not remove them every time. But when is this step really…