Nearly all of the real-world datasets have missing values, and it’s not just a minor nuisance, it is a serious problem that we need to account for. Missing data — is a tough problem, and, unfortunately, there is no best way to deal with it. In this article, I will try to explain the most common and time-tested methods.
In order to understand how to deal with missing data, you need to understand what types of missing data there are, it might be difficult to grasp their differences, but I highly recommend you to read my previous post where I tried to explain missing value types as simple as possible.
Missing data may come in a variety of ways, it can be an empty string, it can be NA, N/A, None, -1 or 999. The best way to prepare for dealing with missing values is to understand the data you have: understand how missing values are represented, how the data was collected, where missing values are not supposed to be and where they are used specifically to represent the absence of data. Domain knowledge and data understanding are the most important factors to successfully deal with missing data, moreover, these factors are the most important in any part of the data science project.
With data Missing Completely at Random (MCAR), we can drop the missing values upon their occurrence, but with Missing at Random (MAR) and Missing Not at Random (MNAR) data, this could potentially introduce bias to the model. Moreover, dropping MCAR values may seem safe at first, but, still, by dropping the samples we are reducing the size of the dataset. It is always better to keep the values than to discard them, in the end, the amount of the data plays a very important role in a data science project and its outcome.
For the sake of clarity let’s imagine that we want to predict the price of the car given some features. And of course, data has some missing values. The data might look like the table illustrated below. In every method described below, I will reference this table for a clearer explanation.
- Listwise deletion.
If missing values in some variable in the dataset is MCAR and the number of missing values is not very high, you can drop missing entries, i.e. you drop all the data for a particular observation if the variable of interest is missing.
Looking in the table illustrated above, if we wanted to deal with all the NaN variables in the dataset, we would drop the first three rows of the dataset because each of the rows contains at least one NaN value. If we wanted to deal just with the mileage variable, we would drop the second and third row of the dataset, because in these rows mileage column has missing entries.
- Dropping variable.
There are situations when the variable has a lot of missing values, in this case, if the variable is not a very important predictor for the target variable, the variable can be dropped completely. As a rule of thumb, when the data goes missing on 60–70 percent of the variable, dropping the variable should be considered.
Looking at our table, we could think of dropping mileage column, because 50 percent of the data is missing, but since it is lower than a rule of thumb and the mileage is MAR value (this was discussed in the previous article on the types of missing values) and one of the most important predictors of the price of the car, it would be a bad choice to drop the variable.
- Encoding missing variables in continuous features.
When the variable is positive in nature, encoding missing entries as -1 works well for tree-based models. Tree-based models can account for missingness of data via encoded variables.
In our case, the mileage column would be our choice for encoding missing entries. If we used tree-based models (Random Forest, Boosting), we could encode NaN values as -1.
- Encoding missing entry as another level of a categorical variable.
This method also works best with tree-based models. Here, we modify the missing entries in a categorical variable as another level. Again, tree-based models can account for missingness with the help of a new level that represents missing values.
Color feature is a perfect candidate for this encoding method. We could encode NaN values as ‘other’, and this decision would be accounted for when training a model.
- Mean/Median/Mode imputation.
With this method, we impute the missing values with the mean or the median of some variable if it is continuous, and we impute with mode if the variable is categorical. This method is fast but reduces the variance of the data.
Mileage column in our table could be imputed via mean or median, and the color column could be imputed using its mode, i.e. most frequently occurring level.
- Predictive models for data imputation.
This method can be very effective if correctly designed. The idea of this method is that we predict the value of the missing entry with the help of other features in the dataset. The most common prediction algorithms for imputation are Linear Regression and K-Nearest Neighbors.
Considering the table above, we could predict the missing values in the mileage column using color, year and model variables. Using the target variable, i.e. price column as a predictor is not a good choice since we are leaking data for future models. If we imputed mileage missing entries using price column, the information of the price column would be leaked in the mileage column.
- Multiple Imputation.
In Multiple Imputation, instead of imputing a single value for each missing entry we place there a set of values, which contain the natural variability. This method also uses predictive methods, but multiple times, creating different imputed datasets. Thereafter, created datasets analyzed and the single best dataset is created. This is a highly preferred method for data imputation, but moderately sophisticated, you can read about it here.
There are a lot of methods that deal with missing values, but there is no best one. Dealing with missing values involves experimenting and trying different approaches. There is one method though, which is considered the best for dealing with missing values, the basic idea of it is preventing the missing data problem by the well-planned study where the data is collected carefully. So, if you are planning a study consider designing it more carefully to avoid problems with missing data.
Click the 💚, if you liked the article, so more people can see it here on Medium. This article is best read with my previous article, that can be found here. If you have any questions, you can write them in the comments section below, and I will do my best to answer them. Also, you can email me directly or find me on LinkedIn.