Source

A Guide To KNN Imputation

Kyaw Saw Htoon
6 min readJul 3, 2020

--

How to handle missing data in your dataset with Scikit-Learn’s KNN Imputer

Missing values exist in almost all datasets and it is essential to handle them properly in order to construct reliable machine learning models with optimal statistical power. In this article, we will talk about what missing values are, how to identify them, and how to replace them by using the K-Nearest Neighbors imputation method. To demonstrate this method, we will use the famous Titanic dataset in this guide.

What are Missing Values?

A missing value can be defined as the data value that is not captured nor stored for a variable in the observation of interest. There are 3 types of missing values -

Missing Completely at Random (MCAR)

MCAR occurs when the missing on the variable is completely unsystematic. When our dataset is missing values completely at random, the probability of missing data is unrelated to any other variable and unrelated to the variable with missing values itself. For example, MCAR would occur when data is missing because the responses to a research survey about depression are lost in the mail.

Missing at Random (MAR)

MAR occurs when the probability of the missing data on a variable is related to some other measured variable but unrelated to the variable with missing values itself. For example, the data values are missing because males are less likely to respond to a depression survey. In this case, the missing data is related to the gender of the respondents. However, the missing data is not related to the level of depression itself.

Missing Not at Random (MNAR)

MNAR occurs when the missing values on a variable are related to the variable with the missing values itself. In this case, the data values are missing because the respondents failed to fill in the survey due to their level of depression.

Effects of Missing Values

Having missing values in our datasets can have various detrimental effects. Here are a few examples -

  1. Missing data can limit our ability to perform important data science tasks such as converting data types or visualizing data
  2. Missing data can reduce the statistical power of our models which in turn increases the probability of Type II error. Type II error is the failure to reject a false null hypothesis.
  3. Missing data can reduce the representativeness of the samples in the dataset.
  4. Missing data can distort the validity of the scientific trials and can lead to invalid conclusions.

Identifying Missing Values

Finding missing values with Python is straightforward. First, we will import Pandas and create a data frame for the Titanic dataset.

import pandas as pddf = pd.read_csv(‘titanic.csv’)

Next, we will remove some of the independent variable columns that have little use for KNN Imputer or the machine learning algorithm if we are building one. These columns include passenger names, passenger IDs, cabin and ticket numbers.

df = df.drop(['Unnamed: 0', 'PassengerId', 'Name', 
'Ticket', 'Cabin'], axis=1)

We will then use Pandas’ data frame attributes, ‘.isna()’ and ‘.isany()’, to detect missing values. These attributes will return Boolean values where ‘True’ indicates that there are missing values in the particular column.

df.isna().isany()

As we can see, the columns ‘Age’ and ‘Embarked’ have missing values. Instead of ‘.isany()’, we can also use ‘.sum()’ to find out the number of missing values in the columns.

df.isna().sum()

There you go. Now, we know that ‘Age’ has 177 and ‘Embarked’ has 2 missing values.

KNN Imputer

KNN Imputer was first supported by Scikit-Learn in December 2019 when it released its version 0.22. This imputer utilizes the k-Nearest Neighbors method to replace the missing values in the datasets with the mean value from the parameter ‘n_neighbors’ nearest neighbors found in the training set. By default, it uses a Euclidean distance metric to impute the missing values.

To see this imputer in action, we will import it from Scikit-Learn’s impute package -

from sklearn.impute import KNNImputer

One thing to note here is that the KNN Imputer does not recognize text data values. It will generate errors if we do not change these values to numerical values. For example, in our Titanic dataset, the categorical columns ‘Sex’ and ‘Embarked’ have text data.

A good way to modify the text data is to perform one-hot encoding or create “dummy variables”. The idea is to convert each category into a binary data column by assigning a 1 or 0. Other options would be to use LabelEncoder or OrdinalEncoder from Scikit-Learn’s preprocessing package.

In this tutorial, we will stick to one-hot encoding. First, we will make a list of categorical variables with text data and generate dummy variables by using ‘.get_dummies’ attribute of Pandas data frame package. An important caveat here is we are setting “drop_first” parameters as True in order to prevent the Dummy Variable Trap.

Note: You can also use Scikit-Learn’s LabelBinarizer method here.

cat_variables = df[[‘Sex’, ‘Embarked’]]
cat_dummies = pd.get_dummies(cat_variables, drop_first=True)
cat_dummies.head()

Now we have 3 dummy variable columns. In the “Sex_male” column, 1 indicates that the passenger is male and 0 is female. The “Sex_female” column is dropped since the “drop_first” parameter is set as True. Similarly, there are only 2 columns for “Embarked” because the third one has been dropped.

Next, we will drop the original “Sex” and “Embarked” columns from the data frame and add the dummy variables.

df = df.drop(['Sex', 'Embarked'], axis=1)
df = pd.concat([df, cat_dummies], axis=1)
df.head()

Another critical point here is that the KNN Imptuer is a distance-based imputation method and it requires us to normalize our data. Otherwise, the different scales of our data will lead the KNN Imputer to generate biased replacements for the missing values. For simplicity, we will use Scikit-Learn’s MinMaxScaler which will scale our variables to have values between 0 and 1.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
df = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
df.head()

Now that our dataset has dummy variables and normalized, we can move on to the KNN Imputation. Let’s import it from Scikit-Learn’s Impute package and apply it to our data. In this example, we are setting the parameter ‘n_neighbors’ as 5. So, the missing values will be replaced by the mean value of 5 nearest neighbors measured by Euclidean distance.

from sklearn.impute import KNNImputerimputer = KNNImputer(n_neighbors=5)
df = pd.DataFrame(imputer.fit_transform(df),columns = df.columns)

Ok, the verdict is in! Let’s see the results.

df.isna().any()
df.isna().sum()

As demonstrated above, our data frame no longer has missing values. They have been imputed as the means of k-Nearest Neighbor values.

Conclusion

There are different ways to handle missing data. Some methods such as removing the entire observation if it has a missing value or replacing the missing values with mean, median or mode values. However, these methods can waste valuable data or reduce the variability of your dataset. In contrast, KNN Imputer maintains the value and variability of your datasets and yet it is more precise and efficient than using the average values.

References

--

--