Approach to Handling Missing Values

A comprehensive guide that explains how to locate and deal with missing values

Suraj Yadav
7 min readAug 4, 2022

Why is it necessary to deal with data that is missing?

In most circumstances, the data that comes from the real world are missing a significant amount of information. There could be a variety of factors contributing to the absence of each value. Some data may have been lost or corrupted, or there may be other, more particular reasons. The accuracy of your model will suffer due to the absence of some required data. When missing information is used in an algorithm, it might lead to inaccurate parameter estimates. Your results’ credibility will suffer if you fail to account for missing information.

The causes of missing values

Have you ever thought why some records in a dataset are missing?

The following are some of the possible explanations for missing data:

  • Many surveys fail to obtain data because respondents avoid answering particular questions. Some people may feel awkward discussing personal details like their income, alcohol consumption, or smoking history. The majority of people omit these on purpose.
  • Information is sometimes pieced together indirectly from a variety of sources, rather than being gathered in one place. Data corruption is a serious problem here. Some data is missing or corrupted because of lack of maintenance.
  • Data collecting errors are another cause of incomplete information. For instance, it is challenging to eliminate all human error from manual data entry.

Load in the dataset

This dataset was obtained from http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data with some modification

import pandas as pddf = pd.read_csv("creditApprovalUCI.csv")df

This gives us the following output:

Let’s figure out how many of each variable’s values are missing and put them in ascending order:

df.isnull().sum().sort_values(ascending=True)

The following is what the code above produces:

We used the isnull() and sum() methods in pandas to figure out how many observations were missing for each variable.

Drop Missing Data

In the process of dealing with missing data, you have the option of either discarding the missing data or substituting other values for the missing ones.

In order to get rid of rows that are lacking at least one value in a row, type in the following:

df_clean=df.dropna()df_clean

This results in a clean DataFrame that is devoid of any missing data, and as we can see at the bottom of the DataFrame output, there are now a total of 564 rows remaining (from the initial of 690 rows).

Imputation using either the mean or the median

Imputation using the mean or the median involves filling in missing values with the value of the variable’s mean or median. This is something that can only be done with numerical variables. These values are then used to impute missing data in train and test sets, as well as in future data that we intend to score using the machine learning model. The mean or the median is calculated by using a train set.

If the variables are normally distributed, you should use mean imputation, and if not, you should use median imputation. If there is a large amount of missing data, using mean and median imputation may cause a change in the distribution of the original variables.

# Import the required library
from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer

In mean and median imputation, the mean or median values should be calculated utilizing the variables in the train set; consequently, let’s separate the data into train sets and test sets, along with their respective targets, keeping only the numerical variables:

X_train, X_test, y_train, y_test = train_test_split(df[['A2', 'A3', 'A8', 'A11','A14','A15','A16']], data['A16'], test_size=0.3,random_state=42
)

Let’s check to see how many of these values are missing from the train set:

X_train.isnull().sum()

Using the SimpleImputer() function available in scikit-learn, let’s create a median imputation transformer:

simple_imputer = SimpleImputer(strategy='median')

In order to carry out mean imputation, we need to tell SimpleImputer() to use the mean strategy, which looks like this:

simple_ imputer = SimpleImputer(strategy =’mean’).

Let’s fit SimpleImputer() so it can learn the median values of the variables by fitting it to the train set:

simple_imputer.fit(X_train)

Let’s fill in the gaps with medians instead of missing values:

X_train = simple_imputer.transform(X_train)X_test  = simple_imputer.transform(X_test)

The median of each variable in the train set was learned by SimpleImputer() via the fit() method and saved in the statistics_ attribute. After that, we used the transform() method of SimpleImputer() to substitute known values for missing ones in both the training and testing data.

Let’s check the train data to see whether or not there is a missing value

pd.DataFrame(data=X_train,columns = ['A2', 'A3', 'A8', 'A11','A14','A15','A16']).isnull().sum()

SimpleImputer() returns NumPy arrays. We can transform the array into a dataframe using pd.DataFrame(X_train, columns = [‘A2’, ‘A3’, ‘A8’, ‘A11’,’A14',’A15',’A16']).

Applying either the mode or the frequent category imputation

The mode is used to fill in missing values in mode imputation. This method is usually used with categorical variables, which is why it is called “frequent category imputation.” Using the train set, frequent categories are estimated, and then those estimates are used to impute values in the train set, the test set, and any future datasets.

`# Load the dataset
import pandas as pd
df = pd.read_csv("creditApprovalUCI.csv")

Create a train set and a test set from the original dataset, and keep only the categorical variables:

Use the SimpleImputer() function available in scikit-learn to generate a frequent category imputer:

simple_imputer = SimpleImputer(strategy='most_frequent')

Fit the imputer to the train set so that it learns the most frequent values:

simple_imputer.fit(X_train)

The following are the values that are encountered most frequently by the imputer for the categorical columns : A1,A4,A5,A6,A7,A9,A10,A12,A13.

>>> simple_imputer.statistics_Output :
array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)

Replace missing values with frequent categories:

X_train = simple_imputer.transform(X_train)X_test = simple_imputer.transform(X_test)

We are able to verify that there are no missing values in the data.

pd.DataFrame(data=X_train,columns=['A1','A4', 'A5', 'A6', 'A7','A9','A10','A12','A13']).isnull().sum()

Note that SimpleImputer() will return a NumPy array and not a pandas dataframe.

A1     0 
A4 0
A5 0
A6 0
A7 0
A9 0
A10 0
A12 0
A13 0
dtype: int64

We split the data into train and test sets, but only kept categorical variables, so that scikit-learn could fill in missing values. Next, we set up SimpleImputer() and put most frequent in the strategy as the imputation method. With the fit() method, imputer learned and stored in its statistics_ attribute the categories that were used most often. With the transform() method, the learned statistics were used to fill in the missing values in the train and test sets. This method gave back NumPy arrays.

Replacement of missing data with a random number

The process of substituting missing data with arbitrary values is what is meant by the term “arbitrary number imputation.” When dealing with positive distributions, some values that are frequently used are 999, 9999, or -1. This approach works well with variables that have numerical values.

# Load the data
import pandas as pd
df = pd.read_csv("creditApprovalUCI.csv")

Create two distinct data sets, one for training and one for testing, while retaining only the numerical variables:

X_train, X_test, y_train, y_test = train_test_split(    data[['A2', 'A3', 'A8', 'A11']], data['A16'], test_size=0.3,    random_state=42
)

Configure SimpleImputer() in such a way that it will substitute 99 for any values that are missing:

simple_imputer = SimpleImputer(strategy='constant', fill_value=99)

Note-> If your dataset contains categorical variables, SimpleImputer() will also add 99 to any missing values in those variables if there are any.

Let’s fit the imputer to the train set:

simple_imputer.fit(X_train)

Let’s substitute 99 for the values that are missing:

X_train = simple_imputer.transform(X_train)X_test = simple_imputer.transform(X_test)

I hope you find this article helpful and have learned some new things ❤

Clap if you enjoyed this article and follow for more content like this.

Other articles :

Reference :

Python Feature Engineering Cookbook -by Soledad Galli

--

--