Ways to Handle Continous Column Missing Data & Its Implementations

Ganesh Dhasade
The Startup
Published in
5 min readSep 1, 2020

In my last blog Link, I explained missing values and their types.

In this blog, I will explain how to handle missing values for the Continuous data column in the dataset with implementation.

Photo by: Scott Graham | Unsplash.com

Continous Data: Continuous data is quantitative data that can be measured, it has an infinite number of possible values within a selected range e.g. temperature range, height, weight, etc.

The dataset used to explain is Titanic (Kaggle dataset)

import pandas as pd
import numpy as np
Data = pd.read_csv("train.csv")
Data.isnull().sum()
Null values present in the dataset
  1. Mean / Median / Mode Imputation:

Assumption: Data is Missing Completely at Random(MCAR).

Description: Replacing NAN values with the most frequent occurrence of variable.

Implementation: Handle Age missing values

# function use to impute_nan values with mean/median/mode
def impute_nan(DataFrame, ColumnName, ImputeValue):
DataFrame[ColumnName + "_Imputed"] = DataFrame[ColumnName].fillna(ImputeValue)
# Call function to impute median value
median = Data.Age.median()
impute_nan(Data, 'Age', median)

Similarly, we can replace the NAN value with mean/mode.

This method has disadvantage i.e. impact the correlation and change distortion in original variance. To understand the impact, the below code used to plot data to see the distribution of variance.

# Code to plot distribution of vairance in data columnsfig = plt.figure()
ax = fig.add_subplot(111)
Data['Age'].plot(kind='kde', ax=ax)
Data['Age_Imputed'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
Imputation of mean/median/mode adds variance into data.

2. Random Sample Imputation

Assumption: Data is Missing Completely at Random(MCAR).

Description: Taking random observation from the dataset and use these observations to replace the NAN values.

Implementation: Handle Age missing values

Step 1: Create a copy of the Age column.

Step 2: Select random sample values to fill the NAN (Ignore NAN values and select only non-NAN values.)

Step 3: Replace NAN values with random sample values with the index of NAN values.

# function to impute_nan values with random observations
def impute_random_nan(Data,ColName):

#Make copy of Column
Data[ColName+"_random"]=Data[ColName]

#Select random sample value to fill the na
# .dropna() - to ignore NAN values and select non-NAN values

random_sample = Data[ColName].dropna().sample(Data[ColName].isnull().sum(),random_state=0)

# Merge - pandas need to have same index in order to merge the dataset.

random_sample.index = Data[Data[ColName].isnull()].index

Data.loc[Data[ColName].isnull(),ColName+'_random']=random_sample
#Call function to impute median value
median = Data.Age.median()
impute_nan(Data, 'Age', median)

Advantage: Not add or minimize the distortion in imputed values. Let's plot the curves to see the variance distribution.

# Code to plot distribution of vairance in data columns
fig = plt.figure()
ax = fig.add_subplot(111)
DataFrame['Age'].plot(kind='kde', ax=ax)
DataFrame['Age_random'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
Distribution of variance is minimum after imputation.

Disadvantage: This randomness does not work in every situation.

3. Capturing NAN Values With a New Feature

Assumption: Data is NOT Missing Completely at Random(MCAR).

Description: Add a new feature to introduce some weight/importance to non-imputed and imputed observations.

Implementation: Handle Age missing values

Step 1: Replace null values with mean/mode/median or random_sample_value.

Step 2: Add a new feature/column to indicate NAN value replaced/not replaced.

Step 3: Drop the Age column and keep imputed and new importance columns.

#1. Add new column with 1 for null values and 0 for not null values
DataFrame['Age_NAN']=np.where(DataFrame['Age'].isnull(),1,0)
#2. impute random sample values in Age column - call function
impute_random_nan (DataFrame, 'Age')
#Display results
DataFrame[['Age','Age_random','Age_NAN']].head(10)
#3. Drop Age Column
DataFrame = DataFrame.drop('Age',axis=1)
Index 5: Age NAN value replace with a random value and Age_NAN add 1 to introduce the importance

Advantage: Capture the importance of missingness.

Disadvantage: Creating Additional Features(Curse of Dimensionality) e.g. if there are 10 columns have null values need to create 10 extra columns.

4. End of Tail Distribution Imputation

Assumptions: Data is NOT Missing At Random (MAR); Data is skewed at the tail-end.

Description: Take the end tail distribution value (after 3rd std. deviation of distribution curve) and replace NAN/Null values.

Implementation: Handle Age missing values

Step 1: Get end distribution value, plot distribution of data using the below code:

# Take non NAN values
NonNANData = DataFrame['Age'].dropna().sample(DataFrame['Age'].isnull().sum())
# Fit a normal distribution to the data:
mu, std = norm.fit(NonNANData)
# Plot the histogram.
plt.hist(NonNANData, bins=50, density=True, alpha=0.6, color='g')
plt.plot(x, p, 'k', linewidth=2)
plt.show()

Step 2: Take to calculate the average (mean) of extreme values (end tail values) and replace with NAN values

#function to impute NAN values with Extreme tail valuedef impute_nan_TailValue(DataFrame,variable,ExtremeValue):    DataFrame[variable+"_end_distribution"] =     DataFrame[variable].fillna(extreme)# Calculate extreme end tail mean value
ExtremeEndTailMean = DataFrame.Age.mean()+3*DataFrame.Age.std()
# Call funtion to replace NAN values with extreme values
impute_nan_TailValue(DataFrame,'Age',ExtremeEndTailMean)
# Display histograms - Data distributionDataFrame.Age.hist(bins=50, color = 'g')
plt.title('AGE : Normal Distribution Histogram',fontsize=15)
plt.title('Age End Distribution : Normal Distribution Histogram',fontsize=15)
DataFrame.Age_end_distribution.hist(bins=50, color = 'g')
Data distribution after imputing NAN values with average end values

Advantage: Capture the importance of missingness if there.

Disadvantage:

  • Changes Co-variance/variance i.e distorts the original distribution; may create biased data.
  • If the number of NAN values is more then it may mask True outlier values.
  • If the number of NAN values is less then replace NAN values may be considered as an outlier.

5. Arbitrary Imputation

Assumptions: Data is NOT Missing At Random (MAR)

Description: Replacing NAN values with fixed arbitrary value (a random value). The arbitrary value should not frequently occur value in the dataset.

Implementation: Handle Age missing values

The arbitrary value can be any random value of choice. Sometimes it decided on the basis of domain or system knowledge.

Here we select 0 and 100 to replace NAN Values.

# Function to replace NAN values with Arbitary values
def impute_nan_ArbitaryValue(df,variable, value):
df[variable+"_"+ str(value)]=df[variable].fillna(value)
# Call funtion to replace NAN values with arbitary values
impute_nan_ArbitaryValue(DataFrame,'Age',0)
impute_nan_ArbitaryValue(DataFrame,'Age',100)
# Display top 10 data
DataFrame[['Age','Age_0','Age_100']].head(10)
Replace NAN values with any random arbitrary value

Advantage: Capture the importance of missingness.

Disadvantage:

  • Changes co-variance/variance; may create outliers
  • May distort the original distribution of data.
  • Hard to decide which value to use.

Conclusion:

The above implementation is to explain different ways we can handle missing continuous data. The most widely used methods are Random Sample Imputation and Mean/Median/Mode imputation.

For reference: Jupyter notebook — code available at GitHub: https://github.com/GDhasade/Medium.com_Contents/blob/master/Handle_Continous_Missing_Data.ipynb

--

--

Ganesh Dhasade
The Startup

Data - Scientist | Analyst | Engineer | Enthusiast | ML Engineer