Practical technique for filling missing values in a data set

Krish
The Startup
Published in
5 min readNov 28, 2019

If you, like me, have wondered about different techniques to fill missing values in a messy data set other than filling the entire data set with blanket functions, then read on.

You might be familiar with the Titanic data set from Kaggle. It’s one of the contests to practice machine learning and has quite a few missing values for ‘age’ field. This data set, especially the age field with several missing values will help demonstrate the idea.

When you load the data and take a look at the information:

data = pd.read_csv('train.csv')
data.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

There are 177 missing values.

The available 714 data points for age are distributed as below:

You could use mean or median, or even forward fill to complete the missing values for the age field. If you explore the data further, you might find some interesting patterns.

#1 Explore all segments of data.

There are some other fields such as ‘Sex’, ‘Pclass’ and ‘Embarked’. Pclass is the ticket class and Embarked is the port of embarkation. You could segment the data based on these fields and you get:

Rather than filling the median value across the board, I chose to segment the data further, look at the distributions and then fill with the median values. This would not disrupt or deviate the data than it needs to be, control the outliers and make sure I have good data points for training.

Another value I considered was to extract the title from the Name field. Titles such as Mrs, Mr, Dr, Major, etc. My assumption was that it would give an indicator to suggest if the person is an adult or not. This segmentation would further strengthen the technique of breaking down the data and then filling the missing values.

x = data.Name.apply(lambda x: re.findall(r"[A-Z]{1}[a-z]+\.", x)[0])
x.value_counts()

Output:

Mr.          517
Miss. 182
Mrs. 125
Master. 40
Dr. 7
Rev. 6
Major. 2
Col. 2
Mlle. 2
Capt. 1
Mme. 1
Jonkheer. 1
Lady. 1
Countess. 1
Don. 1
Ms. 1
Sir. 1
Name: Name, dtype: int64

This turned out to be a difficult exercise because what if the test data set have a title that the train data set doesn’t? With text, it’s quite a hassle. In addition while Master — Mr was a distinction between young and old, Mrs — Miss wasn’t. So I dropped the idea of segmenting it by child/adult using the title.

(Note: I did use this to extract a feature: vip_indicator. This was a binary feature that would be 1 if the title was a specialized one and 0 if it was a regular one. Was it a useful feature? I leave that up to you in case you want to try :))

#2 Choose the best statistic.

After segmenting the data, I realized that for each segment, the median seems to be a good statistic that doesn’t deviate the original data a lot.

Below is the age distribution after using different techniques:

The ‘detailed’ one is the method I used after segmenting the data. Let’s call it segment-and-fill. The titanic data set presented here is cleaned up and ready to go. It is also ordered properly. In this case ffill proves to be a strong contender. Nevertheless, filling the entire data set’s missing values with the mean or median all at one go, fattens the distribution towards the center (as expected) deviating from the original distribution.

Using fillna with median or mean on the entire data set introduces some outliers as well.

Though using ffill is a good alternative in the Titanic data set case, in the absence of a pre-cleaned, ordered data set, the segment-and-fill technique can be useful. I’ve used segment-and-fill technique at work and it gave better final results than others. (Unfortunately I can’t share those results with you in this public forum).

#3 Modularize your technique for reuse.

The function below let me break the original data into segments, fill the missing value for age based on those segments using median values and join the data back together.

#function to fillna for Age
def fill_age_values(df):
embarked = df.Embarked.unique()
pclass = df.Pclass.unique()
sex = df.Sex.unique()

count = 0
frames = {}
for i in range(len(sex)):
for j in range(len(pclass)):
for k in range(len(embarked)):
count += 1
frames['df_{}'.format(count)] = df[(df.Sex == sex[i]) & (df.Pclass == pclass[j]) & (df.Embarked == embarked[k])]

for key in frames.keys():
frames[key].Age.fillna(frames[key].Age.quantile(), inplace=True)

return pd.concat(frames.values()).sort_index()

So to recap:

  1. Explore all segments of data: Spend some time dicing the data, look at every segment that’s available. One thing to watch for is, make sure fields you segment by don’t have missing values (or very few).
  2. Choose the best statistic: Once you segment the data, choose which statistic would best suit your purpose. Also think about the final performance metric of your model and how this feature would affect it.
  3. Modularize your technique for reuse: Future proof your technique to save yourself some hassle later.

Thank you for reading. Share your thoughts if this technique proves useful to your work!

--

--

Krish
The Startup

I work on data analysis, visualization and experimentation as part of my daily job.