How c☐n we deal w☐th mis☐ing data?

Let’s overcome missing data together!

Ahmet Talha Bektaş

9 min readJan 11, 2023

Q: What bothers data scientists?
A: As you could guess, “Irregular Data”

Introduction
1-)For categorical data types
1.1-) Filling by Mode
1.2-) Filling by “Unknown”
2-)For numerical data types
2.1-) Filling by Mean
2.2-) Filling by Median
3-)For both numerical and categorical data types
3.1-)Deleting rows
4-)Easy way
4.1-)SimpleImputer
5)Best ways
5.1-)Filling by “Feature Engineering”
5.2-) Filling by Regression
Author

Introduction

Before we start you can find my notebook for this article both my Kaggle and my GitHub.

What is the missing value?

Sometimes users would not want to give their information or data. In this case, missing values in our data is existing.

The creator of the dataset can show missing values in various ways to us. The most common missing value signs are :

None
”Unknown”
“unk”
0
NaN

What is the main purpose of filling in missing data?

Our main purpose is generally trying to find a good way to make our data more similar to Standard Normal Distribution.

However, sometimes, it can not or should not be similar to Standard Normal Distribution!

You will better understand when we do examples.

Let’s learn how to fill in missing values!

1-)For categorical data types

In categorical data types, we can fill by 2 methods:

Filling by Mode: In categorical data types, we can not measure mean or median, however; we can calculate the mode. We will put the mode of the column to missing values.
Filling by “Unknown”: We can fill unknown values with “unknown” string 😁

1.1-) Filling by Mode

“Mode” is the most frequent value in the column. If missing data is small in contrast to all data, we can assume that these missing values are probably the most frequent value (mode).

Let’s make an example!

I am going to use the NFL dataset.

If you do not know how to read data you should read this article.

Importing Necessary Libraries

Pandas ➱ For reading data and manipulation
Matplotlib ➱ For data visualization operations
Seaborn ➱ For data visualization

import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

Reading Data

df=pd.read_csv("nfl.csv")

EDA

If you haven’t met with EDA yet, I recommend you to read this article.

df.head()
# it shows the first 5 rows

Output:

df.info()
#It shows general information about columns

Output:

df.isnull().sum()
#It shows if there are any missing values in our dataset

Output:

I chose the “FieldPosition” column because its data type is object and it has 8602 empty rows.

Let’s look at the distribution!

I want to look at the distribution of that column before filling in empty rows.

I will use data visualization tools. If you don’t know how to visualize data you should read How can you make up your data? (Part-1) and (Part-2). Also, Supermarket Analysis is a good practice article for data visulation.

plt.figure(figsize=(12,8))

sns.histplot(data=df, x="FieldPosition",kde=True)

plt.xticks(rotation=90);

Output:

Filling missing values!

We can easily fill inside the empty rows by using the fillna() function.

df["FieldPosition"].fillna(df["FieldPosition"].mode().values[0],inplace=True)
# Here I am filling the "FieldPosition" column by its mode. 
# inplace=True means do these changes in DataFrame too.

Let’s look at the distribution again!

plt.figure(figsize=(12,8))

sns.histplot(data=df, x="FieldPosition",kde=True)

plt.xticks(rotation=90);

Output:

1.2-) Filling by “Unknown”

Sometimes there could be a correlation between missing data. And if we fill that as mode we could lose that correlation. As a consequence, we fill these data with “Unknown” string. Of course, in this case, you can fill in whatever you want such as “Missing”, “Unk” or “your_name” 😜

Let’s make an example!

I am going to use Argentina Car Prices Dataset.

Reading Data

df=pd.read_csv("argentina_cars.csv")

EDA

df.head()

Output:

df.info()

Output:

df.isnull().sum()

Output:

I chose the “color” column because its data type is object and it has 11 empty rows.

Let’s look at the distribution!

plt.figure(figsize=(8,6))

sns.histplot(data=df, x="color",kde=True)

plt.xticks(rotation=45);

Output:

Filling missing values!

df["color"].fillna("Unknown",inplace=True)

Let’s look at the distribution again!

plt.figure(figsize=(8,6))

sns.histplot(data=df, x="color",kde=True)

plt.xticks(rotation=45);

Output:

2-)For numerical data types

In numerical data types, we can fill by 2 methods:

Filling by Mean
Filling by Median

2.1-) Filling by Mean

“Mean” is the average of the column. In numerical missing values, you can not fill data by maximum or minimum value. If we do that, these values will be Outliers. Thus, we will fill in data by the average of the column.

Let’s make an example!

I am going to use BigMart Sales Dataset.

Reading Data

df=pd.read_csv("sales.csv")

EDA

df.head()

Output:

df.info()

Output:

df.isnull().sum()

Output:

Let’s look at the distribution!

sns.histplot(data=df, x="Item_Weight",kde=True);

Output:

Filling missing values!

df["Item_Weight"].fillna(df["Item_Weight"].mean(),inplace=True)

Let’s look at the distribution again!

sns.histplot(data=df, x="Item_Weight",kde=True);

Output:

2.2-) Filling by Median

“Median” is the middle value when we sort values from smaller to bigger. In numerical missing values, we can fill our data by using the median of the column.

Let’s make an example!

I am going to use the same dataset as before.

In this case, I will just fill by the median instead of the mean.

Reading Data

df=pd.read_csv("sales.csv")

I will not look at the EDA again because it is the same as before.

Let’s look at the distribution!

sns.histplot(data=df, x="Item_Weight",kde=True);

Output:

Filling missing values!

df["Item_Weight"].fillna(df["Item_Weight"].median(),inplace=True)

Let’s look at the distribution again!

sns.histplot(data=df, x="Item_Weight",kde=True);

Output:

3-)For both numerical and categorical data types

This way works with all data types!

Deleting rows 😅

3.1-)Deleting rows

If the number of missing values is smaller than the number of other values, you can just delete these values. As a data lover, I do not recommend you that because the data is data. we would not like to lose data.

Let’s make an example!

I am going to use Argentina Car Prices Dataset.

Reading Data

df=pd.read_csv("argentina_cars.csv")

EDA

df.head()

Output:

df.info()

Output:

df.isnull().sum()

Output:

df.shape

Output:

(510, 12)

510 rows and 12 columns.

Let’s look at the distribution!

sns.histplot(data=df, x="gear",kde=True)
plt.xticks(rotation=90);

Output:

Filling missing values!

df.drop(df[df["gear"].isnull()].index,axis=0,inplace=True)

Let’s look at the distribution again!

sns.histplot(data=df, x="gear",kde=True)
plt.xticks(rotation=90);

Output:

Nothing changed in the figure but let’s look at to the missing values and shape!

df.isnull().sum()

Output:

There are no missing values in gear anymore because we dropped that values.

df.shape

Output:

(509, 12)

Now we have 509 rows. One row has just gone.

4-)Easy way

We will use a new library from sklearn, it will fill in data for us.

SimpleImputer

4.1-)SimpleImputer

Let’s make an example!

I am going to use the NFL dataset.

Importing SimpleImputer

from sklearn.impute import SimpleImputer

Reading Data

df=pd.read_csv("nfl.csv")

EDA

df.head()

Output:

df.info()

Output:

df.isnull().sum()

Output:

Let’s look at the distribution!

plt.figure(figsize=(40,10))
sns.histplot(data=df, x="Humidity",kde=True)
plt.xticks(rotation=90);

Output:

Filling missing values!

Firstly, we should define our strategy for filling in missing data.

imputer = SimpleImputer(strategy='most_frequent')

For strategy, you can also use:

“mean”: just for numeric data
“median”: just for numeric data
“constant”: you should also define fill_value for replacing missing data.
“most_frequent”: it can be used with categorical or numeric data.

Secondly, we should teach imputer through our data.

imputer.fit(df)

Lastly, filling inside of empty data with our strategy.

imputed_df = imputer.transform(df)

The type of imputed_df is NumPy array so we need to convert that to DataFrame.

imp_df=pd.DataFrame(imputed_df,columns=df.columns)

Let’s look at to missing values!

df.isnull().sum()

Output:

As you see, all missing values are filled 🎉

If you want to learn more about SimpleImputer you can read the documentation of SimpleImputer.

Let’s look at the distribution again!

plt.figure(figsize=(40,10))
sns.histplot(data=df, x="college",kde=True)
plt.xticks(rotation=90);

Output:

5)Best ways

I will show you the best 2 ways of handling missing values. I highly recommend these ways:

Filling by “Feature Engineering”
Filling by Regression

5.1-)Filling by “Feature Engineering”

Feature engineering means extracting features from existing data, and extracting logical columns from existing data.

Let’s make an example!

I am going to use Titanic Dataset.

Reading Data

df=pd.read_csv("titanic.csv")

EDA

df.head()

Output:

df.info()

Output:

df.isnull().sum()

Output:

Extracting Feature

I will extract titles from the “Name” column.

df['Title']=df['Name'].str.extract(' ([A-Z,a-z]+)\.', expand=False)
# We extract words that have not only a space before itself but also a dot(".") at the end of the word.
# expand = True returns a DataFrame
# expand =False returns a Series
df['Title'].value_counts()

Output:

Let’s look at the distribution!

sns.histplot(data=df, x="Age",kde=True);

Output:

Filling missing values!

df['Age'].fillna(df.groupby("Title")["Age"].transform("mean"),inplace=True)
# Here, I am filling in missing values by mean of age in terms of "Title".
# As you know, we use “Miss” when addressing young girls and women under 30,
# use “Mrs.” when addressing a married woman, 
# “Ms.” when you are not sure of a woman's marital status, etc.

# So filling missing age values in terms of titles is very very logical 😁

Let’s look at the distribution again!

sns.histplot(data=df, x="Age",kde=True);

Output:

5.2-) Filling by Regression

Regression is a field of Artificial Intelligence that helps to predict numerical value.

Regression is so fun topic but it is long. Therefore; I will show that in the next notebooks. If you want to learn Regression,

Stay tuned!

My next article will be about “REGRESSION”.

Author:

Ahmet Talha Bektaş

If you want to ask anything to me, you can easily contact me!

📧My email
🔗My LinkedIn
💻My GitHub
👨‍💻My Kaggle
📋 My Medium

How c☐n we deal w☐th mis☐ing data?

Let’s overcome missing data together!

Table of contents:

Introduction

However, sometimes, it can not or should not be similar to Standard Normal Distribution!

1-)For categorical data types

1.1-) Filling by Mode

1.2-) Filling by “Unknown”

2-)For numerical data types

2.1-) Filling by Mean

2.2-) Filling by Median

3-)For both numerical and categorical data types

3.1-)Deleting rows

4-)Easy way

4.1-)SimpleImputer

5)Best ways

5.1-)Filling by “Feature Engineering”

5.2-) Filling by Regression

Stay tuned!

My next article will be about “REGRESSION”.

Ahmet Talha Bektaş

Written by Ahmet Talha Bektaş