TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Machine Learning: Handling the missing values in Data: The Easy Way

Introduction to SciKit Imputer

Akash Panchal
TDS Archive
Published in
3 min readMar 11, 2020

--

Photo by Helloquence on Unsplash

Most of the Machine Learning Algorithms cannot work with missing values in the features. See the below example of the Melbourne Housing Data.

melbourne_data.describe()

Here, BuildingArea has 7130 rows with values while most of the features have 13580 rows with values.

Now, A few things you can do to deal with missing values

1. Get rid of the corresponding data

melbourne_data.dropna(subset=["BuildingArea"])

This will drop all the rows with the missing values. You can see that the number of rows has decreased now.

melbourne_data.describe()

2. Get rid of the entire attribute.

melbourne_data.drop("BuildingArea", axis=1)

This will drop the entire feature/attribute. See below, BuildingArea column is dropped now.

melbourne_data.describe()

3. Set the missing values to some value

Approach A

If you think that the attribute is important enough and you must include for the training. You can fill the missing values.

Fill the missing values with what???

Well, you can replace the missing values with median, mean or zeros.

median = melbourne_data["BuildingArea"].median()
melbourne_data["BuildingArea"].fillna(median, inplace=True)

This will replace all the missing values with the calculated median. Also, one thing you will notice now that the mean value of the attribute is changed as we’ve filled the missing values.

melbourne_data.describe()

You may want to follow a similar process for “YearBuilt” attribute as well. And you will have to save the value of the median, as it will be needed later to fill the missing values in the Test-Set and the new data(Oh yes, didn’t think that).

Approach B: Introducing Imputer

SciKit-Learn provides Imputer class to use the above task with ease. You can use it following way:

First, you need to decide the strategy, it can be one of these: mean, median, most_frequent

Second, create the imputer instance using the decided strategy

# 1. Remove categorial 
melbourne_data = melbourne_data.select_dtypes(exclude=
["object"]).copy()
# 2. Fit the numerical data to Imputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit(melbourne_data)
# 3.
X = imputer.transform(melbourne_data)
melbourne_data_tr = pd.DataFrame(X, columns=melbourne_data.columns,
index=melbourne_data.index)

Now, this will calculate the median of all the attributes and fill the missing values of an attribute with the respective mean values.

melbourne_data_tr.describe()

Note: Not focusing on the Categorial values as they are not in the scope of this tutorial.

Code on Github

Python code: Full python code with all the three ways explained.

Streamlit code: Oh..you love streamlit too. Find the streamlit code here.

Data: Used Melbourne housing data.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Akash Panchal
Akash Panchal

No responses yet