Data Preprocessing Using Python Sklearn

7 min readNov 3, 2018

In Data Analysis Below are the Major steps as showing in Image. Now we are going to discuss about Data preparation. In my previous article I already discussed about EDA :

Data Preprocessing Code Implementation Here.
Dataset we are using in this example Here.

This article I’m going to cover below 3 topics as showed in image.

Data Scaling :
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Euclidean distance between two data points in their computations, this is a problem. So in this case we should make both the features in same range. To suppress this effect, we need to bring all features to the same level of magnitudes. This can be achieved by scaling.

Standard Scaler:
The Standard Scaler assumes that your data is normally distributed within each feature, and it’ll scale them such that the distribution is now:
The mean and standard deviation are separately calculated for the feature, and the feature is then scaled based on:
Code Implementation:
In below code implementation, we are using sklearn.preprocessing module.
To Demonstrate this scaling created some dummy dataframe with 2 columns those are C1 and C2. Both the columns created in different magnitude.

Creating StandardScaler object and passing dataframe . By doing this we will get mean and Variance values of that each column

StandardScaler.transform() method will transform data.

Now find the about output data how the values are standardized. Converted from One range to another range.

MinMaxScaler : It shrinks the range such that it is now between 0 and 1 (or -1 to 1 if there exist negative values).

Preprocessing.MinMaxScaler() api available to achieve this.

RobustScaler :

The RobustScaler uses a similar method to the Min-Max scaler. However, it uses the interquartile range instead of the min-max, which makes it robust to outliers.
Here preprocessing.RobustScaler() api is available to achieve this.

Soon I’m planning to write an article about IQR ,Quartile , Outerlier detection and BoxPlot. Then you will get good understanding on this IQR etc.

Missing Data Handling:

Due to many reasons (Human errors / sensor problem / dug in software) in processing data we may not few columns information(Data Missing). Most irritating word in Data world .
In our processing data few columns (features) data is missing. Now here our agenda is how to predict/fill the missing data in our dataset , So that we will get proper model.
Handling the missing values is one of the greatest challenges faced by analysts, because making the right decision on how to handle it generates robust data models. Let us look at different ways of imputing the missing values.
Note: We will be using libraries in Python such as Numpy, Pandas and SciKit Learn to handle these values.

In this article Missing Data we are going to Understand in different steps.
1. Visualize the data set to see what are the columns are impacted. (missingno module)
2. Data Profiling using Pandas_profiling.
3. Impute Missing Values. (Both Continuous and Categorical)

What is missing Value ?
Missing value mean , column value may be any of ‘NaN’, ‘NA’, ‘None’, ‘ ’, ‘?’ and others. If you have something different than ‘NaN’ you should standardize them by using np.nan.
To demonstrate below examples , I’m using TitanicTrain Data set.

Analyzing each column to identify how many rows has Missing Values

Load titanic_train.csv and create pandas data frame

Info() method provides each column : Age and Cabin columns missing values. Age contain only 714 rows has data and remaining 177 rows data missing. Cabin column contain only 204 rows data.

Using missingno module ,

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

Bar Graph :

2.Using pandas_profiling module :
It provides each column detailed information. It’s is very tough to explain to cover what and all information below command provides. You only should experience by executing below code snippet .

Few dataframe api’s to analyze missing data :
Give me report each column how many values are missed ?

How many rows each columns has data ?

Display each column has original value or null ?
True : means column value is missing (NaN)

Till now we analyzed missing values of each feature. Now in below section we are going to discuss few techniques how to impute(fill) missing values ?

→ Deleting Rows
→ Fill with Fixed Value
→ Continuous Feature: Replace with Mean/Mean/Median values.
→ Categorical Feature: Replace with Frequent Occurred Value .
→ Use some ML Algorithm (Linear Regression / KNN )

Deleting Rows:
Let’s start with the most simple thing you can do: removal. As mentioned before, while this is a quick solution, and might work in some cases when the proportion of missing values is relatively low (<10%), most of the time it will make you lose a ton of data. Imagine that just because of missing values in one of your features you have to drop the whole observation, even if the rest of the features are perfectly filled and informative!
See all the rows are deleted. We have only 183 rows remaining.

Fill with Fixed Values :
Filling with Fixed value is just useful to run model. But not atoll recommendable approach until unless we are sure what we are filling and reason.

Continuous Feature: Replace with Mean/Mean/Median values

Categorical Feature: Replace with Frequent Occurred Value .

Use some ML Algorithm (Linear Regression / KNN )
Using the features which do not have missing values, we can predict the nulls with the help of a machine learning algorithm. This method may result in better accuracy, unless a missing value is expected to have a very high variance. We will be using linear regression to replace the nulls in the feature ‘age’, using other available features. One can experiment with different algorithms and check which gives the best accuracy instead of sticking to a single algorithm.

KNN is a machine learning algorithm which works on the principle of distance measure. This algorithm can be used when there are nulls present in the dataset. While the algorithm is applied, KNN considers the missing values by taking the majority of the K nearest values. In this particular dataset, taking into account the person’s age, sex, class etc, we will assume that people having same data for the above mentioned features will have the same kind of fare.

Unfortunately, the SciKit Learn library for the K — Nearest Neighbour algorithm in Python does not support the presence of the missing values.

Categorical to Numeric Conversion Article , Soon I’m planning to write full length article.

Data Preprocessing Using Python Sklearn

Written by Kesari mohan Reddy