Data Preprocessing before model training

Yash Joshi
Analytics Vidhya
Published in
5 min readJul 14, 2020

You got a data-set and you’re ready to start with model training and predictions, but wait! Is this data ready to train the algorithm? Well you all might know the answer, a big NO. So, what all things are there to make our data ready for model building? Here are the steps that are to be followed and are often referred to as Data Preprocessing techniques:

  1. Importing the libraries
  2. Reading the data set
  3. Checking and handling missing values
  4. Encoding techniques
  5. Feature Scaling
  6. Splitting the data set

Now let’s go to all these one by one:

IMPORTING THE LIBRARIES: At first, all we need is to import all the necessary python libraries which are being used in all the steps. It is important to import pandas as it helps to read files of different formats(csv, excel, json etc).Pandas provide variety of options to manipulate data and analyzing it. Similarly, numpy is also very helpful to perform mathematical operations on our data. All other libraries can also be imported when required.

READING THE DATA SET: Pandas helps to read files of different formats into a pandas data frame: this is a constructor which creates a 2 dimensional table (rows and columns), mutable and can contain heterogeneous data. Here I’m taking a Loan dataset( make sure the data file is in the same path as the working notebook). Doing data.head() will display top 5 rows(starting from index 0).

CHECKING FOR MISSING VALUES: Once the data is loaded, the first thing is to check for null values or any missing values. All machine learning algorithm fails to work with datasets having null values. Below snippet tells which columns contains missing values.

Now the main part is to remove these missing values. For categorical variables we, generally replace the missing values with the mode. For numerical features, there are mean imputation, median imputation, groupby imputation, and KNN Imputer. Forward fill and Backward fill is also used to fill the missing values with the values just before or just after the missing value. It is also required to analyze the data and understand the domain and the pattern for missing values.

There is one more approach to calculate the missing values by using machine learning approach. In this, the available data is the training data and the missing value data is the predicted data. Using regression or classification algorithm we can predict the missing values. (Just attaching 2 snippets for reference). You can check a very informative blog dealing in depth with types of missing values here.

ENCODING TECHNIQUES: Machine Learning algorithms work with numerical data only. If there are any categorical columns then these are to be encoded to numeric values. There are two ways to perform encoding: Label Encoding and One Hot Encoding. Label Encoding is done on the ordinal data and values are assigned to each unique labels in the column. One hot encoding(also called dummy variables) are used for nominal data. Here all the unique values are assigned with either 1 or 0, therefore it is also called as binary encoding. Each unique value is converted to a column. All these encoded values starts from 0 and are just the representation of categorical values.

An example to show how label encoding works
Before and after label encoding
Location is having 6 unique values
After applying one hot encoding, each unique value is converted to columns with binary values only. 1 indicates that location

FEATURE SCALING: Sometimes while training the model its necessary to bring the range of all the numerical variables to a common scale. This helps to give equal importance to all the variables at the time of training. Normalization and Standardization are the two most common techniques. In Normalization the values are scaled down to the range either 0 to 1 or -1 to 1. In Standardization the values have zero mean with variance as 1.

Here you all can go through this great blog regarding the feature selection techniques, as this is also an important factor which needs to be considered.

SPLITTING THE DATA: Now while giving the data to train the model, we will give some part of the data to train and keep the remaining data to be used as test data. Generally the split is done as 20% testing data and 80% training data(but this is not fixed, you can take any value depending on the size of the data).This will help to get to know how accurate our predictions are by comparing the predicted values and the test data values. The order of the variables should be same as given.

Taking test size as 20%

Now you are ready with the preprocessing steps and good to go with model training.

--

--

Yash Joshi
Analytics Vidhya

A young and dynamic learner with the focus to gain knowledge in the data-driven world.