Machine Learning — Data Prepocessing

Published in

Analytics Vidhya

6 min readApr 25, 2020

The most important step in making a machine learning model is the data preprocessing. If we skip the data preprocessing then the created model might not make as much accurate predictions as it should. Following are the steps that are included in the preprocessing of dataset:

Handling the Missing Data
Handling the Categorical Data
Splitting the dataset into the Training set and Test set
Applying the Feature Scaling

Goal

Our dataset contains the information of customers. The goal is to predict whether a customer having certain attribute will buy the product or not.

But as mentioned above, we have to do the data preprocessing first. So, let me import necessary libraries, load the data, and separate the features (in-dependent variable) and the outcome (dependent variable). After that we’ll do the preprocessing of data.

In our case the “Purchase” column is the dependent variable and rest of the columns are independent. So, let’s load and split the dataset.

Loading the dataset into pandas dataframe

After doing this, let start the data preprocessing

1. Handling the Missing Data

The first step in data preprocessing is to handle missing data in the dataset. It’s common to have missing values in the dataset. In order to make accurate predictions from the model we have to handle these missing values. There are multiple ways to handle the missing data:

We can remove the records, having missing data, from the dataset. Well this is a very bad technique as records in the dataset are very crucial. Removing records from the dataset will affect the outcome of the model.
We can replace the missing data with the mean, median or the value having highest frequency of the column values. This is the most common approach to handle missing data.

In our case we’ll replace the missing value in the “Age” and “Salary” columns with the mean of respective columns.

Now the data in features will be:

And you can see that we have successfully handle the missing data.

2. Handling the Categorical Data

Categorical Data is the data that generally takes a limited number of possible values. The data can be numerical or textual in nature. As the models in machine learning are mathematical equations, so they can only work with numbers. For this we need to encode the textual data into number. In our dataset, the column “Country” and “Purchased” are categorical columns. We’ll use LableEncoder to encode the data.

After this the data in the features will be:

We have successfully handled the categorical data. But by doing this we have created a problem here which will affect the model’s predictions. Earlier we said that machine learning models are mathematical equations. By looking into data, especially data in “Country” column, it seems like the country “Cuba” has high priority as that of “Canada” and “Brazil” since the “Cuba” has encoded value of 2 which is greater than that of “Canada” and “Brazil” having encoded value of 1 and 0 respectively.

In order to resolve the issue, we’ll use dummy variable technique. A “Dummy Variable” or “Indicator Variable” is an artificial variable created to represent an attribute with two or more distinct categories/levels. We’ll replace the “Country” column with the number of columns — which is equals to the distinct values in the column. Since there are three distinct value in “Country” column, so we’ll add three columns in our feature data that will represent the country of the customer. The added columns represent the “Cuba”, “Canada”, and “Brazil” respectively. 0 and 1 will be the valid value in these columns. Among the added columns, for a row, only one column can have a value of 1 and other will have value of 0. It’s like a switch and for the country against which it’s ON represents the country of the customer. Let’s handle this problem in our code.

Instead of using “LabelEncoder” we’ll use “OneHotEncoder” on the “Country” column. We are going to replace above lines of code with code given below:

After doing this our feature data will look like:

Now let’s encode “Purchased” column. Since this is the dependent variable the machine learning model will know that it’s a category. So, we are going to use “LabelEncoder” for encoding “Purchased” column.

After encoding the outcome will become:

3. Splitting the Dataset into the Training set and Test set

Machine Learning — As the name suggested the machine is going to learn relations in our dataset. Since the machine is learning we need to test it. In order to do that we will split our dataset into two parts. One part will be used to train the machine (training data) and other part will be used to test the machine (test data). For this purpose, we are going to use “train_test_split”, a utility provided by scikit-learn.

I have selected 20% of dataset for testing and remaining 80% is for training. Usually we use 20% to 30% of the dataset for our testing data.

4. Applying the Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. It is also known as data normalization. Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, many classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. Also, gradient descent converges much faster with feature scaling than without it.

In our case, we will transform the value of “Age” and “Salary” columns so that the values lie in a specific range e.g. -1 to +1. Following are the two most important techniques to perform feature scaling:

Standardization: It is a very effective technique which re-scales a feature value so that it has distribution with 0 mean value and variance equals to 1.

Normalization: This technique re-scales a feature or observation value with distribution value between 0 and 1.

For our dataset we’ll be using “Standardization”. For this we’ll be taking help from “StandardScaler” from scikit-learn.

Now the features_train and features_test will look like:

In our case, we don’t need to scale the “Purchased” column. But will be certain cases where we have to apply feature scaling on dependent variable e.g. in case of regression where the dependent variable has large values.

Great we are done with preprocessing of the dataset and can now use the data as an input for our model.

Please let me know if there are any questions or need some clarification.

Resources:

You can find the code that I have written in the blog from my GitHub Repository.