Data Preprocessing in Machine Learning

Aisika Roy
Nov 5 · 4 min read

Before starting on a machine learning task, it is usually insightful to take a look at examples from the dataset. Whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. Most of the datasets are first processed so that an ML algorithm can be trained on it.Every dataset is different and poses unique challenges. It can contain unformatted real-world data which can be composed of missing data, Capped values, Categorical values which make it difficult for the model to get trained. So, we need to preprocess the data so our model can be more accurate.

Data preprocessing takes 60 to 80 percent of the whole analytical pipeline in a typical machine learning project. Once your data is preprocessed well enough your model will run smooth and will give better results.

Now, let’s see the steps of data preprocessing one by one.

Step 1: Importing required Libraries

A library is a collection of modules that can be called and used. A lot of the things in the programming world do not need to be written explicitly every time they are required. There are functions for them, which can simply be invoked.some of the libraries used in machine learning:

Importing libraries

Step 2: Importing the Dataset

We have to import the datasets from the directory containing it and store it into a pandas dataframe using pandas library.

Importing ’ Salary_Data.csv’ file

Then we will see the information in it using different commands to gain some intuition.

Information about all the columns
Description about each column

Step 3: Missing Data in Dataset

In real life data there can be many missing values. We have to replace these with some values. We will do it here using scikit learn library of python. Here we will replace the missing values with the mean of values in that category.

Here the first line of code gives us how many null values we have in each column. And the lines following it replaces the null values with mean.

Step 4: Encoding categorical data

Our data may be in qualitative form, for example, in detection of spam email, we have texts as our data.It gets harder for machines to understand texts and process them, rather than numbers, since the models are based on mathematical equations and calculations. So, we should encode the categorical data.

Step 5: Splitting the Dataset

Now we need to split our dataset into two sets — a Training set and a Test set. We will train our machine learning models on our training set, i.e our machine learning models will try to understand the correlations in our training set and then we will test the models on our test set to check its accuracy. Generally, nearly 80% of the dataset is kept as training set and the remaining 20% as test set. We will split our dataset into 4 parts: training set features, training set labels, test set features and test set labels.

Step 6: Feature Scaling

The final step of data preprocessing is to apply the feature scaling.

It is a method used to standardize the range of independent variables or features of data.let’s take the example of a dataset having estimated salaries according to age. In this case range of age is 18 to 65 but the range of income is 1,000 to 10,00,000. This may cause problem while finding gradient descent.

change in contour plotting after feature scaling

Here, if gradient descent starts from a it will take very long time to reach global minimum compared to b. Feature scaling will make this circular thus solving the problem.

Now, you are all set to go! Choose the most accurate model and train it.

Aisika Roy

Written by

ML Basics

ML Basics

Fundamentals of machine learning

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade