Data Preprocessing in Machine Learning Model

Published in

Analytics Vidhya

5 min readApr 3, 2020

Data Preprocessing

Data preprocessing is the raw data is converted into the understandable format data. The raw data contain missing data, noisy data, and many errors. All the problems will be solved by using data preprocessing techniques. The whole data set is split up into the train and test set. That train set is used to train the machine learning models.

Raw Data

Raw data is unprocessed computer data. This information may be stored in a file, or may just be a collection of numbers and characters stored on somewhere in the computer’s hard disk. For example, information entered into a database is often called raw data.

Why Data Preprocessing ?

The raw data contains a lot of missing data, noisy data and errors, so we cannot use that format in the machine learning models. The preprocessed data only used in the machine learning models.

The better data preprocessing has increased the accuracy of the model. So the data preprocessing is most important in the machine learning and deep learning models.

Data Preprocessing Techniques

1. Getting the dataset.
2. Importing libraries.
3. Importing dataset.
4. Encoding Categorical Data.
5. Finding Missing Data.
6. Splitting dataset into training and test set.
7. Feature scaling.

1. Getting the Dataset

First, we get the dataset from the website or any other place. The dataset have the csv or excel format. The csv means Comma Separated Values, the excel is a normal Microsoft excel format. The titanic data set is used for this data preprocessing techniques.

Titanic: Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics

www.kaggle.com

2. Importing libraries

The all required packages are imported for data preprocessing. the pandas package is used for handling the data sets. The Numpy is used for array operation performed on the datasets. the matplotlib is used for visualizing the data.

3. Import Datasets

The dataset is imported into the python file by using the pandas.

4. Encoding Categorical Data

The dataset has some volumes in string format. The machine learning model only allowed the numerical values. The string is converted into the numerical values by the categorical value techniques.

Two techniques:
1.label encoder
2.one hot encoder

Label Encoder

SK learn provides a very efficient tool for encoding the levels of categorical features into numeric values Label Encoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels.

The “sex” column is not numerical value. So it is converted into numerical values.

One Hot Encoder

5. Finding Missing Data

The missing data is filled by the various methods.the following methods are used for the filling the missing values.
1. mean/median
2. zero method
3. most frequent values

Above top 5 column have more missing values. Now we can fill the missing values in “age” and “fare” columns.

Mean/Median

The mean value is found in the missing value columns, then the mean value is filled in the missing values places.

Zero Method

The missing values are filled with zeros.

Most Frequent Values

The missing values is filled with the most frequently used data in that particular column.

6. Splitting Dataset Into Training and Test Set

The data set is split into the train set and test set. The important feature only chooses for that model training process, because the accuracy is increased by choosing the most important feature in the data sets.

7. Feature Scaling

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. the feature scaling is used in the regression algorithms to get more accuracy.