Data Preprocessing: Machine Learning

6 min readJan 1, 2023

This article continues from the previous: Introduction to Machine Learning.

Data Preprocessing is a crucial step in Machine Learning, and it is important to convert all string values in data into numerical values so that the computer can understand and learn its pattern, thus being able to predict an outcome with a given input.

In this article, we will breakdown each step of data preprocessing with an example using scikit-learn. In general, data preprocessing comprises of the following steps:
1. Importing data
2. Dealing with missing data
3. Encoding categorical data
4. Splitting data
5. Feature scaling

Before we start, the three most important libraries we use in machine learning are - 1. NumPy, 2. Pandas, 3. Matplotlib.

Numpy is considered to be the backbone of Data Science, it dominates numerical computing in python. It is the foundation of turning data into a series of numbers.

Pandas is a Data Analysis tool, and it uses the idea of DataFrame, it helps to get raw data ready for Machine Learning.

Matplotlib is a plotting library that is widely used in Data Science, it is a plotting library that allows us to turn our data into plots or figures.

These three libraries are imported via:

import numpy as py
import pandas as pd
import matplotlib.pyplot as plt

Importing Data

Data is imported using Pandas, usually in the following way:

df = pd.read_csv('data.csv')

With the help of Pandas, we can visualize our data like so:

As Machine Learning separates the feature variables and the prediction variables (column that we want to predict) as X and y respectively, it can be split via the following ways, with X being the feature variables and y being the column that we want to predict:

X = df.drop('Purchased', axis=1).values
y = df['Purchased'].values

X = df.iloc[:, :-1].values # -> [rows, columns]
y = df.iloc[:, -1].values

Dealing with missing values

It is important to deal with the missing data in the dataset, as it can greatly affect the training process of the computer, thus affecting the prediction.

In general, if a dataset is large, it is always best to remove the entire row of all the rows with any missing data. For example, deleting 1,000 rows of data within 100,000 rows of data is only 1% of the total dataset.

However, if this is not the case, we have to fill up the missing values in order to prevent information losses.

There are several ways to fill in the missing values, and the most common ways are filling the values by 1. average 2. median 3. most frequent values (for frequency-related data).

Filling values with the median is very useful as it prevents the interference of extreme outliers, as extreme outliers can interfere with the average of the entire dataset.

Filling the values with average is one of the most classic ways, and in the following example, I will demonstrate filling the missing values with average.

To fill in the missing values, we make use of a Machine Learning framework — Scikit-Learn. It is the framework that is most widely used in the field of Machine Learning, with many in-built functions available.

By taking the following dataset as an example, we can fill in the missing values with the mean by:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

> Imputing is another way of saying filling up the missing values

> X[:, 1:3] means selecting ALL the rows of data, and columns ‘Age’ and ‘Salary’ -> [Rows, Columns].

Encoding Categorical Data

To make categorical data into numerical data, we have to encode it. However, simply assigning a numerical code such as 0, 1, 2 is not ideal. Using the above dataset with “Country” as example, by assigning 0 to France, 1 to Spain, 2 to Germany, will result in computer thinking that there is an “order correlation” to this value, thus it is not ideal, and we want to prevent it.

To encode categorical data, we can use OneHotEncoding provided by Scikit-Learn. As shown in the example above, the country column will be turned into 3 different columns (a total of 3 countries). For example, France — vector(1,0,0) and Germany being (0,1,0), etc.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# For ColumnTransformer, the argument=[('name', EncoderClass, Column of X)], remainder='what to do for columns not encoded'
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

X = np.array(ct.fit_transform(X)) # Does fit & transform columns at the same time & force to np array type

Splitting dataset into Training & Test Set

It is crucial to separate data into 1. training 2. test set, and sometimes validation set. This is because we want our computer to learn the patterns of our data from the training set, and evaluate the test set (the set that the computer has not seen before) to see if the computer is learning well or correctly.

To do so, we use:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The most common way is to keep 80% of the dataset as the training set, and the rest 20% of the dataset as the test set.

Feature Scaling

Note: Not all Machine Learning models require feature scaling, but some.

Feature Scaling is to rescale our feature variables of the data into the same scale, this will prevent some features from being more dominated than other features, or not even considered by ML models.

The two most common ways are: 1. Standardization 2. Normalization.

Standardization: Subtracting each value of the feature by the mean of all values of the feature & divide by standard deviation.

Normalization: Subtracting each value of the feature by the minimum of all values, divide by (difference of max & min of the feature).

Which to apply?

Normalization is recommended when having a normal distribution in most of the features. (specific situation)
Standardization works well all the time. (go for standardization, as it always works well in training process)

Note

We do not have to apply for:

Binary values
Equations with coefficients of the same scales
Values between 0 & 1

For the test set, we ONLY transform, as we need to scale the test set under the training set - To the same scale.

BUT, fit only using training set (scaler), to treat test set as brand new set, and does not contribute to feature the scaling formula.

# Fit will just get the mean & s.d. of each of the features.
# Transform will apply the formula, and actually manipulate the values.

from sklearn.preprocessing import StandardScaler # Standardization

sc = StandardScaler()

X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

# We need to apply the SAME scaler on training set onto test set
X_test[:, 3:] = sc.transform(X_test[:, 3:])

> Continue reading: Linear Regression

Data Preprocessing: Machine Learning

Written by TC. Lin