Data Preprocessing Methods with Scikit-Learn — Python

Changhyun Kim
6 min readJul 12, 2022

--

Data Preprocessing is one of the key steps of data analysis and machine learning. Effective data preprocessing is crucial in gaining as many insights as possible, and it can also help us obtain higher accuracy with ML models.

In this article, we will look at several ways of preprocessing data with Scikit-Learn. Scikit-Learn is a free software machine learning library for Python. Scikit-Learn is definitely one of the most useful and widely used libraries for machine learning, but in this article, we will solely focus on ‘preprocessing’. However, some of the basic preprocessing methods such as dropping null values will not be dealt with in this article. Instead, preprocessing methods that we can perform effectively with Scikit-Learn such as data encoding and feature scaling will be discussed.

1. Data Encoding

Some of the widely used data encoding methods are Label Encoding and One Hot Encoding. Let us go through these methods with brief explanations and Python examples.

a) Label Encoding

Label encoding is basically a way of encoding categorical variables to numerical variables. For example, let’s consider a basket that contains fruits.

basket = ['apple', 'orange', 'grape', 'strawberry', 'melon', 'plum', 'banana', 'melon', 'plum', 'plum', 'grape', 'watermelon', 'melon', 'orange']

There are eight unique fruits — apple, orange, grape, strawberry, melon, plum, banana and watermelon, and some of them are contained more than once in the basket. Now let’s try to convert these categorical data to numeric form using skleran.preprocessing.LabelEncoder function.

from sklearn.preprocessing import LabelEncoderencoder = LabelEncoder()
labels = encoder.fit_transform(basket)
print(labels) #[0 4 2 6 3 5 1 3 5 5 2 7 3 4]

Using LabelEncoder, we can see that the categorical variables, fruits, are converted to numerical variables.

In order to understand which number represents which fruit, we can use .classes_ operation as following:

We can also convert the numerical labels back to the original categorical values by using the function inverse_transform().

As we have seen, label encoding allows us to convert categorical values to numeric values. However, label encoding must be used only in appropriate cases. For example, using label encoding in regression models could result in critical errors because regression models will identify 2 as a larger value than 1. This means that regression models will identify grape(2) to be more important than banana(1), which definitely is not true because fruits are not ordinal data in this case. Therefore, we introduce another method of label encoding, which is called one-hot encoding.

b) One-Hot Encoding

One-hot encoding can be best explained with a little bit of visualization.

In the diagram above, on the left is original dataset and on the right is the one-hot encoded dataset. As seen, in the one-hot encoded dataset, new features(or columns) are added for each categorical variable, and binary value (0 or 1) was assigned in the column depending on the value. For example, in the first row, there is only ‘Apple’ and this is why 0 was assigned for ‘Banana’ and ‘Orange’ columns in the one-hot encoded dataset while 1 was assigned for ‘Apple’.

Let us bring back the same basket from the previous example. Using OneHotEncoder() from Scikit-Learn to one-hot encode data is one option, but this requires us to use LabelEncoder again. The codes and result are as seen below.

basket = ['apple', 'orange', 'grape', 'strawberry', 'melon', 'plum', 'banana', 'melon', 'plum', 'plum', 'grape', 'watermelon', 'melon', 'orange']from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
basket = ['apple', 'orange', 'grape', 'strawberry', 'melon', 'plum', 'banana', 'melon', 'plum', 'plum', 'grape', 'watermelon', 'melon', 'orange']
encoder = LabelEncoder()
labels = encoder.fit_transform(basket).reshape(-1, 1)
onehot_encoder = OneHotEncoder()
onehot_labels = onehot_encoder.fit_transform(labels)
onehot_labels.toarray()

We can see that the data are one-hot encoded with OneHotEncoder() from Scikit-Learn.

Even though Scikit-Learn is a really convenient and effective library for many aspects of data analytics and machine learning, for one-hot encoding, there is a much easier way of doing the same job using Pandas. Pandas provides pd.get_dummies() function which takes a dataframe and returns a one-hot encoded dataframe right away.

basket_df = pd.DataFrame(basket, columns = ['Fruit'])
pd.get_dummies(basket_df)

Instead of going through LabelEncoder and OneHotEncoder from Scikit-Learn, using pd.get_dummies() might be able to save your time and effort!

2. Feature Scaling

Feature scaling is a method to ‘normalize’ variables or features of data. Feature scaling may be necessary in machine learning for several reasons. It can make the training faster, and it is also capable of making the flow of gradient descent smooth.

We will be looking at two different feature scaling methods from Scikit-Learn which are StandardScale and MinMaxScaler Iris data, a dataset provided by Scikit-Learn, will be used throughout this section for understanding various scalers.

from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df

a) StandardScaler()

StandardScaler() in Scikit-Learn scales the values so that their mean is 0 and variance is 1 — Gaussian distribution. It is very important to convert dataset so that it follows the Gaussian distribution in some ML algorithms. For example, linear regression or logistic regression models assume that data follow the Gaussian distribution.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standard_iris = scaler.fit_transform(iris_df)
standard_iris = pd.DataFrame(standard_iris, columns = iris.feature_names)
standard_iris

Using StandardScaler(), we can see the values of iris data have been converted somehow. Since StandardSclaer() is supposed to convert the data so that they follow the Gaussian distribution, let’s check that.

We can clearly see that the mean is very much close to 0 and the variance is nearly 1, indicating that the data have been successfully scaled.

b) MinMaxScaler()

MinMaxScaler() is one method of scaling data, and it converts data into some value in the range [0, 1], or else in the rnage [-1, 1] if there are negative values. With the iris data, since there are no negative values, the range of data should be between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
minmax_iris = minmax.fit_transform(iris_df)
minmax_iris = pd.DataFrame(minmax_iris, columns = iris.feature_names)
minmax_iris

We can see that the data have been minmax scaled very well using MinMaxScaler() since the minimum and maximum value for each feature is 0 and 1, respectively.

Using these data encoding and feature scaling methods, we can better get ready for training the dataset and running ML models. However, it is also very important to use appropriate preprocessing method rather than to use any method. I will introduce more data preprocessing and scaling methods with Python in the future that can improve your ML performance. Thank you!

--

--

Changhyun Kim

Business & Technology Management (Ph.D) - Economics, Environmental Economics, ESG, Data Science, Sports Economics