Analytics Vidhya
Published in

Analytics Vidhya

Data-Preprocessing: Begin before you think you’re ready…

Photo by Pixabay from Pexels

Data preprocessing is the foundation on which all machine learning models are built, that being said, for an optimal result, our data integrity better be stronger than concrete.

It’s not that it’s an important “Step” in the grand scheme of things, but, it’s more of an interpretation of a language for the machine learning algorithm, and the language can either be a barrier or very possibly a breaker of barriers, rest assured, data pre-processing is the latter.

Put it simply, for our algorithms to work, we need to feed data in a format that could be computed upon, and just like computers operate in binary language, it’s somewhat exactly what we’re gonna do, turn data of unfavorable type to Ones and Zeroes my friend…….1s and 0s.
That’s not all, there are missing values to either replace or remove, splitting of data into training and testing, and lastly, scaling features only AFTER the split.

By the end of this post, you will have cleaned a dataset ready for further analysis…

We begin by importing pre-requisite libraries…

import numpy as np
import pandas as pd

NumPy, so that we can work with arrays, Pandas, so that we can import the dataset and work with a matrix of features and the target variable.

Importing the dataset and first impressions…

df = pd.read_csv('Data.csv')
df
df.info()
df.nunique()
df.describe()

We can immediately notice that “Country” feature is not a numerical value, and there are 3 unique countries (Encoding).
Same goes for the purchased variable which also happens to be our target variable, binary output, but string nonetheless.
The feature “Age” is in the range of 27–48 and “Salary” is in the range of 48000–83000 (Feature Scaling).
And the most obvious one, the NaN values (Missing Values).

Separating the dependent and the independent features…

X = df.iloc[:, :-1].to_numpy() # to_numpy for further operations
y = df.iloc[:, -1].to_numpy()
print(X)print(y)
print(x)
print(y)

So, through the iloc [] method we’re asking to capture all rows and all the columns except the last one, convert them all to NumPy array, and store it in variable “X”.

Similarly, “y” has all the rows and only the last column.

Now we fill the gaps… (Missing Values)

Missing values don’t bode well while training machine learning models, like, at all (Except Naïve Bayes and tree-based algorithms) and so we’ll have to deal with them one way or another.

We could just simply remove those rows that are bothering us by using the dropna () method, but there are times you can make the most the data by replacing the missing values with suitable stuff that makes sense.

We could either use fillna () or the interpolation function or even an external library to make that happen too, just pick one method that suits you best, and stick to it.

The external library way sounds interesting, let’s go with that…

from sklearn.impute import SimpleImputer

So, we’re accessing a class called SimpleImputer from the sklearn (Scikit learn) library from the impute module.

impute_er = SimpleImputer(missing_values = np.nan, strategy = 'mean')

Just like how it works in Object oriented programming, we have made an instance/object of the SimpleImputer class.
The first argument is the target, and the second is the replacement.

At the moment we’re using the average/mean values of the corresponding values, other options include most frequent, constant, and median which are pretty much self-explanatory.

X[:, 1:3] = impute_er.fit_transform(X[:, 1:3])
print(X)

The fit method will connect to the matrix of features, and the transform method will replace the values.
The output as expected, feels a bit more content.

The language barrier…

We know why we’re doing this (because machine learning models can’t handle strings), but here’s the thing, if we encode the “Country” feature with 0 for France, 1 for Spain and 2 for Germany, the model will ‘think’ that Germany has greater value compared to the other countries, which is not at all true, unless we’re talking football (Germany 4 world cups, France 2, Spain 1).

The before mentioned method is called Label Encoding and its useful when there is a hierarchy of things, like Senior > Junior > Associate.

One Hot Encoding will help us with our current predicament by making as many separate columns as the number of categories, which in our case would mean 3 columns, each depicting 0 for False and 1 for True.

P.s… We’ll be removing one of the three columns to avoid multicollinearity.

One way to go about this is to use pd.get_dummies() method, which can directly work on a data frame, but we’ll go ahead and use some external libraries (ColumnTransformer) to work with our matrix of features.

#Class
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
#Object
CT_obj = ColumnTransformer( transformers = [('encoder', OneHotEncoder(drop = 'first'), [0])], remainder = 'passthrough')
# Connecting and replacing
X = np.array(CT_obj.fit_transform(X))
print(X)

drop = “first”, to remove one of the three columns made from the “Country” feature to avoid multicollinearity, and,
remainder = “passthrough” argument to keep the rest of the features on which we did not perform One Hot Encoding.
And as for our Target feature, let’s use Label Encoder, just to get the gist of its code…

#Class
from sklearn.preprocessing import LabelEncoder
#Object
LE_obj = LabelEncoder()
# Connecting and replacing
y = LE_obj.fit_transform(y)
print(y)

When to use LabelEncoder or OneHotEncoder.

Bottom line, use LabelEncoder when dealing with ordinal features (hierarchy), and there are many categories (We used it on a binary feature to familiarize ourselves with the code).

OneHotEncoder when there is no hierarchy but we still want to separate those classes, and the number of classes are too few.

Splitting data into training and testing…

Why?… so that we can evaluate the performance of the model and it is much easier establish how close the model estimates are.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
print(X_train)
print(X_test)
print(y_train)
print(y_test)

Feature Scaling

Not all machine learning models (like Multiple Linear Regression) need this but there will be times when we need all our features on the same scale (for distance-based algorithms like KNN) so that we can avoid the dominance of few features over others, or in other words, few features having more weightage in the final outcome, resulting in ignorance of the features in smaller scale.

What you must understand is that there is a reason why we’re scaling only after the train/test split of the data.

The two main feature scaling techniques

So, let’s say we’re using the standardisation method, “X” will be the value in the dataset, “mean(X)” and “std dev (X)” are pretty much self-explanatory that is it will be mean and the standard deviation of that particular column (feature).

The test set is supposed to be completely new values that will be used to predict outcomes for which we already have the answers to.

“Completely new” being the keywords, so when we’re scaling the features, we cannot possibly include the test set as well because the value of the standard deviation will be different.
Which is why, we scale our features, only after the train test split

Standardisation works well in general, and the values will be more or less in the range of -3 to +3.
Normalisation works great when most of the features are normally distributed (Bell curve) and gives output in the range of -1 to 1.

P.s, in case you were wondering, we do not apply feature scaling on the One hot encoded columns, because that would make it lose its value as an indicator of a particular class (Germany, France, Spain).

from sklearn.preprocessing import StandardScalerscale_obj = StandardScaler()X_train[:, 2:] = scale_obj.fit_transform(X_train[:, 2:])X_test[:, 2:] = scale_obj.transform(X_test[: , 2:])print(X_train)print(X_test)
print(X_train)
print(X_test)

You’ll notice that we only applied .tranform with the test set, had we used fit_transform on the test set, the scaler value would have been different, that is the standard deviation of the test set values would have been used instead of the standard deviation of the train set, and we need to keep the same scaler for our model, which is why with only .transform we use the scaler of the train set to transform values in the test set too.

Conclusion

Data pre-processing is an art, the value of a true data scientist lies not only in solving a business problem thrown at you, but also in knowing what tools to use when and prepare a prologue for a meaningful narrative.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store