Data Pre-processing in Python for Beginner

Published in

Data Science Indonesia

6 min readAug 20, 2021

When dealing with machine learning project, real world data typically is not ready to be used. There might be missing values or incorrect types in the dataset that we get. These rawness of the data needs to be dealt first so that ML algorithm can be applied on it. This is a common problem that all data-related professionals have to face.

The process of dealing with unclean data and transform it into more appropriate form for modeling is called data pre-processing. This step can be considered as a mandatory in machine learning process due to some reason, such as:

data errors: Statistical noise or missing data need to be corrected.
data types: Most machine learning algorithm require input data in form of numbers.
data complexity: Some data might be so complex that algorithm can not perform well on it. Complexity can be a reason for overfitting in a model.

While data pre-processing can be different for every cases, there are some common tasks that ca be used:

data cleansing
feature selection
data scaling
feature engineering
dimensionality reduction

We will explore these steps and implement it on sample dataset using python libraries.

Data Cleansing: Handling missing values

One of the most common process of data cleansing is dealing with missing values. Basically, there are two ways to handle missing values:

Remove rows with missing values
Impute missing values

Removing rows is the simplest strategy and easy to execute. On the contrary, impute missing values is more complicated. We can impute values using some rules, such as:

Constant value that has meaning within the domain and different from other data, like 0 or -1.
Central tendency of data, which are mean, median, or mode.
Predictive values estimated from other data.

Even though most ML algorithm require complete dataset, not all of them fail when there is missing data. There are algorithm that robust to missing values, like KNN and Naive Bayes while other algorithm can use missing values as a unique value, like Decision Trees. Nevertheless, scikit-learn library implementations for those algorithms are not robust to missing values.

We are going to use SimpleImputer class to transform all missing values marked with a NaN value with the mean value for the column. You can download the dataset here: Melbourne Housing Snapshot.

Four features have missing values. We will work on feature ‘Age’, ‘BuildingArea’, and ‘YearBuilt’.

Feature Selection

In a nutshell, feature selection means removing irrelevant features. The reasons we need to do this are to:

reduce complexity
produce easy to understand model
reduce computational cost
prevent overfitting
improve model performance

These are feature selection techniques based on its basic algorithm:

In using stats based feature selection, it is important to choose what method to use based on the data types of input and output variable. This is a decision tree to decide which stats based method is suitable for our data:

We are going to use RFE method to select the most important features from our dataset. Recursive Feature Elimination (RFE) is popular due to its flexibility and ease of use. It reduces model complexity by removing features one by one until the selected number of features is left.

The scikit-learn Python machine learning library provides an implementation of RFE for machine learning. To use it, first, the class is configured with the chosen algorithm specified via the “estimator” argument and the number of features to select via the “n_features_to_select” argument.

Six most relevant features based on RFE are indicated by “Selected=True”

Feature Scaling

Many machine learning algorithms perform better when numerical input variables are scaled. This case includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors, or gradient descent-based algorithms.

There are two common methods for scaling:

For the Melbourne Housing dataset, we are going to implement normalization using scikit-learn object called MinMaxScaler.

All maximum values have been scaled to 1

Feature Engineering

Feature engineering is the process of transforming data to represent the underlying problem better to the predictive models. It is an iterative process that interplays with data selection and model evaluation, again and again.

General process of feature engineering are commonly divided by numerical and categorical feature.

Example of feature engineering for numerical features including:

Feature Generation: feature 1 + feature 2, feature 1 x feature 2, feature 1 /feature 2, etc.
Decomposing Categorical Attributes: item_color -> is_red, is_blue; gender -> is_male, is_female (one-hot encoding)
Decomposing a Date-Time: datetime -> hour_of_day; hour -> morning, night
Reframing Numerical Quantities: weight -> above_70, below_70
etc.

Tips for doing numerical feature engineering effectively:

Ask the expert
Discretization
Combinations of 2 features or more
Using simple statistics descriptive

Next, for handling categorical features, there are several method called encoding. These are three common encoding techniques with sample.

Label Encoding

Give every categorical variable a numerical ID.
Useful for non-linear and tree-based algorithms.
Does not increase dimensionality.
Useful for ordinal data type.

One-Hot Encoding

Create new feature for every unique value.
Memory depends on number of unique category.
Similar to dummy encoding that generates n-1 new columns, while OHE generates n new columns, with n is the count of unique value from encoded feature.

Binary Encoding

Variables -> numerical label (label encoding) -> binary number -> split every digit into different columns.
Useful for feature with large number of unique values. Increase dataset dimension logarithmically.
Only need to create log base 2 new columns of unique values from encoded feature.

The following code shows how to implement one-hot encoding in the pandas Python library via get_dummies class.

Dimensionality Reduction

More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.

Dimensionality reduction techniques are often used for data visualization. Nevertheless, these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.

One of the most popular technique for dimensionality reduction in machine learning is Principal Component Analysis (PCA).

Handling Outliers

Many data have outliers that can heavily affect model training result. In Python, outliers can be easily detected using boxplot visualization.

Both Landsize and BuildingArea feature have outliers

We can adjust the outliers without any additional library using winsorization method. Outlier values can be replaced by certain value that called upper and lower bound.

Those are several common method for data preparation. Every project is unique and may need different approach for data pre-processing and cleansing.