Data Preprocessing Steps for Machine Learning in Python (Part 1)

Learn with Nas
Women in Technology
14 min readSep 30, 2023

Data Preprocessing, also recognized as Data Preparation or Data Cleaning, encompasses the practice of identifying and rectifying erroneous or misleading records within a dataset. This involves pinpointing flawed, incomplete, or irrelevant segments of the data and subsequently modifying, substituting, or eliminating this impure or coarse data [1]. Data Preprocessing techniques have been adapted to train AI models, including machine learning models. The techniques are generally used at the early stages to ensure accurate results [2]. Please be aware that data preprocessing is a comprehensive term covering a wide range of tasks, spanning from formatting the data to creating features, all depending on the nature of your AI project.

This preparatory phase not only enhances the overall quality of the data but also streamlines the modelling process, ultimately leading to more reliable and accurate predictive models. This article delves into the vital role that Data Preprocessing plays in the context of Machine Learning, shedding light on its various aspects and emphasizing its necessity for achieving meaningful and impactful results.

Why is it important?

The significance of Data Preprocessing in Machine Learning cannot be overstated, as it forms the cornerstone of any successful data analysis or machine learning endeavour. In the realm of data-driven technologies, the quality and suitability of data directly influence the outcomes and effectiveness of machine learning models.

Data Preprocessing involves a series of steps such as:

  1. Data Collection
  2. Data Cleaning
  3. Data Transformation
  4. Feature Engineering: Scaling, Normalization and Standardization
  5. Feature Selection
  6. Handling Imbalanced Data
  7. Encoding Categorical Features
  8. Data Splitting

Step 1: Data Collection

The cornerstone of machine learning is rooted in data. Collecting data involves gathering information aligned with the goals and objectives of your AI project. If you feed subpar or low-quality data into your model, it will not produce satisfactory outcomes. This holds true regardless of the model’s complexity, the expertise of the data scientist, or the financial investment in the project [3].

While some companies have been accumulating data for years, ensuring a steady supply for machine learning, those lacking sufficient data can turn to reference datasets available online to complete their AI projects. Discovering new data and sharing it can be achieved through three methods: collaborative analysis (DataHub), web (Google Fusion Tables, CKAN, Quandl, and Data Market), and a combination of collaboration and web use (Kaggle). Additionally, there are specialized data retrieval systems, including data lakes (Google Data Search) and web-based platforms (WebTables) [3].

Suppose we have all the necessary data; we can proceed with creating a dataset.

# import library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# load the data
df = pd.read_csv('data/credit_scoring_eng.csv')
df.head(5)

Result:

Data Description

  • children - number of children in the family
  • days_employed - number of days employed
  • dob_years - client's age in years
  • education - client's education level
  • education_id - education identifier
  • family_status - marital status
  • family_status_id - marital status identifier
  • gender - client's gender
  • income_type - type of employment
  • debt - whether the client has a loan debt
  • total_income - monthly income
  • purpose - purpose of the loan application

Step 2: Data Cleaning

This involves identifying and correcting errors or inconsistencies in the data, such as missing values, outliers and duplicates. Various techniques can be used for data cleaning, such as imputation, removal or transformation [4].

Implementing in phyton:

2-a: Handling missing values

# check the dataset information
df.info()

Result:

Findings:

There are missing values in the columns days_employed and total_income, because the number of rows should be 21,525

# check the percentage
df.isna().sum() / len(df)

Result:

Findings:

The missing value percentage for both columns are around 10%

# Visualizing Missing Data using a seaborn heatmap.
plt.figure(figsize=(10,6))
sns.heatmap(df.isna().transpose(),
cmap="YlGnBu",
cbar_kws={'label': 'Missing Data'})

Result:

Findings:

1. Missing values form a pattern. The missing values are caused by job types where clients with the job types ‘student’ and ‘unemployed’ do not have any income, leading them to leave the ‘days_employed’ and ‘total_income’ columns empty.

2. This conclusion is reinforced by the pattern shown in the seaborn heatmap, indicating that when the value in the ‘days_employed’ column is missing, the data in the same row for ‘total_income’ is also missing (symmetrical).

3. Since the missing values are only present in the ‘days_employed’ and ‘total_income’ columns, and both of these columns have float data types, which fall under the Numeric/Ratio category, the missing data will be filled using statistical calculations (such as Mean, Median).

4. Median is chosen to fill in missing values because it can prevent the occurrence of outliers [5]

#  function to fill in missing values using median
def data_imputation(data, column_grouping, column_selected):
# Parameter meaning
# data => The name of the dataframe to be processed
# column_grouping => The column used to group values and take the median
# column_selected => The column in which we will fill its NaN values

# Get unique category groups
group = data[column_grouping].unique()

# Loop through each value in the group category
for value in group:
# get median
median = data.loc[(data[column_grouping]==value) & ~(data[column_selected].isna()), column_selected].median()

# change missing value
data.loc[(data[column_grouping]==value) & (data[column_selected].isna()), column_selected] = median

# Return the dataframe after filling the missing values
return data
# apply the function to 'total_income' column
df = data_imputation(data=df, column_grouping='age_category', column_selected='total_income')
# apply the function to 'days_employed' column
df = data_imputation(data=df, column_grouping='age_category', column_selected='days_employed')
# check the statistical
df.info()

Result:

2-b: Handling outliers

# check outlier in children column
sns.boxplot(df['children'])
# check statistical data in children column
df['children'].describe()

Findings:

1. Based on the statistical data above, I will replace the value 20 with the value 2, assuming it was an input error.

2. I will remove the minus sign (-), assuming it was an input error

# replace the value 20 with the value 2
condition_children = df['children']==20
df['children'] = df['children'].mask(condition_children, 2)
# remove minus sign
df['children'] = abs(df['children'])
# check outliers in days_employed column
sns.boxplot(df['days_employed'])
# check percentage
len(df.loc[(df['days_employed'] < 0 ) | (df['days_employed'] > 200000)]) / len(df)

Result:

0.8990011614401858

Findings:

  1. There are 2 issues identified in the ‘days_employed’ column:
  • Too many digits after the decimal point.
  • Existence of negative values and outliers, with a high percentage of rows having these conditions, approximately 89%.

2. The steps to solve these issues are as follows:

  • Remove the minus sign (-).
  • Perform rounding.
  • Replace the outlier values.
# remove minus sign (-), assuming it was an input error
df['days_employed'] = abs(df['days_employed'])
# round 
df['days_employed'] = round(df['days_employed'],0)
# check data distribution
df['days_employed'].describe()

Result:

Findings:

The mean value does not represent the data as it is mixed with outliers. Therefore, the replacement of outliers will be done using the median value.

# Replace outlier with median
condition_de = (df['days_employed'] > 200000) & (df['days_employed'].notnull())
df['days_employed'] = df['days_employed'].mask(condition_de, df['days_employed'].median())
# verify the result
sns.boxplot(df['days_employed'])

Result:

2-c: Handling Duplicates

Findings

1. There are 72 identified duplicate data entries.

2. These duplicate data entries will be removed, and the index will be reset.

# remove duplicate data and do reset index
df = df.drop_duplicates().reset_index(drop=True)

You can visit my GitHub account to access the complete code related to the above example:

Step 3: Data Transformation

Data transformation involves technically converting data from one format, standard, or structure to another, without changing the dataset’s content. This is typically done to prepare the data for consumption by an application or user, or to enhance data quality. The specifics of data transformation can vary based on the techniques employed. In this article, I will utilize a data aggregation technique for data transformation [6].

Data aggregation is a method used to present data in a summarized form. Given the likelihood of data originating from diverse sources, combining all incoming data into a cohesive description is the essence of data aggregation. This facet of data processing holds significance as it hinges on the quality and quantity of the data at hand. An illustrative example of this process is generating an annual sales report by consolidating quarterly or monthly data [7].

There are many ways to aggregate data in Pandas, including:

a. Utilizing the groupby() function: Grouping involves breaking down a dataset into smaller subsets depending on specific variables. This approach is widely employed for data exploration and analysis purposes. The pandas’ groupby() function is highly versatile and allows for the grouping of data based on one or more columns. By using the groupby() function, we can group data according to selected variables and subsequently apply a range of aggregation functions to these groups [8].

Implementing in phyton:

We will use groupby() to analyze whether the user review and professional review will influence platform sales

# preparing the dataset
top2_ref_df = reference_df.groupby(['platform', 'name'])[['total_sales', 'critic_score', 'user_score']].sum().query('platform == "PS3" & critic_score > 0 & user_score > 0').reset_index()
top2_ref_df = top2_ref_df[['name', 'platform', 'total_sales', 'critic_score', 'user_score']]
top2_ref_df

b. using pivot_table() function: We’ve explored how the GroupBy concept enables us to investigate connections within a dataset. A pivot table is a similar operation commonly encountered in spreadsheet software and other programs working with tabular data. When using a pivot table, input data in a column-wise format is organized into a two-dimensional table, offering a multidimensional summary of the information. Distinguishing between pivot tables and GroupBy can be confusing at times. It’s helpful to consider pivot tables as essentially a multidimensional form of GroupBy aggregation. In other words, you perform the split-apply-combine process, but in this case, both the splitting and combining occur not along a one-dimensional index, but across a two-dimensional grid [9].

Implementing in phyton:

In order to analyze top 5 platforms in NA, EU and JP regions and visualize the variation in market share from one region to another

# data aggregation of sales for each platform in NA, EU, JP regions
agg_selected_region_platform = pd.pivot_table(data=reference_df, index='platform', values = ['na_sales', 'eu_sales', 'jp_sales'], aggfunc = 'sum').sort_values(by='jp_sales')
agg_selected_region_platform

We can visualize the data:

# visualizing the sales for each platform in NA, EU, JP regions
plt.figure(figsize=(20,6))
plt.title('Distribution of game sales on each platform in the EU, NA, and JP regions')
sns.lineplot(data=agg_selected_region_platform)
plt.show()

You can visit my GitHub account to access the complete code related to the above example:

Step 4: Feature Engineering: Scaling, Normalization and Standardization

Feature engineering constitutes a pivotal stage in the creation of accurate and efficient machine learning models. A significant facet of feature engineering involves scaling, normalization, and standardization, encompassing the alteration of data to enhance its suitability for modeling. Employing these methods can enhance model accuracy, mitigate the influence of outliers, and ensure uniformity in data scale. This article delves into the fundamentals of scaling, normalization, and standardization [10].

Feature Scaling

Feature scaling is a crucial step in data preprocessing, aiming to standardize the values of features or variables within a dataset to a uniform scale. The primary objective is to ensure that all features have a fair influence on the model, avoiding the dominance of features with higher values. The necessity for feature scaling arises when working with datasets that encompass features having diverse ranges, units of measurement, or orders of magnitude. In such scenarios, discrepancies in feature values can introduce bias in model performance or hinder the learning process. Through the application of feature scaling, the features in a dataset can be harmonized to a consistent scale, simplifying the construction of precise and efficient machine learning models. Scaling promotes meaningful feature comparisons, enhances model convergence, and prevents specific features from dominating others solely based on their magnitude [10].

Why Should We Use Feature Scaling?

Certain machine learning algorithms exhibit sensitivity to feature scaling, whereas others remain mostly unaffected by it. Let’s delve into a detailed examination of this aspect.

  1. Gradient Descent Based Algorithms

Machine learning algorithms that use gradient descent as an optimization technique (like linear regression, logistic regression, etc) require data to be scaled [10]

2. Distance-Based Algorithms

Algorithms based on distance metrics, such as K-nearest neighbors (KNN), K-means clustering, and support vector machines (SVM), are highly influenced by the range of features. This is because these algorithms rely on calculating distances between data points to ascertain their similarity [10].

Implementing in Phyton:

A function will be created to calculate the distance using the k-nearest neighbors algorithm based on two distance metrics: Euclidean and Manhattan. We will then compare the distance results on both unscaled and scaled data.

the df dataset:

# function for calculating kNN distance
def get_knn(df, n, k, metric):

"""
Display k nearest neighbors:
param df: Pandas DataFrame used to find similar objects within it
param n: number of the object for which k nearest neighbors are sought
param k: number of k nearest neighbors to be displayed
param metric: name of the distance metric
"""

nbrs = sklearn.neighbors.NearestNeighbors(n_neighbors = k, metric = metric, algorithm = 'brute')
nbrs.fit(df[feature_names])
nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[n][feature_names]], k, return_distance=True)

df_res = pd.concat([
df.iloc[nbrs_indices[0]],
pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance'])
], axis=1)

return df_res

Using unscaled data (df):

# euclidean metric - unscaled data
get_knn(df, 1, 50, 'euclidean')

Result:

# manhattan metric - unscaled data
get_knn(df, 1, 50, 'manhattan')

Findings:

When using unscaled data, the results are the same (referring to the generated indices — at index 1, it has a similar classification to the following indices: 3920, 4948, 2528, 3593 -) for both distance metrics.

Using scaled data:

For instance, age and income have different scales (age = years, income = dollars), hence data scaling is necessary.

MaxAbsScaler is utilized to scale data to its maximum value; that is, dividing each observation by the maximum value of the variable: The result of the previous transformation is a distribution where values roughly vary within the range of -1 to 1.

# scalling the data using MaxAbsScaler
feature_names = ['gender', 'age', 'income', 'family_members']

transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_numpy())
df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to_numpy())

the df_scaled dataset:

# euclidean metric - scaled data
get_knn(df_scaled, 1, 10, 'euclidean')
# manhattan metric - scaled data
get_knn(df_scaled, 1, 10, 'manhattan')

The question is: Does non-scaled data affect the kNN algorithm? If it does, how does it affect it?

Yes, when data is not scaled, the results will be the same (regardless of the metric used). Therefore, the results might be inaccurate due to differences in the scales used in each column.

In calculations, it is important to maintain a consistent scale as much as possible. For example: age and income have different scales (age = years, income = dollars).

You can visit my GitHub account to access the complete code related to the above example:

Normalization

Normalization, a data preprocessing approach, standardizes feature values within a dataset to a consistent scale. This is carried out to streamline data analysis and modeling, mitigating the influence of disparate scales on machine learning model accuracy [10].

Implementing in Phyton:

# import library
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
std = MinMaxScaler().fit(X_train)

# transform training data
X_train_std = std.transform(X_train)

# transform testing data
X_test_std = std.transform(X_test)

Standardization

Standardization, a form of scaling, involves centering values around the mean and adjusting the standard deviation to one unit. Consequently, the attribute’s mean becomes zero, and the resulting distribution maintains a unit standard deviation [10].

Now, let’s proceed with utilizing scikit-learn’s StandardScaler for standardizing features. This process involves eliminating the mean and adjusting the scale to unit variance, ultimately resulting in a mean of 0 and a standard deviation of 1. This aligns the data with a standard normal distribution. [11].

Implementing in Phyton:

We will compare the metric results before and after implementing StandardScaler.

Before normalizing the features:

beforeScaling_lr = LogisticRegression(random_state = 42)
# train model on training set
beforeScaling_lr.fit(features_train, target_train)
# predict using validation set
y_predict_valid_lr = beforeScaling_lr.predict(features_valid)
# measuring probability using validation set
y_probability_valid_lr = beforeScaling_lr.predict_proba(features_valid)[:, 1]
# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

Result:

Normalizing the features:

# normalizing the features using StandardScaler
scaler = StandardScaler()
features_train[df_numerical] = scaler.fit_transform(features_train[df_numerical])
features_valid[df_numerical] = scaler.transform(features_valid[df_numerical])
features_test[df_numerical] = scaler.transform(features_test[df_numerical])

After normalizing the features:

afterScaling_lr = LogisticRegression(random_state = 42)
# train model on training set
afterScaling_lr.fit(features_train, target_train)
# predict using validation set
y_predict_valid_lr = afterScaling_lr.predict(features_valid)
# measuring probability using validation set
y_probability_valid_lr = afterScaling_lr.predict_proba(features_valid)[:, 1]
# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

Result:

Findings:

Both the F1 score and AUC-ROC score are increasing after standardizing the features.

You can visit my GitHub account to access the complete code related to the above example:

In Part 2, I will delve into topics such as Feature Selection, handling imbalanced dataset, Encoding Features and Data Splitting. Keep an eye out for this continuation where we’ll explore these essential steps in detail!

References:

1. Shaomin Wu, A review on coarse warranty data and analysis (2013)

2. George Lawton, Data Preprocessing (2022)

https://www.techtarget.com/searchdatamanagement/definition/data-preprocessing

3. Yuliia Kniazieva, What is Data Collectio in Machine Learning (2022)

https://labelyourdata.com/articles/data-collection-methods-AI

4. Deepak Jain, Data Preprocessing in Data Mining (2023)

https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/

5. https://stats.stackexchange.com/questions/143700/which-is-better-replacement-by-mean-and-replacement-by-median

6. Chiradeep BasuMallick, What Is Data Transformation? Types, Tools, and Importance (2022)

https://www.spiceworks.com/tech/big-data/articles/what-is-data-transformation/

7. Data Science Wizards, Introduction to Data Transformation (2023)

8. https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html

9. https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html

10. Aniruddha Bhandari, Feature Engineering: Scalling, Normalization and Standardization (2023)

11. Scikit-learn documentation, Standard Scaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

12. Nate Rosidi, Advanced Feature Selection Techniques for Machine Learning Models (2023)

--

--