Feature scaling in machine learning

Paresh Patil
5 min readJul 5, 2023

--

Table of contents:

What is Feature scaling?
Need of feature scaling:
· Standardization:
comparison of after standardization and before standardization
Effect of scaling:
Comparison of Distributions
Why is scaling important?
Effect of outlier
when to use

Feature scaling is the last thing you do in feature engineering pipeline the good part about feature scaling is that it is one of easiest step in feature engineering

What is Feature scaling?

Feature scaling is a technique to standardize independent features present in the data within a fixed range

Suppose you have a dataset that contains input features (iq,cgpa) and target variable (LPA).Bringing down your independent features into a smaller range is called feature scaling.

Need of feature scaling:

Sometimes what happens is that your output features have different scales. consider following dataset,focous on age & salary

Age is in the range of tens, and salary is in the range of thousands. Imagine you are working on an algorithm. which works by calculating the Euclidian distance between two points (KNN)

Feature scaling is important in KNN to prevent features with larger magnitudes from dominating the distance calculation, ensure fair comparisons between features, improve convergence speed, and reduce the influence of outliers.

Generally, before feeding data to the model, you Scale the features. It means bringing features on the same scale. Generally, the range lies between -1 to 1

Types of feature scaling:

Standardization:

It is also called as z-score normalization if you have two columns: age and salary, which contain 500 values and you have to standardize the age column

fourmula for that is

xi = individual observation

μ = mean of all observations

σ = standard deviation

You have to calculate this for each value in the column.when you standardize using above formula, you will get numbers like this: 2.3, -1.2, etc.you will get such 500 numbers; their mean will be 0 and standard deviation will be 1

μ=0,σ=1

Example:-

To prove my point, i am using “Social Network Ads.” dataset, which is available on kaggle

#importing necessary liberary
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt
import seaborn as sns

#reading dataset
df = pd.read_csv('Social_Network_Ads.csv')

#seperating necessary columns
df=df.iloc[:,2:]

df.sample(5)

output:-

splitting data in train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('Purchased', axis=1),
df['Purchased'],
test_size=0.3,
random_state=0)

X_train.shape, X_test.shape

output:

applying standard scaler on data (standardization)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)

# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

the official documentation of standard scaler from sklearn : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

the problem with standard scaler is that when you apply it, it returns a numpy array, which you again need to convert into dataframe

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
# round the summary statistics
np.round(X_train.describe(), 1)

comparison of after standardization and before standardization

Before scaling, the dataset had a mean age of 37.9 with a standard deviation of 10.2, and a mean estimated salary of 69,807.1 with a standard deviation of 34,641.2. After scaling, both variables have a mean of 0 and a standard deviation of 1, indicating that they have been standardized to a common scale.

Effect of scaling:

Before scaling, the dataset exhibited varying scales for the variables. However, after applying the scaling transformation, both the age and estimated salary features were standardized to a common scale

Comparison of Distributions

Following the scaling process, the distribution of the variables remains unchanged; however, their scales have been transformed

Why is scaling important?

By incorporating scaling , the accuracy of the model significantly improves. This enhancement underscores the importance of scaling in achieving more accurate and reliable predictions, thereby elevating the overall performance and effectiveness of the model.

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr_scaled = LogisticRegression()

lr.fit(X_train,y_train)
lr_scaled.fit(X_train_scaled,y_train)


y_pred = lr.predict(X_test)
y_pred_scaled = lr_scaled.predict(X_test_scaled)


from sklearn.metrics import accuracy_score

print("Actual",accuracy_score(y_test,y_pred))
print("Scaled",accuracy_score(y_test,y_pred_scaled))

output:

The accuracy of the model before scaling was 65.8%, whereas after applying the scaling technique, the accuracy increased significantly to 86.7%. This substantial improvement in accuracy highlights the crucial role of scaling in enhancing model performance

Effect of outlier

if you apply standardization to a column that contains an outlier.The impact of outliers will not decrease, so you need to handle outliers explicitly.

when to use

if you are working on following algorithms, use standardization blindly.

Thank you for exploring the impact of feature scaling on model accuracy. To further explore the topic and experiment with feature scaling in practice, I have prepared a Jupyter Notebook with code examples. Access the notebook here

Connect with me:

LinkedIn: https://www.linkedin.com/in/pareshpatil122/

GitHub: https://github.com/paresh122

Portfolio: https://pareshpatil-portfolio.netlify.app/

--

--

Paresh Patil

Data wizard, blending science and analysis, conjuring insights to fuel innovation and drive data-driven excellence