Why gradient descent doesn’t converge with unscaled features?

Lavanya Gupta
Analytics Vidhya
Published in
3 min readMar 1, 2021

--

Ever felt curious about this well-known axiom: “Always scale your features”? Well, read on to get a quick graphical and intuitive explanation!

Motivation

I am sure all of us have seen this popular axiom in machine learning: Always scale your features before training!
While most of us know its practical importance, not many of us are aware of the underlying mathematical reasons.

In this super short blog, I have explained what happens behind the scene with our favorite Gradient Descent algorithm when it is fed with features having very different magnitudes.

Understanding with an example

Let’s say we are trying to predict the life expectancy of a person (in years) using 2 numeric predictor variables/features: x1 and x2, where x1 is the age of the person and x2 is his/her salary. Cleary, x1 << x2.

This is a regression problem where we aim to learn the weights theta1 and theta2 for x1 and x2 respectively by minimizing the cost function — Mean Squared Error (MSE).

If we plot theta1, theta2, and cost:

Left: Cost function with scaled features Right: Cost function with unscaled features (elongated in the direction of the smaller magnitude feature)

Left figure: With feature scaling
The cost function is a perfect circle (in 2D) or hemisphere in (3D). Gradient descent…

--

--

Lavanya Gupta
Analytics Vidhya

Carnegie Mellon Grad | AWS ML Specialist | Instructor & Mentor for ML/Data Science