Why is scaling required in KNN and K-Means?

Pulkit Sharma
Aug 25, 2019 · 4 min read

KNN and K-Means are one of the most commonly and widely used machine learning algorithms. KNN is a supervised learning algorithm and can be used to solve both classification as well as regression problems. K-Means, on the other hand, is an unsupervised learning algorithm which is widely used to cluster data into different groups.

One thing which is common in both these algorithms is that both KNN and K-Means are distance based algorithms. KNN chooses the k closest neighbors and then based on these neighbors, assigns a class (for classification problems) or predicts a value (for regression problems) for a new observation. K-Means clusters the similar points together. The similarity here is defined by the distance between the points. Lesser the distance between the points, more is the similarity and vice versa.

Why do we need to scale the data?

All such distance based algorithms are affected by the scale of the variables. Consider your data has an age variable which tells about the age of a person in years and an income variable which tells the monthly income of the person in rupees:

Source: Applied Machine Learning Course

Here the Age of the person ranges from 25 to 40 whereas the income variable ranges from 50,000 to 110,000. Let’s now try to find the similarity between observation 1 and 2. The most common way is to calculate the Euclidean distance and remember that smaller this distance closer will be the points and hence they will be more similar to each other. Just to recall, Euclidean distance is given by:

Source: Applied Machine Learning Course

Here,

n = number of variables

p1,p2,p3,… = features of first point

q1,q2,q3,… = features of second point

The Euclidean distance between observation 1 and 2 will be given as:

Euclidean Distance = [(100000–80000)^2 + (30–25)^2]^(1/2)

which will come out to be around 20000.000625. It can be noted here that the high magnitude of income affected the distance between the two points. This will impact the performance of all distance based model as it will give higher weightage to variables which have higher magnitude (income in this case).

We do not want our algorithm to be affected by the magnitude of these variables. The algorithm should not be biased towards variables with higher magnitude. To overcome this problem, we can bring down all the variables to the same scale. One of the most common technique to do so is normalization where we calculate the mean and standard deviation of the variable. Then for each observation, we subtract the mean and then divide by the standard deviation of that variable:

Apart from normalization, there are other methods too to bring down all the variables to the same scale. For example: Min-Max Scaling. Here the scaling is done using the following formula:

For now, we will be focusing on normalization. You can try min-max scaling as well. Let’s see how normalization can bring down these variables to same scale and hence improve the performance of these distance based algorithms. If we normalize the above data, it will look like:

Source: Applied Machine Learning Course

Let’s again calculate the Euclidean distance between observation 1 and 2:

Euclidean Distance = [(0.608+0.260)^2 + (-0.447+1.192)^2]^(1/2)

This time the distance is around 1.1438. We can clearly see that the distance is not biased towards the income variable. It is now giving similar weightage to both the variables. Hence, it is always advisable to bring all the features to the same scale for applying distance based algorithms like KNN or K-Means.

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Pulkit Sharma

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store