Importance of Feature Scaling for Artificial Neural Networks and K-Nearest Neighbors

6 min readJul 4, 2019

This post is about the scenario which I experienced while I was developing a Neural Network and K-nearest neighbors model for a bank Churn Data.

This post explains the importance of feature scaling which improves the performance of machine learning algorithms like Neural Network and K-nearest neighbors by a considerable margin.

Let’s start the experiment by understanding the data on which we are trying to build our classification model.

Suppose a bank named XYZ has collected the data for the last 6 months to understand the pattern of their customers who decided to exit the bank and the customers who decided to stay at the bank. You can see the features in the dataset mentioned above. This dataset can be used to help in building an ML model which can predict whether a customer will leave the bank or stay at the bank and based on these predictions the bank can take a decision on focusing on the customers who might leave the bank.

When we observe the features like CreditScore, Age, NoOfProducts, Tenure, Balance, EstimiatedSalary these all are continuous values but have different scale and units. We will see how this affects the performance of the model and how this can be handled to improve the performance.

Let's write some code in python using scikit learn to train the Neural Network Model and K-Nearest Neighbors, model and observe the accuracy of the models.

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics

if __name__ == '__main__':

    dataset = pd.read_csv("Churn_Modelling.csv")
    X = dataset.iloc[:, 3:13].values
    y = dataset.iloc[:, 13].values
    label_encoder_X = LabelEncoder()
    X[:, 1] = label_encoder_X.fit_transform(X[:, 1])
    X[:, 2] = label_encoder_X.fit_transform(X[:, 2])
    one_hot_encoder = OneHotEncoder(categorical_features=[1])
    X = one_hot_encoder.fit_transform(X).toarray()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    layers = (11,) * 10
    args = {"activation": "tanh", "max_iter": 1000, "hidden_layer_sizes": layers}
    model = MLPClassifier(**args)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(metrics.accuracy_score(y_test, y_pred))

The above python code develops a Neural Network Model with some initial set of the hyperparameters value set. When I measure the accuracy score of the classification for the developed model on our dataset it is 79.30%.

Now when I Do Feature scaling on this dataset and ones again train the Neural Network model using the same hyperparameters.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cols = ["CreditScore", "Age", "Tenure", "Balance", "EstimatedSalary"]
dataset[cols] = scaler.fit_transform(dataset[cols])

The above lines of code show how Standard Scaling can be done using scikit learn on the selected columns. Add this snippet of code in the previous code just after reading the data from the Churn_Modelling.csv file.

You can read more about Standard Scalar here:

https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

The Standard scaler scales the values in all the given columns such that the mean=0 and standard deviation =1

The dataset after feature scaling :

And now if we calculate the accuracy of classification after feature scaling is 86.1%. And this prediction improvement of 6–7 % is a good amount of achievement by just doing feature scaling.

The Reason why the improvement in the performance of the Neural Network algorithm after feature scaling?

The Neural Network in our case uses gradient descent as an optimization algorithm to find the appropriate weights(w) for each feature.

we are using the gradient descent algorithm to optimize the cost function and update the weights. So if some of the features have a very large scale then gradient descent takes more numbers of iterations for that feature to converge as compared to the other features having a small scale and if the feature which has high scale is highly correlated to the target output then the performance of the model will be affected as a whole because the weights for this particular feature will not converge to give the best performance. And in our code, we are using 1000 epochs to train the model and so if we do not scale the features, the features which we did not scale, the weights were not converged by 1000 epochs and I guess would require much more epochs for the non-scaled features to converge. So, in this case, the best option is that instead of increasing the epochs and ultimately increasing the training time we scale the features so that the neural network model gives the best performance without unnecessarily increasing the number of epochs.

Similarly, now we talk about the performance of K-NN and how feature scaling is so important for the K-NN model to perform better.

dataset = pd.read_csv("Churn_Modelling.csv")
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
label_encoder_X = LabelEncoder()
X[:, 1] = label_encoder_X.fit_transform(X[:, 1])
X[:, 2] = label_encoder_X.fit_transform(X[:, 2])
one_hot_encoder = OneHotEncoder(categorical_features=[1])
X = one_hot_encoder.fit_transform(X).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
args = {"algorithm": 'brute', "leaf_size": 30, "n_jobs": -1,"n_neighbors": 9,"weights": 'uniform'}
model = KNeighborsClassifier(**args)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)
print(metrics.accuracy_score(y_test, y_pred))

The above code is written in python using scikit learn in which a K-NN model is trained on the same dataset without feature scaling and when we observe the performance of the model considering the accuracy it is 77.4%. Now before training the model, we do feature scaling and then we observe the performance of the model considering the accuracy is 83%.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cols = ["CreditScore", "Age", "Tenure", "Balance", "EstimatedSalary"]
dataset[cols] = scaler.fit_transform(dataset[cols])

Add the above code to scale the features using standardization.

Now I will talk about why K-NN needs all the features to be of the same scale. In the case of the K-NN, it uses the Euclidean distance to measure the closeness or similarity between any two data points and the formula to calculate the euclidean distance between any two points is

Image result for euclidean distance formula — Euclidean distance between two points in n dimensions

In the case of the K-NN algorithm, one feature is one dimension, for example, consider a situation from the above dataset in which we take two features and they are balance and Tenure both have very different scale like the Tenure is the number of years you have been the customer of the bank and that on average cannot be more than 100 years but the balance field can have high value like millions and billions and when these two variables used to find the Euclidean distance the balance variable will mostly contribute to the distance and thus while finding the nearest neighbors the features with high scale will impact the results and so to avoid this feature scaling is important so that every feature is on the same scale and each feature is given equal importance while finding the nearest neighbors.

Conclusion:

So we have seen with code example and a dataset which has features with a different scale that feature scaling is so important for Artificial Neural network and the K nearest neighbor algorithm and before developing a model one should always take feature scaling into consideration.

There are various feature scaling methods each method has pros and cons based on the problem in hand and we can decide based on the problem which scaling method is best suitable for the problem.

Please share this post if you learned at least a bit from this post.

Importance of Feature Scaling for Artificial Neural Networks and K-Nearest Neighbors

Written by Piyush Chaudhari