# K-Nearest Neighbor (KNN) Regression and fun behind it

Hello there!

In my previous post, we developed a Polynomial Linear Regression (PLR) model to predict fuel efficiency of cars.

Link to that article ->https://medium.com/sanrusha-consultancy/polynomial-linear-regression-9d691a605aa0

Using Polynomial Linear Regression we got r2 score of 68% and Root Mean Square Error of 4.34.

Let’s use KNN algorithm and same data set and predict fuel efficiency. We shall get better.

Import sklearn.neighbors has two methods KNeighborsRegressor for regression and KNeighborsClassifiers for classification. As we have continuous data in this case, we are going to use KNeighborsRegressor method.

Run below lines of Python code to fit the training data in KNN model.

from sklearn.neighborors import KNeighborsRegressor

knn=KNeighborsRegressor(n_neighbors=9)

knn.fit(X_train,y_train)

Now, it’s time to predict y value based on X_test.

y_pred_knn=knn.predict(X_test)

Let’s extend the scatter plot of Multiple Linear Regression (MLR) and Polynomial Linear Regression (PLR) with KNN predicted values.

Run below lines of Python code

plt.scatter(X,y,color=”blue”)

plt.scatter(X,linear_reg.predict(X),color=”red”)

plt.scatter(X,y_poly_pred,color=”green”)

plt.scatter(X,knn.predict(X),color=”yellow”)

plt.title(“Fuel Efficiecy Prediction”)

plt.xlabel(“displacement”)

plt.ylabel(“mpg”)

plt.show()

You should get below graph

The yellow points are KNN result. It looks much more align with actual y values (blue) than MLR values (red) and PLR values (green).

Let’s compare the Root Mean Square Error (RMSE) and r2 score of the models

Run below lines of Python code.

from sklearn.metrics import r2_score

from math import sqrt

from sklearn.metrics import mean_squared_error

print(‘R2 score for MLR ‘,r2_score(y,linear_reg.predict(X)))

print(‘Root Mean Square error MLR’,sqrt(mean_squared_error(y,linear_reg.predict(X))))

print(‘R2 score for PLR ‘,r2_score(y,y_poly_pred))

print(‘Root Mean Square error PLR’,sqrt(mean_squared_error(y,y_poly_pred)))

print(‘R2 score for KNN ‘,r2_score(y,knn.predict(X)))

print(‘Root Mean Square error KNN’,sqrt(mean_squared_error(y,knn.predict(X))))

It should give below result

`R2 score for MLR 0.6481521023561414`

Root Mean Square error MLR 4.62376921742559

R2 score for PLR 0.688808733323848

Root Mean Square error PLR 4.348428859091055

R2 score for KNN 0.7514144992162307

Root Mean Square error KNN 3.8864811545505065

KNN model has better r2 score and low error.

Congratulations!! you are on path of improvement!

BTW, did you notice n_neighbors=9 option while creating instance of class KNeighborsRegressor?

When it comes to KNN, choosing how many neighbors you want to consider as K number is critical for the model performance. A low K number will have high error, however, high K number does not guarantee improvement in error.

This is process of optimization.

Run below line of Python code to understand what I mean.

from sklearn import neighbors

from math import sqrt

from sklearn.metrics import mean_squared_error

rmse_val = [] #to store rmse values for different k

for K in range(20):

K = K+1

model = neighbors.KNeighborsRegressor(n_neighbors = K)

model.fit(X_train, y_train) #fit the model

pred=model.predict(X_test) #make prediction on test set

error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse

rmse_val.append(error) #store rmse values

print(‘RMSE value for k= ‘ , K , ‘is:’, error)

It should show below output

`RMSE value for k= 1 is: 7.053834579750295`

RMSE value for k= 2 is: 5.441291963113916

RMSE value for k= 3 is: 4.541024129053596

RMSE value for k= 4 is: 4.513171651336827

RMSE value for k= 5 is: 4.151970098182612

RMSE value for k= 6 is: 4.024999235614188

RMSE value for k= 7 is: 3.9269413024345967

RMSE value for k= 8 is: 3.870707350804194

RMSE value for k= 9 is: 3.873132234407178

RMSE value for k= 10 is: 3.83917910529473

RMSE value for k= 11 is: 3.927154454170978

RMSE value for k= 12 is: 3.8934578093360876

RMSE value for k= 13 is: 3.8844842689991883

RMSE value for k= 14 is: 3.9141210482648803

RMSE value for k= 15 is: 3.9530763892849077

RMSE value for k= 16 is: 3.9243184020777506

RMSE value for k= 17 is: 3.924336668906001

RMSE value for k= 18 is: 3.954351280038823

RMSE value for k= 19 is: 3.959182188509304

RMSE value for k= 20 is: 3.9930392758183393

The RMSE value clearly shows it is going down for K value between 1 and 10 and then increases again from 11 on wards.

If you draw a plot for these it will look like below

Run below Python code to draw the plot

#plotting the rmse values against k values

curve = pd.DataFrame(rmse_val) #elbow curve

curve.plot()

This graph indicates how to find an optimized value of K for KNN algorithm.

You can also use grid search to find optimum K value

Run below line of Python code to get optimum value of K

from sklearn.model_selection import GridSearchCV

params = {‘n_neighbors’:[2,3,4,5,6,7,8,9]}

knn = neighbors.KNeighborsRegressor()

model = GridSearchCV(knn, params, cv=5)

model.fit(X_train,y_train)

model.best_params_

It should return below result

`{'n_neighbors': 9}`

Congratulations! Now you know how to use KNN regression algorithm.

Reference: