K-Nearest Neighbor (KNN) Regression and fun behind it

Sanjay Singh
Oct 3, 2019 · 4 min read

Hello there!

In my previous post, we developed a Polynomial Linear Regression (PLR) model to predict fuel efficiency of cars.

Link to that article ->https://medium.com/sanrusha-consultancy/polynomial-linear-regression-9d691a605aa0

Using Polynomial Linear Regression we got r2 score of 68% and Root Mean Square Error of 4.34.

Let’s use KNN algorithm and same data set and predict fuel efficiency. We shall get better.

Import sklearn.neighbors has two methods KNeighborsRegressor for regression and KNeighborsClassifiers for classification. As we have continuous data in this case, we are going to use KNeighborsRegressor method.

Run below lines of Python code to fit the training data in KNN model.

from sklearn.neighborors import KNeighborsRegressor
knn=KNeighborsRegressor(n_neighbors=9)
knn.fit(X_train,y_train)

Now, it’s time to predict y value based on X_test.

y_pred_knn=knn.predict(X_test)

Let’s extend the scatter plot of Multiple Linear Regression (MLR) and Polynomial Linear Regression (PLR) with KNN predicted values.

Run below lines of Python code

plt.scatter(X,y,color=”blue”)
plt.scatter(X,linear_reg.predict(X),color=”red”)
plt.scatter(X,y_poly_pred,color=”green”)
plt.scatter(X,knn.predict(X),color=”yellow”)
plt.title(“Fuel Efficiecy Prediction”)
plt.xlabel(“displacement”)
plt.ylabel(“mpg”)
plt.show()

You should get below graph

The yellow points are KNN result. It looks much more align with actual y values (blue) than MLR values (red) and PLR values (green).

Let’s compare the Root Mean Square Error (RMSE) and r2 score of the models

Run below lines of Python code.

from sklearn.metrics import r2_score
from math import sqrt
from sklearn.metrics import mean_squared_error

print(‘R2 score for MLR ‘,r2_score(y,linear_reg.predict(X)))
print(‘Root Mean Square error MLR’,sqrt(mean_squared_error(y,linear_reg.predict(X))))
print(‘R2 score for PLR ‘,r2_score(y,y_poly_pred))
print(‘Root Mean Square error PLR’,sqrt(mean_squared_error(y,y_poly_pred)))
print(‘R2 score for KNN ‘,r2_score(y,knn.predict(X)))
print(‘Root Mean Square error KNN’,sqrt(mean_squared_error(y,knn.predict(X))))

It should give below result

R2 score for MLR  0.6481521023561414
Root Mean Square error MLR 4.62376921742559
R2 score for PLR 0.688808733323848
Root Mean Square error PLR 4.348428859091055
R2 score for KNN 0.7514144992162307
Root Mean Square error KNN 3.8864811545505065

KNN model has better r2 score and low error.

Congratulations!! you are on path of improvement!

BTW, did you notice n_neighbors=9 option while creating instance of class KNeighborsRegressor?

When it comes to KNN, choosing how many neighbors you want to consider as K number is critical for the model performance. A low K number will have high error, however, high K number does not guarantee improvement in error.

This is process of optimization.

Run below line of Python code to understand what I mean.

from sklearn import neighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)

model.fit(X_train, y_train) #fit the model
pred=model.predict(X_test) #make prediction on test set
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
print(‘RMSE value for k= ‘ , K , ‘is:’, error)

It should show below output

RMSE value for k=  1 is: 7.053834579750295
RMSE value for k= 2 is: 5.441291963113916
RMSE value for k= 3 is: 4.541024129053596
RMSE value for k= 4 is: 4.513171651336827
RMSE value for k= 5 is: 4.151970098182612
RMSE value for k= 6 is: 4.024999235614188
RMSE value for k= 7 is: 3.9269413024345967
RMSE value for k= 8 is: 3.870707350804194
RMSE value for k= 9 is: 3.873132234407178
RMSE value for k= 10 is: 3.83917910529473
RMSE value for k= 11 is: 3.927154454170978
RMSE value for k= 12 is: 3.8934578093360876
RMSE value for k= 13 is: 3.8844842689991883
RMSE value for k= 14 is: 3.9141210482648803
RMSE value for k= 15 is: 3.9530763892849077
RMSE value for k= 16 is: 3.9243184020777506
RMSE value for k= 17 is: 3.924336668906001
RMSE value for k= 18 is: 3.954351280038823
RMSE value for k= 19 is: 3.959182188509304
RMSE value for k= 20 is: 3.9930392758183393

The RMSE value clearly shows it is going down for K value between 1 and 10 and then increases again from 11 on wards.

If you draw a plot for these it will look like below

Run below Python code to draw the plot

#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve
curve.plot()

This graph indicates how to find an optimized value of K for KNN algorithm.

You can also use grid search to find optimum K value

Run below line of Python code to get optimum value of K

from sklearn.model_selection import GridSearchCV
params = {‘n_neighbors’:[2,3,4,5,6,7,8,9]}

knn = neighbors.KNeighborsRegressor()

model = GridSearchCV(knn, params, cv=5)
model.fit(X_train,y_train)
model.best_params_

It should return below result

{'n_neighbors': 9}

Congratulations! Now you know how to use KNN regression algorithm.

Reference:

Machine Learning Hands-on

Sanjay Singh

Written by

Sanrusha

Sanrusha

Data Science, Machine Learning and Artificial Intelligence

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade