# K-Nearest Neighbor (KNN) Regression and fun behind it

Hello there!

In my previous post, we developed a Polynomial Linear Regression (PLR) model to predict fuel efficiency of cars.

Using Polynomial Linear Regression we got r2 score of 68% and Root Mean Square Error of 4.34.

Let’s use KNN algorithm and same data set and predict fuel efficiency. We shall get better.

Import sklearn.neighbors has two methods KNeighborsRegressor for regression and KNeighborsClassifiers for classification. As we have continuous data in this case, we are going to use KNeighborsRegressor method.

Run below lines of Python code to fit the training data in KNN model.

from sklearn.neighborors import KNeighborsRegressor
knn=KNeighborsRegressor(n_neighbors=9)
knn.fit(X_train,y_train)

Now, it’s time to predict y value based on X_test.

y_pred_knn=knn.predict(X_test)

Let’s extend the scatter plot of Multiple Linear Regression (MLR) and Polynomial Linear Regression (PLR) with KNN predicted values.

Run below lines of Python code

plt.scatter(X,y,color=”blue”)
plt.scatter(X,linear_reg.predict(X),color=”red”)
plt.scatter(X,y_poly_pred,color=”green”)
plt.scatter(X,knn.predict(X),color=”yellow”)
plt.title(“Fuel Efficiecy Prediction”)
plt.xlabel(“displacement”)
plt.ylabel(“mpg”)
plt.show()

You should get below graph

The yellow points are KNN result. It looks much more align with actual y values (blue) than MLR values (red) and PLR values (green).

Let’s compare the Root Mean Square Error (RMSE) and r2 score of the models

Run below lines of Python code.

from sklearn.metrics import r2_score
from math import sqrt
from sklearn.metrics import mean_squared_error

print(‘R2 score for MLR ‘,r2_score(y,linear_reg.predict(X)))
print(‘Root Mean Square error MLR’,sqrt(mean_squared_error(y,linear_reg.predict(X))))
print(‘R2 score for PLR ‘,r2_score(y,y_poly_pred))
print(‘Root Mean Square error PLR’,sqrt(mean_squared_error(y,y_poly_pred)))
print(‘R2 score for KNN ‘,r2_score(y,knn.predict(X)))
print(‘Root Mean Square error KNN’,sqrt(mean_squared_error(y,knn.predict(X))))

It should give below result

`R2 score for MLR  0.6481521023561414Root Mean Square error MLR 4.62376921742559R2 score for PLR  0.688808733323848Root Mean Square error PLR 4.348428859091055R2 score for KNN  0.7514144992162307Root Mean Square error KNN 3.8864811545505065`

KNN model has better r2 score and low error.

Congratulations!! you are on path of improvement!

BTW, did you notice n_neighbors=9 option while creating instance of class KNeighborsRegressor?

When it comes to KNN, choosing how many neighbors you want to consider as K number is critical for the model performance. A low K number will have high error, however, high K number does not guarantee improvement in error.

This is process of optimization.

Run below line of Python code to understand what I mean.

from sklearn import neighbors
from math import sqrt
from sklearn.metrics import mean_squared_error
rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)

model.fit(X_train, y_train) #fit the model
pred=model.predict(X_test) #make prediction on test set
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
print(‘RMSE value for k= ‘ , K , ‘is:’, error)

It should show below output

`RMSE value for k=  1 is: 7.053834579750295RMSE value for k=  2 is: 5.441291963113916RMSE value for k=  3 is: 4.541024129053596RMSE value for k=  4 is: 4.513171651336827RMSE value for k=  5 is: 4.151970098182612RMSE value for k=  6 is: 4.024999235614188RMSE value for k=  7 is: 3.9269413024345967RMSE value for k=  8 is: 3.870707350804194RMSE value for k=  9 is: 3.873132234407178RMSE value for k=  10 is: 3.83917910529473RMSE value for k=  11 is: 3.927154454170978RMSE value for k=  12 is: 3.8934578093360876RMSE value for k=  13 is: 3.8844842689991883RMSE value for k=  14 is: 3.9141210482648803RMSE value for k=  15 is: 3.9530763892849077RMSE value for k=  16 is: 3.9243184020777506RMSE value for k=  17 is: 3.924336668906001RMSE value for k=  18 is: 3.954351280038823RMSE value for k=  19 is: 3.959182188509304RMSE value for k=  20 is: 3.9930392758183393`

The RMSE value clearly shows it is going down for K value between 1 and 10 and then increases again from 11 on wards.

If you draw a plot for these it will look like below

Run below Python code to draw the plot

#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve
curve.plot()

This graph indicates how to find an optimized value of K for KNN algorithm.

You can also use grid search to find optimum K value

Run below line of Python code to get optimum value of K

from sklearn.model_selection import GridSearchCV
params = {‘n_neighbors’:[2,3,4,5,6,7,8,9]}

knn = neighbors.KNeighborsRegressor()

model = GridSearchCV(knn, params, cv=5)
model.fit(X_train,y_train)
model.best_params_

It should return below result

`{'n_neighbors': 9}`

Congratulations! Now you know how to use KNN regression algorithm.

Reference:

Machine Learning Hands-on

## Sanrusha

#### Data Science, Machine Learning and Artificial Intelligence

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade