# Wine Quality Classification Using KNN

## A guide to tune hyperparameters of KNN with Grid Search and Random Search

## Table of content

# Introduction

K-Nearest-Neighbors is a supervised and non-parametric technique used for classification and regression. Supervised because the data is already labelled and non-parametric due to the fact that there is no underlying assumption for data distribution. Differently from other models, it doesn’t need any training data for model generation. All the training data is used in the testing phase, making training faster and testing phase slower and costlier.

In KNN, K is the number of nearest neighbors. Deciding the “right” number is important, not only because the algorithm requires such parameter, but also because the appropriate number of nearest neighbors determines the performance of the model.

In this article, I want to use the K-Nearest-Neighbors to predict the wine quality, which has a score between 0 and 2. The dataset is available in the library scikit-learn. The attributes of this dataset are:

- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

# Prepare Data

We start by loading the wine dataset from scikit-learn datasets.

#scikit-learn dataset library

from sklearn import datasets,preprocessing

import numpy as np

from matplotlib import pyplot as pltwine = datasets.load_wine(as_frame=True)

wine.frame

print(wine.target_names)

'''['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']'''print(list(wine.target))'''[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]'''

To reach better results, we normalize the data between 0 and 1 using the function preprocessing.scale:

X_scaled = preprocessing.scale(wine.data)

X_scaled.mean(axis=0)

'''array([-8.38280756e-16, -1.19754394e-16, -8.37033314e-16, -3.99181312e-17,

-3.99181312e-17, 0.00000000e+00, -3.99181312e-16, 3.59263181e-16,

-1.19754394e-16, 2.49488320e-17, 1.99590656e-16, 3.19345050e-16,

-1.59672525e-16])'''X_scaled.std(axis=0)

'''array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])'''

We split the dataset into training and test set using the function train_test_split:

from sklearn.model_selection import train_test_split# Split dataset into training set(80%) and test set(20%)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, wine.target, test_size=0.2)

# Create KNN Model

We build KNN classifier for K=6.

#Import knearest neighbors Classifier model

from sklearn.neighbors import KNeighborsClassifier#Create KNN Classifier

knn = KNeighborsClassifier(n_neighbors=6)#Train the model using the training sets

knn.fit(X_train, y_train)#Predict the response for test dataset

y_pred = knn.predict(X_test)

# Model Evaluation

Once the model is trained, we can predict on our testing data.

#Import scikit-learn metrics module for accuracy calculation

from sklearn import metrics

# Model Accuracy, how often is the classifier correct?

print("Test accuracy:",metrics.accuracy_score(y_test, y_pred))#Test accuracy: 0.9722222222222222

We can see that our model seems accurate with a test accuracy of 97%.

print("Goodness of fit: {}".format(metrics.r2_score(y_test, y_pred)))

#Goodness of fit: 0.9583333333333334metrics.plot_confusion_matrix(knn,X_test, y_test.to_numpy().reshape(-1, 1))

plt.show()

From the confusion matrix, we can observe that class 0 was misclassified as 1 one time, while class 2 was misclassified as 1 two times. The performance seems pretty good. But what if the split we make isn’t random? It would bring overfitting and loose of generalization. To avoid this situation we’ll try the K-Cross Validation.

# K-fold Cross Validation

K-Cross Validation is a technique used to split the data into K “folds” of equal size. K-1 portions are used to train the model and one portion to test it. This procedure is repeated K times. The Average testing performance is used to evaluate the performance of the model. Before we used K=6, now we want to be sure that we are using the right number of nearest neighbors, so we’ll select the best value of K for KNN splitting the dataset into 5 folds.

from sklearn.model_selection import cross_val_score,cross_val_predictk_range = range(1, 31)# list of scores from k_range

k_scores = []#loop through reasonable values of k

for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) #obtain cross_val_score for KNNClassifier with k neighbours

scores = cross_val_score(knn, X_scaled, wine.target, cv=5, scoring='accuracy') #append mean of scores for k neighbors to k_scores list

k_scores.append(scores.mean())print(k_scores)

`plt.plot(k_range, k_scores)`

plt.xlabel('Value of K for KNN')

plt.ylabel('Cross-Validated Accuracy')

From the plot, we observe that we reach the maximum performance for k=7,8,17,18,25 and many other values. The accuracies seem very similar in this plot and it’s difficult to choose what values of K are better. For this reason, we’ll try techniques to tune the hyperparameters and choose the most optimal values of these for the model.

# Grid Search

Hyperparameter tuning is the problem of choosing a set of hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. A traditional way to perform it is the **Grid Search**, a technique that builds a model for every combination of hyperparameters specified and evaluates each model. It can be computationally expensive when the dataset is very big.

For KNN model, we specify the grid of parameters, that will have values from 1 to 24, that will be searched using K-fold cross validation. We’ll use the function GridSearchCV of the library scikit-learn.

from sklearn.model_selection import GridSearchCV#create new a knn model

knn2 = KNeighborsClassifier()#create a dictionary of all values we want to test for n_neighbors

param_grid = {"n_neighbors": np.arange(1, 25)}#use gridsearch to test all values for n_neighbors

knn_gscv = GridSearchCV(knn2, param_grid, cv=5)#fit model to data

knn_gscv.fit(X_scaled, wine.target)

`#check top performing n_neighbors value`

print("Best parameter: {}".format(knn_gscv.best_params_))

We can observe that 7 is the optimal value for n_neighbors, while with best_score we check the performance of the model with this hyperparameter. best_score is the mean accuracy of the scores obtained with K-fold cross validation.

`#check the best score`

print("Best score: {}".format(knn_gscv.best_score_))

# Random Search

Random Search is a technique that doesn't evaluate all the combinations of hyperparameters in the searching space, but indeed it “randomly” choose combinations at every iteration. This is the advantage of Random Search over the Grid Search due to its lower computational cost.

As before, we define the parameter values for K to search. We also consider other parameters beside k, called **weight parameters**. There are two types of weight parameters:

- uniform:
*all points in the neighborhood are weighted equally* - distance:
*weights closer neighbors more heavily than further neighbors*

These hyperparameters will be searched using the function RandomizedSearchCV of the library scikit-learn.

fromsklearn.model_selectionimportRandomizedSearchCVk_range = list(range(1, 25))

weight_options = ['uniform', 'distance']

param_dist = dict(n_neighbors=k_range, weights=weight_options)rand = RandomizedSearchCV(knn, param_dist, cv=5, scoring='accuracy', n_iter=10, random_state=5)

rand.fit(X_scaled, wine.target)

*#check top performing n_neighbors value*

print("Best parameter: **{}**".format(rand.best_params_))

*#check the best score*

print("Best score: **{}**".format(rand.best_score_))

Differently from Grid Search, the choice of K is 17, instead of 7. Moreover, the optimal weights consider all the points in the neighborhood weighted equally. The performances of Grid Search and Random Search are the same, but knowing that Random Search is computationally less expensive, this technique would be preferred, especially when the dataset is very big.

Thanks for reading! I hope you enjoyed my post. The GitHub repository can be found here.