Wine Quality Classification Using KNN

A guide to tune hyperparameters of KNN with Grid Search and Random Search

Eugenia Anello
Nov 30, 2020 · 7 min read
Image for post
Image for post
Photo by Andrea Piacquadio on pexels

Table of content

Introduction

Prepare Data

#scikit-learn dataset library
from sklearn import datasets,preprocessing
import numpy as np
from matplotlib import pyplot as plt
wine = datasets.load_wine(as_frame=True)
wine.frame
Image for post
Image for post
print(wine.target_names)
'''['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']'''
print(list(wine.target))'''[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]'''
X_scaled = preprocessing.scale(wine.data)
X_scaled.mean(axis=0)
'''array([-8.38280756e-16, -1.19754394e-16, -8.37033314e-16, -3.99181312e-17,
-3.99181312e-17, 0.00000000e+00, -3.99181312e-16, 3.59263181e-16,
-1.19754394e-16, 2.49488320e-17, 1.99590656e-16, 3.19345050e-16,
-1.59672525e-16])'''
X_scaled.std(axis=0)
'''array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])'''
from sklearn.model_selection import train_test_split# Split dataset into training set(80%) and test set(20%)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, wine.target, test_size=0.2)

Create KNN Model

#Import knearest neighbors Classifier model
from sklearn.neighbors import KNeighborsClassifier
#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=6)
#Train the model using the training sets
knn.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = knn.predict(X_test)

Model Evaluation

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Test accuracy:",metrics.accuracy_score(y_test, y_pred))
#Test accuracy: 0.9722222222222222
print("Goodness of fit: {}".format(metrics.r2_score(y_test, y_pred)))
#Goodness of fit: 0.9583333333333334
metrics.plot_confusion_matrix(knn,X_test, y_test.to_numpy().reshape(-1, 1))
plt.show()
Image for post
Image for post

K-fold Cross Validation

from sklearn.model_selection import cross_val_score,cross_val_predictk_range = range(1, 31)# list of scores from k_range
k_scores = []
#loop through reasonable values of k
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k) #obtain cross_val_score for KNNClassifier with k neighbours
scores = cross_val_score(knn, X_scaled, wine.target, cv=5, scoring='accuracy')
#append mean of scores for k neighbors to k_scores list
k_scores.append(scores.mean())
print(k_scores)
Image for post
Image for post
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
Image for post
Image for post

Grid Search

from sklearn.model_selection import GridSearchCV#create new a knn model
knn2 = KNeighborsClassifier()
#create a dictionary of all values we want to test for n_neighbors
param_grid = {"n_neighbors": np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn2, param_grid, cv=5)
#fit model to data
knn_gscv.fit(X_scaled, wine.target)
Image for post
Image for post
#check top performing n_neighbors value
print("Best parameter: {}".format(knn_gscv.best_params_))
Image for post
Image for post
#check the best score
print("Best score: {}".format(knn_gscv.best_score_))
Image for post
Image for post

Random Search

from sklearn.model_selection import RandomizedSearchCVk_range = list(range(1, 25))
weight_options = ['uniform', 'distance']

param_dist = dict(n_neighbors=k_range, weights=weight_options)
rand = RandomizedSearchCV(knn, param_dist, cv=5, scoring='accuracy', n_iter=10, random_state=5)

rand.fit(X_scaled, wine.target)
Image for post
Image for post
#check top performing n_neighbors value
print("Best parameter: {}".format(rand.best_params_))
Image for post
Image for post
#check the best score
print("Best score: {}".format(rand.best_score_))
Image for post
Image for post

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Eugenia Anello

Written by

I am a Data Science student and a Traveller enthusiast | I learn something new everyday | https://www.linkedin.com/in/eugenia-anello-545711146

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Eugenia Anello

Written by

I am a Data Science student and a Traveller enthusiast | I learn something new everyday | https://www.linkedin.com/in/eugenia-anello-545711146

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store