K-fold CV — Hyper-parameter tuning in Python

Little Dino
5 min readMar 30, 2022

--

Introduction

We went through the idea of K-fold CV in the last post, today we’re going to look at how to implement it in Python. In particular, we’ll focus on hyper-parameter tuning, not model performance estimation. Let’s get started!

Step by step explanation

We’ll use Iris dataset and K nearest neighbor algorithm in this example, and our goal is to predict the sub-species based on the characteristics of a flower. If you haven’t heard of K nearest neighbor, don’t freak out, you can still learn K-fold CV. We’ll go through the process step by step.

1. Import packages

The first thing we do is importing the packages we need. By convention, we use a single cell to import all the packages.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from prettytable import PrettyTable

2. Read data

Remember to save the Iris data in the same directory of your Python file, and you can download the data here.

data = pd.read_csv('IRIS.csv')

3. Quickly inspect the data

data.head()

What we will see is the first 5 rows of the dataset. In specific, we have 4 characteristics of the Iris flower (sepal width and length, petal width and length), and the target attribute is the last column (we want to predict the sub-species an Iris flower belongs to).

First 5 rows of Iris dataset

Of course we can do more inspection. Say we want to see the statistical summary of all numerical columns, just type data.describe().

4. Train/Test split

To make our performance estimation justifiable, we need to split data into training and testing set, and we’ll use the common 80/20 split (80%-training, 20%-testing). Remember that testing set is ONLY used in testing, which is the last step of the whole process.

The characteristics (independent attributes) are the first 4 columns, and the target is the last column. Thus, we first create two variables, characteristics and target, which are the corresponding subsets of data.

Then, we use the function train_test_split to split the dataset. The first parameter is the dependent variables, the second is the independent/target variable. test_size means the proportion of data used in testing. We make it 0.2 since we choose 80/20 split. random_state just makes sure the split will be the same every time (the randomness in splitting is fixed).

characteristics = data.iloc[:,:4] # the first 4 columns 
target = data.iloc[:,-1] # the last column
x_train, x_test, y_train, y_test = train_test_split(characteristics, target, test_size=0.2, random_state=2727)

5. Cross validation for hyper-parameter tuning

We’re finally here!!! The first thing we do is to pick some candidate hyper-parameters of the model K nearest neighbor. In this example, we allows p to be [1, 2, 3], n_neighbors to be [2, 3, 4, 5, 6]. The meaning of these parameters is not important, you only need to know we want to find the BEST model with certain hyper-parameters combination.

To illustrate, we apply grid search by using for loops. Namely, we perform K-fold cross validation (K=10) on EVERY model, then we select the one with the best average accuracies.

Actually, the cross_validate function pretty much does everything for us. We just need to put some parameters in this function, then voila, we have the scores of each fold. In specific, the parameters are the model we choose (knn), independent attributes of training data (x_train), target attribute of training data (y_train), the K (cv), and the scoring measure (scoring).

However, these scoring measure is dictionary and its key test_score is an array since we have one performance measure for each fold. Therefore, we take average of these scores as a general representation of model performance (np.mean(scores['test_score'])).

⚡ In the cross_validate function, we put the TRAINING data. It’s really important that we don’t use testing data in the training phase.

hyperparameter_score_list = []
for p in range(1,4):
for neighbor in range(2,7):
knn = KNeighborsClassifier(p=p, n_neighbors=neighbor)
scores = cross_validate(knn, x_train, y_train, cv=10, scoring='accuracy')
mean_score = np.mean(scores['test_score'])
hyperparameter_score_list.append([p, neighbor, mean_score])

6. Choose the hyper-parameters

In this example, we just print the hyper-parameters and the corresponding accuracy score in a table for easy visual comparison, but there are more automatic ways to select hyper-parameters.

myTable = PrettyTable(["p (distance)", "Number of neighbors", "Avg accuracy"])
for row in hyperparameter_score_list:
myTable.add_row([row[0], row[1], round(row[2],3)])
print(myTable)
Table for hyper-parameter selection

It’s clear that multiple combinations of hyper-parameters have the same highest average accuracy, 0.983. For simplicity, we randomly choose p=2, n_neighbors=3 as our best parameters.

7. Train a model and Evaluate the model performance on testing set

Congratulations, this is the LAST step! We’ve selected the hyper-parameters that work best among all the candidates, so we now fit a model with these parameters.

⚡ This model is fit on the whole training set, we only split training set into training and validation in the cross-validation step.

Then, we use the model to predict the labels of testing set and present this accuracy as final model performance estimation. This can be done by the score method of the fitted model, you just put the independent and target variables of the testing set.

knn = KNeighborsClassifier(p=2, n_neighbors=3)
knn_best_model = knn.fit(x_train, y_train)
print("Best Model Testing Score: ", knn_best_model.score(x_test, y_test))

The final accuracy turns out to be 0.933! It shows that K nearest neighbor with p=2, n_neighbors=3 performs pretty well in the testing set. One thing to note is that accuracy is not comprehensive, you still need to look at other performance measures to determine whether the model is appropriate.

Coding

I know it’s a long journey and it looks difficult, but the coding part is really not that hard. I put all the code together below so you can try to implement and understand it, and I hope this article helps you understand K-fold CV a bit more 🥸🥸🥸.

--

--

Little Dino

Welcome to my little world! I LOVE talking about machine learning, data science, coding, and statistics!