🏋🏼‍♂️ How to use cross-validation to train models in scikit-learn

6 min readDec 7, 2023

Today, we are going to take a look how to use cross-validation to train a machine learning model in scikit-learn. In summary, we will use cross-validation to extract metrics to evaluate data quality and after that use all data for training a model.

The common approach to train a model is to split the dataset into training and test sets, you can do it quickly with the train_test_split helper function.

However, partitioning the data into two or more sets you will reduce the number of samples which can be used to fit the model. In this approach you "waste" data that could be used to train a final model. If you don’t have enough data, this can be a problem.

What is a cross-validation?

Cross-validation is a statistical method used for evaluating a model and testing its performance on unseen data to detect overfitting. Is it useful for selecting the best model and tuning hyperparameters that uses different portions of the data to test and train a model on different iterations.

There are many different types of cross-validation, but the most common is k-fold cross-validation. This technique involves dividing the dataset into multiple subsets or folds. Each fold will be used as training set and validation set repeatedly until all folds are used in the fitting process. In the end, performance metrics (such as accuracy, precision, recall or others) from each step can be explored to provide a more robust evaluation of the model’s performance.

I recommend to watch this video as a study aid. Common choices for the value of k in k-fold cross-validation include 5, 10, or 20, but the choice depends on the specific dataset and the computational resources available. Additionally, there are variations of cross-validation, such as stratified k-fold, leave-one-out, and leave-p-out cross-validation, each with its own use case and advantages.

Take a look at scikit-learn model selection module to know the types of cross-validation techniques and how to use them. The choice of technique depends on the size and nature of the data, in general the default cross-validation techniques, k-fold cross-validation, works well.

The goal here is to show how cross-validation (CV for short) can be used to build more accurate and reliable models using all data to train the model. The main advantage of using this approach is when you don’t have many samples.

Keep in your mind: You never should use the same sample to training an test the model. It is a mistake.

To reproduce the examples seen in this post use Google Colab Scratchpad and copy and past the code.

Lets get started 🕹️

First of all, let’s start by installing all required libraries:

pip install scikit-learn matplotlib

To save time, let’s to use a popular dataset Iris. Import the samples:

from sklearn.datasets import load_iris


iris = load_iris()
X, y = iris.data, iris.target

print(iris.target_names)

Another important component from scikit-learn is the Pipeline. Pipeline allows you to compose a sequence of behaviors allowing you to build a simpler model:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier


pipe = Pipeline(
  [
      ("scalar", StandardScaler()),
      ("pca", PCA(n_components=3)),
      ("estimator", DecisionTreeClassifier()),
  ],
  verbose=True,
)

Now, you have a working pipeline that you can use to apply the cross-validation technique.

Computing cross-validated metrics

The function cross_val_score is the simplest way to use cross-validation. The following example demonstrates how to know the accuracy of a pipeline computing the score with 5 folds.

from sklearn.model_selection import cross_val_score


CV = 5

scoring = "accuracy"

scores = cross_val_score(pipe, X, y, scoring=scoring, cv=CV)
print(f"{scoring}: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Take a look at the plot to better understand the results:

import matplotlib.pyplot as plt
import numpy as np


folds = np.arange(1, CV + 1)

plt.rc('grid', linestyle="-", color='black')
plt.scatter(x=folds, y=scores)
plt.xticks(folds)

plt.grid(True)

plt.title(f"Cross-validation scores")

plt.xlabel("k-fold")
plt.ylabel(f"Scores [{scoring}]")

plt.show()

The scoring parameters define the evaluation method used to calculate the score. The most popular metric is accuracy, but here you can find many scoring options.

Try to use another metric, F1-score:

scores = cross_val_score(pipe, X, y, scoring='f1_macro', cv=CV)
scores

The same behavior, but with another score.

Computing cross-validated confusion matrix

A confusion matrix is a tool used to evaluate the performance of a classification model. The matrix compares the predicted classes of a model with the actual classes of the data, breaking down the results into four categories: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). It is used to evaluate the model’s ability to correctly classify instances of two or more classes to examine the performance of the model for each class individually. See this video to understand the details on how the confusion matrix works.

In this section we are create a confusion matrix with the cross-validation approach.

The function cross_val_predict will return the prediction computed with an estimator:

from sklearn.model_selection import cross_val_predict


CV = 5

y_pred = cross_val_predict(
    pipe,
    X,
    y,
    cv=CV,
)

To plot the confusion matrix use the code bellow:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y, y_pred),
                              display_labels=iris.target_names)
cm.plot();

On other hand, you can use the function classification_report to extract more metrics from the predicted result:

from sklearn.metrics import classification_report


print(classification_report(y, y_pred, target_names=iris.target_names))

We cover the set of the most common metrics to evaluate classification models with cross-validation to know the performance of the data. In the next section we are starting the training steps.

Training the model

After collecting and verifying the metrics about the data, now you’re ready to fit the model with all samples. Just call the method fit with all data and save the model in a pickle format.

import pickle


pipe.fit(X, y)

with open("pipe.bin", "wb") as f:
    pickle.dump(pipe, f)

Load the saved model and call the method predict_proba to predict a result:

with open("pipe.bin", "rb") as f:
    _pipe = pickle.load(f)

    X_pred = [[3, 2, 4, 0.2], [  4.7, 3, 1.3, 0.2 ]]    

    y_pred = _pipe.predict_proba(X_pred)
    
    for prediction in y_pred:
        for class_name, proba in zip(iris.target_names, prediction):
            print(f"{class_name}: {proba}")
        print("---")

Here, we have a model fitted with all data and the assurance that the data will perform well. This is based on the metrics extracted using cross-validation.

Bonus: Use cross-validation to train a text classification

We will walk through a code snippet that demonstrates how to train a text classifier, step-by-step using cross-validation.

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
import pandas as pd


data = [
    ("Great for the jawbone.", "pos"),
    ("What a waste of money and time!.", "neg"),
    ("And the sound quality is great.", "pos"),
    ("I advise EVERYONE DO NOT BE FOOLED!", "neg"),
    ("The commercials are the most misleading.", "neg"),
    ("Doesn't hold charge.", "neg"),
    ("It has kept up very well.", "pos"),
]

df = pd.DataFrame(data, columns=['text','label'])

X, y = df.text, df.label

pipe = Pipeline(
    [
        ("vectorizer", TfidfVectorizer()),
        ("classifier", SVC(probability=True, random_state=42)),
    ],
    verbose=True,
)


n_splits = 2
cv = StratifiedKFold(n_splits=n_splits, random_state=42, shuffle=True)

scoring = "accuracy"

scores = cross_val_score(pipe, X, y, scoring=scoring, cv=cv)
print(f"{scoring}: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Finally, you are training a model and save it:

import pickle


pipe.fit(X, y)
with open("model.bin", "wb") as f:
    pickle.dump(pipe, f)

The main focus here is to simplicity and apply the commands to train a text classification model.

Conclusion

This article provided a quick and easy step-by-step guide to calculate metrics with cross-validation and uses all the samples all the fit the model.

Don’t forget to clap (👏) it if you find it helpful!