Resampling in Python — Cross-Validation Part 3/3

Wendy Hu
4 min readFeb 15, 2023

--

Leave-one-out cross-validation

Leave-one-out cross-validation (LOOCV)

Randomly splits the dataset into a training set with all but one observation and a validation set with only that observation. LOOCV is a special case of the K-fold cross-validation when K = the number of observations in the dataset.

The LOOCV Python Example

  1. Import libraries
# scientifc computing
import pandas as pd
import numpy as np

# generalized linear models
import statsmodels.api as sm
from statsmodels.formula.api import glm

2. Load the insurance dataset

# load the insurnace dataset
df = pd.read_csv('insurance.csv', index_col = 0)

# include only "age" as the explanatory variable and "charges" as the response variable
df = df[['age', 'charges']]

# take a look the first five rows
df.head()

This is what the data looks like.

3. Randomly create 10 mutually exclusive folds using the original dataset. Since the original dataset has 1338 observations, we would have 1338 folds and each fold would have 1 observation. (1338 observations / 1338 folds = 1 observation / fold)

def get_folds(seed, k_folds):

# set the random seed
np.random.seed(seed)

# get the size of the fold
fold_size = np.floor(df.shape[0] / k_folds)

# deep copy the original dataset and transpose it
df_copy = df.copy().T

# create result dictionary
df_folds = {}

# loop through each fold
for i in range(k_folds):

# create a fold dataframe
df_fold = pd.DataFrame()

# get the name of the fold
name = 'fold_{}'.format(i)

# when the length of the dataframe is smaller than the size of the fold, continue looping
while len(df_fold) < fold_size:

# randomly choose an index from the copy of the original dataset
sampled_index = np.random.choice(list(df_copy.T.index))

# pop the row off the dataframe using the chosen index
s_popped = df_copy.pop(sampled_index)

# convert the popped row into a dataframe
df_popped = s_popped.to_frame().T

# concatenate this popped dataframe to the fold dataframe
df_fold = pd.concat([df_fold, df_popped])

# assign the name of the fold as the key and the fold dataframe as the value
df_folds[name] = df_fold

# return th result dictionary
return df_folds

The first fold looks like this.

The first fold randomly selected from the original dataset

4. Write a helper function to fit a linear regression

def fit_linear_regression(df_train, df_val):

# fit a linear regression model
lm = glm(formula = 'charges ~ age', data = df_train, family = sm.families.Gaussian()).fit()

# print model summary
print(lm.summary())
print("\n")

# predict the charges using the val set
pred_charges = lm.predict(df_val)

# concatenate the charges and the predicted charages
df_pred = pd.concat([df_val['charges'], pred_charges], axis = 'columns').rename(columns = {0: 'pred_charges'})

# print the mean squared error
print("MSE: {}".format(np.mean((df_pred['charges'] - df_pred['pred_charges'])**2)))
print("\n")
print("***************************************************")

5. Create the training set and the validation set for each fold, so we will have 1338 training/validation sets. For each training/validation set, the current fold will be the validation set and the rest of the folds will be the training set. Fit the linear regression model 1338 times using the 1338 training/validation sets for each fit. Since it’s time-consuming to fit the model 1338 times, the below function is modified to iterate just once to illustrate the point. Please feel free to uncomment “break” to complete the entire 1338 iterations.

def get_train_val_fit_loocv(df_folds):

# get the number of folds
k_folds = len(df_folds.keys())

# loop through each fold
for i in range(k_folds):

# prepare the training set
df_train = pd.DataFrame()
for j in range(len(df_folds.keys())):
if i != j:
name_train= 'fold_' + str(j)
print("{} added to the training set".format(name_train))
df_train = pd.concat([df_train, df_folds[name_train]])
print("\n")
print("Training set prepared")
print(df_train)

# prepare the val set
name_val = 'fold_' + str(i)
print("{} added to the validation set".format(name_val))
df_val = df_folds[name_val]
print("\n")
print("Validation set prepared")
print(df_val)

# fit the linear regression model using that fold as the val set and the rest of the folds as the training set
fit_linear_regression(df_train, df_val)

# feel free to uncomment this to complete all the iterations
break

The output of the linear regression model using the first fold as the validation set and the rest of the folds as the training set:

  • Add folds 1 through 1337 to the training set. The training set looks like this with 1337 observations.
The training set using the all except the first fold
  • Add fold 0 to the validation set. The validation set looks like this with 1 observation.
The validation set using the first fold
  • The linear regression model looks like this
The summary of the linear regression model using the above training set and the validation set

References

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

--

--