Resampling in Python — Cross-Validation Part 3/3

4 min readFeb 15, 2023

Leave-one-out cross-validation (LOOCV)

Randomly splits the dataset into a training set with all but one observation and a validation set with only that observation. LOOCV is a special case of the K-fold cross-validation when K = the number of observations in the dataset.

The LOOCV Python Example

Import libraries

# scientifc computing
import pandas as pd
import numpy as np

# generalized linear models
import statsmodels.api as sm
from statsmodels.formula.api import glm

2. Load the insurance dataset

# load the insurnace dataset
df = pd.read_csv('insurance.csv', index_col = 0)

# include only "age" as the explanatory variable and "charges" as the response variable
df = df[['age', 'charges']]

# take a look the first five rows
df.head()

This is what the data looks like.

3. Randomly create 10 mutually exclusive folds using the original dataset. Since the original dataset has 1338 observations, we would have 1338 folds and each fold would have 1 observation. (1338 observations / 1338 folds = 1 observation / fold)

def get_folds(seed, k_folds):
    
    # set the random seed
    np.random.seed(seed)
    
    # get the size of the fold
    fold_size = np.floor(df.shape[0] / k_folds)
    
    # deep copy the original dataset and transpose it
    df_copy = df.copy().T
    
    # create result dictionary
    df_folds = {}
    
    # loop through each fold
    for i in range(k_folds):
        
        # create a fold dataframe
        df_fold = pd.DataFrame()
        
        # get the name of the fold
        name = 'fold_{}'.format(i)
        
        # when the length of the dataframe is smaller than the size of the fold, continue looping
        while len(df_fold) < fold_size:
            
            # randomly choose an index from the copy of the original dataset
            sampled_index = np.random.choice(list(df_copy.T.index))
            
            # pop the row off the dataframe using the chosen index
            s_popped = df_copy.pop(sampled_index)
            
            # convert the popped row into a dataframe
            df_popped = s_popped.to_frame().T
            
            # concatenate this popped dataframe to the fold dataframe
            df_fold = pd.concat([df_fold, df_popped])
        
        # assign the name of the fold as the key and the fold dataframe as the value
        df_folds[name] = df_fold
    
    # return th result dictionary
    return df_folds

The first fold looks like this.

The first fold randomly selected from the original dataset

4. Write a helper function to fit a linear regression

def fit_linear_regression(df_train, df_val):
    
    # fit a linear regression model
    lm = glm(formula = 'charges ~ age', data = df_train, family = sm.families.Gaussian()).fit()
    
    # print model summary
    print(lm.summary())
    print("\n")
    
    # predict the charges using the val set
    pred_charges = lm.predict(df_val)
    
    # concatenate the charges and the predicted charages
    df_pred = pd.concat([df_val['charges'], pred_charges], axis = 'columns').rename(columns = {0: 'pred_charges'})
    
    # print the mean squared error
    print("MSE: {}".format(np.mean((df_pred['charges'] - df_pred['pred_charges'])**2)))
    print("\n")
    print("***************************************************")

5. Create the training set and the validation set for each fold, so we will have 1338 training/validation sets. For each training/validation set, the current fold will be the validation set and the rest of the folds will be the training set. Fit the linear regression model 1338 times using the 1338 training/validation sets for each fit. Since it’s time-consuming to fit the model 1338 times, the below function is modified to iterate just once to illustrate the point. Please feel free to uncomment “break” to complete the entire 1338 iterations.

def get_train_val_fit_loocv(df_folds):
    
    # get the number of folds
    k_folds = len(df_folds.keys())
    
    # loop through each fold
    for i in range(k_folds):
        
        # prepare the training set
        df_train = pd.DataFrame()
        for j in range(len(df_folds.keys())):
            if i != j:
                name_train= 'fold_' + str(j)
                print("{} added to the training set".format(name_train))
                df_train = pd.concat([df_train, df_folds[name_train]])
        print("\n")
        print("Training set prepared")
        print(df_train)
        
        # prepare the val set
        name_val = 'fold_' + str(i)
        print("{} added to the validation set".format(name_val))
        df_val = df_folds[name_val]
        print("\n")
        print("Validation set prepared")
        print(df_val)
        
        # fit the linear regression model using that fold as the val set and the rest of the folds as the training set
        fit_linear_regression(df_train, df_val)
        
        # feel free to uncomment this to complete all the iterations
        break

The output of the linear regression model using the first fold as the validation set and the rest of the folds as the training set:

Add folds 1 through 1337 to the training set. The training set looks like this with 1337 observations.

The training set using the all except the first fold

Add fold 0 to the validation set. The validation set looks like this with 1 observation.

The linear regression model looks like this

The summary of the linear regression model using the above training set and the validation set

References

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

Resampling in Python — Cross-Validation Part 3/3

Leave-one-out cross-validation (LOOCV)

The LOOCV Python Example

References

Written by Wendy Hu