Resampling in Python — Cross-Validation Part 2/3

5 min readFeb 13, 2023

K-fold cross-validation

Randomly splits the dataset into K mutually exclusive folds (think of folds as groups) of equal size. As noted in the image above, 4 and 2 are randomly selected from the original dataset as the first fold and 9 and 1 are selected as the second fold. To construct the first training/validation set, we will use the first fold (4 and 2) as the validation set and the rest of the folds as the training set (9 and 1 from the second fold all the way to 7 and 5 from the fifth fold). A special case of K-fold cross-validation is leave-one-out cross-validation when K = N (the number of observations in the dataset).

The K-Fold Cross-Validation Python Example

Import libraries

# scientifc computing
import pandas as pd
import numpy as np

# generalized linear models
import statsmodels.api as sm
from statsmodels.formula.api import glm

2. Load the insurance dataset

# load the insurnace dataset
df = pd.read_csv('insurance.csv', index_col = 0)

# include only "age" as the explanatory variable and "charges" as the response variable
df = df[['age', 'charges']]

# take a look the first five rows
df.head()

This is what the data looks like.

3. Randomly create 10 mutually exclusive folds using the original dataset. Since the original dataset has 1338 rows, each fold would have 133 rows (1338 rows / 10 folds = 133.8 rows per fold and we take the floor of this number).

def get_folds(seed, k_folds):
    
    # set the random seed
    np.random.seed(seed)
    
    # get the size of the fold
    fold_size = np.floor(df.shape[0] / k_folds)
    
    # deep copy the original dataset and transpose it
    df_copy = df.copy().T
    
    # create result dictionary
    df_folds = {}
    
    # loop through each fold
    for i in range(k_folds):
        
        # create a fold dataframe
        df_fold = pd.DataFrame()
        
        # get the name of the fold
        name = 'fold_{}'.format(i)
        
        # when the length of the dataframe is smaller than the size of the fold, continue looping
        while len(df_fold) < fold_size:
            
            # randomly choose an index from the copy of the original dataset
            sampled_index = np.random.choice(list(df_copy.T.index))
            
            # pop the row off the dataframe using the chosen index
            s_popped = df_copy.pop(sampled_index)
            
            # convert the popped row into a dataframe
            df_popped = s_popped.to_frame().T
            
            # concatenate this popped dataframe to the fold dataframe
            df_fold = pd.concat([df_fold, df_popped])
        
        # assign the name of the fold as the key and the fold dataframe as the value
        df_folds[name] = df_fold
    
    # return th result dictionary
    return df_folds

The first fold looks like this.

The first fold randomly generated from the original dataset

4. Write a helper function to fit a linear regression model

def fit_linear_regression(df_train, df_val):
    
    # fit a linear regression model
    lm = glm(formula = 'charges ~ age', data = df_train, family = sm.families.Gaussian()).fit()
    
    # print model summary
    print(lm.summary())
    print("\n")
    
    # predict the charges using the val set
    pred_charges = lm.predict(df_val)
    
    # concatenate the charges and the predicted charages
    df_pred = pd.concat([df_val['charges'], pred_charges], axis = 'columns').rename(columns = {0: 'pred_charges'})
    
    # print the mean squared error
    print("MSE: {}".format(np.mean((df_pred['charges'] - df_pred['pred_charges'])**2)))
    print("\n")
    print("***************************************************")

5. Create the training set and the validation set for each fold, so we will have 10 training/validation sets. For each training/validation set, the current fold will be the validation set and the rest of the folds will be the training set. Fit the linear regression model 10 times using the 10 training/validation sets for each fit.

def get_train_val_fit(df_folds):
    
    # get the number of folds
    k_folds = len(df_folds.keys())
    
    # loop through each fold
    for i in range(k_folds):
        
        # prepare the training set
        df_train = pd.DataFrame()
        for j in range(len(df_folds.keys())):
            if i != j:
                name_train= 'fold_' + str(j)
                print("{} added to the training set".format(name_train))
                df_train = pd.concat([df_train, df_folds[name_train]])
        print("\n")
        print("Training set prepared")
        print(df_train)
        
        # prepare the val set
        name_val = 'fold_' + str(i)
        print("{} added to the validation set".format(name_val))
        df_val = df_folds[name_val]
        print("\n")
        print("Validation set prepared")
        print(df_val)
        
        # fit the linear regression model using that fold as the val set and the rest of the folds as the training set
        fit_linear_regression(df_train, df_val)