Resampling in Python — Cross-Validation Part 1/3

3 min readFeb 11, 2023

Resampling Methods

To repeatedly sample data from a dataset and refit the model on each sample to obtain additional information about the model. It can be used to measure the accuracy of the model. Major resampling methods are:

Cross-validation
Bootstrap

Cross-Validation

There are three major cross-validations:

The validation set approach: randomly split a dataset into a training set and a validation set.
Leave-one-out cross-validation (LOOCV): randomly split a dataset into a training set with all but one observation and a validation set with only that observation.
K-fold cross-validation: randomly split a dataset into K mutually exclusive groups of equal size. For each group, use all data except for that group as the training set and use data in that group as the validation set, for a total K training/validation sets.

The Validation Set Approach Python Example

Import libraries

# scientifc computing
import pandas as pd
import numpy as np

# generalized linear models
import statsmodels.api as sm
from statsmodels.formula.api import glm

Load the insurance dataset

# load the insurnace dataset
df = pd.read_csv('insurance.csv', index_col = 0)

# include only "age" as the explanatory variable and "charges" as the response variable
df = df[['age', 'charges']]

# take a look the first five rows
df.head()

This is what the data looks like.

Use the validation set approach to fit a linear regression model using the training set and evaluate the model using the validation set.

def apply_validation_set(seed, split = 0.5):

    # set the random seed
    np.random.seed(seed)
    
    # get the training set
    df_train = df.sample(frac = split)

    # get the validation set
    df_val = df[df.index.isin(df_train.index)]
    
    return df_train, df_val

def fit_linear_regression(df_train, df_val):
    
    # fit a linear regression model
    lm = glm(formula = 'charges ~ age', data = df_train, family = sm.families.Gaussian()).fit()
    
    # print model summary
    print(lm.summary())

    # predict the charges using the val set
    pred_charges = lm.predict(df_val)

    # concatenate the charges and the predicted charages
    df_pred = pd.concat([df_val['charges'], pred_charges], axis = 'columns').rename(columns = {0: 'pred_charges'})

    # print the mean squared error
    print("MSE: {}".format(np.mean((df_pred['charges'] - df_pred['pred_charges'])**2)))

Calling the function with seed = 1 gives this model summary

df_train, df_val = apply_validation_set(1)
fit_linear_regression(df_train, df_val)

Calling the function with seed = 2 gives this model summary

df_train, df_val = apply_validation_set(2)
fit_linear_regression(df_train, df_val)

Calling the function with seed = 3 gives this model summary

df_train, df_val = apply_validation_set(3)
fit_linear_regression(df_train, df_val)

Notice that each seed produces a different model (different coefficients of the intercept and the age). This is because each seed produces a different train/validation set split. The MSEs vary quite a bit due to the randomness of the train/validation set split. This “proves” the high variance of the validation set approach.

References

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

Resampling in Python — Cross-Validation Part 1/3

Resampling Methods

Cross-Validation

The Validation Set Approach Python Example

References

Written by Wendy Hu