Resampling in Python — Cross-Validation Part 1/3

Wendy Hu
3 min readFeb 11, 2023

--

The Validation Set Approach

Resampling Methods

To repeatedly sample data from a dataset and refit the model on each sample to obtain additional information about the model. It can be used to measure the accuracy of the model. Major resampling methods are:

  • Cross-validation
  • Bootstrap

Cross-Validation

There are three major cross-validations:

  • The validation set approach: randomly split a dataset into a training set and a validation set.
  • Leave-one-out cross-validation (LOOCV): randomly split a dataset into a training set with all but one observation and a validation set with only that observation.
  • K-fold cross-validation: randomly split a dataset into K mutually exclusive groups of equal size. For each group, use all data except for that group as the training set and use data in that group as the validation set, for a total K training/validation sets.
Cross-validations compare and contrast

The Validation Set Approach Python Example

Import libraries

# scientifc computing
import pandas as pd
import numpy as np

# generalized linear models
import statsmodels.api as sm
from statsmodels.formula.api import glm

Load the insurance dataset

# load the insurnace dataset
df = pd.read_csv('insurance.csv', index_col = 0)

# include only "age" as the explanatory variable and "charges" as the response variable
df = df[['age', 'charges']]

# take a look the first five rows
df.head()

This is what the data looks like.

The insurance DataFrame

Use the validation set approach to fit a linear regression model using the training set and evaluate the model using the validation set.

def apply_validation_set(seed, split = 0.5):

# set the random seed
np.random.seed(seed)

# get the training set
df_train = df.sample(frac = split)

# get the validation set
df_val = df[df.index.isin(df_train.index)]

return df_train, df_val
def fit_linear_regression(df_train, df_val):

# fit a linear regression model
lm = glm(formula = 'charges ~ age', data = df_train, family = sm.families.Gaussian()).fit()

# print model summary
print(lm.summary())

# predict the charges using the val set
pred_charges = lm.predict(df_val)

# concatenate the charges and the predicted charages
df_pred = pd.concat([df_val['charges'], pred_charges], axis = 'columns').rename(columns = {0: 'pred_charges'})

# print the mean squared error
print("MSE: {}".format(np.mean((df_pred['charges'] - df_pred['pred_charges'])**2)))

Calling the function with seed = 1 gives this model summary

df_train, df_val = apply_validation_set(1)
fit_linear_regression(df_train, df_val)
Model summary with seed = 1

Calling the function with seed = 2 gives this model summary

df_train, df_val = apply_validation_set(2)
fit_linear_regression(df_train, df_val)
Model summary with seed = 2

Calling the function with seed = 3 gives this model summary

df_train, df_val = apply_validation_set(3)
fit_linear_regression(df_train, df_val)
Model summary with seed = 3

Notice that each seed produces a different model (different coefficients of the intercept and the age). This is because each seed produces a different train/validation set split. The MSEs vary quite a bit due to the randomness of the train/validation set split. This “proves” the high variance of the validation set approach.

References

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

--

--