Resampling Methods
To repeatedly sample data from a dataset and refit the model on each sample to obtain additional information about the model. It can be used to measure the accuracy of the model. Major resampling methods are:
- Cross-validation
- Bootstrap
Cross-Validation
There are three major cross-validations:
- The validation set approach: randomly split a dataset into a training set and a validation set.
- Leave-one-out cross-validation (LOOCV): randomly split a dataset into a training set with all but one observation and a validation set with only that observation.
- K-fold cross-validation: randomly split a dataset into K mutually exclusive groups of equal size. For each group, use all data except for that group as the training set and use data in that group as the validation set, for a total K training/validation sets.
The Validation Set Approach Python Example
Import libraries
# scientifc computing
import pandas as pd
import numpy as np
# generalized linear models
import statsmodels.api as sm
from statsmodels.formula.api import glm
Load the insurance dataset
# load the insurnace dataset
df = pd.read_csv('insurance.csv', index_col = 0)
# include only "age" as the explanatory variable and "charges" as the response variable
df = df[['age', 'charges']]
# take a look the first five rows
df.head()
This is what the data looks like.
Use the validation set approach to fit a linear regression model using the training set and evaluate the model using the validation set.
def apply_validation_set(seed, split = 0.5):
# set the random seed
np.random.seed(seed)
# get the training set
df_train = df.sample(frac = split)
# get the validation set
df_val = df[df.index.isin(df_train.index)]
return df_train, df_val
def fit_linear_regression(df_train, df_val):
# fit a linear regression model
lm = glm(formula = 'charges ~ age', data = df_train, family = sm.families.Gaussian()).fit()
# print model summary
print(lm.summary())
# predict the charges using the val set
pred_charges = lm.predict(df_val)
# concatenate the charges and the predicted charages
df_pred = pd.concat([df_val['charges'], pred_charges], axis = 'columns').rename(columns = {0: 'pred_charges'})
# print the mean squared error
print("MSE: {}".format(np.mean((df_pred['charges'] - df_pred['pred_charges'])**2)))
Calling the function with seed = 1 gives this model summary
df_train, df_val = apply_validation_set(1)
fit_linear_regression(df_train, df_val)
Calling the function with seed = 2 gives this model summary
df_train, df_val = apply_validation_set(2)
fit_linear_regression(df_train, df_val)
Calling the function with seed = 3 gives this model summary
df_train, df_val = apply_validation_set(3)
fit_linear_regression(df_train, df_val)
Notice that each seed produces a different model (different coefficients of the intercept and the age). This is because each seed produces a different train/validation set split. The MSEs vary quite a bit due to the randomness of the train/validation set split. This “proves” the high variance of the validation set approach.
References
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani