K-fold cross-validation
Randomly splits the dataset into K mutually exclusive folds (think of folds as groups) of equal size. As noted in the image above, 4 and 2 are randomly selected from the original dataset as the first fold and 9 and 1 are selected as the second fold. To construct the first training/validation set, we will use the first fold (4 and 2) as the validation set and the rest of the folds as the training set (9 and 1 from the second fold all the way to 7 and 5 from the fifth fold). A special case of K-fold cross-validation is leave-one-out cross-validation when K = N (the number of observations in the dataset).
The K-Fold Cross-Validation Python Example
- Import libraries
# scientifc computing
import pandas as pd
import numpy as np
# generalized linear models
import statsmodels.api as sm
from statsmodels.formula.api import glm
2. Load the insurance dataset
# load the insurnace dataset
df = pd.read_csv('insurance.csv', index_col = 0)
# include only "age" as the explanatory variable and "charges" as the response variable
df = df[['age', 'charges']]
# take a look the first five rows
df.head()
This is what the data looks like.
3. Randomly create 10 mutually exclusive folds using the original dataset. Since the original dataset has 1338 rows, each fold would have 133 rows (1338 rows / 10 folds = 133.8 rows per fold and we take the floor of this number).
def get_folds(seed, k_folds):
# set the random seed
np.random.seed(seed)
# get the size of the fold
fold_size = np.floor(df.shape[0] / k_folds)
# deep copy the original dataset and transpose it
df_copy = df.copy().T
# create result dictionary
df_folds = {}
# loop through each fold
for i in range(k_folds):
# create a fold dataframe
df_fold = pd.DataFrame()
# get the name of the fold
name = 'fold_{}'.format(i)
# when the length of the dataframe is smaller than the size of the fold, continue looping
while len(df_fold) < fold_size:
# randomly choose an index from the copy of the original dataset
sampled_index = np.random.choice(list(df_copy.T.index))
# pop the row off the dataframe using the chosen index
s_popped = df_copy.pop(sampled_index)
# convert the popped row into a dataframe
df_popped = s_popped.to_frame().T
# concatenate this popped dataframe to the fold dataframe
df_fold = pd.concat([df_fold, df_popped])
# assign the name of the fold as the key and the fold dataframe as the value
df_folds[name] = df_fold
# return th result dictionary
return df_folds
The first fold looks like this.
4. Write a helper function to fit a linear regression model
def fit_linear_regression(df_train, df_val):
# fit a linear regression model
lm = glm(formula = 'charges ~ age', data = df_train, family = sm.families.Gaussian()).fit()
# print model summary
print(lm.summary())
print("\n")
# predict the charges using the val set
pred_charges = lm.predict(df_val)
# concatenate the charges and the predicted charages
df_pred = pd.concat([df_val['charges'], pred_charges], axis = 'columns').rename(columns = {0: 'pred_charges'})
# print the mean squared error
print("MSE: {}".format(np.mean((df_pred['charges'] - df_pred['pred_charges'])**2)))
print("\n")
print("***************************************************")
5. Create the training set and the validation set for each fold, so we will have 10 training/validation sets. For each training/validation set, the current fold will be the validation set and the rest of the folds will be the training set. Fit the linear regression model 10 times using the 10 training/validation sets for each fit.
def get_train_val_fit(df_folds):
# get the number of folds
k_folds = len(df_folds.keys())
# loop through each fold
for i in range(k_folds):
# prepare the training set
df_train = pd.DataFrame()
for j in range(len(df_folds.keys())):
if i != j:
name_train= 'fold_' + str(j)
print("{} added to the training set".format(name_train))
df_train = pd.concat([df_train, df_folds[name_train]])
print("\n")
print("Training set prepared")
print(df_train)
# prepare the val set
name_val = 'fold_' + str(i)
print("{} added to the validation set".format(name_val))
df_val = df_folds[name_val]
print("\n")
print("Validation set prepared")
print(df_val)
# fit the linear regression model using that fold as the val set and the rest of the folds as the training set
fit_linear_regression(df_train, df_val)
The output of the linear regression model using the first fold as the validation set and the rest of the folds as the training set:
- Add folds 1 through 9 to the training set
- The training set looks like this with 133 rows / fold * 9 folds = 1,197 rows
- Add fold 0 to the validation set
- The validation set looks like this with 133 rows / fold * 1 fold = 133 rows
- This is what the linear regression model looks like
References
An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani