Cross-Validation — Introduction

Different validation methods and comparison of its performances in python.

Divya Raghunathan
Analytics Vidhya
7 min readJun 10, 2020

--

Photo by John Macdonald on Unsplash

In this article, we will go over some of the widely used validation methods employed in machine learning models and their pros and cons. We will see them in action using a real-life dataset in Python.

General theory

Why do we split the dataset? The simplest answer is to evaluate the performance of the model i.e to determine the predictive power of our model.

The dataset is often split into 2 or 3 parts- train/ test, or train/valid/test set respectively.

The three-part split comes in two flavors :

1) Train/Valid/Test Split

2) Cross-validation with Holdout set

The extra validation set is beneficial for hyperparameters tuning.

In the train/test/split method, the training set is used to tune the hyperparameters. We fit the model in all combinations of the hyperparameter values using the training data. Then, using the fitted model we can predict the validation set and evaluate the performance using all combinations of hyperparameter values. The hyperparameter combination which yields the best results is chosen. The best performing model depends on the evaluation metric — Accuracy, precision, recall we choose. We predict the test set using the hyperparameter combination obtained through the validation set process. The test set is unseen and not used in fitting a model, so it helps evaluate the model performance without any bias.

A drawback of the train/valid/test split method is that there is just ONE validation set to tune the parameters. There is a high risk of “chance” or “luck” in this method of splitting. To overcome this drawback, we can make use of cross-validation. This method works similarly to the train/valid/test split method, but we would repeat the splitting steps many times over. There are different types of Cross-validation types- k-fold CV, leave-one-out (LOO), nested CV, etc. Let’s take a look at k-fold CV.

K-fold CV with holdout set

This is the simplest form of cross-validation.

Image source here

This is how the K fold CV works

  • A test set is kept aside for final evaluation
  • With the remaining data, the data is split into k folds. The model is then trained using the k-1 of the folds (training data). Then, using the kth set (validation set), the prediction is performed using different hyperparameter values and select the hyperparameters that give the best validation score.
  • Finally, the model is evaluated with the test set using the best hyperparameters.

Comparison in Python

We will be using a dataset from the UCI repository. It is a crime rate data from different communities. There are 101 predictor variables such as household size, number of police officers, population, etc. The goal is to predict the total number of non-violate crimes per 100k population (response).

The code and the dataset can be found here.

We will use Lasso regression for our prediction model. Since there are many variables, Lasso is a great tool for filtering out the unnecessary features. With all the other factors being the same, we will implement different validation methods to identify the best hyperparameters for each.

The features in the dataset :

Snapshot of the variable names

The validation methods we will be testing are as below:

1. Train / validation / test split.

2. 5-Fold cross-validation.

3. 10-Fold cross-validation

Dataset split

Let’s do a 70%-30% train-test split and further divide the training set into two equal parts for train and valid set.

X_train_valid, X_test, y_train_valid, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, test_size = 0.2857, random_state = 1)

Hyperparameter selection

Let’s make use of three hyperparameters for this exercise, Alpha, maximum iterations, tolerance

alphas = np.logspace(-10,10,21) 
max_iters = np.arange(50,75,5)
tols = np.linspace(0.0001,0.1,5)

1. Train / validation / test split.

We will go over this step by step to see what happens under the hood in python.

Create hyperparameter trios using itertools.product(alphas,max_iters,tols). We have created 525 unique combinations of 3 hyperparameters. Next, we will standardize the data. It is essential to standardize the data for Lasso regression so that we can level the playing field for all variables. For this, we use a StandardScaler() . Refer to sklearn documentation for more details on the standard scaler available in sklearn.preprocessing. Using a for loop, we fit the training data using all the 525 hyperparameter combinations. Then, we predict the crime rate (y) for the validation set. The evaluation metric we have chosen is Mean squared error. Our hyperparameter trio winner is the one that produces the least error. Then, we scale (fit) the train+valid data and apply the scaling to the test data. Using the best results obtained from the validation set, we fit the train+valid data and predict the crime rate (y) for the test set. The python code for this entire process is below:

hyperparameter_trio=list(itertools.product(alphas,max_iters,tols)) 
print("The number of trios in total: {}".format(len(hyperparameter_trio)))
#scaling the data
scaler=StandardScaler() # Instantiate
scaler.fit(X_train)
X_train=pd.DataFrame(scaler.transform(X_train))
X_valid=pd.DataFrame(scaler.transform(X_valid))
Validation_Scores=[]
start = datetime.now()
for a in hyperparameter_trio:
lm_trainlasso=linear_model.Lasso
(alpha=a[0],max_iter=a[1],tol=a[2])
lm_trainlasso.fit(X_train,y_train)
Validation_Scores.append(metrics.mean_squared_error
(lm_trainlasso.predict( X_valid),y_valid))
end = datetime.now()
M1 = end - start
minerror_M1 = min(Validation_Scores)
besttrio_M1 = hyperparameter_trio[np.argmin(Validation_Scores)]
scaler = StandardScaler()
scaler.fit(X_train_valid)
X_train_valid = pd.DataFrame(scaler.transform(X_train_valid))
X_test = pd.DataFrame(scaler.transform(X_test))
M1_terror = metrics.mean_squared_error(lm1.predict(X_test),y_test)
print("The prediction error for the test set is : {}".format(M1_terror))

2. 5-Fold cross-validation.

We will use the GridSearchCV package available in the sklearn library for cross-validation. GridSearchCV would tune all combinations of our hyperparameter values ( search over parameters specified in a grid). GridSearchCV performs a K Fold cross-validation and by default, it performs 5 folds. The estimator parameter performs the fit function on training data and transform function on the valid data each grid point. We include transformations within an estimator where the transformations are performed sequentially and each transformation should be ready to undergo the fit and transform method. The param_grid parameter allows you to specify the hyperparameter values and GridSearchCV iterates over all the values in that grid. The python code for this process is below:

start = datetime.now()
estimator = Pipeline([(‘scale’, StandardScaler()), (‘lasso’,Lasso())])
parameters = {‘lasso__alpha’:alphas, ‘lasso__max_iter’:max_iters, ‘lasso__tol’:tols}
lm2 = GridSearchCV(estimator = estimator, param_grid = parameters, cv = 5, scoring = ‘neg_mean_squared_error’, n_jobs = -1)
lm2.fit(X_train_valid, y_train_valid)
end = datetime.now()
M2 = end — start

3. 10-Fold cross-validation

The 10-Fold cross-validation method is similar to 5 fold method in terms of python implementation. In GridSearchCV , we will set the CV parameter as 10.

Comparing the above three validation methods on 3 different criteria

  1. Comparing based on the time taken for the code to run
  2. Comparing the hyperparameter values
  3. Comparing the selected coefficients
  4. Comparing the prediction results.

The results obtained are below

Model 1- Train/valid/test split method takes the least time. As the number of folds increases, more time is taken by the model.

The maximum iterations taken for the model to converge is the highest for the third model (10 fold CV). The alpha values are the same for 5 fold CV and 10 fold CV

10 fold CV produces the least error on the test set.

The above depicts the absolute values of the coefficients arranged in descending order. These are the top 5 coefficients for each validation method. Most of the coefficients are similar in the three models. PctForeignBorn, PersPerOccupHous and MalePctNevMarr are the predictors in the top list for all the three models.

The train/valid/test method shrank one coefficient. However, the CV models shrank the below coefficients.

Index(['population', 'racePctWhite', 'racePctAsian', 'racePctHisp',
'agePct65up', 'numbUrban', 'medIncome', 'pctWWage', 'perCapInc','NumUnderPov', 'PctOccupManu', 'PersPerFam', 'PctYoungKids2Par','PctTeen2Par', 'PctWorkMomYoungKids', 'NumImmig', 'PctImmigRec8',PctRecImmig5', 'PctRecImmig10','PctNotSpeakEnglWell',
'PctLargHouseFam', 'PctHousLess3BR', 'PctHousOwnOcc', 'PctVacantBoarded', 'OwnOccHiQuart', 'OwnOccQrange', 'RentMedian',
'MedRent', 'NumStreet', 'PctSameHouse85'],dtype='object', name='Vars')
Total features shrinked to zero by M1: 1
Total features shrinked to zero by M2: 30
Total features shrinked to zero by M3: 32

Out of 101 predictors, the train/valid/test method shrinks only one feature to zero. 5 fold CV model provides better results than the train/valid/test method as it is shrinking 30 features to zero. Finally, 10 fold CV method performs even better as it shrinks 32 useless features to zero.

Final thoughts

The cross-validation method to opt for depends on the business problem. There is often a trade-off between time and the evaluation metric.

Cross-validation is a vast topic. This article is a high-level. overview of the concept

Thank you for reading. I’d love to hear your thoughts and feedback on my articles. Please do leave them in the comment section below.

--

--

Divya Raghunathan
Analytics Vidhya

Budding Data Scientist | Business Analytics Graduate Student