Cross Validation in Machine Learning using StatsModels and Sklearn with Logistic Regression Example

Published in

Analytics Vidhya

7 min readJul 5, 2021

In this tutorial, we will learn what is cross validation in machine learning and how to implement it in python using StatModels and Sklearn packages.

What is Cross Validation and why do we need it?

Cross validation is a resampling method in machine learning. To understand cross validation, we need to first review the difference between train error rate and test error rate.

Train error rate is the average error (misclassification rate in classification problems) that results from the same data the model was trained on.

In contrast Test error rate is the average error that results from using the trained model on unseen test data set (also known as validation dataset)

In the absence of test data, we wont be able to tell if our model is working equally good on the unseen data, which is the ultimate goal of any machine learning problem. The process of using test data to estimate the average error when the fitted/trained model is used on unseen data is called cross validation. In simple words, we cross validate our prediction on unseen data and hence the name “cross validation”

Types of Cross Validation

There are thee main types of cross-validation. Some articles mention bootstrap as a cross validation method but I personally don’t count bootstrap as a cross-validation method.

Validation set approach

This approach is simplest of all. Just divide your data in to two parts i.e. train and test datasets. Train your model on train dataset and run the validation on test dataset. This approach could be problematic because you are assuming that your test data represents whole data, which could be violated in practice. As a result, your test error estimates could be very unstable. In addition, since machine learning methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate.

Leave-One-Out Cross Validation:

As the name suggests, you leave one observation from the training data while training your model. Technically, this approach is same as above but in your test dataset you just have 1 row. What is different is that you repeat this experiment by running a for loop and take 1 row as a test data in each iteration and get the test error for as many rows as possible and take of average of errors in the end. Ideally you should run the for loop for n number of time (where n = sample size). By this time, you can already identify the problem here. When your data is big, this method could be very inefficient.

k-Fold Cross Validation:

This is hybrid of above two types. We divide our data into k folds and run a for loop for k times taking one fold at a time as a test dataset in each iteration and calculate average error rate (or accuracy) in the end.

We will use validation set approach and k-Fold in this tutorial.

Application

Let’s get our hands dirty with some coding. Are you excited?

ME TOO!

Problem Setup

We will use the heart dataset to predict the probability of heart attack using all predictors in the dataset. In doing so, we also want to estimate the test error of the logistic regression model described in that section using cross validation.

Data Set Information:

I took this dataset from Center for Machine Learning and Intelligent Systems

https://archive.ics.uci.edu/ml/datasets/Heart+Disease

. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no/less chance of heart attack and 1 = more chance of heart attack.

Attribute Information:

Only 14 attributes used:

#3 (age)
#4 (sex)
#9 (cp)
#10 (trestbps)
#12 (chol)
#16 (fbs)
#19 (restecg)
#32 (thalach)
#38 (exang)
#40 (oldpeak)
#41 (slope)
#44 (ca)
#51 (thal)
#58 (target) (the predicted attribute)

Please note that this dataset has some missing data. For simplicity, we will just attempt complete case analysis.

import pandas as pd
import numpy as npdf = pd.read_csv('heart.csv')df.head()

Logistics Regression Model using Stat Models

The simplest and more elegant (as compare to sklearn) way to look at the initial model fit is to use statsmodels. I love the summary report it generates in just one line of code.

from statsmodels.formula.api import logitfit_logit = logit("target ~ age + sex + cp + trestbps +	chol + fbs + restecg + thalach + exang + oldpeak + slope + ca + thal", df).fit()print(fit_logit.summary())Optimization terminated successfully.
         Current function value: 0.348904
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 target   No. Observations:                  303
Model:                          Logit   Df Residuals:                      289
Method:                           MLE   Df Model:                           13
Date:                Sun, 04 Jul 2021   Pseudo R-squ.:                  0.4937
Time:                        19:07:27   Log-Likelihood:                -105.72
converged:                       True   LL-Null:                       -208.82
Covariance Type:            nonrobust   LLR p-value:                 7.262e-37
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.4505      2.571      1.342      0.180      -1.590       8.490
age           -0.0049      0.023     -0.212      0.832      -0.050       0.041
sex           -1.7582      0.469     -3.751      0.000      -2.677      -0.839
cp             0.8599      0.185      4.638      0.000       0.496       1.223
trestbps      -0.0195      0.010     -1.884      0.060      -0.040       0.001
chol          -0.0046      0.004     -1.224      0.221      -0.012       0.003
fbs            0.0349      0.529      0.066      0.947      -1.003       1.073
restecg        0.4663      0.348      1.339      0.181      -0.216       1.149
thalach        0.0232      0.010      2.219      0.026       0.003       0.044
exang         -0.9800      0.410     -2.391      0.017      -1.783      -0.177
oldpeak       -0.5403      0.214     -2.526      0.012      -0.959      -0.121
slope          0.5793      0.350      1.656      0.098      -0.106       1.265
ca            -0.7733      0.191     -4.051      0.000      -1.147      -0.399
thal          -0.9004      0.290     -3.104      0.002      -1.469      -0.332
==============================================================================

Brief Model Interpretation

We see that sex, cp, thalach, exang, oldpeak, ca, and thal variables are significantly associated (we are not inferring causality in this problem) with heart attack.

Cross Validation using Validation dataset approach

Let’s split our data into two sets i.e. train and test

from sklearn.model_selection import train_test_split# splitting our dataset into train and test datasets.train, test = train_test_split(df, test_size = 0.3, random_state = 1)

Let’s fit the model on train data and look at the test error rate.

fit_logit_train = logit("target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach + exang + oldpeak + slope + ca + thal", train).fit()train_pred = fit_logit_train.predict(test)

# converting probability to labelsdef convert_prob_to_label(prob, cutoff = 0.5):
    label = None
    if prob > cutoff:
        label = 1
    else:
        label = 0
    return label

pred_labels = list(map(convert_prob_to_label, train_pred))
pred_labels = np.asarray(pred_labels)Optimization terminated successfully.
         Current function value: 0.317208
         Iterations 8from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(test.target, pred_labels)

From above confusion matrix, we can calculate misclassification rate as

mis_rate = (conf_matrix[[1],[0]].flat[0] + conf_matrix[[0],[1]].flat[0])/len(test)print(f"Misclassification rate = {mis_rate :.3f}")

Misclassification rate = 0.220

We got a good model to start with with error rate of 22%. Please note that we did not run any model selection (model selection is out of scope for this article). However, this test misclassification rate could be due to chance and might depend upon the test data. As a result, we might overestimate the test error rate. To get more stable estimate of test error / misclassification rate, we can use k-fold cross validation.

k-Fold Cross Validation using Sklearn

When running k-Fold cross validation, there are two key parameters that we need to take care of.

Number of folds : We need to cognizant about the number of folds. In practice 5–10 folds work well for medium size of data. Too large number of folds might result in high variance of results.
Number of repeats : We can run the experiments as many times we want. Depending upon the computation power we have in hand, we can select a big number here. Bigger is better here.

For this specific problem, I am using KFold cross validation five folds across 100 trials to calculate the average misclassification rate.

** Please note that Stats Models does not have its own cross validation libraries. **

from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X = df.loc[ : , ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']]
y = df[['target']]

cv = RepeatedKFold(n_splits=5, n_repeats= 100, random_state=1)
model = LogisticRegression()

# getting misclassification rate
scores = 1 - cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)import seaborn as sns
ax = sns.histplot(x=scores, kde=True)

Above histogram clearly shows us the variability in test error. Instead of providing a single number estimate of test error, it’s always better to provide mean and standard error of the test error for decision making.

Conclusion

print(f"Mean of misclassification error rate in test date is, {np.mean(scores) : .3f} with standard deviation = {np.std(scores) : .4f} ")

Mean of misclassification error rate in test date is, 0.167 with standard deviation = 0.0424

In conclusion, our misclassification rate is 16.7%. In this article, we learnt how cross validation helps us to get stable and more robust estimate of test error. In the next blog, we will do the similar thing using bootstrap methods.