Cross-Validation for Classification Models

Jaswanth Badvelu
Analytics Vidhya
Published in
9 min readJun 5, 2020

In my previous blog COVID-19 Predicting Death Rate using Classification, different classification machine learning models are built to predict the accuracy of the death rate of the patient that is likely to die from COVID-19 and were compared. In this blog, K fold Cross-Validation is performed to validate and estimate the skill of the machine learning models used previously using the same dataset. The machine learning models used are Support Vector Machine, Logistic Regression, Decision Tree, and Random Forest.

Photo by Bonnie Moreland, some rights reserved

K Fold Cross-Validation

Normally when building a machine learning model for any data, to check whether the model is performing well with unseen data 20–30 % of the dataset is split into test data and kept hidden. The training dataset is used to train the model and the test dataset is used to validate the model. In K fold cross-validation the total dataset is divided into K splits instead of 2 splits. These splits are called folds. Depending on the data size generally, 5 or 10 folds will be used.

The procedure for K fold cross-validation is all observations in the dataset are randomly sampled into K folds of approximately equal size. And the model will be trained with K-1 folds holding out 1 fold to validate the model, repeating the same process for every fold. And the average error rate/accuracy is considered in the end. The figure below illustrates the process for 5 fold cross-validation.

Fig 1. 5 fold Cross-Validation (source)

Why should we use cross-validation?

  1. Use total data

Sometimes the dataset size may be too small. In this case, if data is split for validation, the remaining data may not be enough to train the machine model to identify patterns causing underfitting which in turn, increases error induced by bias. There is where K fold cross-validation comes into play, using K fold cross-validation there is always enough data to train the model and also validate the model.

2. Model consistency

If the model is tested with only one test set, there is a chance of overfitting which in turn, increases variance. By using Cross-Validation, we can validate our model with more data which helps to prove our model consistency on unseen data. i.e. if the accuracy/error rate for all the test sets is consistent that proves our machine learning model is consistent.

Correlation

Fig 2. Correlation Matrix for all variables.

The correlation matrix is plotted to identify the Pearson’s Correlation Coefficient values for all variables. These values help us in measuring the association between different variables and also gives information about the magnitude of the association, or correlation, as well as the direction of the relationship.

With higher the correlation coefficient value, as one variable increases, the other variable tends to decrease if negatively correlated or increase if positively correlated.

Models used

Unlike my previous blog, where only 2 models are compared to predict death, one model with all remaining variables and another model with 3 variables that are highly correlated with death variable. This time 7 different models are built using all 7 variables and compared by increasing complexity for every model, starting with using only one variable(Age_group) that has a high correlation coefficient value with the death variable for Model 1 to using all variables one by one. The models used are

Model[1] <- "Death~ Age_group"
Model[2] <- "Death~ Age_group+Hospital_status"
Model[3] <- "Death~ Age_group+Hospital_status+Asymptomatic"
Model[4] <- "Death~ Age_group+Hospital_status+Asymptomatic+Region"
Model[5] <- "Death~ Age_group+Hospital_status+Asymptomatic+
Region+Tramssion"
Model[6] <- "Death~ Age_group+Hospital_status+Asymptomatic+ Region+Tramssion+Occupation"
Model[7] <- "Death~ Age_group+Hospital_status+Asymptomatic+ Region+Tramssion+Occupation+Gender"

From here on for the sake of simplicity only Model names are used instead of naming variables for all each model whenever comparing. This time error rate is used as a metric to compare different models. All 7 models are compared and 5 Fold cross-validation was used to estimate the performance of the model using different machine learning models. The machine learning models used are Support Vector Machine, Logistic Regression, Decision Tree, and Random Forest. And One standard error rule was used to choose the best model.

One standard error rule

For those who don’t know about one standard error rule, one standard error rule is used in cross-validation, in which we take the simplest model whose error is within one standard error of the best model (The model with Least error).

The reason for this is we should acknowledge that the error rate for each model from cross-validation has variation because they are chosen randomly from data. Since the minimum standard error also has variation, one standard error rule says that rather than picking a model that has minimum error rate, we will pick the simplest model that has error rate within one standard error of the minimum error rate, because if a model’s error rate is within one standard error of each other’s error rate, we can’t tell them apart based on the sample data we have.

Support Vector Machine

Fig 3. shows us the average error rate for all the 7 models when the SVM model is used.

Fig 3. Average error rates with standard error for all the Models using SVM

It can be seen that with the increase in complexity of the models the error rate is decreasing. The least error rate is observed from Model 5 with an error rate of 0.129(87.1% accuracy) and the error rate is the same for Model 6 and Model 7. Also, It can be seen that the average error rate is the same for Model 1, Model 2, Model 3 with an error rate of 0.151(84.9% accuracy).

So according to the graph in fig 3. to predict death using support vector machine only Age_group is enough. For slightly better accuracy Age_group, Hospital_status, Asymptomatic, Region, Transmission variables can be used.

Fig 4. shows the average error rates for all 7 models after 5 fold cross-validation was performed and standard error was plotted to identify the best model using one standard error rule.

Fig 4. 5 fold Cross-Validation error rates with standard error for all the Models using SVM

It can be seen from Fig 4. Model 5(Death~ Age_group+ Hospital_status+Asymptomatic+Region) has a minimum error rate in comparison to all other models. And also with the increase in complexity of the models the error rate is decreasing until Model 5 and then started increasing. But Model 1, Model 2, Model 3, Model 4 falls below the standard error of Model 5. So according to one standard error rule, the simplest model should be chosen among these 4 models which is Model 1 with only one variable. So for support vector machine to predict death rate only Age_group is enough.

Also, The range of error rate after performing cross-validation is identical which proves our model is consistent.

Logistic Regression

Fig 5. shows us the average error rate for all the 7 models when the Logistic regression model is used.

Fig 5. Average error rates with standard error for all the Models using Logistic Regression

In the case of logistic regression, Model 6 was best performing with an error rate of 0.115 (88.5% accuracy) followed by Model 5 and Model 3. And Model 1 was least performing with an error rate of 0.151 (84.9% accuracy). Even here the error rate was decreasing with the increase in model complexity until Model 3 and increased slightly for Model 4 and reduced again.

So according to the graph in fig 5. if we have Age_group, Hospital_status, Asymptomatic, Region, Transmission, and Occupation we can predict death rate with 88.5% accuracy.

Fig 6. shows the average error rates for all 7 models using Logistic Regression after 5 fold cross-validation was performed and standard error was plotted to identify the best model using one standard error rule.

Fig 6. 5 fold Cross-Validation error rates with standard error for all the Models using Logistic Regression

From Fig 6. the best is model after performing cross-validation is Model 3 with an error rate of 0.1356(accuracy= 86.44). The simplest model that falls under the standard deviation of Model 3 is Model 2. Using One standard error rule Model 2 is selected as the best model. The death rate can be predicted using logistic regression with an accuracy of 86.5% using Age_group, Hospital_status variables. Also, The range of error rate after performing cross-validation is identical which proves our model is consistent.

Decision Tree

Fig 7. shows us the average error rate for all the 7 models when the Decision Tree model is used.

Fig 7. Average error rates with standard error for all the Models using Decision Tree

In the case of the Decision Tree, Model 3 and Model 4 are best performing models with an error rate of 12.1(Accuracy= 87.9), and the error rate started increasing after Model 4 with the increase in the complexity of the model. And Model 1 was least performing with an error rate of 0.151 (84.9% accuracy). Here the error rate was decreasing with the increase in model complexity until Model 4 and increased slightly for Model 5 and was consistent for Model 6,7.

Fig 8. shows the average error rates for all 7 models after 5 fold cross-validation was performed and standard error was plotted to identify the best model using one standard error rule using Decision Tree.

Fig 8. 5 fold Cross-Validation error rates with standard error for all the Models using Logistic Regression

From Fig 8. It can be seen that after performing CV the error rate remained the same for Model 4 and Model 5 with an error rate of 0.122(accuracy= 88.8). The simplest model that falls under the standard deviation of the Model 4,5 is Model 3. Using One standard error rule Model 3 is selected as the best model. The death rate can be predicted using a Decision tree with an accuracy of 87.6% using Age_group, Hospital_status, Asymptomatic, Region. Also, The range of error rate after performing cross-validation is identical which proves our model is consistent.

Random Forest

Fig 9. shows us the average error rate for all the 7 models when the Random Forest model is used.

Fig 9. Average error rates with standard error for all the Models using Random Forest

The best performing model here for the random forest is Model 6 with an error rate of 0.119 (accuracy = 88.1) and the error rate remained the same for model 7. The least performing model here again is Model 1 with an error rate of 0.151(accuracy = 84.9). It is surprising to see with the Age_group variable alone every model is able to predict whether the patient will die or not with 85% accuracy. Even here it can be seen that the error rate was constantly decreasing with the increase in the complexity of the model.

Fig 10. shows the average error rates for all 7 models after 5 fold cross-validation was performed and standard error was plotted to identify the best model using one standard error rule for Random Forest.

Fig 10. 5 fold Cross-Validation error rates with standard error for all the Models using Random Forest

From Fig 10. The error rate for Model 4 and Model 5 is the same with an error rate of 0.1301. The simplest model that falls under the standard deviation of Model 4 and Model 5 is Model 3. Using One standard error rule Model 3 is selected as the best model. So for Random forest, The death rate can be predicted using logistic regression with an accuracy of 87% using Age_group, Hospital_status, Asymptomatic variables. Also, The range of error rate after performing cross-validation is identical which proves our model is consistent.

Conclusion

After comparing all 7 models using Support Vector Machine, Logistic Regression, Decision Tree, and Random Forest machine learning models. Decision Tree machine learning model can predict death rate with more accuracy followed by Random Forest. Using One standard error rule we can see that Model 3 is the best performing model for both Decision Tree and Random Forest with variables Age_group, Hospital_status, Asymptomatic, Which are the same variables that are used in my previous blog to predict death rate. So this concludes that models that are built previously to predict death rate to identify the patients with the highest risk of death using Statistics Canada COVID 19 data are accurate.

Data

All the graphs are generated in R Studio using R Language. All the Code and Data used to generate these graphs and models can be found in my GitHub repo here: https://github.com/JaswanthBadvelu/IENG-3304-Data-Management-and-Analytics/tree/master/Lab4%20Cross%20Validation

--

--

Jaswanth Badvelu
Analytics Vidhya

I write articles about easy ways to implement Data Science and Machine learning techniques in real world.