Cross Validation and HyperParameter Tuning in Python

saranya Mandava
6 min readSep 18, 2018

--

Hello everyone! In this blog post, I want to focus on the importance of cross validation and hyperparameter tuning along with the techniques used. I will give a short overview of the topic and give an example implementation in python.

Cross validation is a technique used to identify how well our model performed and there is always a need to test the accuracy of our model to verify that, our model is well trained with data without any overfitting and underfitting. This process of validation is performed only after training the model with data.

First, let us understand the terms overfitting and underfitting.

while using statistical methods (like logistic regression, linear regression etc…) on our data, generally we split our data into training and testing samples and fit the model on training samples and make predictions on test samples. Now, there is a possibility of overfitting or underfitting the data.

Overfitting

In statistics, overfitting means our model fits too closely to our data. The fitted line will go exactly through every point in the graph and this may fail to make predictions on future data reliably.

To lessen the chance of, or amount of, overfitting, several techniques are available (e.g. model comparison, cross-validation, regularization, early stopping, pruning, Bayesian priors, or dropout)

Underfitting

Underfitting means our model doesn’t fit well with the data(i.e, model cannot capture the underlying trend of data, which destroys the model accuracy)and occurs when a statistical model or machine learning algorithm cannot adequately capture the underlying structure of the data.

Let’s walkthrough an example to understand the concept using Scikit-Learn library in python on titanic dataset with Logistic regression.

In previous posts, we checked the data to check for anomalies and we know our data is clean. Therefore, we can skip the data cleaning and jump straight into k-fold cross validation.

First import the required libraries.

Let’s quickly go over the imported libraries.

  • Pandas — Load the data into pandas data frame
  • From Sklearn, sub-library preprocessing, I’ve imported the label encoder and one hot encoder module, so I can encode categorical values in my dataset.
  • From Sklearn, sub-library model_selection, I’ve imported the train_test_split so I can, well, split to training and test sets and Kfold, RandomSearchCV, Gridsearchcv libraries for performing k-fold cross validation, random search and grid search hyper parameter tuning respectively.
  • From Sklearn, sub-library linear_model I’ve imported logistic regression, so I can run a logistic regression on my data.
  • From Matplotlib I’ve imported pyplot in order to plot graphs of the data

First step is to split our data into training and testing samples.

Next step is to fit the training data and make predictions using logistic regression model.

Now, we need to validate our results and find the accuracy of our model predictions. This is important because it gives us information about how the model performs when we have a new data in terms of accuracy of its predictions. In this article, let us understand using K-fold cross validation technique.

K-fold Cross Validation

Our dataset should be as large as possible to train the model and removing considerable part of it for validation poses a problem of losing valuable portion of data that we would prefer to be able to train. In order to address this issue, we use the K-fold Cross validation technique.

In K Fold cross validation, the data is divided into k subsets and train our model on k-1 subsets and hold the last one for test. This process is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. We then average the model against each of the folds and then finalize our model. After that we test it against the test set. Below is the sample code performing k-fold cross validation on logistic regression.

Accuracy of our model is 77.673% and now let’s tune our hyperparameters. In the above code, I am using 5 folds. But, how do we know number of folds to use?

The more the number of folds, less is value of error due to bias but increasing the error due to variance will increase; the more folds you have, the longer it would take to compute it and you would need more memory. With a lower number of folds, we’re reducing the error due to variance, but the error due to bias would be bigger. It would also computationally cheaper. Therefore, in big datasets, k=3 is usually advised.

Hyperparameter Tuning

Hyperparameters are hugely important in getting good performance with models. In order to understand this process, we first need to understand the difference between a model parameter and a model hyperparameter.

Model parameters are internal to the model whose values can be estimated from the data and we are often trying to estimate them as best as possible . whereas hyperparameters are external to our model and cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. Hyperparameters are model-specific properties that are ‘fixed’ before you even train and test your model on data.

The process for finding the right hyperparameters is still somewhat of a dark art, and it currently involves either random search or grid search across Cartesian products of sets of hyperparameters.

There are bunch of methods available for tuning of hyperparameters. In this blog post, I chose to demonstrate using two popular methods. first one is grid search and the second one is Random Search.

Grid Search

GridSearch takes a dictionary of all of the different hyperparameters that you want to test, and then feeds all of the different combinations through the algorithm for you and then reports back to you which one had the highest accuracy.

Using grid search, even though there are more hyperparameters let’s us tune the ‘C value’ also known as the ‘regularization strength’ of our logistic regression as well as ‘penalty’ of our logistic regression algorithm.

First, let us create logistic regression object and assign different values over which we need to test.

The above code finds the values for Best penalty as ‘l2’ and best C is ‘1.0’. Now let’s use these values and calculate the accuracy.

We achieved an unspectacular improvement in accuracy of 0.238%. Depending on the application though, this could be a significant benefit.

Random Search

It is performed by evaluating n uniformly random points in the hyperparameter space, and select the one producing the best performance.

Now, we instantiate the random search and fit it like any Scikit-Learn model:

These values are close to the values obtained with grid search

That’s it for this time! I hope you enjoyed this post. As always, I welcome questions, notes, comments and requests for posts on topics you’d like to read. See you next time!

References:

Entire code can be found here.

--

--