Machine Learning Model Evaluation and Hyper-Parameter Tuning: Looking Beyond Accuracy

Published in

Empathy.co

8 min readMar 22, 2019

Last month, we discussed the impact of gradient descent on deep learning optimisation and reviewed the importance of optimising the cost function. In this post, we will be exploring in detail how to evaluate model results, as well as best practice for optimising hyper-parameters.

To do this, we will be building a model for image recognition. We will show how to compute metrics to assess the quality of the model and some optimisation techniques for hyper-parameters.

Model Evaluation

Just like a student revising for an exam, your machine learning model will have to go through a process of learning and training before being ready to complete its intended task. This training will enable it to generalise and derive patterns from actual data, but how can we assess whether or not our model provides a good representation of our data? How can we validate the model and predict how it will perform with data it hasn’t seen before?

When assessing a data model, accuracy is the most frequently used metric. It gives a general understanding of how many data samples are misclassified, however this information can be deceptive and can give us a false sense of security.

Normally, you’d split your data into a training set and a test set. You’d train the model on the training set, and would measure accuracy while testing the model on the test set. This is the fastest way to evaluate the model’s performance, but it’s not the best one.

Bias-variance tradeoff

Although there are many different metrics that can be used to measure a model’s performance, keeping bias and variance low is always essential. We define bias as any systematic difference between the output of our model and the ‘true’ value. Variance refers to the model’s statistical limit (see figure for illustration).

*Mastering Machine Learning with scikit-learn, pg. 14*

So, in which situations will we have to combat high bias or high variance?

If the model is overtrained, or too complex for a given training dataset, it will memorise patterns and even the input noise. In such situations, we will have high variance (overfitting) and the model will perform poorly with unseen data.

In the opposite scenario, the model will perform poorly and produce similar errors on both training and testing data. In this situation, we will see high bias (underfitting). The model will be too inflexible and won’t have enough features to fully represent the data.

We ideally want to achieve both low bias and low variance. We are aiming for model predictions that are very close or identical to the values seen in the training data. Unfortunately, the efforts to reduce one often increases the other. We have to find a compromise. This balance between bias and variance is called the bias-variance tradeoff.

While it might not always be possible to find enough data to prevent overfitting, or to know exactly how complex a model should be, plotting the training and testing accuracies as functions of the number of training samples might help.

K-fold cross-validation

In this section, we will use Keras to wrap a neural network, and leverage on sklearn to run a K-fold cross validation. For the neural network, we will use LeNet architecture. It was one of the first prominent deep convolutional architectures, it’s fairly easy to code, and it’s not too computationally expensive. The architecture consists of two sets of convolutional and subsampling (also known as Pooling) layers, followed by a flattening convolutional layer, then two dense (fully connected) layers.

Before going through the code in detail, it may be useful to define what k-fold cross-validation is and why you should use it.

K-fold cross-validation is one of the most common methods for confirming an estimated hypothesis on data, and for assessing how accurately a model performs and its ability to generalise. In k-fold cross-validation, you randomly split the training data into ‘k’ equal-sized folds. In each iteration, one of the folds is used for performance evaluation, while the rest is used for training. This process is executed ‘k’ times so that we obtain ‘k’ models and performance estimates.

Statistically, the average performance measured over k-fold cross-validation gives a proper estimate of how well a model does its task in general.

Cross-Validation Code

First, we’ll import the data from Keras. As a training set, we will be using the Fashion-MNIST dataset. Fashion-MNIST is a dataset consisting of Zalando’s product images. It has a training set of 60,000 samples and a test set of 10,000 samples. Each image within these sets is 28 pixels by 28 pixels.

Before passing the data to our model, we must declare the number of channels (also known as the depth of the image) and reshape the samples to 60,000 x [1, 28, 28] to suit the convolutional requirements. Note that the dataset is composed of black and white images. For that reason, we have just one channel.

On the following step we define a class for the convolutional neural network (LeNet):

To perform cross-validation, we will import the function ‘cross_val_score’ from Sklearn. This function takes the classifier, the samples (X), the labels (y), and the number of folds (cv) as inputs:

After running cross-validation, the function returns a list of accuracies for the five folds. In order to know how it performs on average, we look at the mean and the standard deviation:

As a side note, this metric measures the percentage of data samples properly classified, or being more precise, a proportion of correct predictions with respect to the samples. In the background, Keras classifies each sample, yields a vector with the probability for each class, and selects the highest value as the model prediction. Finally Keras compares the prediction against the true value

After running cross-validation for 5 folds, and extracting the mean accuracy and the standard deviation, we have a more accurate assessment of the model’s performance and of how robust it is on average. We can see that the classifier achieves on average 90% accuracy. This value fluctuates from iteration to iteration with a standard deviation of roughly 0.4%. We can conclude that we have a low variance and a relatively low bias. Still, we encourage you to benchmark different CNN architectures with the Fashion-MNIST dataset: https://github.com/zalandoresearch/fashion-mnist#benchmark

Hyper-Parameters Optimisation

When creating a neural network we have two types of parameters. There are the parameters that are learned during training, such as weights, and the parameters that will be hard-coded and optimised separately. This second class of parameters are called hyper-parameters. Examples being the dropout rate, learning rate, number of epochs, batch size, optimiser, number of layers, number of nodes, etc.

Fine-tuning the hyper-parameters might improve predictions, but there isn’t a rule that will tell you how many layers, the number of epochs, or the batch size to use on your neural network.

Finding an optimal solution (or even a sub-optimal solution) often involves repeatedly training different versions of a model to different sets of hyper-parameters. However, there are still a few techniques for hyper-parameter optimisation. One of the simplest is grid-search.

Grid-Search

Grid-Search is a popular method for identifying the best hyper-parameter set for a model. It’s a brute force search method that takes a set of possible values for each hyper-parameter we want to tune. The machine evaluates the performance for each combination. In the end, it will return the selection with the best performance.

It’s a time-consuming method, but since we can have multiple local minima and acceptable solutions for hyper-parameters, (and it’s easy to end working with a suboptimal set of combinations), a grid-search might improve your model’s performance. In a real scenario, we start with a broad and wide-ranging set of values for each hyper-parameter. After identifying which values and region to explore, we can reduce the range on the grid-search.

Grid-Search Example

We will carry on working with the sample code we were using before, but modified so we can pass a set of values for the learning rate, the epochs, and the batch size.

Even though we are using a low number of folds for the cross validation (the cv parameter) and a modest number of values for the 3 hyper-parameters to reduce waiting times, it will still take over an hour to run the following configuration. We recommend to set n_jobs to -1 to run the grid search in parallel on all processors:

After running the grid-search we can get the best combination of parameters, the best score, and a dictionary with each combination’s results:

In the end, the best combination turned out to be the one we had previously used to run the cross-validation.

Conclusion

In this article, we went through some basic ideas on how to evaluate a machine learning model and how to fine-tune its hyper-parameters.

We explored a very common strategy for model evaluation, k-fold cross-validation, which, by averaging individual model performance, will identify whether a model suffers from high bias or high variance.

We also explored how to search for the right set of hyper-parameters, by applying grid-search. Exhausting all possible hard-coded combinations that will help you to find the optimal combination of hyper-parameters.

If you have any thoughts or questions about model evaluation or hyper-parameter optimisation please feel free to get in touch.

Key Takeaways

Accuracy measured across a training and a testing dataset is the fastest most frequently used metric for assessing a model, but it’s not the best.
Whatever metric you use to measure performance, you need to ensure you keep bias and variance as low as possible.
If a model is overtrained or too complex, you will see high variance. If a model is too inflexible and produces similar errors in both training and testing data, you will see high bias.
Attempts to reduce one often increases the other. A compromise is needed, known as a bias-variance tradeoff.
Average performance measured over k-fold cross-validation gives a proper estimate of how well a model does its task in general.
Neural networks have two types of parameters; those learned during training, and those that are hard-coded and optimised separately. The latter are hyper-parameters.
Optimising hyper-parameters often requires repeatedly training different versions of a model to different sets of hyper-parameters.
Grid-search is one of the simplest techniques for hyper-parameter optimisation.

Machine Learning Model Evaluation and Hyper-Parameter Tuning: Looking Beyond Accuracy

Model Evaluation

Hyper-Parameters Optimisation

Conclusion

Key Takeaways

Written by David Lourenço Mestre