How to generate neural network confidence intervals with Keras

Published in

HAL24K TechBlog

6 min readSep 3, 2019

Whether we’re predicting water levels, queue lengths or bike rentals, at HAL24K we do a lot of regression, with everything from random forests to recurrent neural networks. And as good as our models are, we know they can never be perfect. Therefore, whenever we provide our customers with predictions, we also like to include a set of confidence intervals: what range around the prediction will the actual value fall within, with (e.g.) 80% confidence?

Unlike for classification problems, where machine learning models usually return the probability for each class, regression models typically return only the predicted value (with a few notable exceptions). In order to gauge the model’s confidence, we need to re-engineer our models to return a set of (differing) predictions each time we perform inference. We can then use the distribution of these predictions to calculate the model’s confidence intervals.

In this article we’ll show you how you can do this for any neural network, including those you’ve already trained. Our implementation will be in Keras — one of our favourite libraries for prototyping deep learning models.

Our original model

As our example we’re going to use a simple LSTM model we’ve trained earlier, designed for forecasting traffic flow. We will evaluate its performance on a validation dataset, with no overlap with the data it was trained on. Here is one week of its predictions, for a single sensor:

Generally it’s performing pretty well — it outperforms the persistence baseline (a guess of the last known value) by 64% — although it struggles for the lowest traffic flow values and for any unexpected peaks.

As we’d hope, our model’s prediction errors follow an approximately normal distribution:

Our model’s predictions are on average correct, with a roughly equal rate of over and underprediction. For our re-engineered model to provide accurate confidence intervals, we require its predictions (for each inference call) to show a similar distribution around their average value.

Re-engineering the model

To turn our single-valued regression model into one capable of returning multiple (different) predictions, we will repurpose a technique we usually use only during training — dropout. When we apply dropout we “turn off” a randomly chosen fraction of the units of the model (or certain layers of the model). When training this helps prevent overfitting, by reducing co-adaptation between units and so forcing all units to generalise well to unseen data. Quoting the original publication, this means:

During training, dropout samples from an exponential number of different “thinned” networks.

By applying dropout when performing inference, we are therefore sampling one of these “thinned” networks to generate our predictions. By sampling enough times, we can build up a distribution of predictions from our single trained model, and use this distribution to calculate the confidence intervals of the original (un-“thinned”) model’s predictions.

Image taken from Srivastava et al. 2014.

In Keras we will apply dropout at inference time using the following function:

This function takes the configuration and weights from a pretrained model (model), and uses them to create a new model (model_dropout) with the specified amount of dropout applied to all layers. We do this by looking for the layers which contain dropout and setting the dropout to the desired value. In our case we have LSTM layers, for which we want to change the dropout parameter (we will ignore the recurrent_dropout for simplicity’s sake).

However, Keras turns off dropout by default when performing inference, so we cannot simply use this new model to generate our predictions. Instead we have to trick Keras into thinking we are still training the model — and so still using the dropout — by setting the learning_phase to 1. We therefore create a separate predict function predict_with_dropout, which takes both the model inputs and learning phase, and returns the model outputs (i.e. the predictions).

Generating confidence intervals

We can then generate a set of predictions like so:

Note that [1] is being appended to the input_data when we call predict_with_dropout, telling Keras we wish to use the model in the learning phase, with dropout applied. We predict with dropout 20 times, giving us 20 different predictions for each sample in the input data. With more predictions the confidence interval estimates will become more accurate, however the prediction process will last longer. The use of 20 predictions therefore seems a fair compromise.

From these predictions it is then trivial to calculate the upper and lower limits for a given confidence interval:

Choosing the right dropout

In the above example we set dropout = 0.5, however we have no idea if this is the correct amount of dropout to use. If the dropout is too large then the predictions generated will be very diverse, and so the confidence intervals estimated from them will be too large. Conversely if the dropout is too small then the predictions generated will be too similar, and so the confidence intervals will be too small. We can judge the suitability of the dropout by looking at the distribution of its predictions around the median. For a dropout of 0.5 we can see that the distribution of predictions is much broader than the errors we saw previously, so the amount of dropout needs to be reduced.

To determine the optimal dropout to use, we can look at the percentage of actual values which fall within each calculated confidence interval. For the optimal dropout we would expect 10% of actual values to fall within the 10% confidence interval, 20% within the 20% and so on. To choose the optimal dropout value, we calculate the percentage of actual values within the various predicted confidence intervals for a range of different dropout values.

When the dropout is too large the confidence intervals are overestimated: for a dropout of 0.5, 40% of actual values fall within the 20% confidence interval. When the dropout is too small the confidence intervals are underestimated: for a dropout of 0.2, 20% of actual values fall within the 50% confidence interval. However, when the dropout value is just right, the confidence interval matches the distribution of actual values almost perfectly, in this case for a dropout value of 0.375.

Plotting the distribution of predictions for the optimal dropout of 0.375, we see it matches the prediction error pretty well:

As a final check we can plot the confidence intervals generated by our re-engineered model, for the same set of predictions we first displayed. We see that where the model performs worst the confidence intervals are broadest, so that all actual values bar one unexpected peak fall within the 95% confidence interval. These dropout-generated confidence intervals therefore give us a good measure of our neural network’s confidence for any set of predictions.

About HAL24K

HAL24K is a Data Intelligence scale-up based in San Francisco, Amsterdam and London, delivering operational and predictive intelligence to cities, countries and companies.