Residual Machine Learning: Continuous as Categorical

Pete Condon
3 min readFeb 9, 2018

--

Sometimes it can be challenging to quantify how sure a prediction is.

When asked to predict a continuous variable (e.g. how much energy will be consumed tomorrow?) many machine learning algorithms will predict a single result without offering any guidance on how certain that result is.

This can be a problem if you want to understand the likelihood of passing key thresholds (e.g. will peak demand exceed capacity?), evaluate risk (e.g. what is the best case / worst case of a forecast storm), or highly uncertain events (e.g. individual energy consumption).

Individual energy consumption doesn’t follow a nice, smooth curve for individual customers; it fluctuates wildly for a variety of reasons:

<Figure 1, individual energy consumption

Fitting a distribution to the errors of a model can be a solution, unless there’s a pattern, (i.e. heteroskedasticity). Individuals are far more predictable at 4am than they are at 7pm:

<Figure 2, residential energy consumption heat map>

While it would be possible to build a separate model for each hour, the sheer number of heteroskedastic factors involved in individual consumption (hour of the day, day of the week, weather, ownership of PV system, an air conditioner, swimming pool etc) makes it impractical to split the data into enough groups for separate models.

One underutilised solution is to either split target variables, or the residuals from another model, into groups (sometimes referred to as bins) and modelling the bins as categorical variables.

This approach has two benefits:

  • exploring the uncertainty around the most likely prediction by examining the probabilities of the other categories, and
  • automatically handling latent variables.

Latent variables occur when there is something you don’t or can’t measure that is influencing the result. Consider the electricity consumption of a hypothetical factory:

<Figure 3, factory consumption>

The pattern here is quite strong, except when the machinery doesn’t turn on. There could be any number of reasons why this happens: maybe there were no orders that day, maybe something broke, maybe everyone is at a party. We’ll never know what happened, or if or when it will happen again, but our model will implicitly factor these events into the predictions:

<Figure 4, factory model>

An important consideration is how big the residual bins should be, too small and the predictions may not be reliable, too big and they may not be useful. One option is to use quantiles (perhaps deciles or percentiles) to give a nice, even spread to the number of records in each bin.

Another option is to convert the continuous value into a Z score (divide by the mean and subtract the standard deviation to form a bell curve) and then apply a formula like this:

round(10^(round(log10(abs(residual)), 0.01)), 0.1)

which gradually increases the size of the steps as you move away from the mean, giving you more precision where it’s likely to be important:

<Figure 5, step chart>

The big learning is that with a model of the residuals it is possible answer in terms of confidence, precisely identifying the risk of key thresholds being passed. This allows better understanding where the resources are likely to be needed; improving the prioritisation of maintenance and upgrades, and minimising unnecessary spending.

--

--