# Estimating per-input regression deviation

Recently I’ve worked on two regression problems in which it was useful to predict the estimate confidence as well as the expected value for each separate input. Conceptually this is simple, but I’ve found few definitive answers and ready-to-use models, so I’ll share a few thoughts and experiments. Since estimating “confidence” is not clearly defined, I’ll talk about estimating the parameters of a normal (Gaussian) distribution: the mean value and standard deviation.

### Motivation

There are a few reasons why we’d want to have an estimation of regression deviation:

• Report it to the end user (directly or by predicting a range).
• Detect which regions of the input space have more deviation in the target value.
• Discard high-deviation predictions to lower the expected error.
• Offer guarantees on the maximum predicted error.

These reasons are not independent of each other, I just want to stress we can arrive to the need for this estimation coming from several points of view.

### Input space assumptions

This approach relies on the input space having regions with varying target value statistics. There must exist regions in which the target value can be predicted with more confidence (less deviation) then in others. I believe this is often the case, here are a few common causes:

#### 1. Variable training set density

Different events have different occurrence frequencies and we sometimes need to model less frequent events. It might not be obvious in which input space regions we have more data, especially when the input space dimensionality is high.

Formally, we’re talking about the distribution of samples in input space p(x). Here’s a plot with an arbitrary function sampled across the domain of interest uniformly and normally.

If we were to train a model to predict the target value based on a sample of data normally distributed in the input domain, it’s likely it would have a greater error in regions in which the input density is lower.

#### 2. Hidden causalities and variable noise

It’s common that our input features don’t completely describe the domain. For example, let’s assume we’re predicting a simple function f(x, x) =x₀ + 0.1 * x₀ * x₁, but we only have x₀ as an input feature.

We can easily model this function and predictions will be useful across the whole input domain. However, the mean error will on average increase linearly with x₀ and it might be beneficial to report that.

Hidden causes could be considered input noise, but there could be other sources of noise variation. Measurement instruments might behave differently in different regions. Human annotators vary and that might be correlated to input space.

### Formal problem definition

Assume a dataset consisting N training pairs, each having a d-dimensional vector of input features xᵢ, i ∈ {0, N - 1}, xᵢ ∈ ℝᵈ and the value we want to predict yᵢ ∈ ℝ. Our regression model is a function y`ᵢ = f(xᵢ). Predictions y`ᵢ are imperfect due to data and modelling limitations and we want to reformulate the problem into predicting a normal distribution N(μᵢ, σ²ᵢ) with mean μᵢ = y`ᵢ = f(xᵢ) and standard deviation σ² = g(xᵢ).

This formulation has a well defined error metric. We can calculate the density of the distribution N(μᵢ, σ²ᵢ) in the point yᵢ: p(yᵢ | μᵢ, σ²ᵢ) and maximize it. The optimal solution will model g(x) to predict higher deviations in regions in which it’s harder to accurately predict the target and thus f(x) is on average further off from the known truth y.

#### Bayesian Linear Regression

The stated formulation of estimating a normal distribution of target values instead of a single point resembles Bayesian Linear Regression. However, BLR models N(μᵢ, σ²). Note that the standard deviation is not modeled per-sample, but as a single value for the whole input domain.

### Algorithms

There are many ways to approach this. There are models which do it out of the box (Gaussian Process), some can be easily adapted (ensembles, neighborhood) and I propose two ways to model σ² directly. Following is a description of these approaches. I’m sure there might be other approaches too. Feel free to comment.

#### Modeling σ² as a regression problem

The formal definition implies we could try to model the deviation function σ² = g(x) using an arbitrary regression algorithm.

Assume we have a training set D, after validation and testing data has been taken out. We can split it into two parts, D₁ and D₂. We use D₁ to train the regression mean estimator μ = f(x). We use this estimator to get mean point estimates μᵢ on D₂. We can now use the biased (uncorrected) formula for the standard deviation [1] with a population of N = 1 to get that σ²ᵢ = |yᵢ - f(xᵢ)|. Using the unbiased formula doesn’t work as it results in σ²ᵢ = ∞ as we typically don’t have many samples in the dataset with the same input value xᵢ. The biased formula tends to underestimate, but it’s still informative. Using this method we can obtain target deviations on the second part of the training set D₂ and use them for training the deviation estimator g(x).

If f(x) doesn’t overfit at all (a bold statement, but sometimes true) it’s not necessary to split D on D₁ and D₂. Splitting is methodologically correct since on unseen data g(x) will be approximating the deviation w.r.t. f(x) mean estimates. However, splitting results in a significant training-set reduction for both estimators. If f(x) overfits then the predictions on D₁ will be different then on D₂ and we need ensure there’s no overlap. This choice can be treated as a hyper-parameter: do whatever yields better estimates on the validation set.

The benefits of this approach are that it’s simple, it can be used with arbitrary regression algorithms for f(x) and g(x) estimation and it seems to perform very well. The drawbacks are that the deviation estimator is trained on highly biased samples, but we can always check this effect on test data.

#### Back-propagation through the probability density function

Consider the probability density function for the normal distribution N(μ, σ²):

It’s differentiable with respect to it’s parameters μ and σ²:

Following is a plot of the density function and it’s derivatives:

To make the derivatives more intuitive, consider the following example. In a normal distribution with μ = 0 and σ² = 1 consider points x₀ = 0.1 , x₁ = 1, x = 2.

Consider what would increase the probability density at each of those points. For point x₀ = 0.1 shifting μ towards it wouldn’t result in a significant increase in it’s likelihood, but a reduction in σ² would. Density at x₁ = 1 is primarily influenced by a shift in the μ and at x= 2 (and further out) both parameters have a similar influence, but that of σ² is greater.

However, both derivatives asymptotically approach zero the further they move away from μ. This is not a desirable property, so let’s consider the logarithm of the probability density function:

The log-density is the squared distance (x - μ)², divided by the “spread” σ², plus normalization. This is a very clear, intuitive similarity metric that we should be able to optimize using it’s gradients w.r.t. the parameters:

We can use an arbitrary differentiable function to calculate μ and σ², anything between a simple linear function to a complex neural net.

Experiments show this converges on the example dataset (described below). Some neural-net intuitions about learning-rates, weight initialization and convergence speed don’t hold, so a wider hyper-parameter exploration might help if you use this.

The advantages of this approach should be obvious: we can use the same neural-net techniques that have proven superior in many domains. The same disadvantages apply: the more complex the model, the more data and computation power we need for optimization.

#### Using an ensemble of mean predictors

To be precise, we can use bagging [2]. It’s a widely used technique, it’s most common representative the random forest algorithm. The general idea is that M weaker estimators are trained on subsets of the training data and the mean of their predictions returned as an estimate. The standard deviation can be estimated the same way.

The benefit of this approach is that it works very well both for μ and σ² estimation. The drawbacks of ensembles are that they often consume more memory and CPU time to use.

#### Nearest neighbors

The nearest-neighbors family of non-parametric methods often gets overlooked (being simple), but they can be very useful in this scenario. Neighborhood (either k-nearest or radius-based) target value averaging is easily extendible to include target value deviation averaging as well. Neighbor distance-weighting is applicable too.

Neighborhood-based approaches often perform very well. Their drawback is that they’re based on remembering the whole training set (consumes memory) and calculating distances from it when deployed (consumes CPU time). Improvements are used to alleviate that, but these approaches are still more applicable when the training dataset is lightweight.

#### Gaussian Process

Gaussian Process [3] is a model that predicts a normal distribution of target values per definition. It’s an interesting non-parametric model that I must admit I haven’t fully grasped yet. There’s a book [4] on the subject for those inclined.

In practice, fitting a GP requires the inversion of an N x N matrix (where N is the size of the training set) which has O(N³) complexity. This doesn’t scale to large datasets. Also, GP seems to be finicky about choosing the right kernel for the dataset at hand. I’ve had little success treating it as a black-box algorithm, even when training time is not prohibitive. Still, for smaller datasets that you have good understanding of (and can model the kernel appropriately) it might be a very useful tool.

### Experiment

All the code is available in a GitHub repository.

#### Dataset

To play with these concepts I’ve used the public bike-sharing dataset: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. I chose it because of it’s size (17k points, 16 attributes) and an interesting prediction target. It’s about predicting the number of bike rentals, daily or hourly. I did minimal data preprocessing (one-hot encoding of categorical attributes, exclusion of some attributes, standard scaling).

#### Models

The following models are evaluated:

• Making a separate deviation estimator (scikit-learn). Hyper-parameters are the choice of regression estimator both for mean and deviation modeling, data separation between the two models and hyper-parameters for these estimators.
• Back-propagating through the log-probability density function using PyTorch. I’ve used a simple linear function for both mean and deviation calculation. The only hyper-parameter is the weight decay.
• Ensemble of estimators. Hyper-parameters are the estimator type and count and estimator-specific hyper-parameters.
• K-nearest-neighbors. Hyper-parameters are k (neighborhood size) and if weighting should be used.

Models are implemented with a scikit-learn estimator interface. Hyper-parameter optimization is done using grid-search with the mean log-density as the evaluation metric.

#### Results

I’ve not spent too much time optimizing hyper-parameters so the results aren’t conclusive and they shouldn’t be used to compare algorithms. My goal with this experiment is to demonstrate that each of these approaches can be used, not which of them is best. Note that results tend to vary with each run, especially with the back-propagation model.

`       Model        MAE [1]   LN-PDF [2]   best hyper-parameter     ----------------- --------- ------------ ----------------------   Dev. regression      85.5         -7.2   split=0.5 [3], forest(300)                    Back-propagation     98.9        -10.3   alpha=1e-7              Bagging             137.6         -7.1   forest (3000)    kNN regression      111.7         -6.5   k=30, uniform`
`[1] Mean absolute error[2] Mean natural logarithm of the probability density function[3] Train-set split between mean and deviation estimators`

Following are a few scatter plots of deviation estimates against absolute error on the test set. Ideally the deviation estimate should be correlated with the error so that for high-error points the deviation estimate is also high.

The correlation is nowhere near ideal, but in most models for many high-error points the deviation estimation indeed is higher. Let’s see how that can be used to reduce the mean-absolute-error by discarding estimates with high deviation. On the x-axis is the not-discarded part of the test-set predictions.

We can see that all models get significant MAE improvements by discarding high-predicted-deviation samples. This really can be useful if requirements for error rates are strict.

### Conclusion

I really enjoy the concept of back-propagating through the probability density function. Even though it’s the most difficult model (of those evaluated) to properly optimize and it doesn’t yield the best results, it’s potential is great and there’s a certain elegance to it. Also it was fun revisiting and thinking through some old-school algorithms and appreciating them for their simplicity.

I believe this post demonstrates several valid approaches to the stated problem. Hope you find it useful!