Estimating Aggregate Variance: Introducing Random Sort

Pete Condon
4 min readMar 18, 2018

--

A working implementation of this algorithm can be found here: https://github.com/PeteCondon/RandomSort

Models of individual or local behaviour can be useful in offering specific predictions and detailed simulations (e.g. who will turn on their air conditioner under a given scenario? Or how many assets will fail in a storm?), but when the results are aggregated they often underestimate the variance of the group (e.g. how much electricity will be consumed next month? Or how will reliability perform next month?):

<Figure 1, underfitting Probability of Exceedance bands>

The challenge this creates is that the models that best understand the micro level the best can’t explain the group’s behaviour. Sometimes everything your measuring heads in the same direction for reasons that you may not be able to measure or predict in advance, like a heatwave or a storm, otherwise known as latent factors.

In this post, we describe a novel technique called Random Sort that we have developed to deal with underfitting of aggregate variance, with two key assumptions:

  • the mean of the model must match the mean of the actual, no persistent over or under bias, and
  • the model must have a distribution of errors, or a model of residuals.

If these assumptions have been met then we can take multiple estimations for each customer to understand what the possible range of aggregate behaviour is:

<Figure 2, estimates of purchases>

A good model will have errors that are independent — there is nothing left in the data we have that will explain the result any further and we can sample from our distribution in the assumption that the overs and unders should balance each other out. But this doesn’t work in the aggregated scenario if there are latent factors.

To get around this we make the low values lower, and the high values higher, by sorting a some of the estimates on each row. This will mean that Estimate 1 tends to end up with higher values, and Estimate 5 tends ends up with lower values:

<Figure 3, sorting>

But the challenge is working out how often each column should be included in the sort.

We can work out how badly our model is underestimating the variance by calculating the percentiles from the estimates and comparing them with how often they are exceeded. The percentiles can be thought of as Probability of Exceedance (PoE X, a value that is exceeded X percent of the time). Ideally the predicted line should match the perfect line, the PoE 10 should be exceeded 10% of the time, and the PoE 90 should be exceeded 90% of the time. Figure 2 looks like this:

<Figure 4, PoE curve>

We already know that the model is underfitting, but the PoE curve allows us to measure it precisely:

<Figure 5, PoE difference>

Now we can use the absolute difference as the percentage for each column to include in sorting, i.e. Estimate 1 is included 33% of sorts because Percentile 1 is exceeded 34% of the time, and Estimate 99 is include in 17% of the sorts because Percentile 99 is only exceeded 82% of the time. This results in aggregated estimates that provide a good estimate of the variance:

<Figure 6, Random sorted>

The variance can be re-reconciled against other known measurements (locations, asset types, etc), to provide results that are consistent with a range of factors.

The big learning is that with Random Sort, it is possible to perform complicated scenario modelling that also considers that sometimes things just happen. For example, how much more electricity would be consumed if the number of air conditioners in a suburb doubled, and what would that do to peak demand at the substation?

--

--