Reaching the depths of (power/geometric) ensembling when targeting the AUC metric

Published in

Data Science & Design

12 min readAug 22, 2016

Laurae: This post is about optimizing AUC when ensembling different (correlated/clustered) models using power/geometric (weighted) means. It has five parts and deals with different questions answered with either theoretical or pragmatic (= from practice) knowledge, along with practical results. This was originally posted at Kaggle but is scattered along different posts.

Part Summaries

Part 1: practical results of ensembling by power
Part 2: the formula for ensembling by power
Part 3: how to deal with clustered models
Part 4: this applies to similar ranking performance metrics (Gini…)
Part 5: (my pragmatic) theoretical framework

Part 1: practical results of ensembling by power

I’ll give here the the major properties of the ensembling techniques to optimize AUC that are not done using machine learning models, but done via manual intervention.

Ranking of the my different ensembles of 10 models using different ensembling techniques:

Simple Average: 0.827353 (123th)
Stretched Average: 0.827358 (122th)
Rank Average: 0.827280 (146th)
Power 2 Average: 0.827413 (99th)
Power 4 Average: 0.827473 (82th)
Power 8 Average: 0.827554 (62th)
Power 16 Average: 0.827497 (72th)
Power 32 Average: 0.827403 (104th)

Using simple averaging in this data set

Typical ensembling method when no weights are to be found
Confident models (in both sides) have a larger impact than non-confident models, if the difference between the extreme probabilities per row is high while maintaining a minima/maxima (i.e being confident with models not confident around gives less confidence to unallow to make big probability mistakes)
For AUC, weight is the probability (unless weighted average).

Public/Private: 0.840995/0.827353

Using stretching in this data set

Positively confident models are given lower weights than non-positively confident models when it comes to negative labels
Non-positively confident models are given higher weights to increase the diversity of the ensemble
Positively confident models have a higher skewed distribution allowing to decrease the variance of the diversity of the ensemble
For AUC, weight is the probability stretched from [0, 1], thus the weight for each model is [ (Xn — min(Xn) ) / (max(Xn — min(Xn)) ] for each model Xn approximately

Public/Private: 0.841001/0.827358

Using rank averaging in this data set

Confident models are not existing anymore due to mandatory scaling to [1, nrow]
Underconfident models may jinx everything if used inappropriately
For AUC, weight is 1 for every row/model unless weights are specified explicitly (weighted average)

Public/Private: 0.840928/0.827280

Using power averaging in this data set

The high positive probabilities of the positively confident models remain positively confident
The non-high probabilities of any models are knocked down
Only the high probabilities subsist, hence if many models agrees on a positive: it will remain positive (against the others) — and this scale required to remain positive is tighter (due to the power scaling)
For AUC, weight is 1 for every row/model unless weights are specified explicitly (weighted average)

Public/Private: 0.841082/0.827473

N.B: Choosing the power depends on the risk you want to take, on what you are trying to optimize, and on what you are using as inputs (a bad model input can screw up everything). It’s wise to choose only your best performing models. It’s all about risk optimization due to the decorrelation issues. Power optimization tends to give the highest performance for AUC, but this is not always the case.

Example of my selection for choosing power 4: (p.s: picture was sent during a chat with a teammate, so it’s not clean at all…!)

Power 2: 0.840976/0.827413

Power 4: 0.841082/0.827473

Power 8: 0.840810/0.827554

Power 16: 0.840525/0.827497

Power 32: 0.840362/ 0.827403

Part 2: the formula for ensembling by power

Michael Maguire wrote:
Thanks Laurae! Could you describe power averaging a bit more (i.e. formula)? I’m not familiar with it. Thanks!

Same as averaging, except you use power on each model.

Example, for 4 submissions, you average them in a general case like:

Final Submission = (Submission1 + Submission2 + Submission3 + Submission4) / 4

For a power averaging, you do the following:

Final Submission = (Submission1^Power + Submission2^Power + Submission3^Power + Submission4^Power) / 4

You use the properties of the power function (the further away you are from 1, the faster you are closer to 0). It means a model must be very confident on a positive label (here, target=1) to have its probability remaining the furthest away from the negative label (here, target=0).

Visualization:

When dealing with power averaging, you must keep in mind you are looking at how close your models are from 0, and not how close they are from 1 (due to the power function properties). It also means that a bad model used for ensembling can jinx your whole ensemble very harshly, even if you have 10 strong models that are supposed to “average out the error”. Visualizing the correlation between your models that will be used to ensemble is essential you understand the risk you are taking when ensembling using power average. The higher the decorrelation, the higher the risk of messing up things because of a bad input model => you are looking for high correlation straight from the beginning for the ensemble, not highly decorrelated/diverse models.

The probabilities that remain the highest at the end are the probabilities whose relative agreement (weighted down by the probability and the power) coming from each ensembled model is the highest. The reverse is also true for those who are the lowest probabilities. This is the reason power averaging usually fares better, specifically for AUC (does NOT work RMSE, Log loss, etc.), and sometimes even better than a Rank SVM if you using stacking.

Part 3: how to deal with clustered models

aldente wrote:
(1) Do you have experience with under what conditions is it best applicable to each kind of average? E.g. (a) which one is best for unbalanced/balanced dataset, (b) for other conditions…?

Power averaging only for AUC, nothing else (I might have missed some exotic performance metrics that may be optimized through power averaging though). In AUC you need optimize to get the most positives at the top of the rank (i.e you need the lowest amount of false positives at the highest possible probabilities).

Typical averaging or weighted averaging works well with most performance metrics when probabilities are taken into account.

aldente wrote:
(2) What’s best way to apply weighted average on each average? Example for power average, this,
(v1**2)*0.36 + (v2**2)*0.16 + (v3**2)*0.48
Or this, ?
(v1*0.36)**2 + (v2*0.16)**2 + (v3*0.48)**2

The first version is correct: (v1**2)*0.36 + (v2**2)*0.16 + (v3**2)*0.48

The second version leads to different scalings and this may cause issues as you are assessing probabilities with different scales.

aldente wrote:
(3) I tried power-averages on our best (rank-average) submission, but none gives better score. Even they’re much worse.
So I think maybe it’s as said in your statement “Visualizing the correlation between your models that will be used to ensemble is essential you understand the risk you are taking when ensembling using power average. The higher the decorrelation, the higher the risk of messing up things”. But I’m still not sure.
The weighted rank-average is,
v0_rank*0.16 + v1_rank*0.16 + v2_rank*0.16 + v3_rank*0.16 + v4_rank*0.36
The correlations has 2 groups of less-correlated submissions,

So in our case that none of our power averages has better public & private score, am I correctly interpreting your statement that power average works only for highly correlated submissions?

You have two sets of models that are not highly correlated. You need to make the power average separately per cluster of highly correlated models, then average out what is left.

i.e:

Cluster 1 = v1, v3, v4
Cluster 2 = v0, v2
Power averaging on Cluster 1 => X1
Power averaging on Cluster 2 => X2
Averaging on X1 and X2 => final submission

At power 2 you are already at potential_min(r²) = 0.5184, which should be way too low. And make sure you are using the same scales (from 0 to 1, no stretch, no conversion, i.e raw probability outputs).

potential_min(r^n) > 0.90 is recommended, although going under can yield significant performance boost (in my case, I could turn n=8 to have r⁸ = 0.80367 which is not that bad).

And yes, it does work better if you use highly correlated inputs.

Part 4: this applies to similar ranking performance metrics (Gini…)

eagle4 wrote:
If I may and now that I learnt about power average, I believe the power average may have worked for normalized Gini as well as AUC, such as the Lib Mut competition metric:https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction/details/evaluation.

It should work well on normalized Gini :) the formula should be the following using basic mathematics, supposing actual the labels:

Gini = (2*AUC — 1) * (0.5 * sum(actual == 0)) (using the Gini index property)
AUC = (Gini / sum(actual == 0)) + 0.5
Perfect Gini = 0.5 * sum(actual == 0)
Normalized_Gini = Gini / Perfect_Gini = [(2*AUC — 1) * 0.5 * sum(actual == 0)] / [0.5 * sum(actual == 0)]
Normalized_Gini = (2*AUC — 1) = 2*(Gini / sum(actual == 0))
Metric to optimize ==> AUC, the normalized Gini punishes twice faster than AUC

Perhaps aldente might found a solution to get #1 on it ^^

By the way, the evaluation metric provided at Kaggle is giving the wrong normalized Gini coefficient when opposite ties are present, right? (ex: pred = 0.2 and 0.2, label = 0 and 1)

Use the following for ties handling assuming the AUC/Gini relation is true:

FastNormGini <- function(y, x) {  # y = actual
  # x = predicted
  # using fast AUC computation to compute normalized Gini coefficient
  x1 = x[y==1]
  n1 = length(x1)
  x2 = x[y==0]
  n2 = length(x2)
  r = rank(c(x1,x2))
  return(2 * (sum(r[1:n1]) - n1*(n1+1)/2) / (n1*n2) - 1)}

Part 5: (my pragmatic) theoretical framework

eagle4 wrote:
@Laurae,
Could you please explain what you mean by “potential_min(r²) “ in your earlier post and how you got .51.
Thanks again for your posts !

The potential minimal R-squared is the (optimistic) minimal R squared you would get if you put to the power the minimal (calcluated) R-squared in your ensemble. The “potential” of a variable comes from Physics. It is “optimistic” because we assume the output of the ensemble is corrected by the other inputs in the ensemble.

Step 0

First of all, you need to split your ensemble into two virtual separate ensembles:

The first half are the strongest models
The second half are the “weakest” models (but are still strong due to their high correlation with the strongest models)

To determine the base R-squared you must use to calculate the potential minimum R-squared, you must have a “weight initializer”. The weight initializer is the only strongest model you have (the model giving you the lowest loss).

Once you found the “weight initializer”, you must compute the full correlation matrix of the inputs you will use for your ensemble. Let’s suppose the “weight initializer” is the model X_1.

Step 1

You must find the closest correlated model to the model X_1 (the one whose correlation is the highest with X_1). We call this model X_2. We exclude X_1 from the correlation matrix.

Then, find the closest correlated model to the model X_2 (the one whose correlation is the highest with X_2). We call this model X_3. We exclude X_2 from the correlation matrix (X_1 is already excluded also).

And loop until you exhausted all the input models. At the end, you “should” end on the weakest model, and you “should” have went through the strongest models to the weakest models (this is not always the case).

When the loop has ended (you exhausted all inputs), the minimal R-squared is the correlation between X_1 and X_n (n being the amount of model inputs you have). If you have 10 models, this would be between X_1 and X_10.

Fictive example (took a picture from Internet), with X_1 (the strongest model) being 1:

X_1 = model 1 — we delete model 1 from the correlation matrix
X_2 = model 2 (because the highest correlated model with the model 1 is the model 2) — we delete model 2 from the correlation matrix (1 and 2 are deleted)
X_3 = model 3 (because the highest correlated model with the model 2 is the model 3) — we delete model 3 from the correlation matrix (1, 2, and 3 are deleted)
X_4 = model 4 (because the highest correlated model with the model 3 is the model 4) — we delete model 4 from the correlation matrix (1, 2, 3, and 4 are deleted)
X_5 = model 5 (because the highest correlated model with the model 4 is the model 5) — we exhausted all inputs

And as expected, the lowest correlation with X_1 (model 1) is X_5 (model 5). Hence, the minimum R-squared is 0.579.

Step 2

Now you have the minimal R-squared, as you are limited by the amount of submissions you can make at Kaggle, you must find an optimal power to use. It breakdowns like this:

The higher the power you use, the higher the variance you add (the lower the potential minimal R-squared).
The lower the power you use, the higher the bias you add (the higher the potential minimal R-squared).
Large note: the variance and the bias FROM the inputs you have, not from the ensemble (it means the starting point of the variance/bias is supposedly your ensemble). The minimal R-squared serves as weight.

To balance the trade cost between variance and bias, a typical threshold I use (from my experience, but there is no known theories to analyze it yet though) for the potential minimal R-squared is 0.90: it means that

If you put the minimal R-squared to a power (potential minimal R-squared) that puts it below 0.90, you take the risk of adding too much variance to your ensemble (when you supposedly have a good bias/variance!!!).
If you put the minimal R-squared to a power (potential minimal R-squared) that puts it higher than 0.90, you take the risk of adding too much bias to your ensemble (when you supposedly have a good bias/variance!!!).
The threshold is defined as the power which puts the potential minimal R-squared below 0.90.
Note: the impact is close to negligible, due to the fact you start from the ensemble bias/variance.

Imagine you have a minimum R-squared of 0.988.

R^1 = 0.988
R^2 = 0.976
R^3 = 0.964
R^4 = 0.953
R^5 = 0.941
R^6 = 0.930
R^7 = 0.199
R^8 = 0.908
-----------
R^9 = 0.897

Here, the threshold is the power 9:

If you go over the power 9, you take the risk of increasing the variance too highly (and risk hurting slightly your model).
If you go under the power 9, you take the risk of increasing the bias too highly (and risk hurting slightly your model).

If you need to tinker for Kaggle (or for anything you cannot test a lot), I would recommend to drop 3 submissions:

One with a conservative biased power (very under 9)
One with the threshold power (9)
One with a conservative variance power (very over 9)

I tend to recommend the following:

First: 9^-0.5 = 3
Second: 9
Third: 9*(9^-0.5) = 27

But you are free to try more typical gut testings, like:

First: 4
Second: 8
Third: 16

There are cases where a biased power leads to a higher performance, while there are also cases where a heavily variance power leads to a higher performance.

Minimal decorrelation?

Potential minimal for the following reason: the power function is not linear, and hence increases decorrelation faster than the R-squared to power itself. Thus, defined as “optimistic”.

Quick example of why it can decorrelate faster (Base=best model aka X_1, Pred=worst model aka X_n):

Base (Initial): 0.75, 0.75, 0.75, 0.75
Pred (Initial): 0.6, 0.8, 0.7, 0.9
R-squared: 0.9739Base (Power 8): 0.100113, 0.100113, 0.100113, 0.100113
Pred (Power 8): 0.0167962, 0.167772, 0.057648, 0.430467
Potential minimal R-squared: 0.8090
R-squared: 0.5349 (decorrelated!!!!!)