“How many candy corns?” or a brief application of ensemble learning

Fabian Gallusser
disney-streaming
Published in
5 min readJun 21, 2021

How many candy corns are in this jar?

Take a guess! Answer at the end of the post….

It’s a common challenge, but always a difficult one. While some will come up with quick and random guesses to get it over with, others will look up the item online, estimate its volume based on dimensions (accounting for flat top and bottom naturally), also estimate the volume of a candy corn, divide the two, account for empty space…

A third option exists: wait for everyone to cast their vote, then average all those responses. While not always feasible (and certainly not one everyone can adopt!), this approach performs historically well in a wide range of situations! This effect has been dubbed ‘wisdom of crowds’, in which large groups of people can collectively provide better estimates than a smaller group of experts.

Back to the candy cane example. In our broader team, 42 estimates were ultimately provided. Averaging all answers would have provided the second best estimate, less than 6% off of the true value!

Our proposed ‘recursive bivariate regression’ approach

How do candy canes relate to our modeling methodology at Disney Streaming? When trying to model how various features and variables relate to a key metric of interest (for instance modeling how user behavior is tied to churn), the standard approach consists in fitting a single model using all features as inputs (some might be filtered out or transformed, but we can assume without loss of generality that all are kept). The issue is that we are putting all our eggs in the same basket, with this “expert” model. If there are any issues with the features (e.g. high collinearity), the entire single model can crumble. Naturally, many techniques have been developed to detect and address these issues, however it can be a long and cumbersome process to iteratively fit, fix and re-fit models.

Why not rely instead on the ‘wisdom of crowds’, and fit many poor models? Instead of a single model with all features, let us fit many models with just a handful of features.

The idea is implemented as follows: for each of our variables of interest, loop through all other available features, fit the bivariate model, and keep track of our results. For our variable of interest, this will produce a distribution of its coefficient as we control for every other signal in our dataset, from which we can derive a single estimate and confidence interval based on the median and quantiles.

The key question remains: how does this approach compare to a more standard approach of fitting a single model with all available signals?

One good model or many bad ones?

We ran multiple simulations knowing the true underlying values, and fit both approaches:

  • single model with all variables (SGM — single good model)
  • recursive bivariate models (MBM — multiple bad models)

We then compared key metrics for both approaches overreach of the iterations.

The main method for comparison is typically coverage, measuring how frequently the derived confidence interval contains the true value to be estimated. In our case however, this is not the most important metric. Our primary purpose is to identify drivers that, when acted upon, will impact our output metric of interest in the expected direction. Whether the resulting effect is exactly as anticipated or not is secondary. We therefore define directionality as follows:

  • If the true coefficient is 0: whether our confidence interval contains 0
  • If the true coefficient is positive: whether the lower bound of our confidence interval is positive
  • If the true coefficient is negative: whether the upper bound of our confidence interval is negative

In the illustration above, two methods lead to two identical estimates but with different confidence interval widths.

  • The true value is in the orange interval leading to coverage, but because 0 is also contained in the interval, the estimate would not be flagged as significant. We do not have directionality.
  • Conversely, the true value is not in the green interval leading to an absence of coverage, but 0 is also not contained in the interval, the estimate would therefore be flagged as significant, with the estimate having the same sign (positive) as the true coefficient. We have directionality.

We ran 1000 simulations using 100,000 users and 50 correlated features along with 50 uncorrelated features.

The results are as follows:

The y-axis indicates the fraction of times we were able to correctly determine the correct impact of a given variable. Both for the correlated (left) and uncorrelated (right) variables, our approach of aggregating bad models is better suited to deliver reliable recommendations to our stakeholders.

One of the reasons our approach outperforms the more traditional technique is related to the width of the confidence intervals:

The y-axis here represents the distribution of confidence intervals on a logarithmic scale. Our approach typically returns confidence intervals 10 times smaller thus allowing to draw correct inferences on a more regular basis. This is directly related to the high collinearity across variables which significantly inflates width, an effect our approach is immune to.

Conclusion

We have detailed a simple method to draw correct insights at scale, without having to rely on overly-complex techniques to address the standard issues that arise in data modeling. Despite the value this approach has already brought in helping to identify the main features influencing a key metric of interest, we have only laid down the foundations for this approach and are continuously refining it. What would the benefits of using other models be? Would ‘recursive tri-variate regression’ outperform our current bivariate technique?

I’m fully aware I promised the answer to the number of candy corns at the end of the post, but technically didn’t specify the ‘when’. Please add your guess in the comments, and we can give ‘wisdom-of-crowds’ a second shot!

--

--