Ensembles and Ensembles of Ensembles

Hyperparameter free machine learning

Published in

From the Diaries of John Henry

11 min readJul 24, 2020

I recently had the opportunity to attend another machine learning research conference, this one organized by ICML, hosted online with pre-recorded presentations, Zoom chats, and various other virtual interactive features (these things are too affordable not to attend, seriously no excuse). To be honest I’m finding these research conferences kind of difficult to navigate. Representing a startup company trying to attract users means navigating a field full of potential competitors necessitating a bit of caution. I mean that’s what I tell myself, the reality is am sort of bad at networking, so yeah there’s a little of that at play as well.

Between poster sessions, tutorials, workshops, and invited talks there were several highlights, so I won’t try to cover everything here. Instead will focus discussions on a closing workshop which I think covered some important ground, related to the field of automated machine learning aka AutoML. I’ll frame this essay as highlights of several papers that were sources of key components of modern paradigms, useful to understand the foundations of modern libraries.

AutoML can generally be thought of as machine learning libraries that incorporate one or more individually trained models in which training hyperparameters are abstracted away from the user. In some cases that abstraction may be achieved by a kind of automated architecture search and/or hyperparameter search, or in other cases we’ll see that such abstractions may be achieved simply by aggregating multiple models in ensembles. Most common AutoML libraries are intended for tabular data applications, as will be those discussed below, although many of these techniques I am sure extend to other modalities just as well. Yeah so without further ado.

Ensembles

Dietterich, T. G. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems, pp. 1–15. Springer, 2000.

It turns out that one of the most common features of AutoML implementations are these aggregations of models into ensembles, in which final predictions for inference are derived from some combination of constituent models. The aggregations rely on diversity of the models such that errors in different directions may offset each other in aggregation, sort of like how in statistics properties of a collective may be more reliable than proprieties of a single sample. Models may make different kinds of errors due to different architectures or in some cases even with consistent architectures such as from different random perturbations from training set composition and/or initializations causing the optimization to get caught in different local minima of the loss function fitness landscape. In some cases, such as with under-parameterized models, single models on their own may not even be capable of reaching a true global solution in the fitness landscape, but through aggregations of surrounding points the solution may be better approximated. What’s really cool is that this potential for improved performance through aggregation need not even require highly performative models, in fact component models that perform even just slightly better than guessing, e.g. in a binary classification with error rate <0.5, may be sufficient to approximate an optimal solution given sufficient diversity.

The ensembles themselves may be aggregated in several fashions, or in the context of AutoML libraries, several of these fashions combined. Different model aggregation techniques include bagging, stacking, boosting, Bayesian, and k-fold cross validation — we’ll describe a few below. The source of model diversity can be achieved by many kinds of training permutations, such as by applying different architectures, different hyperparameters, adding random noise to model initializations, data set composition, varying feature compositions with feature subsets in training, or another neat one for cases of regression or multi label classifications can be to aggregate label classes into collections of different coarse grained subsets by grouping different classes into common labels. What’s important is that through randomness the expectation is that these variations may be wrong in different directions, allowing the collective to hone in on the true solution.

Bagging

Breiman, L. Bagging predictors. Machine learning, 24(2): 123–140, 1996.

Bagging is a common type of model ensemble aggregation technique in which model diversity is achieved by variations in training data achieved by bootstrap sampling from the training set. Bootstrap sampling refers to a statistical technique where when samples are drawn from a population, for instance in the AutoML case from a population of rows in a tabular data set, the drawn samples are subject to replacement, which just means that for multiple samples drawn from the same dataset single observations may be drawn zero, one, or more times. As an example, if we have a four row tabular training set with features/label pairs we’ll refer to as [a, b, c, d], if we were to populate a sampled four row training set for bagging aggregations the bootstrap random sample from this set it might look something like this: [a, c, a, d]. Although some of the features/label aggregations are missing from the returned set (like ‘b’), and some are overrepresented (like ‘a’), this is just one sampled training set, and through repetition of sampled training data to populate multiple returned sets each of the features/label pairs would be expected to obtain a more level distribution across the sets. Thus these deviations in training sample representations in each training can serve as a source of model diversity.

The aggregation of these bagging models for inference can be achieved by simple voting in classification or averaging in regression — where voting just means if we have e.g. an ensemble of five models for a binary classification inference operation, if two models predict True and three models predict False, we may select False as our final hypothesis. As a caveat to this method, it should be noted that in general bagging has shown to work better when the constituent models are of a variety that may be considered “noisy”, by which is meant that small changes in data properties may result in nonlinear changes to predictions. Examples of architectures that may exhibit this property include models in the decision tree paradigms or neural networks, where “stable” models include architectures like linear regression models or nearest neighbors. Another limitation is that bagging works better when the error rate of the underlying models are fairly close, e.g. a majority vote between two highly accurate models and three high uncertainty models may not give the best result.

Stacking

Ting, K. M. and Witten, I. H. Stacking bagged and dagged models. In Proceedings of the 14th International Conference on Machine Learning, pp. 367–375, 1997.

Stacking is a kind of model aggregation technique that may be built on top of bagging, such as when model diversity may be achieved with bootstrap sampling, built on top of “dagging” which is similar to bagging but sampling is performed without replacement, or built on other means to achieve model diversity as noted above. Stacking refers to cases where the aggregation of constituent models in inference is achieved by a different means than simply popular vote, more specifically by a second layer model that applies machine learning to adjust weightings of the predictions of models at a lower tier, thus adding a little more intelligence to model aggregation. Here we’ll refer to the first tier of models as the ‘student’ models and the second tier as the ‘teacher’ model (borrowing this terminology from a presentation I saw from H2O). In some cases the teacher model may be a single model, or in other cases may be an ensemble of models as well. In fact, one doesn’t need to limit to a single stacking layer, as an ensemble of teacher models could itself be stacked by an additional tier of models e.g. an ‘administrator’ model, and so on.

One of the benefits of stacking is accommodation of student models with diverse model accuracies, as the teacher model will be able to recognize when one of the models is more accurate than others and adjust weightings accordingly. Some variations on the teacher layers may be for simpler models to serve as the aggregator than the student models, although in some cases the full diversity of model architectures from student layers may be duplicated at teacher layers. Note that some libraries, such as AutoGluon, include a pass-through of training data used to train the student models to also include in the training of the teacher models, similar to what is known as a skip-connection in neural networks. In practice it may be appropriate to train the teacher layers on data points that were not used to train the student layers, such as for use of k-fold cross validation what can be called out-of-fold data. From a parallelization standpoint, each of the student models can be trained in parallel, and then teacher models could be a next step in sequence.

Boosting

Freund, Y., and Schapire, R. E. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, volume 96, pp. 148–156. Citeseer, 1996

Boosting is an entirely different kind of model aggregation technique, here we’ll talk about the “Adaboost” variant of boosting, other kinds of boosting techniques which we won’t go into include gradient boosting, cat boosting, quantum boosting, and etc. Although Adaboost aggregates a collection of diverse models (generally all from the same architecture of some weak learner), the training operation is more appropriate for sequential application, as each model training operation will partly rely on properties derived from the one prior. Generally speaking, the source of model diversity from an Adaboost aggregation is achieved by iteratively adjusting weights on the samples from a training set, such as for a tabular training set samples which constitute a row of feature sets and corresponding labels from training data, e.g. commonly known as a set {Xi, yi}. In Adaboost the weights for different samples are adjusted after each model training by comparing inference error from the actual label yi corresponding to the features Xi vs the inference error from from a false label y≠yi. This comparison allows us to identify those training samples that are harder to predict, and the effect of the weighting update is for each iteration to progressively increase attention towards these more challenging samples.

The expectation is that this increased weighting towards difficult samples will make subsequent models put more weight on edge cases vs earlier iterations, and then once some desired number of iterations has been performed a weighted aggregation of the sequence of models may be applied for inference, such as for instance may apply more weight to models trained later in the sequence. There are some weaknesses, for instance as we increase the number of classes in the labels, the minimum accuracy threshold of the weak learner must achieve a higher bar to converge. Another weakness could originate from cases of noisy labels, i.e. where some portion of the set are mislabeled, in which case boosting may apply too much weight to the noise sources and cause overfitting. Note that in some cases the comparison of classification error between the two label cases yi vs y may benefit from considering a non-boolean label metric which we’ll call plausibility (similar to probability but without the need to sum to unity between the two cases), as may be available as an output from some weak learners.

K-fold Cross-Validation

Parmanto, B., Munro, P. W., and Doyle, H. R. Reducing variance of committee selection with resampling techniques. Connection Science, 8(3–4):405–425, 1996

Another common source of model diversity I am a little more familiar with, with diversity originating from partitioning training data into a rotating collection of train/test partitions, for example for a 5-fold cv the train/validation split may be achieved with 80%/20% of the training data, repeated five times where each time the validation split is a different 20%. Each of these five models can then be aggregated as an ensemble in inference. This relatively simple procedure has the benefit of drawing training from the entire train set so that no data is set aside for validation, and I expect in the context of an AutoML library this type of operation may be coupled with many of the other aggregation techniques.

Bayesian

Dietterich, T. G. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems, pp. 1–15. Springer, 2000.

Dietterich notes another aggregation techniques in his writeup that I’ll briefly highlight here for completeness. Bayesian aggregation is a probabilistic method which as you might expect borrows from Bayes’ Theorem. More specifically, inference hypotheses are aggregated by evaluating a distribution of hypothesis in relation to a sample. I didn’t follow this point completely, but apparently one way to support this distribution evaluation is by way of a Markov chain Monte Carlo sampling technique, which from what I gather is generally a useful tool in probabilistic inference. As far as means to develop a prior distribution for the Baye’s analysis I am not sure and will leave that question as a reader exercise.

AutoML Libraries

Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larry, P., Li, M., and Smola, A. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. ICML, Workshop on Automated Machine Learning, 2020
LeDell, E., and Poirier, S., H2O AutoML: Scalable Automatic Machine Learning. ICML, Workshop on Automated Machine Learning, 2020

The ICML workshop I attended included papers offered by developers from two different AutoML frameworks, including AutoGluon (from Amazon) and H2O. Both of them include some further benchmarks in their papers verses other libraries which I’ll leave to the reader to explore. An interesting point of differentiation between the two included utilization of CASH techniques, which stands for Combined Algorithm Selection and Hyperparameter optimization, in that H2O makes some use of this technique in their filtering of models / hyperparameters for inclusion in the final returned set vs AutoGluon which instead relies entirely on model diversity for honing in on a global solution. (I’m of course oversimplifying there are other points of differentiation as well). In fact this was one of my main takeaways from this workshop, that with sufficient model diversity in an ensemble individual model hyperparameter tuning may not be required. Like a free lunch.

This wasn’t my only takeaway. Another point that caught my attention is that the data preprocessing methods available in these libraries for the most part appear somewhat rudimentary, which of course I take comfort in in that there remains some value for a library like Automunge, although at the same time the fact that these libraries are starting to venture deeper into these waters may, well, I don’t know it’s just something that contributed to my mostly keeping my mouth shut at the workshop. When in doubt, keep quiet as they say. After all, automating tabular data preprocessing is a very different kind of challenge than automating model training. It could be that there is no free lunch, that it would take some GPT-3 or similar capable AI to tame that challenge. My hope is that when that time comes, Automunge will be the platform on which it is conducted.

References

Breiman, L. Bagging predictors. Machine learning, 24(2): 123–140, 1996.
Dietterich, T. G. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems, pp. 1–15. Springer, 2000.
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larry, P., Li, M., and Smola, A. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. ICML, Workshop on Automated Machine Learning, 2020
Freund, Y., and Schapire, R. E. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, volume 96, pp. 148–156. Citeseer, 1996
LeDell, E., and Poirier, S., H2O AutoML: Scalable Automatic Machine Learning. ICML, Workshop on Automated Machine Learning, 2020
Parmanto, B., Munro, P. W., and Doyle, H. R. Reducing variance of committee selection with resampling techniques. Connection Science, 8(3–4):405–425, 1996
Ting, K. M. and Witten, I. H. Stacking bagged and dagged models. In Proceedings of the 14th International Conference on Machine Learning, pp. 367–375, 1997.

Chopin’s Mazurka in B major

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com