Ensemble Methods

Abolfazl Ravanshad
3 min readApr 27, 2018

--

In this post, I am going to review ensemble methods in general. Ensemble methods combine multiple hypotheses to form a better hypothesis (hopefully). All ensemble methods share the following 2 steps:

1- Producing a distribution of simple models on subsets of the original data

2- Combining the distribution into one “aggregated” model

The term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner. The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner. An ensemble itself is a learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. Empirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine [1].

Common types of ensembles

Simple averaging: One of the most popular ways of combining multiple hypotheses is through simply averaging of the corresponding output values. It is considered a too simple but very effective ensemble method. It achieves most of the benefits of ensembles in reducing the variance of the estimate of the output class posteriors.

Stacking (stacked generalization): The main concept is to use a new classifier to correct the errors of a previous classifier, hence the classifier are stacked on top of one another. It involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. Theoretically, any arbitrary combiner algorithm can be used, although in practice, a single-layer logistic regression model is often used as the combiner.

Bootstrap Aggregating (Bagging): This algorithm generates multiple bootstrap training sets from the original training set and uses each of them to generate a classifier for inclusion in the ensemble. In general, bagging does more to reduce the variance in the base models than the bias, so bagging performs best relative to its base models when the base models have high variance and low bias. Decision trees are unstable, which explains why bagged decision trees often outperform individual decision trees. Similarly, bagging, or in general bootstrapping works well for neural network classifiers. However, Naive Bayes classifiers or kNN classifiers are stable, which explains why bagging is not particularly effective for Naive Bayes or kNN classifiers [2].

Boosting: This algorithm involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified. Boosting does more to reduce bias than variance. For this reason, boosting tends to improve upon its base models most when they have high bias and low variance. Examples of such models are Naive Bayes classifiers and decisions tumps. Boosting’s bias reduction comes from the way it adjusts its distribution over the training set. However, this method of adjusting the training set distribution causes boosting to have difficulty when the training data are noisy. It is because the weights assigned to noisy examples often become much higher than for the other examples, often causing boosting to focus too much on those noisy examples and overfit the data [2].

Advantages of ensemble methods

  • Intuitively, ensembles allow the different needs of a difficult problem to be handled by hypotheses suited to those particular needs.
  • Mathematically, ensembles provide an extra degree of freedom in the classical bias/variance tradeoff, allowing solutions that would be difficult (if not impossible) to reach with only a single hypothesis.
  • They’re unlikely to overfit.

Disadvantages of ensemble methods

  • The model that is closest to the true data generating process will always be best and will beat most ensemble methods. So if the data come from a linear process, linear models will be much superior to ensemble models.
  • Ensemble models suffer from lack of interpretability. Sometimes we need predictions and explanations of the predictions. It is hard to convince people to act on predictions when the methods are too complex for their comfort level. Variable importance analysis can help with insights, but if the ensemble is more accurate than a linear additive model, the ensemble is probably exploiting some non-linear and interaction effects that the variable importance analysis can’t completely account for [3].
  • Ensemble methods are usually computationally expensive. Therefore, they add learning time and memory constrains to the problem.

References:

  1. Ensemble Learning (wikipedia)
  2. Classifier ensembles: Select real-world applications
  3. When should I not use an ensemble classifier?

--

--