Ensemble Methods (Algorithms)

Ahmed Imam
The Startup
Published in
6 min readJun 2, 2020

Random Forest & Gradient Boosting Algorithms

Tip It’s recommended that you should read the following articles regarding “Decision Tree: Regression Trees” and” Decision Trees Classifier” before going deep in ensemble methods.

Introduction

The basic idea behind ensemble methods is to build a group or ensemble of different predictive models -weak models- (each of these models is independent of the others) and then combine their outputs in some way to get a precise final predicted value.

As using different algorithms to train different models (independent models) and combining their results is very very difficult, ensemble methods uses only one algorithm to train all of these models in a way to be completely independent.

So, ensemble methods employ a hierarchy of two algorithms.

  1. The low-level algorithm is called a base learner. The base learner is a single machine learning algorithm that gets used for all of the models that will get aggregated into an ensemble.
  2. The upper-level algorithm (e.g. Bagging, random forest and gradient boosting algorithms) manipulates the inputs to the base learners so that the models they generate are somewhat independent.

Many algorithms could be used as base-learner such as linear regression, SVM, but the in practical, binary decision-trees algorithm was chosen to be the default base learner of the ensemble methods.

The ensembles may consists of a contribution of hundreds or may thousands of binary decision trees, and of course inherit their properties from it.

Random Forest

There are two ensemble learning paradigms: bagging (abbreviation of bootstrap aggregation) and boosting. The bagging paradigm is behind the random forest algorithm whereas boosting paradigm is behind gradient boosting algorithm.

How Does the Bagging Algorithm Work?

  • The bootstrap aggregation algorithm uses what is called a bootstrap sample.
  • Bootstrap sample represents a group of data that being selected randomly from the original data set with replacement.
  • Replacement means that, a bootstrap sample may contain multiple copies of a certain row/observation from the original data as the selection from original data set being done randomly one by one and that’s Ok.
  • Each sample contains a number of instances of about 60% of the nubmer of instances within the original data set (with replacement).
  • The job of bootstrap aggregation algorithm is to train each base learner with one of these samples.
  • The resulting models are averaged in regression problems.

Where: 𝑦̂ : is the average predicted output and 𝑆 : is number of samples.

  • For classification problems, the models can either be averaged or probabilities can be developed based on the percentages of different classes.
  • Sufficient depth for decision trees will lead to a proper work and more precise predictions for bagging algorithm.
  • Random forest is so called parallel model.

Fig.1 Bagging Algorithm

Different from Bagging

The random forest algorithm has an advantage over bagging algorithm that, the training samples don’t incorporate all the attributes (columns) of the original data set, instead they take a random subset of them also.

Advantages

Random forest is one of the most widely used ensemble learning algorithms. Why is it so effective? The reason is that by using multiple samples of the original dataset, we reduce the variance of the final model. Remember that the low variance means low overfitting. Overfitting happens when our model tries to explain small variations in the dataset because our dataset is just a small sample of the population of all possible examples of the phenomenon we try to model.

Gradient Boosting

Gradient boosting develops an ensemble of tree-based models by training each of the trees in the ensemble on different labels and then combining the trees.

For a regression problem

Where the objective is to minimize residuals (errors), each successive tree is trained on the errors left over by the collection of earlier trees. Let’s see how it works?

  1. Building our ensemble regressor simply by starting with this constant model (𝑓=𝑓 0) where 𝑓 0 is the average value of actual (observed) outputs

2. Then replace labels of each example (instance) i = 1, . . . , N in the training data set with the result of the following formula:

Where 𝑦̂ 𝑖 , called the residual, is the new label for example 𝑥𝑖, 𝑦𝑖 is actual output and 𝑓(𝑥𝑖) is the first predicted output of boosting model (calculated mean value which is fixed through all values of X in this case).

3. Now we use the modified training set, with residuals instead of original labels, to build a new decision tree model, 𝑓1

4. Get output of 𝑓1 decision-tree then the boosting model is now defined as:

where 𝛼 is the learning rate (a hyper-parameter with value 0→1).

5. Repeat point-2 to get new residuals and again replace the labels in our training data set with these new residuals.

6. Build a new decision tree model, 𝑓2 based on new data set and get results.

7. The new boosting model is now redefined as:

8. The process continues until the maximum of 𝑀 (another hyper-parameter) trees are combined.

9. For this reason gradient boosting is called sequential model

Fig.2 Gradient Boosting Regression

So, what’s happening here? Boosting algorithm has applied the policy of reducing the residuals (errors) that results from each base-learner gradually.

This being done by feeding the sequential base-learners within our ensemble with these residuals as labels instead of the real one.

Some weight ‘𝛼 ‘ is added to output of each new base-learner (tree) to enhance the predicted value of the ensemble model.

Therefore, each additional tree added to the ensemble model partially fixes the errors made by the previous trees until the maximum number of trees are combined. For this it’s called gradient boosting as MSE cost function reaches its minimum and predicted values get enhanced.

Tip You can see the difference with bagging here: boosting reduces the bias (or underfitting) instead of the variance. As such, boosting can overfit. However, by tuning the depth and the number of trees, overfitting can be largely avoided.

Advantages

Gradient boosting is one of the most powerful machines learning algorithms. Not just because it creates very accurate models, but also because it is capable of handling huge datasets with millions of examples and features. It usually outperforms random forest in accuracy but, because of its sequential nature, can be significantly slower in training.

--

--