Ensemble Learning- An Integral Aspect of Machine Learning

Heena Rijhwani
The Startup
Published in
6 min readNov 21, 2020

The famous saying “There is no strength without unity” perfectly captures the idea behind Ensemble Learning. Ensemble techniques are widely used in statistics and machine learning to improve the performance of a model. They use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. There are mainly two types of ensemble techniques: Bagging and Boosting.

In Bagging, individual models are trained on random subsets of the original dataset and consequently, their individual predictions are aggregated. Hence Bagging is also called as Bootstrap Aggregation.

Let’s take an example.

Consider an original dataset D and various base models such as M1, M2, M3 and so on. For every model, we provide a sample of the dataset. The training set for each of the base classifiers is independent of each other. Some of the original data may be repeated in the resulting training set while the rest may be left out.

For instance, for Model 1, we provide a sample of dataset D called D’, which is smaller than D. We again resample records for Model 2 and so on. This process is called Row sampling with replacement. Each of the models are trained on the respective samples, where some records are common while the rest are not. Records that are not part of a specific sample are used for testing the model accuracy. Such records are called Out-of-Bag (OOB) records. For test data, prediction is made by each of the models. Then, voting classifier is applied, i.e., majority of the votes are considered as the final prediction or outcome. In other words, every model makes a prediction (votes) for each test instance and the final output prediction is the one that receives more than half of the votes. If more than half of the models have output 1, the final outcome will be 1.

Majority voting could be based on simply counting the individual votes or it could be weighted based on their individual accuracy measures. Ensemble methods are also used for regression problems, where the prediction of new data is simple average or weighted average of all the predictions.

RANDOM FOREST

Random Forest is a popular bagging technique due to its performance and scalability. It is an ensemble of decision trees, where each decision tree is built from bootstrap samples (sampling with replacement i.e. some records might be repeated) and randomly selected subset of features without replacement. The number of estimators or models to be used in Random Forest can be tuned to increase the accuracy of the model.

The hyper parameters in a Random Forest model include:

  • Number of decision trees.
  • Number of records and features to be sampled.
  • Depth and search criteria (Gini impurity index or entropy).

Random forest is a supervised learning algorithm which is used for both classification as well as regression. As we know a forest is made up of trees and more trees implies a more robust forest. Similarly, random forest algorithm creates decision trees on data samples and gets a prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree as it reduces the problem of over-fitting by averaging the result.

Working of Random Forest Algorithm:

Step 1 − First, start with the selection of random samples from a given dataset.

Step 2 − Next, this algorithm will construct a decision tree for every sample and get the prediction result from every decision tree.

Step 3 − In this step, voting will be performed for every predicted result.

Step 4 − At last, the most voted prediction result will be selected as the final prediction result.

Advantages:

  • It overcomes the problem of overfitting by averaging or combining the results of different decision trees.
  • Random forest works well for a large range of data items than a single decision tree does.
  • It has less variance than single decision trees.
  • It has greater flexibility and very high accuracy.
  • Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.
  • Random Forest algorithms maintains good accuracy even if a large proportion of the data is missing.

Disadvantages:

  • Complexity is the main disadvantage of Random Forest algorithms.
  • Construction of Random Forests can be much harder and time-consuming as compared to decision trees.
  • More computational resources are required to implement Random Forest algorithm.
  • It is less intuitive when we have a large collection of decision trees.
  • The prediction process using random forests is very time-consuming in comparison with other algorithms.

BOOSTING

Boosting is another ensemble technique which creates a strong classifier from a number of weak classifiers. First the base classifier is trained on the given dataset, then a second model is coined that attempts to correct the errors from the first model i.e. only incorrectly classified records are passed onto the next model. Boosting builds multiple classifiers in a sequential manner as opposed to bagging, which can build classifiers in parallel.

ADABOOST

AdaBoost or Adaptive Boosting is a commonly used Boosting technique. It uses an iterative approach to learn from the mistakes of weak classifiers, and turn them into strong ones.

Working of AdaBoost:

Step 1- All records are assigned a sample weight. This sample weight is equal to 1/n where n is the number of records. For example, if our dataset contains 8 records, the sample weight will be 1/8.

Step 2- A weak classifier is prepared on the training data using the weighted samples.

Decision trees having only one level are created for each feature and are known as stumps.

Now, to select the base learner model, we compare entropies of each of the decision trees and the one with minimum entropy is selected.

Step3- Find performance of the stump.

Where total error is calculated by summing up sample weights of incorrectly classified observations. For instance, if we have only one incorrectly classified observation total error will be 1/8.

Step 4- Update weights.

Higher weights are assigned to wrong classified observations so that in the next iteration these observations will get the high probability for classification and lesser weights are assigned to correctly classified observations.

Where wi is the new weight, w(i-1) is the old weight and alpha is the performance as calculated in the previous step. It is positive for correctly classified samples and it is negative for misclassified samples.

Step 5- Reiterate from Step 2 until all the observations have been correctly classified, or the maximum iteration level has been reached.

Gradient Boosting and Extreme Gradient Boosting or XGBoost are other commonly used boosting techniques.

--

--

Heena Rijhwani
The Startup

Final Year Information Technology engineer with a focus in Data Science, Machine Learning, Deep Learning and Natural Language Processing.