**Ensemble Methods**

Ensemble approaches are a sort of machine learning technique that combines multiple base models into a single best-fit predictive model. Ensemble techniques are multi-model approaches that are used to get superior results. Ensemble techniques, in most situations, produce more accurate findings than a single model. In a number of machine learning competitions, the winning solutions used ensemble approaches. The winner of Netflix’s popular competition employed an ensemble method to create a powerful collaborative filtering system.

Let’s largely utilize Decision Trees to outline the definition and practicality of Ensemble Methods.

A Decision Tree determines the predictive value based on a series of questions and conditions. For instance, this simple Decision Tree determines whether an individual should play outside or not. The tree takes several weather factors into account and gives each factor either to make a decision or ask another question. In this example, every time it is overcast, we will play outside. However, if it is raining, we must ask if it is windy or not? If windy, we will not play. But given no wind, tie those shoelaces tight because we're going outside to play.

Decision Trees can also solve quantitative problems as well with the same format. In the Tree to the left, we want to know whether or not to invest in commercial real estate property. Is it an office building? A Warehouse? An Apartment building? Good economic conditions? Poor Economic Conditions? How much will an investment return? These questions are answered and solved using this decision tree.

When making Decision Trees, there are several factors we must take into consideration: On what features do we make our decisions on? What is the threshold for classifying each question into a yes or no answer? In the first Decision Tree, what if we wanted to ask ourselves if we had friends to play with or not. If we have friends, we will play every time. If not, we might continue to ask ourselves questions about the weather. By adding an additional question, we hope to greater define the Yes and No classes.

This is where Ensemble Methods come in handy! Rather than just relying on one Decision Tree and hoping we made the right decision at each split, Ensemble Methods allow us to take a sample of Decision Trees into account, calculate which features to use or questions to ask at each split and make a final predictor based on the aggregated results of the sampled Decision Trees.

**Simple Ensemble Techniques:**

There are mainly three Ensemble techniques.

1) Max Voting

2) Averaging

3) Weighted Averaging

**1)** **Max Voting: -**

· The max voting method is generally used for classification problems.

· In this technique, multiple models are used to make predictions for each data point.

· The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.

For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating will be taken as 4. You can consider this as taking the mode of all the predictions.

The result of max voting would be something like this:

**2)** **Averaging: -**

· In Averaging multiple predictions are made for each data point in averaging.

· In this method, we take an average of predictions from all the models and use it to make the final prediction.

· Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

For example, in the below case, the averaging method would take the average of all the values.

i.e. (5+4+5+4+4)/5 = 4.4

**3)** **Weighted Average: -**

· This is an extension of the averaging method.

· All models are assigned different weights defining the importance of each model for prediction.

· For instance, if two of your colleagues are critics, while others have no prior experience in this field, then the answers by these two friends are given more importance as compared to the other people.

The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.

**Advanced Ensemble Techniques:**

**1)** **Stacking: -**

Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, KNN, or SVM) to build a new model. This model is used for making predictions on the test set.

Below is a step-wise explanation for a simple stacked ensemble:

- The train set is split into 10 parts.

2) A base model (suppose a decision tree) is fitted on 9 parts and predictions are made for the 10th part. This is done for each part of the train set.

3) The base model (in this case, decision tree) is then fitted on the whole train dataset.

4) Using this model, predictions are made on the test set.

5) Steps 2 to 4 are repeated for another base model (say KNN) resulting in another set of predictions for the train set and test set.

6) The predictions from the train set are used as features to build a new model.

7) This model is used to make final predictions on the test prediction set.

**2)** **Blending: -**

· Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set to make predictions.

· In other words, unlike stacking, the predictions are made on the holdout set only.

· The holdout set and the predictions are used to build a model which is run on the test set. Here is a detailed explanation of the blending process:

- The train set is split into training and validation sets.

2) Model(s) are fitted on the training set.

3) The predictions are made on the validation set and the test set.

4) The validation set and its predictions are used as features to build a new model.

5) This model is used to make final predictions on the test and meta-features.

**3)** **Bagging: -**

Bagging is a technique for merging the outputs of various models (for example, all decision trees) to produce a more generic output. I have a question: Will it be useful to develop all the models on the same set of data and integrate them? Given the same input, there’s a good likelihood that these models will produce the same outcome.

Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The size of the subsets is the same as the size of the original set.

The bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution (complete set). The size of subsets created for bagging may be less than the original set.

1) Multiple subsets are created from the original dataset, selecting observations with replacement.

2) A base model (weak model) is created on each of these subsets.

3) The models run in parallel and are independent of each other.

4) The final predictions are determined by combining the predictions from all the models.

**4)** **Boosting: -**

Before we proceed any further, I’d like to ask you a question: Will merging the predictions of the first model and the next (probably all models) yield better results if a data point is mistakenly anticipated by the first model? Boosting is used to deal with such problems.

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works in the below steps.

1) A subset is created from the original dataset.

2) Initially, all data points are given equal weights.

3) A base model is created on this subset.

4) This model is used to make predictions on the whole dataset

5) Errors are calculated using the actual values and predicted values.

6) The observations which are incorrectly predicted, are given higher weights.

(Here, the three misclassified blue-plus points will be given higher weights)

7) Another model is created and predictions are made on the dataset.

(This model tries to correct the errors from the previous model)

8) Similarly, multiple models are created, each correcting the errors of the previous model.

9) The final model (strong learner) is the weighted mean of all the models (weak learners).

Thus, the boosting algorithm combines a number of weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they work well for some parts of the dataset. Thus, each model actually boosts the performance of the ensemble.

**Explain at least two different types and when you will apply them.**

**Solution:**

Algorithms based on Bagging and Boosting: Bagging and Boosting are two of the most common Ensemble Learning strategies. For each of these strategies, here are some of the most common algorithms. In the near future, I’ll produce individual entries detailing each of these algorithms.

**Bagging algorithms:**

· Random Forest

· Bagging meta-estimator

**Boosting algorithms:**

· AdaBoost

· Gradient Boosting Machine (GBM)

· XGBoost

· Light GBM

· CatBoost

**Random Forest:**

· Another ensemble machine learning algorithm that uses the bagging technique is Random Forest.

· It is a bagging estimator algorithm extension.

· Random forest uses decision trees as its base estimators.

· Random forest, unlike the bagging meta estimator, chooses a collection of features at random to determine the best split at each node of the decision tree.

To summarize, a Random forest randomly selects data points and features and builds multiple trees (Forest).

**AdaBoost:**

· AdaBoost, or adaptive boosting, is one of the most basic boosting algorithms.

· Modeling is usually done with decision trees.

· Multiple sequential models are constructed, each one correcting the previous model’s flaws.

· AdaBoost adds weights to the observations that are poorly predicted, and the following model tries to accurately forecast these values.

Sample running for AdaBoost:

**XGBoost:**

· Extreme Gradient Boosting (XGBoost) is a more advanced version of the gradient boosting method.

· XGBoost has been shown to be a powerful machine-learning algorithm that has been widely used in competitions and hackathons.

· XGBoost is about 10 times faster than other gradient boosting approaches and has a good predictive potential.

· It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as the ‘regularized boosting technique.

**Light GBM:**

· Before we get into how Light GBM works, let’s have a look at why we need it in the first place when there are so many others (like the ones we have seen above).

· When the dataset is extremely huge, Light GBM outperforms all other techniques.

· Light GBM takes less time to run on a large dataset than the other algorithms.

· LightGBM is a gradient boosting framework that employs tree-based algorithms and uses a leaf-wise approach, whereas other techniques use a level-wise method.

· The graphics below can help you better comprehend the differences.

**Bagging and Boosting Have Some Similarities:**

1. Both of these strategies are ensemble approaches for generating N learners from a single learner.

2. Both use random sampling to generate many training data sets.

3. By averaging the N learners, both arrive at the final choice (or taking the majority of them i.e. Majority Voting).

Both are effective at lowering variance and increasing stability.

**When to use the Bagging and Boosting:**

· Bagging is a technique for reducing prediction variance by producing additional data for training from a dataset by combining repetitions with combinations to create multi-sets of the original data.

· Boosting is an iterative strategy for adjusting an observation’s weight based on the previous classification.

· Boosting attempts to increase the weight of observation if it was erroneously categorized.

· Boosting creates good predictive models in general.

**Conclusion: **Ensemble modeling can dramatically improve your model’s performance and can even be the difference between first and second place!