In probability class, the most basic concept we learn is Sampling. When pulling out candies from a bucket we can either use two different sampling methods.

**Sampling without Replacement**

- Extracting a sample out of sample space and the sample will not return back to the pool. Probability changes for each variable as more sample gets pulled out (overall Sample Size reduces)**Sampling with Replacement**

- When we pull out a sample out from a sample space, the sample that we extracted will return back to the sample space and therefore will maintain the probability of sampling out a certain type of variable (Equal probability for all different variables as extraction continues)

This two simple concept will allow us to understand the meaning behind one of the Ensemble method called Bagging.

*K-Fold Split*: Sampling without Replacement

*K-Fold Split*

Let’s say we have a dataset X, having total of 1 → K rows.

What K-fold performs is that the entire data can be split into k different blocks. Every newly made dataset would have K-1 rows (one of the row is removed from the dataset). There will be K different datasets (if we were to split each every row).

The problem of K-fold comes from the concept of independence. When the split creates K different datasets, it is true that every each dataset is not identical, however, the only difference made from the split is the k-1 fold and the remaining k-2 rows are identical. This would lead to **highly correlated datasets** and as a result the model predicting these would result in similar output.

The final output from the K-fold split is:

Each function f1 →fk is newly created data with one missing row.

Output can be computed using different type of aggregate functions, but it is most common to use simple average.

**Bootstrap Aggregating — Bagging**: Sampling with Replacement

Each member of the ensemble is constructed via sampling from total of N data examples, choosing N items uniformly at random with replacement.

From the original dataset, every time when each individual bootstrap (dataset) is created. Due to the replacement nature and random selection, there is a possibility that the same sample exist multiple times within a same dataset.

Advantage of having the Bootstrap method is that due to the randomness, the irreducible error, use to have normal distribution would also change its distribution as the new bootstrap is created, and therefore would diversify the dataset. This ultimately makes **Bagging method suitable for Low Bias and High Variance Model. (Also means High Complex Models) — In other words, it reduced overfitting by averaging predictions**

From the Bias and Variance point of view, bagging would:

**Reduce the Variance**: Averaging over independent samples

Variance is reduced depending on the number of ensembles created.**Bias Unchanged**: Averaged prediction has the same expectation

even with multiple bootstraps created, expectation of the prediction will remain the same.

Explained from Ensemble Overview post the reduction of variance by 1/m is theoretical condition (Complete Uncorrelation between datasets are impossible). Therefore, when sampled predictions have variance sigma² and correlation p, then variance reductions can be shown by:

One might think, if creating samples for each bootstrap is all about probability isn’t it also possible that none of the bootstrap will have certain sample?

You are right! even with high number of bootstraps, since the method relies heavily on the probability, there is always a chance some data would left out.

Above equation simply represents, as the N increases, the chance that a sample would never be in any of the bootstrap is 1/3 = 0.33% and there is 2/3 = 66% chance that a single sample would appear more than one time in all of the bootstrap.

**The samples that are not included in any of the Bootstrap is called Out Of Bag (OOB)** and it will be used to validate the model. In a typical machine learning model evaluation, the dataset is split into 70:30. This split can occur in Bootstrap method as well. Because from the above equation it was proved that about 30% of the data will be OOB therefore for each bootstrap, we can test the model with the data that the model have never seen it before (Data that was never present during training).

Bagging can be used with any type of Supervised Machine Learning Algorithms

Overview of Bagging:

- From the original Dataset (Observations) create M number of bootstraps via sampling via replacement.
- Create a model for each independent training sets and train the model with designated datasets
- Use a model to predict the testing dataset (OOB will be testing dataset)
- Gather all of the predictions from each model and input into the aggregate function.

**Result Aggregating Functions**

Aggregations can be done through multiple methods depending on the nature of the dataset.

Let’s say we have outcomes shown below:

## Majority Voting

This is simply comparing the predicted outcome.

Ex. Out of all bootstraps, let’s say 10.

- 7 of them are showing 1 as output

- 3 of them are showing 0 as output

In this case, majority voting results 1 as the final outcome of the aggregate function

## Weighted Voting

Weight used in this function is T*raining Accuracy* of individual models

Compare the conditional probability:

P(y = 1 | X new) vs P(y = 2 | X new)

Ex. As we can see from the diagram of outcomes,

P(y = 1| X new) = (0.80 + 0.75 + 0.88 + 0.65 + 0.78 + 0.83) / #Models

P(y =0| X new) = (0.91 + 0.77 + 0.95 + 0.82) / #Models

Compare the probabilities and choose the one with higher p.

Another Weighted Voting method uses *Predicted Probability* for each class

Same logic applies but this time, instead of Training accuracy, Probability of y = 1will be used for the weight. (Higher weight is applied when the model is confident about the prediction)

## Stacking

Using another prediction model (Meta-Classifier/Regressor) to aggregate the results

Input: Predictions made by ensemble members

Target: Actual true label

Ensemble technique such as bagging and boosting is very common in tree models. It is proved that the CART models are always performing better when bagging is applied.