Math Intuition behind AdaBoost…

Published in

Analytics Vidhya

13 min readJun 29, 2020

Boosting techniques have recently been rising in Kaggle competitions and other predictive analysis tasks. I’ll try to explain the concepts of Boosting and AdaBoost as clearly as possible

🔒Limitations of Bagging Models

In Case of Random Forest each individual tree is independent and hence it does not focus on the errors or predictions that are made by the previous tree which can actually be helpful to the upcoming tree for focusing on certain aspects of the data in-order to improve the overall performance.
For example, if there is a particular data point that is incorrectly predicted by the first tree then there is a good chance that second tree might also predict incorrectly and the same is valid for all the other trees in random forest. So, this means that the trees have no relation with each other and are working parallelly on different subsets of data. Since each of these individual trees do not learn from other tress.
Now is there any way that we can take the learning from previous tree in-order to improve the overall performance. Well that is not possible in case of random forest because the trees in random forest are built parallelly. So, we look at another technique Boosting. In case of boosting the models are dependent on each other and each of the upcoming the model tries to reduce the overall error by working on the incorrect predictions of the previous model. Hence these models are built sequentially.

💥Boosting:

Definition: The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.
Before we go further, here’s another question for you: If a data point is incorrectly predicted by the first model, and then the next (probably all models), will combining the predictions provide better results? Such situations are taken care of by boosting.
Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works in the below steps.
1.A subset is created from the original data set.
2.Initially, all data points are given equal weights.
3.A base model is created on this subset.
4.This model is used to make predictions on the whole data set.

5.Errors are calculated using the actual values and predicted values.
6.The observations which are incorrectly predicted, are given higher weights.
(Here, the three mis-classified blue-plus points will be given higher weights)
7.Another model is created and predictions are made on the data set.
(This model tries to correct the errors from the previous model).

8.Similarly, multiple models are created, each correcting the errors of the previous model.
9.The final model (strong learner) is the weighted mean of all the models (weak learners).

10.Thus, the boosting algorithm combines a number of weak learners to form a strong learner. The individual models would not perform well on the entire data set, but they work well for some part of the data set. Thus, each model actually boosts the performance of the ensemble.

🚩 Boosting algorithms:

Ada Boost
GBM
XGBM
Light GBM
CatBoost

💥 AdaBoost:

Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling. Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns higher weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly.

💢Three Concepts behind AdaBoost:

So, Lets start by using decision trees and random forests to explain the three concepts behind AdaBoost.
Concept-1.In Random Forest each time you make a decision tree you make a full-sized tree. Some trees might be bigger than other but there is no Predetermined maximum depth. In contrast, in a forest of trees made with AdaBoost the trees are usually just a node and two leaves which are called ‘Stumps’.

💫Stumps:

· Stumps are not great at making accurate Classifications.
· A tree with one node and two leaves is called Stump. So, here we call Forest of stumps instead of Forest of trees.

For example, if we are using the data given above (left) to determine someone has Heart Disease or not. Then a full-sized Decision Tree would take advantage of all 4 variables that we measured (Chest pain, Good Blood Circulation, Blocked Arteries and Weight) to make a decision as shown in image above(right). In contrast a stump can use only one variable to make a decision as shown in below image. Thus, Stumps are technically weak learners.

Concept -2.In a Random Forest, each tree has an equal vote on the final Classification. In contrast, in a Forest of Stumps made with AdaBoost, some stumps get more say ( Performance) in the final classification than others.
Concept-3. Lastly, in a Random Forest each decision tree is made independently of the others. In other words, it does not matter the order of trees we made because random forest build trees parallelly. In contrast, in a Forest of stumps made with AdaBoost, order is important. Since Forest of stumps are made Sequentially.

🪐To review, three ideas behind AdaBoost are:

1. AdaBoost combines a lot of ‘Weak Learners’ to make Classification. The weak learners are ‘Stumps’.
2. Some Stumps get more say(performance) in the classification than others.
3. Each Stump is made by taking the previous stump’s mistake into account.

💥Intuition and Math behind AdaBoost:

Imagine we have our sample, divided into a training set and test set, and a bunch of classifiers. Each of them is trained on a random subset of the training set (note that those subsets can actually overlap-it is not the same as, for example, cross-validation).
Then, for each observation in each subset, AdaBoost assigns a weight, which determines the probability that this observation will appear in the training set. Observations with higher weights are more likely to be included in the training set.
Hence, AdaBoost tends to assign higher weights to those observations which have been misclassified, so that they will represent a larger part of the next classifiers training set, with that aim, this time, the next classifier trained will perform better on them.

🚦How to create Forest of Stumps Using AdaBoost:

Below is the Data set on which we are going to construct Forest of Stumps. We Create a Forest of Stumps with AdaBoost to predict if a patient has heart disease. We will make these predictions based on a patient’s Chest Pain, Blocked Arteries status and their weight.

🧡STEP 1: Giving Each Sample a Weight

The first thing we do is give each sample a weight that indicates how important it is to be correctly Classified. At the Start, all samples get same weight which is 1/ (total number of samples or records). Since the data set contain 8 records each sample get 1/8 as sample weight and that makes the samples or every record equally important. However, after we make the first stump, these weights change in order to guide how the next stump is created.

NOTE: The Sample weight is different from Patient weight.

💛STEP 2: Making First Stump in the Forest.

Now we need to make the first stump in the forest. This is done by finding the variable, Chest Pain, Blocked Arteries or Patient Weight, that does the best job classifying the samples. So, based on the above data set we classify the samples.
NOTE: Because all of the weights are the same, we can ignore the Sample weight column in the data set.

We Start by seeing how well Chest Pain Classifies the samples. Of the 5 samples with chest pain, 3 were correctly classified as Having Heart Disease and two were incorrectly classified.
Of the three samples without chest pain two were correctly classified as not having hear Disease and 1 was incorrectly Classified.
Now we do the same thing for Blocked Arteries and Patient Weight.

NOTE: We used the Technique of decision tree to determine that 176 was the best weight to separate the patients.

💙STEP 3: Now we calculate the Gini index or the entropy for the three stumps. I have used Gini index.

💥 Gini

Gini says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

It works with categorical target variable “Success” or “Failure”.
It performs only Binary splits
Higher the value of Gini higher the homogeneity.
CART (Classification and Regression Tree) uses Gini method to create binary splits.

🚩Steps to Calculate Gini for a split:

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p²+q²).
Calculate Gini for split using weighted Gini score of each node of that split

🚩Split on Weight:

Calculate, Gini for Left sub-node = (1)*(1)+(0)*(0)=1
Gini for Right sub-node = (4/5)*(4/5)+(1/5)*(1/5)=0.68
Calculate weighted Gini for Split Weight = (3/8)*1+(5/6)*0.68 = 0.8
Gini Index =1- Weighted Gini
=1–0.8
=0.2
Similarly Calculate the Gini Index for all other splits using the above formula. We get the Gini Index as specified below in the image.

The Gini Index for weight attribute is the lowest. So, this will be our First Stump.

Now we need to determine how much say this stump will have in the classification. As discussed earlier some stumps get more say in the final classification than others. We determine how much say a stump has in the final classification based on how well it classified the sample. The above stump made one error as mentioned in the incorrect column. As shown in the red box below. This patient who weighs less than 176, has heart disease, but the stump says they do not. As we have considered weight>176 can only lead to Heart disease.

💜STEP 4: Calculating Amount of Say for Particular Stump

📢NOTE: Amount of Say can be referred as Performance of A Stump.

The total Error for a stump is the sum of the weights associated with the incorrectly classified Samples. Thus, in this case the total Error is 1/8 since only one sample has been classified incorrectly.

NOTE: Because all of the sample weights add up to 1. Total Error will always be between 0 for a perfect stump(The Stump that classifies every data point correctly) and 1 for a horrible stump(The Stump that classifies every data point incorrectly).
We use the Total Error to determine Amount of Say this stump has in the final classification with the following formula.

We can draw a graph of the Amount of Say by plugging in a bunch of numbers between 0 and 1 for Total Error. The Blue line tells us the Amount of say for Total Error values between 0 and 1 as shown in the below image. When a stump does a good job and the Total Error is small then the Amount of Say is relatively large, Positive Value. When a stump is not better at classification i.e. half of the samples are correctly classified and half are incorrectly classified which means the Total Error is 0.5 then the Amount of Say is 0 and when a stump does terrible job and the total error is close to 1 in other words if the stump misclassifies all the samples then the Amount of Say will be large negative Value.

NOTE: If Total Error is 1 or 0. Then the Amount of Say equation will freak out. In practice a small error term is added to prevent this from happening.
So, with patient weight >176 the Total Error is 1/8.

So, the Amount of Say that First Stump has on final Classification =0.97

💛STEP 5: Modify the Weights of every sample

Now we need to learn how to modify the weights so that the next stump will take the errors that current stump made into account.
Lets go back to the First Stump we made, when we created the First Stump all of the Sample weights were the same and that meant we did not emphasize the importance of correctly classifying any particular sample but Since the First Stump has incorrectly classified the sample that is shown in the red box below. We will emphasize the need for the next stump to correctly classify it by increasing its sample weight and decreasing all of the other sample weights.

Let’s Start by increasing the sample weight for incorrectly classified Sample (The sample that is shown in red box above).
The below is the formula that we use to increase the sample weight for the sample that was incorrectly classified.

Now we need to decrease the Sample weights for all of the correctly Classified Samples.
The below is the formula that we use to decrease the sample weight for the samples that was correctly Classified.

🧡STEP 6: Keep the Track of New Sample Weights and Normalize.

Now we need to Normalize the new Sample Weights so that they will add up to 1. Right now, if you add up the new sample weights you will get 0.05+0.05+0.05+0.33+0.05+0.05+0.05+0.05=0.68
So, we divide each new Sample weight by 0.68 to get the normalized values as shown below.

Now when we add the new sample Weights, we get 1 (plus or minus a little rounding error).
Now we just transfer the Normalized Sample weights to the Sample Weights column, since those are what we will use for next Stump as shown in the below Image.

💙STEP 7: Creating New Data set for the Next model.

we can make a new collection of samples that contains duplicate copies of the samples with the largest sample weights.

So, we start by making a new, but empty, data set that is the same as the original. Then we pick a random number between 0 and 1 and we see where the number falls when we use the sample weight like a distribution.
If the number is between 0 and 0.07 then we would the sample that is shown in red box below to the empty data set.

If the number is between 0.07 and 0.14 (0.07+0.07) we put the next sample into the new data set. If the number is between 0.14 and 0.21 (0.14+0.07) we put the third sample to the new data set and if the number is between 0.21 and 0.70 (0.21+0.49) we put the mis-classified sample into the new data set. This process goes on until we fill the new data set. So, based on the random numbers we get we put our corresponding samples into the new data set.
NOTE: In the above case we must put only 8 samples in to the new data set. The size of the new data set should be same as the actual size of the data set.

Since the misclassified Sample has more weight it has more probability than other samples to get into new data set.
Now we get rid of the original data set and use the new data set as the data set for training the next model. We will again assign the sample weights and repeat the above steps until all the samples are correctly classified or by using stopping criteria like number of estimators.
Now we need to talk about how a forest of stumps created by Ada Boost makes Classifications.
Suppose we have trained 10 models out of these 10 models for a given data sample 6 models( Stumps) predict the person has a Heart disease and 4 models predict the person does not have heart disease. So we calculate summation of Amount of Say for all the six Stumps that classified as Heart disease and Summation of Amount of Say for all the four Stumps (Models) that classified as does not have Heart disease. Now we classify the patient into that class which has highest Summation of Amount of Say.

Let’s discuss in comments if you find anything wrong in the post or if you have anything to add…
Thanks.

Credits and Sources: