AdaBoost from Scratch

AdaBoost Algorithm Explained and Implemented

6 min readJul 21, 2022

AdaBoost (Adaptive Boosting) is a classification boosting algorithm developed by Yoav Freund and Robert Schapire in 1995. They won the Gödel Prize in 2003 for their work. AdaBoost (and indeed other boosting methods) creates strong learners from a large number of weak learners.

A weak learner (or classifier) is a model that can give better results than a random guess but still does not provide adequate performance. With Adaboost we are combining many weak classifiers. These weak classifiers progressively learn from each other’s mistakes, creating a strong model as a whole. These weak unit classifiers are called base classifiers and are usually decision trees.

PseudoCode

Set equal weights for each observation (w=1/n)

for t=1:T:

Train a weak classifier

Choose coefficient alpha (importance)

Update weights

End for

Output:

Algorithm

Let’s go through an example for a better understanding of the algorithm. As seen in the table below, we have a dataset consisting of 8 observations and 3 features (2 categorical and 1 continuous). 4 observations have positive values while 4 ones are negative.

Firstly, we initialize weights equally; w = (1/n) = 0.125

Now, we create our base learner (decision trees). Each decision tree will have only one depth. These trees are called stumps.

For each feature, we create a stump. To choose which base learner to start with first in the sequence, the entropy (or Gini index, whichever you choose) of the stumps is calculated. The stump with the lowest entropy is selected.

If you want to remember decision trees;

Decision Tree Parameter Explanations

Sklearn’s Decision Tree Parameter Explanations

python.plainenglish.io

Creating Stumps

For categorical features;

If we split the target according to feature1;

for the black category: 3 correct predictions and 2 incorrect predictions.
red category: 2 correct predictions and 1 incorrect prediction.

For continuous numerical features;

Sort the feature
Get midpoints for adjacent values
Calculate Gini impurity (or information gain) for each point split

In our case;

Sorted: 98, 119, 127, 128, 135, 136, 154, 157
Midpoints: 108.5, 123, 127.5, 131.5, 135.5, 145, 155.5
Gini Impurities: 0.429, 0.333, 0.467, 0.375, 0.2, 0.333, 0.429

135.5 is the best value for the split since it has the lowest impurity.

Gini indexes of 3 stumps are;

S1= 0.47, S2=0.5, S3=0.2

The stump that has the lowest Gini value will be our first stump in the sequence.

Total Error

The total error is the sum of the weights of the incorrect predictions. We have 1 incorrect prediction, therefore total error is 0.125.

Importance

We will calculate the importance (amount of say or influence) of the base classifier (stump), right now. As can be understood from the formula, the smaller the total error, the greater the importance of the base classifier.

When the total error is 0.5, the significance value of the classifier is 0 (tossing a coin). If the stump consistently does a bad job, its importance becomes negative.

In our example; the total error is 0.125. So, importance = 0.97

Now we will update the weights so the next stump will take the errors that the current stump made into the account.

The first stump misjudged a single observation. We’ll now increase the weight of the erroneous observation and decrease the others so that the next stump takes this into account.

weight = (1/8 x 2.64) = 0.33 => new weight to increase.

To decrease weights, we will use the negative of the importance (e ^-importance).

weight = (1/8 * 0.38) = 0.5 => new weight to decrease

We should normalize the new weights to add up to one.

Now, at this stage, we will select values randomly, and where this value falls in the cumulative distribution of weights, we will move that observation to a newly created empty dataset. That is, we create a new dataset with random selections. For example, if the randomly chosen value is between 0.14 and 0.21, we will move the 3rd observation.

An observation with a higher weight has a higher probability of being found in the new dataset.

From this point, we start from the beginning with our new dataset.

Python Code

From Scratch

Let’s create a dummy dataset to test;

from sklearn.datasets import make_gaussian_quantilesdef generate_dummy(n):
    n_per_class = int(n/2)
    X, y = make_gaussian_quantiles(n_samples=n, n_features=2, n_classes=2)
    return X, y*2-1X, y = make_toy_dataset(n=100)

Test:

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_scoreX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)A = AdaBooster(400)
A.fit(X_train, y_train)
y_pred = A.predict(X_test)
print('The ROC-AUC:', round(roc_auc_score(y_test, y_pred), 4))#OUT: 
0th iteration; error: 0.3500000000000007
100th iteration; error: 0.388889913934225
200th iteration; error: 0.43300048902688026
300th iteration; error: 0.4141223413353688
The ROC-AUC score of the model is: 0.9286

Sklearn

from sklearn.ensemble import AdaBoostClassifier

Parameters;

base_estimator: object => model type; default is Decision Tree.

n_estimators: int => number of iterations; default is 50.

learning_rate: float => Weight applied to the base classifier can be tuned by this parameter. Higher values increase the contribution of each stump. The default value is 1.

A = AdaBoostClassifier(n_estimators = 400)
A.fit(X_train, y_train)
y_pred_sk = A.predict(X_test)
print('The ROC-AUC:', round(roc_auc_score(y_test, y_pred), 4))#OUT:
The ROC-AUC: 0.9286

References

https://blog.paperspace.com/adaboost-optimizer/

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

https://www.youtube.com/watch?v=LsK-xG1cLYA

AdaBoost Algorithm - A Complete Guide for Beginners - Analytics Vidhya

Boosting is an ensemble modelling technique that was first presented by Freund and Schapire in the year 1997, since…

www.analyticsvidhya.com

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

AdaBoost from Scratch

AdaBoost Algorithm Explained and Implemented

PseudoCode

Algorithm

Decision Tree Parameter Explanations

Sklearn’s Decision Tree Parameter Explanations

Creating Stumps

Total Error

Importance

Python Code

From Scratch

Sklearn

Read More…

Random Forest Implementation

Sklearn Implementation and Parameter Explanations

Decision Tree Parameter Explanations

Sklearn’s Decision Tree Parameter Explanations

Logistic Regression from Scratch

Explained and Implemented

Ridge, Lasso & ElasticNet Regressions From Scratch

Python Code from Scratch and Sklearn Implementation

Modeling Interactions and Behavior

A Very Brief Introduction to Modeling Diagrams

References

AdaBoost Algorithm - A Complete Guide for Beginners - Analytics Vidhya

Boosting is an ensemble modelling technique that was first presented by Freund and Schapire in the year 1997, since…

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Okan Yenigün