AdaBoost from Scratch

AdaBoost Algorithm Explained and Implemented

Okan Yenigün
6 min readJul 21, 2022
Photo by Tim Marshall on Unsplash

AdaBoost (Adaptive Boosting) is a classification boosting algorithm developed by Yoav Freund and Robert Schapire in 1995. They won the Gödel Prize in 2003 for their work. AdaBoost (and indeed other boosting methods) creates strong learners from a large number of weak learners.

A weak learner (or classifier) is a model that can give better results than a random guess but still does not provide adequate performance. With Adaboost we are combining many weak classifiers. These weak classifiers progressively learn from each other’s mistakes, creating a strong model as a whole. These weak unit classifiers are called base classifiers and are usually decision trees.

PseudoCode

Set equal weights for each observation (w=1/n)

for t=1:T:

Train a weak classifier

Choose coefficient alpha (importance)

Update weights

End for

Output:

Algorithm

Let’s go through an example for a better understanding of the algorithm. As seen in the table below, we have a dataset consisting of 8 observations and 3 features (2 categorical and 1 continuous). 4 observations have positive values while 4 ones are negative.

Dataset

Firstly, we initialize weights equally; w = (1/n) = 0.125

Weights initialized.

Now, we create our base learner (decision trees). Each decision tree will have only one depth. These trees are called stumps.

Stump. Source: Wikipedia

For each feature, we create a stump. To choose which base learner to start with first in the sequence, the entropy (or Gini index, whichever you choose) of the stumps is calculated. The stump with the lowest entropy is selected.

If you want to remember decision trees;

Creating Stumps

For categorical features;

First stumps for categorical features

If we split the target according to feature1;

  • for the black category: 3 correct predictions and 2 incorrect predictions.
  • red category: 2 correct predictions and 1 incorrect prediction.

For continuous numerical features;

  • Sort the feature
  • Get midpoints for adjacent values
  • Calculate Gini impurity (or information gain) for each point split

In our case;

  • Sorted: 98, 119, 127, 128, 135, 136, 154, 157
  • Midpoints: 108.5, 123, 127.5, 131.5, 135.5, 145, 155.5
  • Gini Impurities: 0.429, 0.333, 0.467, 0.375, 0.2, 0.333, 0.429

135.5 is the best value for the split since it has the lowest impurity.

Gini indexes of 3 stumps are;

S1= 0.47, S2=0.5, S3=0.2

The stump that has the lowest Gini value will be our first stump in the sequence.

Total Error

The total error is the sum of the weights of the incorrect predictions. We have 1 incorrect prediction, therefore total error is 0.125.

Importance

We will calculate the importance (amount of say or influence) of the base classifier (stump), right now. As can be understood from the formula, the smaller the total error, the greater the importance of the base classifier.

Total error vs importance. Source

When the total error is 0.5, the significance value of the classifier is 0 (tossing a coin). If the stump consistently does a bad job, its importance becomes negative.

In our example; the total error is 0.125. So, importance = 0.97

Now we will update the weights so the next stump will take the errors that the current stump made into the account.

The first stump misjudged a single observation. We’ll now increase the weight of the erroneous observation and decrease the others so that the next stump takes this into account.

weight = (1/8 x 2.64) = 0.33 => new weight to increase.

To decrease weights, we will use the negative of the importance (e ^-importance).

weight = (1/8 * 0.38) = 0.5 => new weight to decrease

We should normalize the new weights to add up to one.

Now, at this stage, we will select values randomly, and where this value falls in the cumulative distribution of weights, we will move that observation to a newly created empty dataset. That is, we create a new dataset with random selections. For example, if the randomly chosen value is between 0.14 and 0.21, we will move the 3rd observation.

An observation with a higher weight has a higher probability of being found in the new dataset.

Random dataset

From this point, we start from the beginning with our new dataset.

Python Code

From Scratch

Let’s create a dummy dataset to test;

from sklearn.datasets import make_gaussian_quantilesdef generate_dummy(n):
n_per_class = int(n/2)
X, y = make_gaussian_quantiles(n_samples=n, n_features=2, n_classes=2)
return X, y*2-1
X, y = make_toy_dataset(n=100)
Plot of dummy data

Test:

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)A = AdaBooster(400)
A.fit(X_train, y_train)
y_pred = A.predict(X_test)
print('The ROC-AUC:', round(roc_auc_score(y_test, y_pred), 4))
#OUT:
0th iteration; error: 0.3500000000000007
100th iteration; error: 0.388889913934225
200th iteration; error: 0.43300048902688026
300th iteration; error: 0.4141223413353688
The ROC-AUC score of the model is: 0.9286

Sklearn

from sklearn.ensemble import AdaBoostClassifier

Parameters;

base_estimator: object => model type; default is Decision Tree.

n_estimators: int => number of iterations; default is 50.

learning_rate: float => Weight applied to the base classifier can be tuned by this parameter. Higher values increase the contribution of each stump. The default value is 1.

A = AdaBoostClassifier(n_estimators = 400)
A.fit(X_train, y_train)
y_pred_sk = A.predict(X_test)
print('The ROC-AUC:', round(roc_auc_score(y_test, y_pred), 4))
#OUT:
The ROC-AUC: 0.9286

Read More…

References

https://blog.paperspace.com/adaboost-optimizer/

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

https://www.youtube.com/watch?v=LsK-xG1cLYA

--

--