AdaBoost from Scratch
AdaBoost Algorithm Explained and Implemented
AdaBoost (Adaptive Boosting) is a classification boosting algorithm developed by Yoav Freund and Robert Schapire in 1995. They won the Gödel Prize in 2003 for their work. AdaBoost (and indeed other boosting methods) creates strong learners from a large number of weak learners.
A weak learner (or classifier) is a model that can give better results than a random guess but still does not provide adequate performance. With Adaboost we are combining many weak classifiers. These weak classifiers progressively learn from each other’s mistakes, creating a strong model as a whole. These weak unit classifiers are called base classifiers and are usually decision trees.
PseudoCode
Set equal weights for each observation (w=1/n)
for t=1:T:
Train a weak classifier
Choose coefficient alpha (importance)
Update weights
End for
Output:
Algorithm
Let’s go through an example for a better understanding of the algorithm. As seen in the table below, we have a dataset consisting of 8 observations and 3 features (2 categorical and 1 continuous). 4 observations have positive values while 4 ones are negative.
Firstly, we initialize weights equally; w = (1/n) = 0.125
Now, we create our base learner (decision trees). Each decision tree will have only one depth. These trees are called stumps.
For each feature, we create a stump. To choose which base learner to start with first in the sequence, the entropy (or Gini index, whichever you choose) of the stumps is calculated. The stump with the lowest entropy is selected.
If you want to remember decision trees;
Creating Stumps
For categorical features;
If we split the target according to feature1;
- for the black category: 3 correct predictions and 2 incorrect predictions.
- red category: 2 correct predictions and 1 incorrect prediction.
For continuous numerical features;
- Sort the feature
- Get midpoints for adjacent values
- Calculate Gini impurity (or information gain) for each point split
In our case;
- Sorted: 98, 119, 127, 128, 135, 136, 154, 157
- Midpoints: 108.5, 123, 127.5, 131.5, 135.5, 145, 155.5
- Gini Impurities: 0.429, 0.333, 0.467, 0.375, 0.2, 0.333, 0.429
135.5 is the best value for the split since it has the lowest impurity.
Gini indexes of 3 stumps are;
S1= 0.47, S2=0.5, S3=0.2
The stump that has the lowest Gini value will be our first stump in the sequence.
Total Error
The total error is the sum of the weights of the incorrect predictions. We have 1 incorrect prediction, therefore total error is 0.125.
Importance
We will calculate the importance (amount of say or influence) of the base classifier (stump), right now. As can be understood from the formula, the smaller the total error, the greater the importance of the base classifier.
When the total error is 0.5, the significance value of the classifier is 0 (tossing a coin). If the stump consistently does a bad job, its importance becomes negative.
In our example; the total error is 0.125. So, importance = 0.97
Now we will update the weights so the next stump will take the errors that the current stump made into the account.
The first stump misjudged a single observation. We’ll now increase the weight of the erroneous observation and decrease the others so that the next stump takes this into account.
weight = (1/8 x 2.64) = 0.33 => new weight to increase.
To decrease weights, we will use the negative of the importance (e ^-importance).
weight = (1/8 * 0.38) = 0.5 => new weight to decrease
We should normalize the new weights to add up to one.
Now, at this stage, we will select values randomly, and where this value falls in the cumulative distribution of weights, we will move that observation to a newly created empty dataset. That is, we create a new dataset with random selections. For example, if the randomly chosen value is between 0.14 and 0.21, we will move the 3rd observation.
An observation with a higher weight has a higher probability of being found in the new dataset.
From this point, we start from the beginning with our new dataset.
Python Code
From Scratch
Let’s create a dummy dataset to test;
from sklearn.datasets import make_gaussian_quantilesdef generate_dummy(n):
n_per_class = int(n/2)
X, y = make_gaussian_quantiles(n_samples=n, n_features=2, n_classes=2)
return X, y*2-1X, y = make_toy_dataset(n=100)
Test:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_scoreX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)A = AdaBooster(400)
A.fit(X_train, y_train)
y_pred = A.predict(X_test)
print('The ROC-AUC:', round(roc_auc_score(y_test, y_pred), 4))#OUT:
0th iteration; error: 0.3500000000000007
100th iteration; error: 0.388889913934225
200th iteration; error: 0.43300048902688026
300th iteration; error: 0.4141223413353688
The ROC-AUC score of the model is: 0.9286
Sklearn
from sklearn.ensemble import AdaBoostClassifier
Parameters;
base_estimator: object => model type; default is Decision Tree.
n_estimators: int => number of iterations; default is 50.
learning_rate: float => Weight applied to the base classifier can be tuned by this parameter. Higher values increase the contribution of each stump. The default value is 1.
A = AdaBoostClassifier(n_estimators = 400)
A.fit(X_train, y_train)
y_pred_sk = A.predict(X_test)
print('The ROC-AUC:', round(roc_auc_score(y_test, y_pred), 4))#OUT:
The ROC-AUC: 0.9286
Read More…
References
https://blog.paperspace.com/adaboost-optimizer/
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html