Developing the Right Intuition for Adaboost From Scratch

Published in

The Startup

9 min readAug 15, 2020

Have you ever wondered while going through Kaggle notebooks how the top competitors come with brilliant models with high accuracy? The answer simply to that is the use of ensemble methods in their models. Adaboost is an iterative ensemble technique in which the whole data is fed into several base models(weak classifiers) and the final prediction is made by calculating the weighted sum of these base models. The most interesting feature about this algorithm is that each base model is created in such a way that each of the base models learns from the mistakes made by the previous ones and as a result of this, it iteratively reduces the training error gradually and improves the classification performance tremendously. Adaboost over the years has helped in developing face detection technology, hand detection technology, and human detection. I have tried in this blog to help you develop an intuition about the classifier and how to make it from scratch.

AdaBoost constructs a global and optimal combination of weak classifiers based on a sample reweighting.

Ensemble Method

To understand the working of Adaboost let us first understand what is meant by ensemble techniques. In this method first, we divide the data given to us into child datasets and then the datasets are fed into various base models (like SVM, logistic regression) and then the result is predicted by aggregating the results predicted by all of these bases.

Ensemble methods were introduced to convert the Low Biased High Variance model to the Low Biased Low Variance model.

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict.
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data

Low Biased High Variance model is one that performs well in the training data (that’s why low biased) but as soon as some data is provided outside of training data the model becomes faulty(that’s why high variance). It has been found out that whenever a single ML model is applied to the data(like SVM, Random Forest), there is only one model that is studying the entire trend of the data, so naturally, the model identifies the training data very well but whenever new trends come into the data, the model can’t function. But when we are using the ensemble method, several models are studying the data. As a result, there are different models to study different trends of the data. So unlike the single ml model, the ensemble learns the data rather than memorizing the data and develops intuition. So when new data points come they use the intuition developed by the algorithm to predict the data.

Ensemble methods: Bagging and Boosting

1. Bagging(stands for Bootstrap aggregation) is an ensemble method where data points are collected by random sampling of the data provided, to form child datasets to be fed into ML models in an iterative fashion forming new data points for the next iteration from the previous model, at random. In Boosting the data points are formed at random and fed into ML models in an iterative way to work on predictions and then depending on the accuracy of the predictions, weighted resampling of the data is formed to fed into the ML model for the next iteration.

2.In Bagging the base models that are created are independent of each other and the final prediction is done by calculating the aggregation of each of the predicted values by the base models. But in boosting the base models are dependent on each other such that a model is created keeping in mind the error made by the previous one so that the training error is reduced gradually.

3. In Bagging the final result is predicted by finding the aggregate of the predictions made by each base model, such that each base model has equal say. In Boosting, the final prediction is made by the weighted aggregate of the predictions made by each base model, which indicates that base models do not have equal say in the final prediction. The model with more accuracy will have a higher say than the model with low accuracy

Example of Bagging: Random Forest
Example of Boosting: Adaboost, Gradient Boost, Xg Boost

AdaBoost

As I have mentioned earlier, AdaBoost (stands for Adaptive Boosting)is an ensemble process generally used for classification problems. Like boosting methods, it gives us the final prediction by feeding the data into base models and then doing the weighted average of the predictions to give the final result.

There are three major properties of Adaboost:

The base models of the algorithm are weak learners known as Decision Stump.
All the base models do not have an equal say in the final prediction. The priority given to the model depends on the weight assigned to it.
The base models are dependent on each other. One base model is created depending upon the errors of the previous model.

Decision Stump

The base models or the weak learners in the case of Adaboost are known as Decision stump. Like Random Forest, Adaboost can be called a forest of stumps. We apply a greedy search algorithm for each feature provides to us and find the best splitting feature and its threshold and we split the entire data based on that. Then this split data is used for predictions. In other words, we are creating a Decision tree with a single node.

The Decision stump is thus a weak learner which means that no matter what the distribution over the training data is, it will always do better than chance when it tries to label the data. Doing better than chance means we are always going to have an error rate which is less than 1/2.

These decision stumps are High Biased High Variance models as they perform weakly on both training and testing data(it is quite obvious when the entire data is being classified based on a single feature). But combining the decision stumps we arrive at a low biased low variance model which works well on both training and testing data.

But the common question that arises is that why we use weak learners instead of using strong learners(a fully grown decision tree). The answer to this is very simple. It eliminates the probability of the overfitting of the ensemble model. If the model overfits, the model will perform well on training data not testing data. In other words, it is the use of the decision stumps as a base model makes that model low biased from being high biased.

Lets now build the AdaBoost algorithm from Scratch

Let's see the mathematics behind the base models

alpha it the loss function=the weight assigned to the entire model
h(x) is the Decision Stump classifier used in this case.
T is the number of base models used in this ensemble method. The number of base models to be used in AdaBoost depends highly on the data provided to us.

Thus we find out the weighted sum of the classifiers in order to predict the final answer or better to say to find out H(x).

Let's see the step by step process involved here:

Step-1: We first assign the same sample weight to each of the data points. So we apply 1/(number of samples) to all the data points.

Step-2: We create a decision stump and classify entire data on the basis of the threshold. We select the best feature by a greedy search algorithm and also find the threshold and split the entire data on the basis of that. We identify the best feature by calculating the error. Moreover, we use the weak learner to give us predictions.

Step-3: We find the weight of the entire model. In other words, we find the value of alpha. Experimentally it was found that with the increase of error rate the weight decreases. Plotting a graph we get

Now the error was found out

ε=(misinterpretations)/samples=( Misclassifications)/N
ε=Σ weights of misclassified samples

If the error > 0.5 we have to just flip the decision and error=1-error.(As we have seen that in case of a weak learner the error cannot be greater than 0.5 as the weak learner will always be better than the chance.)

Now we find out the weight(alpha) or loss function of the model using the formula

Step-4: Now we have to update the weight of all the data points. We give more importance to those data points that had been labeled incorrectly and increase their weights. This is done to penalize the errors more than the correct predictions. On the other hand, less importance is given to those data points that were predicted correctly and their weights are decreased according to the formula:

The positive sign is for wrongly predicted data points and the negative sign is for rightly predicted data points. Where omega(i) is the updated weight and omega(i-1) was the previous weight.

Step-5: We normalize the updated weights so that the updated weights lie between 0 and 1 and add up to one by the formula:

w(new)=w(i)/sum of all the weights

We just have to follow the steps mentioned above iteratively for each base model. The number of iterations or the number of base models required depends highly upon the data provided.

Se lets create an Adaboost class with fit() and predict() methods from scratch

So we have now created our own Adaboost Classifier from scratch. So let's test the model.

So we basically import our AdaBoost Classifier, create our accuracy function, import train test split, fit the trained data into our classifier, and obtain our results.

So now check for the accuracy of the model:

Accuracy: 0.9736842105263158

Multiclass Adaboost

If our label is Multiclass, it works in a similar manner to bi-class Adaboost with minor changes. If a multiclass classification involves k classes then instead of being a single variable between {-1,1}, y becomes a k dimensional vector. Similarly, f(x) also outputs a k dimensional vector. So in the error function instead of doing

We take the dot product of y and f(x) vector

Then we try to minimize this new error function.

Conclusion:

The AdaBoost algorithm is a vast topic of discussion. Research papers are regularly published to tune and improve this algorithm. But data imbalance leads to a decrease in classification accuracy. Moreover, training is time-consuming, and it is best to cut the point at each reselection of the current classifier. But it is one of the most used boosting techniques used in classification problems and I tried to build a simple intuition here.