Adaboost as a Classifier & Regressor.
Machine Learning Boosting Algorithm
Hello all, welcome to this brand new article on boosting techniques.I’m writing this article to explain, what is the intuition behind the AdaBoost algorithm in Machine Learning. They are not a black box(that takes some inputs and return outputs). I know most of you guys know how to implement these algorithms using the Scikit-Learn library in python. But knowing what happens inside the algorithm is very necessary and helpful. So stick to me guys till the end of the article to know the actual mathematics and intuition behind the AdaBoost algorithm.
There are different types of boosting algorithms in machine learning and they are used for solving both classification and regression algorithms.
Given:
- Ada Boost (Adaptive Boosting)
- Gradient Boosting
- XG Boost (eXtreme Gradient Boosting)
- Cat Boost (Category Boosting)
In this article, I talk about Adaptive Boost as a Classifier as well as Regressor.
First, I will tell about the algorithm and then explain it using the data set.
Algorithm
AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work.
3. Normalize the new sample weights so that their sum is 1.
4. Construct the next tree using the new weights.
5. In the end, compare the summation of results from all the trees and the final result is either the one with the highest sum(for regression) or it is the class which has the most weighted voted average(for classification).
Similarly for regression, the output is:-
I know the steps I mentioned seem a little bit weird to you, don’t worry I’ll you line by line in the upcoming section.
Explanation:-
Note:- In this article, I'll explain the AdaBoost with the help of a classification problem because in regression we have to follow the same steps, only changes you see is the value of residuals are larger.
Let’s start, by taking a small dataset and read it using pandas read_csv function:-
- In the above dataset, we have three feature columns :
1.Is Chest Pain Present 2. Are any arteries blocked 3.Weight of the person
- And the target column is Is Heart Patient
Iteration first:-
- As we see there are a total of 8 rows in our dataset. So first, we’ll initialize the sample weights [(w=1 / N), where N is the total records i.e 8)] as 1/8 at the beginning for each row present in the dataset.[and we only do this in the first iteration, for the next iteration we take the normalize weights from the previous iteration]. Because, In the beginning, all the samples are equally important After applying the weights they look like:-
- We’ll consider each column in the above data set except the last two columns to create weak decision-makers and then try to figure out what are the correct and incorrect predictions based on that column.
- Stumps:-
Stumps are simply defined as the the tree having one parent node and two leaf node.
These pictures given below are called stumps.
Here we don’t know that which column is selected as stumps in three feature columns.
So for selecting any columns we need to find a Gini Impurity for each column.
Gini Impurity:-
Gini Impurity is a measurement of the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set.
Formula:-
- Where p is the probability of correct prediction.
After applying this formula on three feature columns we got:
G.I for chest pain tree= 0.47
G.I for blocked arteries tree= 0.5
G.I for body-weight tree= 0.2
And, we select the tree with the lowest Gini Imprity. Here body-weight column will be the first decision-maker for our model.
- Here we selected a stump which is having a parent node as body-weight. So we use this stump for the prediction of the data set.
- Now, we’ll calculate the contribution of this tree(stump) i.e body-weight to our final decision using the formula:
- This stump classified only one data incorrectly out of the 8, hence the total error is 1/8 (see Figure 1).
- Putting this into the above contribution formula we get
Contribution= 0.97
- We’ll now calculate the new weights for each row using the formula given below:-
Basically the formula we are using here is decreasing the weights of the correctly classified rows and vice versa.
- After calculating the weights for each row we populate it into the data frame :
So the next step is to normalize the new sample weights and add it in a new column Normalized weights:
These new normalized weights will act as the sample weights for the next iteration.
- Until now in the first iteration, we have a stump i.e body weight and normalize weight.
Iteration second:-
- In this iteration, we create a new dataset and apply the same process we have done in iteration first, using the normalized weights from iteration first.
Explaination of the above sentence
The normalized weights are between 0–1. For the creation of a new dataset, we need the older data set. In the process of creating a new dataset, first, we select a random value between 0–1, and based on this value, we select the rows for the new dataset. So the question arises is, how can we use this random value? First, we create a range of the normalized weights(cumulative sum), like for the above dataset(Figure 3) the range should be — 0.07–0.14, 0.14 -0.21, 0.21–0.49, 0.49–0.56 and so on. if we select a random value like 0.28, so this value falls in the range 0.21–0.49 then the row corresponding to the normalized weight 0.49 is added to the new dataset. Similarly, if we select the random value is 0.08 and this value falls in the range 0.14–0.21 then the row corresponding to the weight 0.21 is added to the new dataset. This process goes on until the shape of the new dataset is equals to the older one. The benefit of creating the new dataset using this method is that the row which is incorrectly classified the results are selected more than once. These rows are learned by new stumps.
- When the new data set is created we apply the same steps on this dataset as we have done on the original dataset, like finding Gini-impurity, total_error, etc.
This process of creating stumps goes on until the total_error stops decreasing.
Prediction Rule:-
- Suppose, m trees(stumps) are classifying a person as a heart patient and n trees(stumps) are classifying a person as a healthy one, then the contribution of m and n trees are added separately and whichever has the higher value, the person is classified as that.
For example, if the contribution of m trees is 1.2 and the contribution of n trees is 0.5 then the final result will go in the favour of m trees and the person will be classified as a heart patient.
Thank you so much for reading this article till the end. I hope you understand the concept behind the Adaboost Classifier. In the next article, I’ll be talking about Gradient Boosting.
See you in the next article till then……….
Happy Learning!