Gradient boosting Vs AdaBoosting — Simplest explanation of how to do boosting using Visuals and Python Code

Rohit Madan
6 min readNov 22, 2019

--

I have been wanting to do this for a while now I am excited, I want to explain these mathematical + ML techniques using simple English, so here we go.

Gradient boosting & AdaBoosting are both great ways to get better final predictor(classifiers or regressors)with the help of using combined results of multiple predictors and then using all of their knowledge to predict a final result.

First let’s discuss the underlying principle of boosting before we find any differences

Both of these ML techniques under the Ensemble Learning umbrella where we use multiple predictors to make final prediction however boosting is difference in a big way as in boosting we also consider the mistakes of previous predictors and train the new predictors on those mistakes and then again do this till we get a better fit predictor which has learnt from mistakes and will make a final decision after advice from multiple predictors.

Consider the below image (Source is from Aurelio Geron) — here we are training multiple decision trees to make a prediction and we are of course going to use all predictors to make a prediction.

Regressor1(DecisionTreeRegressor) makes a model on some noisy quadratic dataset and the model looks like the green line, the blue dots are the data points.

Regressor 1

As you can see the model is under fitting of course as it leaves out covering a lot of data points but this is only model 1 from regressor 1.

Next we keep all the results for which prediction is wrong (so all data points which do not fit on the line in the above image) and store them . Now again Regressor 2 is trained on these datapoints (both Adaboost and Gradient have different tricks on how they use these wrongly predicted datapoints which are covered below) and we get another model which looks like -

Regressor 2

See how the stores datapoints from previous predictor are being used and the model now tries to fit these datapoints into a new model using a new predictor.

Great 2 predictors, 2 models aren’t we done yet ?

No, look at how many points are still wrongly predicted.

I can understand the long face but there is still work to do, our predictors are not predicting all point currently so what do we do?

We repeat the process again and we get regressor 3 -

Regressor 3

Now, we have 3 models which predicts on Datapoints 1, Datapoints 2 and Datapoints 3. Ensemble models as you remember make the best use of these predictors, what it does is that it uses all 3 predictors to predict output for a datapoint and then takes votes from all 3 to assign it a prediction and boosting has helped predictors to learn from mistakes of previous predictors so now we have 3 different predictors which help each other but also independantly provide truth when asked.

For Ex — We have Apple, Orange, Grapes

And new fruit is Apple so once we use boosting , predictor 1 says its Not Apple , Predictor 2 says but you didn’t cover this , let me learn from your mistake and says Maybe Apple(about 60%) and Predictor 3 says Predictor 2 you’re good but you missed this mistake and says Definitely Apple.

So when the user asks the model takes all 3 advices and responds to user this is an apple.

And that’s how simple boosting is, in fact check the below image to see stages of boosting with the left side representing the predictors and the right side representing the ensemble predictions.

The Final prediction is done by the model on the rightmost bottom corner, the models on left are Predictor 1, Predictor2 and Predictor 3 and the images on the right are Ensemble learning models which are using available predictors to make prediction. Check the middle row, there are 2 predictors till now and ensemble model uses predictor 1 and predictor 2 to make a prediction. In the rightmost image of 1st row, ensemble learning is same as Predictor 1 but in middle row Ensemble learning model uses Predictor 2 model also to make a prediction and hence the model curve changes as you can see right side image of middle row.

So what’s the difference between both Ada and Gradient boost?

This is the basic principle of both AdaBoost and Gradient Boosting, the differences in both techniques is that in case of Adaboost when Predictor 1 makes a model, the datapoints that are incorrectly predicted are stored and then a weight is added to them (the idea is penalise the predictor which makes most wrong predictions and to reward the one which makes right predictions) so the new predictor gets weights added and again make predictions which are both right and wrong, the wrong ones are again used to add a weight to the next predictor and so on till we get a Final Predictor which makes better predictions so the trick is to keeping adding +ve and -ve weights to Predictors about certain data points.

This sequential learning technique has some similarities with Gradient Descent, except that instead of tweaking a single predictor’s parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better ~ Aurelin Geron

Gradient boost also uses the same boosting principle and Ensemble learning principle but instead of storing incorrectly predicted data points to add weight these datapoints are considered as a training set and the new predictor tries to fit these data points making a new model. Then predictions are done using new model and the wrongly predicted data points are again used to fit a new predictor till we Final Predictor which makes better predictions so the trick is to keep fitting wrongly predicted datapoints with the new predictor till lesser predictions are wrong.

Simple right ?

There is an added concept called Learning Rate which is denoted by Alpha, this rate ranges from 0–1 and is used as a hyperparameter in both Adaboost predictors and Gradient Boost predictors, this parameter controls the freedom of the predictor thereby adding or reducing both bias and variance to the model, the thumb rule is

If your model is overfitting, it needs more variance so reduce the parameter and if it’s underfitting it needs less variance so increase it.

Now comes the final stage of this article, how to do boosting using python.

How to Code this with a python library?

First let me show you how a gradient booster for regression will be

from sklearn.ensemble import GradientBoostingRegressorgbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)gbrt.fit(X_train, y_train)

Simple right ?

Let’s do a Adaboost for classifier

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier( DecisionTreeClassifier(max_depth=1), n_estimators=200, learning_rate=0.5) ada_clf.fit(X_train, y_train)

Just replace X_train and y_train with your dataset and voila you’ve boosted your code.

Did you like this simple way of explaining things or would like me to cover something for you ?

You can send me feedback on how to improve at rohitmadan16@gmail.com

May the force be with you.

FIN.

P.S -> I’d like to thank Aurelion Geron for the graphics and for literally showing the light in days of darkness.

--

--