Gradient boosting Vs AdaBoosting — Simplest explanation of how to do boosting using Visuals and Python Code

Published in

Analytics Vidhya

6 min readNov 21, 2019

I have been wanting to do this for a while now I am excited, I want to explain these mathematical + ML techniques using simple English, so here we go.

Gradient boosting & AdaBoosting are both great ways to get better final predictor(classifiers or regressors)with the help of using combined results of multiple predictors and then using all of their knowledge to predict a final result.

First let’s discuss the underlying principle of boosting before we find any differences

Both of these are ML techniques under the Ensemble Learning umbrella where we use multiple predictors(classifiers) to make a final prediction however boosting is different in a big way from just bagging or voting hard classifiers , as in boosting we also consider the mistakes of previous predictors and train the new predictors on those mistakes and then again do this till we get a better fit predictor which has learnt from mistakes and will make a final prediction after taking advice from multiple predictors.

Consider the below image (Source is from Aurelio Geron) — here we are training multiple decision trees to make a prediction and we are of course going to use all predictors to make a prediction.

Regressor1(DecisionTreeRegressor) makes a model on some random quadratic dataset and the model looks like the green line, the blue dots are the data points.

As you can see the model is under fitting of course as it leaves out covering a lot of data points but this is only model 1 from regressor 1.

Next we keep all the results for which prediction is wrong (so all data points which do not fit on the line in the above image) and use them . Now Regressor 2 is trained on these datapoints (after adding/subtracting some weights which I will cover in detail below) and we get another model which looks like -

See how the stored datapoints from previous predictor are being used to make a better model which fits these data points better?

Great 2 predictors, 2 models aren’t we done yet ?

No, look at how many points are still wrongly predicted.

1 more predictor I promise.

I can understand the long face but there is still work to do, our predictors are not predicting all point currently so what do we do?

We repeat the process again and we get regressor 3 -

Now, we have 3 models which predicts on the dataset however each predictor just predicts better on some subset of the dataset.

Ensemble models as you remember make the best use of these predictors, what it does is that it uses all 3 predictors to predict output for a datapoint and then takes votes from all 3 predictors.

See how boosting has helped the 3 predictors which have all learnt from the mistakes of previous predictors, so now we have 3 different predictors who not help each other but also provide independent unbiased advices when asked.

To understand what I mean consider the example -

We have a bowl of fruits with Apple, Orange, Grapes and my eyes are closed, so idk which fruit I have picked and I have to make a guess.(it’s easy as a person but the machine has to predict which fruit is this).

So the problem requires the machine to learn all 3 fruits , you’re following me right?

Now say I add another Apple to the bowl and ask the machine what fruit I have added.

The machine makes 1 model and predictor 1 says its Not an Apple , Predictor 2 says learns from the mistake of model 1 and says Maybe Apple(about 60%) and Predictor 3 says Predictor 2 you’re good but you missed this mistake and says Definitely Apple.

So now when I ask the machine what fruit have I added, the algo takes all 3 advices and responds this is an apple. (because it takes votes from all 3 machines and the majority was in favour of apple).

Simple right :)

And that’s how simple boosting is, in fact check the below image to see stages of boosting with the left side representing the predictors and the right side representing the ensemble predictions.

See each new predictor learning on the left and the combo model (on right) using the combined learnings of all predictors to make the best prediction.

The Final ensemble model is the rightmost bottom corner, the models on left in each row are Predictor 1, Predictor 2 and Predictor 3 and the images on the right are Ensemble learning models which are using available predictors to make prediction.

So what’s the difference between both Ada and Gradient boost?

This is the basic principle which is same for both AdaBoost and Gradient Boost, the differences in both techniques is how the new predictor learns from the old one, in case of Adaboost when Predictor 1 makes a model, the datapoints that are incorrectly predicted are stored and then a weight is added to them (the idea is to penalize the new predictor which makes most wrong predictions and to reward the one which makes right predictions) so the new predictor gets weights added and again make predictions , the wrong predictions are again used to add a weight to the next predictor and so on till we get a Final Predictor which makes better predictions by using all 3 predictors, so the trick is to keeping adding +ve and -ve weights to Predictors about certain data points till we have unique predictors that can combine to give a better result.

Remember — Adaboost has some similarities with Gradient Descent technique, except that instead of tweaking a single predictor’s parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better ~ Aurelin Geron

Gradient boost also uses the same boosting principle but instead of adding any weights to predictors wrong predicted datapoints are considered as a new training set and the new predictor tries to fit these data points making a new model. Then predictions are done using new model and the wrongly predicted data points are again used as a new training set to fit a new predictor till we get a Final Predictor which makes better predictions so the trick is to keep fitting wrongly predicted datapoints with the new predictor till lesser predictions are wrong and then use all predictors together to predict output(by voting ofcourse).

That’s all, if you got this you know both of them.

Sometimes our ensemble models overfit a lot and some underfit as well, so to regularize the ensemble model we need to keep in mind about Learning Rate.

Learning Rate which is denoted by Alpha, this rate ranges from 0–1 and is used as a hyperparameter in both Adaboost predictors and Gradient Boost predictors, this parameter controls the freedom of the predictor thereby adding or reducing both bias and variance to the model.

just remember the thumb rule is

If your model is overfitting, it needs more variance so reduce the learning parameter and if it’s underfitting it needs less variance so increase it.

Okay theory is great but how do you Code this with python ?

First let me show you how a gradient booster for regression will be(since we can do both classification and regression with ada and gradient boost)

from sklearn.ensemble import GradientBoostingRegressorgbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)gbrt.fit(X_train, y_train)

That’s all.

Let’s do a Adaboost for classifier

from sklearn.ensemble import AdaBoostClassifier
 
ada_clf = AdaBoostClassifier( DecisionTreeClassifier(max_depth=1), n_estimators=200, learning_rate=0.5) ada_clf.fit(X_train, y_train)

Remember — just replace X_train and y_train with your dataset and you can use this exact code.

I would highly recommend using the moons dataset or the mnist dataset for both Ada and Gradient boosting.

Did you like this simple way of explaining things ?

If you want me to cover a topic in simple terms or if you feel I can improve and have feedback, you email me rohitmadan16@gmail.com

May the force be with you.

FIN.

P.S -> I’d like to thank Aurelion Geron for the graphics and for literally showing the light in days of darkness.