A Complete Guide To Boosting Ensembles

Boosting is one of the ensembling technique which becoming more and more popular day by day. No doubt boosting works phenomenal but people often assume it as a black box model and therefore this blog will give a tour to boosting ensembles covering its introduction, mathematics behind it, classification, its implementation and how to interpret its result.

But before understanding boosting ensembles its better to understand what ensembles are, why they prefer over simple machine learning algorithms etc. Lets start with understanding power of ensembles and then we move to boosting.

Table of Contents

Ensembles — Introduction

In machine learning, ensembles are the models which use multiple models(known as BASE LEARNERS) together in order to form single predictive model.

But when we have some best machine learning algorithm which works phenomenal then the question arises

WHY ensembles are better than individual models??? WHY ensembles capable of minimizing error??? WHY distinct weak base learners form strong ensemble compare to similar base learners ???

To answer this mighty “WHY” and to understand ensembles let me try to frame its explanation from day to day life example and then using mathematics , implementation in python, concept behind every ensemble technique.

Consider David want to try a newly inaugurated restaurant for a dinner but before making to the place he actually want to a little sure about the place and therefore decided to check reviews about restaurant.

Top reviews of restaurant
WHY ensembles are better than individual models???

Now consider david want to make decision, on the basis of single review — either positive or negative he is not literally convinced about the place, but when he get through multiple reviews — of almost similar nature he will be a little confident about the place.

In very similar manner ensembles can be considered comparatively stronger as a single classifier decision can be doubtful but going through many similar decisions, a data scientist can be more confident on machine’s output.

WHY ensembles capable of minimizing error???
Venn diagram shows how error reduces for group decision

Suppose someone highly recommend restaurant but believing on single review is little difficult as nothing is perfect, may be the type of food reviewer tried is the only best kind of food restaurant served or may be staff member attended reviewer is the only king staff member restaurant have and their can be more similar reasons and in a nutshell just following a single review is getting a little biased.

Now instead what david did is he checked multiple reviews, what he conclude is in most of the reviews its mentioned staff is rude, food is not up to mark except some Chinese dishes and ambience is good — this sounds like a more clear picture to david, now he can judge the place better, have a concrete reason for making or not to place.

This is what ensemble learning trying to achieve i.e. rather than believing on a single algorithm or a single model, ensembles average out results of multiple algorithm — sounds simple but ensembles are an outstanding approach to a problem.

WHY distinct weak base learners form strong ensemble compare to similar base learners ???

Answering this question is very easy now for almost anyone but let me try by relating above example, not only checking many reviews work for david he should also make sure that he get a variety of reviews. What if all reviewers are Chinese food lover or if people tried some drinks but loved the ambience their can be multiple possibility when even a group of reviews get biased.On the other hand if reviewers tried variety of food, have different taste etc. then it surely make reviews a strong tool in deciding about restaurant.

And this is what even every domain want, studying a problem and visualize it from every possible aspect which make ensembles super useful. In ensembles rather than going with similar base learners or having almost similar hyper-parameter its usually better to choose a variety of base learners.

Now its clear ensembles in one or more ways are superior to other machine learning algorithms. Now its time to understand boosting type of ensembles and its various variants.


Boosting — Introduction

Boosting, also known as sequential ensemble is form using several weak base learners dependent on each other to combine to form a strong learner. This is an iterative technique in which we initiate model build on training data, then second model build in such a manner that it attempts to reduce the error from first model, then third tried to reduce second model error and so on.

As shown in diagram, when an input is mis-classified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. By combining the whole set at the end converts weak learners into better performing model.

Due to iterative addition of base learners, boosting models easily overfits hence its low variance and high bias models can be used as base learners in boosting ensembles.

Defining boosting again, boosting can be considered as a combination of high- bias, low variance models (like decision trees with low depth),as boosting can reduce variance in model without impacting its bias.

Algorithm

Gradient boosting decision trees(GBDT)

GBDT is one of the boosting type ensemble technique which can be implemented using scikit-learn easily. Along with all boosting features scikit-learn also provides row sampling in GBDT which made it super powerful algorithm.

As per above algorithm, final function for GBDT can be given as follow:

Problem with this function is now model starts overfitting as additive aggregation of base learners and hence to reduce overfitting we introduces a learning rate in this equation.

What this learning rate did is it shrinks the contribution of each tree by learning rate times and therefore overfitting of a model can be avoided with the help of a trade-off between learning rate and no of base learners.

Now we left with two main hyper-parameters:

  1. k i.e. no. of base learners( decision trees)
  2. v i.e. learning rate

We can compute simple cross-validation techniques to find the optimal value of hyper-parameters.

Depth of tree strongly influence overfitting in model usual depth of individual tree is 2,3,4 or max 5. More than 5 can leads to overfitting of model.

Along with its vulnerability to overfitting GBDT have more limitations:

GBDT considered as a slow process, for a large dataset it usually takes much time as compare to others machine learning algorithms.

Parallelizing task in GBDT(or other boosting algorithms) is also not possible.

But all thanks to XGBoost another boosting algorithm which comes with all advantages of GBDT and overcome its time complexity limitation.

For implementing GBDT we can refer sklearn documentation.

Extreme gradient boosting(XGBoost)

XGBoost comes with a little tweak in GBDT i.e.

Scikit-learn GBDT can perform row sampling, but it cannot facilitate column sampling technique. Column sampling fasten GBDT without compromising model performance. In XGBoost we get all benefits of GBDT along with column sampling technique.

Implementation

Unlike other machine learning algorithm, XGBoost is not a part of sklearn till now, but we can install XGboost from a pre-built binary wheel, available from Python Package Index (PyPI).

For Classification

from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier as xgbc
from scipy import stats
classifier = xgbc(random_state = 0)
tuned_param = {'n_estimators': [x for x in range(1,101,10) ], 'learning_rate': stats.uniform(0.0095,
0.15)}
model = RandomizedSearchCV(classifier , tuned_param, scoring = 'f1' , n_iter = 15 , cv =3)
model.fit(X_train , y_train)
print(model.best_estimator_)

For Regression

from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor as xgbr
from scipy import stats
regressor = xgbr(random_state = 0)
tuned_param = {'n_estimators': [x for x in range(1,101,10) ], 'learning_rate': stats.uniform(0.0095, 0.15)}
model = RandomizedSearchCV(regressor , tuned_param, scoring = 'explained_variance' , n_iter = 15 , cv =3)
model.fit(X_train , y_train)
print(model.best_estimator_)

In XGBoost, learning_rate, max_depth, n_estimators are most commonly parameters used as hyper-parameter.

learning_rate — For XGBoost learning rate can be used to set weights to newly added trees, this is an important hyper-parameter for managing trade-off between overfitting and underfitting.

max_depth — The maximum depth of a tree.

n_estimators — No of base-learners.

Interpretation

Building a model, applying different machine learning algorithms is simply 2–3 lines of code change, what more important is interpreting how model trains, its results.

Interpretation is a two stage process, stage 1 is model interpretation and stage 2 is result interpretation.

Feature Importance plot:

Feature importance plot is simplest way to depict relative importance of features. This plot provides a picture how different features responsible in decision making.

Sometimes training model only on most important features will provide better results.

Above plot generated using importance type as weight, we can use other importance type too to completely confident about relative feature importance.

Three feature importance types are:

  1. Weight. The number of times a feature is used to split the data across all trees.
  2. Cover. The number of times a feature is used to split the data across all trees weighted by the number of training data points that go through those splits.
  3. Gain. The average training loss reduction gained when using a feature for splitting.
XGBoost feature importance plot comparison for different importance_type
Implementation of feature importance plot in python
import xgboost
def plot_importance(importance_type):
xgboost.plot_importance(model_1, importance_type =importance_type,  title = 'Feature importance plot with type = ' + importance_type)
imp_type = ['weight' , 'gain', 'cover']
for type_ in imp_type:
plot_importance(type_)

How accurate feature importance plot is ? ? ?

A good plot is always good plot, no matter what model is, what hyper-parameter value we are using. Therefore if a plot is independent of hyper-parameters , algorithm used it can term as good plot and we can rely on it.

Consider below diagram,

Here we can say their is some difference in absolute values of feature importance, but relative feature importance is same for all plots, hence we can rely on above feature importance with full confidence.

Code to test feature importance plot in python
def fi_plot_comp(base, depth):
col = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
model = xgbc(learning_rate = 0.1, max_depth = depth , n_estimators = base)
model.fit(X_train, y_train)
plt.figure(figsize=(4,3))
index = np.arange(len(col))
plt.bar(index, model.feature_importances_)
plt.xlabel('Features', fontsize=10)
plt.ylabel('Relative importance', fontsize=10)
plt.xticks(index, col, fontsize=10, rotation=90)
plt.title('feature imp plot for xgboost when \n max_depth = ' + str(depth) + ' & #estimators= ' + str(base))
plt.show()

for x in [50,100]:
for y in [3,4]:
fi_plot_comp(x,y)

Feature impact plot — SHAP value plot

This plot is very useful to visualize:

  1. how different feature used to generate output for test data.
  2. how much each feature in a collaborative prediction has contributed to its success.

Feature impact plot can visualize in two formats:

(a) Distribution format

(b) Bar-Graph format

Comparing above two formats , distribution plot seems more informative, this along with feature impact on output tells how weights of different features distributed, about feature variance.

SHAP values(SHapley Additive exPlanation)
— SHAP value measures how much each feature in a model contributes in success of either class ,positive or negative.
Implementation of SHAP value plot — distribution format in python
import shap
import xgboost
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Implementation of SHAP value plot — bar-graph format in python
shap.summary_plot(shap_values, X_test, plot_type="bar")

Test sample interpretation plot

Below results are generated for two test sample, for first one model predicts 1 and for second model predicts 0.

Consider distribution plot, it clearly tells feature 1 & feature 0 dominates in deciding class label as 1 and all other feature didn’t contribute much.Something similar we can depicts from bar-graph plot.

Distribution plot gives an additional information i.e. along with feature importance it tells towards which class certain feature is pushing a test sample.

Similarly from label “0” distribution plot, its clear that feature 1 dominates in deciding class label, feature 0 & feature 2 also made some contribution and along remaining features made almost negligible contribution.

Implementation of test sample plot — distribution format in python
import shap
import xgboost
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
test_id = 5
shap.summary_plot(shap_values[test_id:test_id+1], X_test[test_id:test_id+1,:])
Implementation of test sample plot — bar-graph format in python
shap.summary_plot(shap_values[test_id:test_id+1], X_test[test_id:test_id+1,:], plot_type="bar")

Conclusion

Ensembles usually provide some great results and among ensembles, boosting is more popular with data scientist but people give up on it due to its low interpretability. But Boosting ensembles are not that less interpretable that all assumed it, infect there are many ways to visualize them.

References

  1. Interpretable Machine Learning with XGBoost
  2. XGBoost Documentation
  3. GBDT Documentation