Ensemble Learning — Bagging, Boosting, Stacking and Cascading Classifiers in Machine Learning using SKLEARN and MLEXTEND libraries.

Saugata Paul
27 min readNov 30, 2018

--

Hi guys! Today I will give you a deep understanding of how ensemble models in Machine Learning work. We will start with an intuition of various ensemble learning strategies which are deployed by thousands of Kagglers and dive deeper into making more accurate predictions using a group of machine learning models.

This blog is a slightly lengthy one, but I have tried to cover most strategies with diagrams and examples. I have used the MLEXTEND library extensively. Visit their official GitHub repo here: http://rasbt.github.io/mlxtend/

Introduction.

Ensemble learning is a strategy in which a group of models are used to solve a challenging problem, by strategically combining diverse machine learning models into one single predictive model. In general, ensemble methods are primarily used to improve the overall performance accuracy of a model and combine several different models, also known as the base learners, to predict the results, instead of using a single model.

Why do we train so many different classifiers instead of just one? Well, using several models to predict the final result actually reduces the likelihood of giving weightage to decisions made by a poor models.

The more diverse these base learners are, the more powerful will the final model be. In any machine learning model, the generalization error is given by the sum of squares of bias + variance + irreducible error. Irreducible errors are something that is beyond us! We cannot reduce them. However, by using ensemble techniques, we can reduce the bias and variance of a model. This reduces the overall generalization error.

The bias-variance trade-off is the most important benchmark that differentiates a robust model from an inferior one. In machine learning, the models which have a high bias tend to have a lower variance and vice-versa.

1. Bias: Bias is an error which arises due to false assumptions made in the learning phase of a model. A high bias can cause a learning algorithm to skip important information and correlations between the independent variables and the class labels, thereby under-fitting the model.

2. Variance: Variance tells us how sensitive a model is to small changes in the training data. That is by how much the model changes. High variance in a model will make it prone to random noise present in the dataset thereby over-fitting the model.

To understand the Bias-Variance trade-off in Machine Learning models in more detail you can check this article: https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

You can think of ensemble learning analogous to the board of directors in a company, where the final decision is taken by the CEO. Instead of taking a decision all by himself, the CEO takes inputs from each of the board members before arriving at a final conclusion. The CEO, in this case, is the final model and the board members are the base learners which provide independent inputs to the CEO. This drastically reduces the chance of committing an error when the CEO makes his final decision.

We use this approach regularly in our daily lives as well — for example, we ask for the opinions of different experts before arriving at conclusions, we read different product reviews before buying aproduct, a panel of judges consult among them to declare a winner. In each of the above scenarios what we are actually trying to achieve is to minimize the likelihood of an unfortunate decision made by one person (in our case a poor model).

Typically, ensemble learning can be categorized into four categories:

1. Bagging: Bagging is mostly used to reduce the variance in a model. A simple example of bagging is the Random Forest algorithm.

2. Boosting: Boosting is mostly used to reduce the bias in a model. Examples of boosting algorithms are Ada-Boost, XGBoost, Gradient Boosted Decision Trees etc.

3. Stacking: Stacking is mostly used to increase the prediction accuracy of a model. For implementing stacking we will use the mlextend library provided by sci-kit learn.

4. Cascading: This class of models are very very accurate. Cascading is mostly used in scenarios where you cannot afford to make a mistake. For example, a cascading technique is mostly used to detect fraudulent credit card transactions, or maybe when you want to be absolutely sure that you don’t have cancer.

In this article, I will mostly explain the ensemble learning strategies called Bagging, Boosting and Stacking with some code samples. I will also try to give an intuitive understanding of what cascading means and for what purpose should we use it.

For the purpose of this blog, I have chosen to work with the Iris dataset because of its simplicity. You can try all these techniques and apply them to any real-world dataset you want. However, the performance will vary across datasets. The Iris dataset can be downloaded from this link: https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv.

For those of you who aren’t aware of what Iris database is, it’s basically a dataset which contains information about three species of flowers — setosa, versiclor, and virginica. The four features that distinguish these species from one another are petal length, septal length, petal width, and sepal width. The main aim of a classification model is to learn the relationships between the features and the class label and classify the three species of flowers.

Let us draw the pair plots for all features in the Iris dataset. In this way, we can visually see how each feature separates the data with respect to the other.

#Load the IRIS dataset and display pair plots.
iris_dataset = pd.read_csv("iris.csv")
X, y = iris_dataset.iloc[:,0:4], iris_dataset.iloc[:,4]
#Pair plots for iris dataset
import seaborn as sns
plt.close();
sns.set_style("whitegrid");
sns.pairplot(iris_dataset, hue="species", size=3);
plt.show()
Pair Plots to visually differentiate between the three species of flowers, based on all the independent features.

We can see that petal length and petal width are the two most important features that visually separates the three classes of flower most accurately. We can tell this just by looking at the pair plots. We don’t even need machine learning models to predict! Now let us try out different ensemble techniques to find out how they behave on the Iris dataset.

Bagging Classifiers (Bootstrap Aggregation)

The first ensemble technique that I will discuss is Bagging. Bagging stands for bootstrap aggregation. Bagging is one of the earliest, interesting and a very powerful ensemble algorithm. The core idea of bagging is to use bootstrapped replicas of the original dataset and use them to train different classifiers.

We will create subsets by randomly sampling a bunch of points from the training dataset, with replacement. Now we will train individual classifiers on each of these bootstrapped subsets. Each of these base classifiers will predict the class label for a given problem. This is where we combine the predictions of all the base models. This part is called the aggregation stage.

Typically a simple majority vote is used in a classification system and taking the mean of all predictions for regression models, to combine all the base classifiers into one single model to provide the final output of the ensemble model. A simple example of such an approach is the Random Forest algorithm. Bagging reduces the high variance of a model, thereby reducing the generalization error. Bagging is a very efficient method especially when you have very limited data. By using bootstrapped samples we are able to get an estimate by aggregating the scores over many samples.

Please take a look at the below diagrams to get an intuitive understanding of how bagging works at different stages.

Bagging Algorithm. Link: https://www.researchgate.net/figure/272682704_fig7_Table-1-Algorithm-1-Bagging
The above image shows different stages of a Bagging Algorithm.

Let’s understand the above diagram with a simple example. Let’s say we have a training set which contains 100K data points. We will create N subsets by randomly sampling 50K data points for each subset. Each of these N subsets will be used to train N different classifiers. At the aggregation stage, all these N predictions will be combined into one single model also called the meta-classifier. Out of the 100K points present originally present in the dataset, if we remove 1000 points, the impact it will have on the sampled datasets will be very less. If you think intuitively, some of these 1000 points might not be present in all the sampled datasets at all and thus the number of points that will be removed from each sampled dataset will be very less. Even zero in some cases! To sum it up, the impact of removing 1000 such points will be actually less on the base learners, thereby reducing the variance in a model and making it more robust. Variance is nothing but sensitivity to noise as we have discussed earlier.

Bagging Code Sample:

In the code sample below, first, what we do is initialize 8 different base learners. We will fit each of these 8 base learners to our training set and compare it’s accuracy to the bagging versions of each of the classifiers.

In most of the cases below, we see that there is a slight increase in the model’s accuracy when we use the bagging version of each classifier as compared to the normal ones. For this classification task, I will use a 3 fold cross-validation to obtain the accuracy scores across different folds. You can increase the number of folds if you have a large real-world training set. For each of the base learners selected below, we see that bagging actually works! Using bagging has actually improved the accuracy by a certain margin (most notably in the AdaBoostClassifier)

from sklearn.preprocessing import LabelEncoder
encoder_object = LabelEncoder()
y = encoder_object.fit_transform(y)
iris_dataset = pd.read_csv("iris.csv")
X, y = iris_dataset.iloc[:,0:4], iris_dataset.iloc[:,4]
RANDOM_SEED = 0#Base Learners
rf_clf = RandomForestClassifier(n_estimators=10, random_state=RANDOM_SEED)
et_clf = ExtraTreesClassifier(n_estimators=5, random_state=RANDOM_SEED)
knn_clf = KNeighborsClassifier(n_neighbors=2)
svc_clf = SVC(C=10000.0, kernel='rbf', random_state=RANDOM_SEED)
rg_clf = RidgeClassifier(alpha=0.1, random_state=RANDOM_SEED)
lr_clf = LogisticRegression(C=20000, penalty='l2', random_state=RANDOM_SEED)
dt_clf = DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=RANDOM_SEED)
adab_clf = AdaBoostClassifier(n_estimators=5,learning_rate=0.001)
classifier_array = [rf_clf, et_clf, knn_clf, svc_clf, rg_clf, lr_clf, dt_clf, adab_clf]
labels = [clf.__class__.__name__ for clf in classifier_array]
normal_accuracy = []
normal_std = []
bagging_accuracy = []
bagging_std = []
for clf in classifier_array:
cv_scores = cross_val_score(clf, X, y, cv=3, n_jobs=-1)
bagging_clf = BaggingClassifier(clf, max_samples=0.4, max_features=3, random_state=RANDOM_SEED)
bagging_scores = cross_val_score(bagging_clf, X, y, cv=3, n_jobs=-1)

normal_accuracy.append(np.round(cv_scores.mean(),4))
normal_std.append(np.round(cv_scores.std(),4))

bagging_accuracy.append(np.round(bagging_scores.mean(),4))
bagging_std.append(np.round(bagging_scores.std(),4))

print("Accuracy: %0.4f (+/- %0.4f) [Normal %s]" % (cv_scores.mean(), cv_scores.std(), clf.__class__.__name__))
print("Accuracy: %0.4f (+/- %0.4f) [Bagging %s]\n" % (bagging_scores.mean(), bagging_scores.std(), clf.__class__.__name__))

How did our Bagging classifiers behave compared to normal ones? Let’s check the output below.

Accuracy: 0.9538 (+/- 0.0367) [Normal RandomForestClassifier]
Accuracy: 0.9604 (+/- 0.0155) [Bagging RandomForestClassifier]

Accuracy: 0.9408 (+/- 0.0420) [Normal ExtraTreesClassifier]
Accuracy: 0.9412 (+/- 0.0577) [Bagging ExtraTreesClassifier]

Accuracy: 0.9534 (+/- 0.0087) [Normal KNeighborsClassifier]
Accuracy: 0.9869 (+/- 0.0092) [Bagging KNeighborsClassifier]

Accuracy: 0.9400 (+/- 0.0321) [Normal SVC]
Accuracy: 0.9604 (+/- 0.0422) [Bagging SVC]

Accuracy: 0.7998 (+/- 0.0058) [Normal RidgeClassifier]
Accuracy: 0.8607 (+/- 0.0248) [Bagging RidgeClassifier]

Accuracy: 0.9542 (+/- 0.0515) [Normal LogisticRegression]
Accuracy: 0.9600 (+/- 0.0320) [Bagging LogisticRegression]

Accuracy: 0.9473 (+/- 0.0329) [Normal DecisionTreeClassifier]
Accuracy: 0.9673 (+/- 0.0245) [Bagging DecisionTreeClassifier]

Accuracy: 0.6667 (+/- 0.0000) [Normal AdaBoostClassifier]
Accuracy: 0.9604 (+/- 0.0274) [Bagging AdaBoostClassifier]

Here, in the below code sample we can visually see how the accuracy improves on using a bagging classifier as compared to a normal one. From a simple implementation point of view, this code sample helps us understand how to implement the concept of bootstrap aggregation.

### Bagging.import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
fig, ax = plt.subplots(figsize=(20,10))
n_groups = 8
index = np.arange(n_groups)
bar_width = 0.35
opacity = .7
error_config = {'ecolor': '0.3'}
normal_clf = ax.bar(index, normal_accuracy, bar_width, alpha=opacity, color='g', yerr=normal_std, error_kw=error_config, label='Normal Classifier')
bagging_clf = ax.bar(index + bar_width, bagging_accuracy, bar_width, alpha=opacity, color='c', yerr=bagging_std, error_kw=error_config, label='Bagging Classifier')
ax.set_xlabel('Classifiers')
ax.set_ylabel('Accuracy scores with variance')
ax.set_title('Scores by group and gender')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels((labels))
ax.legend()
#fig.tight_layout()plt.show()

The vertical black line present at the top of each bar indicates the variance in the model. As we can see, the variance mostly reduces when we use the bagging version of any classifier. However, for some models, variance increases slightly more than the normal version of the classifier. This is mostly because of lack of training data points. Remember we have only 150 observations in our training set. Greater the number of data points in our train set, more robust will the final models be.

Bar graphs comparing the accuracy between different classifiers and their Bagging versions.

Impact of change in bagging accuracy with an increase in the sub-sampling ratio.

A very important factor to be kept in mind while building bagging classifiers is that the accuracy of a model doesn’t always increase when we increase the sub-sampling ratio. In the Bagging Classifier library, sub-sampling, i.e. the fraction of data that gets into each of the base learners, is denoted by the parameter “max_samples”.

In the code sample below, we will display the bagging scores for each of the base learners at various sub-sampling ratio. We will also plot the bagging scores for each of the base learners in a line chart to get a more intuitive understanding of this concept — that is bagging scores doesn’t necessarily increase when we increase the sub-sampling ratio.

### Display the accuracy of different bagging classifiers at various sub sampling ratio in a Pretty table.subsampling_ratio = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]various_bagging_scores = []for clf in classifier_array:
cv_scores = cross_val_score(clf, X, y, cv=3, n_jobs=-1)
#print("\nAccuracy: %0.4f (+/- %0.4f) [Normal %s]" % (cv_scores.mean(), cv_scores.std(), clf.__class__.__name__))

mean_bagging_score = []
for ratio in subsampling_ratio:
bagging_clf = BaggingClassifier(clf, max_samples=ratio, max_features=3, random_state=RANDOM_SEED)
bagging_scores = cross_val_score(bagging_clf, X, y, cv=3, n_jobs=-1)
mean_bagging_score.append(bagging_scores.mean())
#print("Bagging accuracy: %0.4f [max_samples %0.2f]" % (bagging_scores.mean(), ratio))
various_bagging_scores.append(mean_bagging_score)
various_bagging_scores.insert(0,subsampling_ratio)

#Compare performance and display it in a pretty table.
from prettytable import PrettyTable
table = PrettyTable()
labels.insert(0,"Max Samples")
#table.field_names = label_models
index=0for value in various_bagging_scores:
table.add_column(labels[index],value)
index += 1
print(table)
Bagging accuracy for all classifiers for various values of sub-sampling ratio.

This code sample below helps us understand that the bagging accuracy doesn’t always increase when we increase the sub-sampling ratio.

#Plot the bagging scores using a line chart.
labels.remove("Max Samples")
various_bagging_scores.remove(various_bagging_scores[0])
x_axes = subsampling_ratiocolor_map = ['blue','g','r','c','grey','y','black','m']
plt.figure(figsize=(20,10))
for index in range(0,len(labels)):
plt.plot(x_axes, various_bagging_scores[index], color=color_map[index], label=labels[index])
plt.xlabel('Sub sampling Ratio')
plt.ylabel('Accuracy')
plt.title("Comparison b/w accuracy of different classifiers at various sub sampling ratio")
plt.legend()
plt.show()
A line chart showing the change in Bagging accuracy at different values of sub-sampling ratio.

As we can clearly see for Ada Boost classifier (denoted by the maroon line), that the bagging accuracy drops to a significantly lower value when we increase the sampling ratio from 0.7 to 0.8. For the Decision Tree classifier, the bagging accuracy falls beyond the sampling ratio of 0.4. This behavior is almost the same for all the base learners. There is absolutely no evidence of the fact that a higher sub-sampling ratio means a higher bagging accuracy. I have taken my sampling ratio to be 0.4. You can try and experiment with different values on different datasets.

Boosting Classifiers:

The second ensemble technique that we are going to discuss today is called Boosting. Boosting is used to convert weak base learners to strong ones. Weak learners generally have a very weak correlation with the true class labels and strong learners have a very high correlation between the model and the true class labels.

Boosting involves training the weak learners iteratively, each trying to correct the error made by the previous model. This is achieved by training a weak model on the whole training data, then building a second model which aims at correcting the errors made by the first model. Then we build a third model which tries to correct the errors made by the second model and so on. Models are added iteratively until the final model has corrected all the errors made by all the previous models.

When the models are added at each stage, some weights are assigned to the model which is related to the accuracy of the previous model. After a weak classifier is added, the weights are re adjusted. The incorrectly classified points are given higher weights and correctly classified points are given lower weights. Such an approach will make the next classifier to focus on the mistakes made by the previous model.

Boosting reduces generalization error by taking a high-bias & low-variance model and reducing the bias by a significant level. Remember, bagging reduces variance. Similar to bagging, boosting also lets us work with both classification and regression models. Please take a look at the below diagrams to intuitively understand how boosting works at each of the stages. The diagram below shows the different stages in a boosting algorithm.

The different stages of a boosting algorithm.

Let’s understand the above diagram. We have a dataset D, the first thing that we will do at stage 0 is train a model on the whole dataset. The model may be either a classification or a regression model. Let’s name this model M_0. Let us assume this model M_0 is trying to fit a function h_0(x). Then, the prediction function for this model is given by y_pred=h_0(x). Model 0 is designed to have a high bias. Generally boosting is applied for a high bias and low variance model. High bias in a model basically refers to a high training error. High bias arises mostly due to some incorrect assumptions made at the training stage.

Now, after building the first model we will, at stage 0, get the error in prediction for each data points made by the model M_0. So, the error in the prediction for any class label is given by y-y_pred. Remember, there are lots of lots of error functions out there — for example, the squared error, the hinge loss error, the logistic loss error etc. But, for simplicity, we will focus on simple difference error for this example.

Now that we have done these things in stage 0, what we will do in stage 1 is as follows. I will try fit a model M_1 on the errors produced by the model at stage 0. Remember, M_1 is not training on the actual class labels. M_1 is training on the errors we have got at the end of stage 0. Let’s say we get a function h_1(x), which has trained on the errors generated by model M_0. Thus at end of stage 1, my final model will actually be the weighted sum of the previous two prediction functions (as shown in the diagram). We will assign weights a_0 and a_1 to h_0(x) and h_1(x) respectively. Thus at the end of stage 1, the model looks like this : F_1(x) = a_0 * h_0(x) + a_1 * h_1(x), where a_0 and a_1 are weights assigned to the prediction functions. Remember, the weights will always be higher for functions which has a high misclassification error. In this way, we can make the next model in the sequence to focus more on the errors made by the previous model.

Similarly, the model at the end of stage 2 will have the function F_2(x) = a_0 * h_0(x) + a_1 * h_1(x) + a_2 * h_2(x). Thus at the end of all stages, the final model that we have is given by summation of a_i * h_x(i), where the value of i ranges from 1 to N. Thus, intuitively speaking we are actually reducing the training error which means in other words we are actually reducing the bias of a model.

Also, find below the Bagging algorithm.

Boosting Algorithm : https://www.researchgate.net/figure/Boosting-Algorithm-11_fig1_309031690

Boosting Code samples:

We will look at some of the most popular boosting classifiers in the below code sample. We will also see how using boosting increases the overall accuracy in prediction using an EnsembleVoteClassifier. EnsembleVoteClassifier is a very powerful class available in the MLEXTEND package, which is used to combine the predictions of different machine learning models by using the concept of majority voting.

The EnsembleVoteClassifier implements two types of voting approach — “hard” and “soft”. In a “hard” voting approach we will predict the class label of the final model based on the majority vote obtained from all the base classifiers. For example, if 7 out of 10 base learners predicts the class label to be “Yes” in a binary classification problem, we will take “Yes” to be the final class label of the final ensemble model. Read more about EnsembleVoteClassifier at this link: http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/

Anyway, in the below example we see that the accuracy increases to 0.967 (when we use boosted classifiers with majority voting) as compared to the highest accuracy of an individual model which was 0.961, although the increase is very slight. Still, you got my point!

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from xgboost import XGBClassifier
ada_boost = AdaBoostClassifier(n_estimators=5)
grad_boost = GradientBoostingClassifier(n_estimators=10)
xgb_boost = XGBClassifier(max_depth=5, learning_rate=0.001)
ensemble_clf = EnsembleVoteClassifier(clfs=[ada_boost, grad_boost, xgb_boost], voting='hard')
boosting_labels = ['Ada Boost', 'Gradient Boost', 'XG Boost', 'Ensemble']
for clf, label in zip([ada_boost, grad_boost, xgb_boost, ensemble_clf], boosting_labels):
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: {0:.3f}, Variance: (+/-) {1:.3f} [{2}]".format(scores.mean(), scores.std(), label))

Let’s see how did the different classifiers performed on the IRIS dataset. Check the below output.

Acuuracy: 0.967, Variance: (+/-) 0.018 [Ada Boost]
Acuuracy: 0.960, Variance: (+/-) 0.027 [Gradient Boost]
Acuuracy: 0.961, Variance: (+/-) 0.042 [XG Boost]
Acuuracy: 0.967, Variance: (+/-) 0.033 [Ensemble]

Boosting Decision Regions:

Here, we look at the decision boundaries by each boosted classifiers, and how all the three base models classify the data over a region. We will also take a look at the decision region generated by the voting classifier. In the MLEXTEND package, there is a very powerful library called “plot_decision_regions” which can be used to visually see the decision regions for different classifiers. You can check their GitHub profile for more techniques at this link: http://rasbt.github.io/mlxtend/user_guide/plotting/plot_decision_regions/.

Anyway, in the below example I have trained and fitted the model to two of the most important features, i.e. “petal_length” and “petal_width”. Remember, the pair plots above?

#Decision Regions for all the boosting algorithms.
X = np.array(iris_dataset[['petal_length','petal_width']])
y = np.array(y)
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(20,16))
for clf, label, grd in zip([ada_boost, grad_boost, xgb_boost, ensemble_clf], boosting_labels, itertools.product([0, 1], repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title(label)
plt.show()
Decision Regions for all three boosted classifiers vs The final ensemble model.

Stacking Classifiers.

Stacking is an ensemble learning technique which is used to combine the predictions of diverse classification models into one single model also known as the meta-classifier.

All the individual models are trained separately on the complete training data set and fine-tuned to achieve a greater accuracy. The bias and variance trade-off is taken care off for each model. The final model, also known as the meta-classifier is fed either the class labels predicted by the base models or the predicted probabilities for each class label. The meta-classifier is then trained based on the outputs given by the base models. In stacking, a new model is trained based on the predictions made by the previous models.

This process takes place sequentially. This means several models are trained at stage 1 and are fine-tuned. The predicted probabilities of each model from stage 1 are fed as an input to all the models at stage 2. The models at stage 2 are then fine-tuned and the corresponding outputs are fed to models at stage 3 and so on. This process occurs multiple times based on how many layers of stacking one would like to use.

The final stage consists of one single powerful model, which gives us the final output by combining the output of all the models present in the previous layers. This single powerful model at the end of a stacking pipeline is called the meta-classifier. Often times, using stacking classifiers increases the prediction accuracy of a model. But in no way can there be a guarantee that using stacking will increase the prediction accuracy at all times!

Take a look at the below diagrams to understand how stacking works. You can refer to the MLEXTEND GitHub page at this link: http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/, to get more ideas on how to implement stacking in different scenarios.

Stacking Algorithm.
The different stages of a typical Stacking Algorithm.

Stacking code sample:

In the below code sample, we will use eight different base learners and train each of them on the whole dataset. Each of these models can be fine-tuned using grid search cross-validation. Each of these N models will predict eight class labels.

At the final stage, the predictions of all the base models are combined using majority voting (for classification tasks), to create a final model called the meta-classifier. The meta-classifier in our case is the logistic regression model.

As we can see from the outputs below, stacking has indeed managed to increase the accuracy of the final model although the increase is very less. But you get the idea! Like the previous examples, we will use a 3 fold cross-validation. Again, you can experiment with this value when you work with some real-world datasets. Further down, in a separate example, we will try grid search cross-validation on the base learners and see if the overall accuracy increases (well, it should actually)

RANDOM_SEED = 0X, y = iris_dataset.iloc[:,0:4], iris_dataset.iloc[:,4]from sklearn.preprocessing import LabelEncoder
encoder_object = LabelEncoder()
y = encoder_object.fit_transform(y)
#Base Learners
rf_clf = RandomForestClassifier(n_estimators=10, random_state=RANDOM_SEED)
et_clf = ExtraTreesClassifier(n_estimators=5, random_state=RANDOM_SEED)
knn_clf = KNeighborsClassifier(n_neighbors=2)
svc_clf = SVC(C=10000.0, kernel='rbf', random_state=RANDOM_SEED)
rg_clf = RidgeClassifier(alpha=0.1, random_state=RANDOM_SEED)
lr_clf = LogisticRegression(C=20000, penalty='l2', random_state=RANDOM_SEED)
dt_clf = DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=RANDOM_SEED)
adab_clf = AdaBoostClassifier(n_estimators=100)
lr = LogisticRegression(random_state=RANDOM_SEED) # meta classifier
sclf = StackingClassifier(classifiers=[rf_clf, et_clf, knn_clf, svc_clf, rg_clf, lr_clf, dt_clf, adab_clf], meta_classifier=lr)classifier_array = [rf_clf, et_clf, knn_clf, svc_clf, rg_clf, lr_clf, dt_clf, adab_clf, sclf]
labels = [clf.__class__.__name__ for clf in classifier_array]
acc_list = []
var_list = []
for clf, label in zip(classifier_array, labels):
cv_scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f) [%s]" % (cv_scores.mean(), cv_scores.std(), label))
acc_list.append(np.round(cv_scores.mean(),4))
var_list.append(np.round(cv_scores.std(),4))
#print("Accuracy: {} (+/- {}) [{}]".format(np.round(scores.mean(),4), np.round(scores.std(),4), label))

Let’s see how the above code helped in increasing the accuracy of prediction by using Stacking classifiers.

Accuracy: 0.9538 (+/- 0.0367) [RandomForestClassifier]
Accuracy: 0.9408 (+/- 0.0420) [ExtraTreesClassifier]
Accuracy: 0.9534 (+/- 0.0087) [KNeighborsClassifier]
Accuracy: 0.9400 (+/- 0.0321) [SVC]
Accuracy: 0.7998 (+/- 0.0058) [RidgeClassifier]
Accuracy: 0.9542 (+/- 0.0515) [LogisticRegression]
Accuracy: 0.9473 (+/- 0.0329) [DecisionTreeClassifier]
Accuracy: 0.9600 (+/- 0.0161) [AdaBoostClassifier]
Accuracy: 0.9608 (+/- 0.0424) [StackingClassifier]

A bar graph to see stacking actually increases the accuracy.

The below code block will plot the accuracy values for each of the base learners and also the accuracy of the final meta-classifier. We have managed to increase the accuracy value very slightly from 0.9600 (highest accuracy obtained from a single base learner) to 0.9608 (using the stacking classifier).

Bar graphs comparing the accuracy values for different classifiers.

Stacking Decision Regions:

For simplicity, we will look at the decision regions obtained using three of our base learners and also the final stacked meta-classifier. The three base learners that we will select for this purpose are RandomForestClassifier, SupportVectorClassifer, and RidgeClassifier. As like before, I will train and fit the model to two of the most important features, i.e. “petal_length” and “petal_width”.

#Decision Regions for 4 algorithms.
X = np.array(iris_dataset[['petal_length','petal_width']])
y = np.array(y)
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(20,16))
for clf, label, grd in zip([rf_clf, svc_clf, rg_clf, sclf], ["Random Forest Classifier", "Support Vector Classifer", "Ridge Classifier", "Stacking Classifier"], itertools.product([0, 1], repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title(label)
plt.show()
Decision Regions for all three base classifiers vs the meta-classifier.

Stacking using probability score:

Instead of using the actual class label of the base learners, we can also use the probability scores of each of the base models to train the meta-classifier. For this, we need to set “use_probas=True”. If we use “average_probas=True” the probability scores of each of the base learners are averaged. However, setting “average_probas=False” means that the probability scores for each of the base learners are stacked and they are all passed as input to the next level classifiers.

Note, that we have actually managed to increase the accuracy of the stacking classifiers by using probability score. 0.9673 as compared to 0.9608, as you would see in the example below!

RANDOM_SEED = 0
from sklearn.naive_bayes import GaussianNB
X, y = iris_dataset.iloc[:,0:4], iris_dataset.iloc[:,4]from sklearn.preprocessing import LabelEncoder
encoder_object = LabelEncoder()
y = encoder_object.fit_transform(y)
#Base Learners
rf_clf = RandomForestClassifier(n_estimators=10, random_state=RANDOM_SEED)
et_clf = ExtraTreesClassifier(n_estimators=5, random_state=RANDOM_SEED)
knn_clf = KNeighborsClassifier(n_neighbors=2)
lr_clf = LogisticRegression(C=20000, penalty='l2', random_state=RANDOM_SEED)
dt_clf = DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=RANDOM_SEED)
adab_clf = AdaBoostClassifier(n_estimators=100)
lr = LogisticRegression(random_state=RANDOM_SEED) # meta classifier
gnb_clf = GaussianNB()
#sclf = StackingClassifier(classifiers=[rf_clf, et_clf, knn_clf, svc_clf, rg_clf, lr_clf, dt_clf, adab_clf], use_probas=True, average_probas=False, meta_classifier=lr)
sclf = StackingClassifier(classifiers=[rf_clf, knn_clf, gnb_clf, lr_clf, et_clf, dt_clf, adab_clf], use_probas=True, average_probas=False, meta_classifier=lr)
classifier_array = [rf_clf, knn_clf, gnb_clf, lr_clf, et_clf, dt_clf, adab_clf, sclf]
labels = [clf.__class__.__name__ for clf in classifier_array]
for clf, label in zip(classifier_array, labels):
cv_scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f) [%s]" % (cv_scores.mean(), cv_scores.std(), label))

Let’s see how the accuracy improved on using class probabilities instead of actual class labels.

Accuracy: 0.9538 (+/- 0.0367) [RandomForestClassifier]
Accuracy: 0.9534 (+/- 0.0087) [KNeighborsClassifier]
Accuracy: 0.9342 (+/- 0.0328) [GaussianNB]
Accuracy: 0.9542 (+/- 0.0515) [LogisticRegression]
Accuracy: 0.9408 (+/- 0.0420) [ExtraTreesClassifier]
Accuracy: 0.9473 (+/- 0.0329) [DecisionTreeClassifier]
Accuracy: 0.9600 (+/- 0.0161) [AdaBoostClassifier]
Accuracy: 0.9673 (+/- 0.0333) [StackingClassifier]

Stacking classifiers using Grid Search cross-validation.

from sklearn.model_selection import GridSearchCVRANDOM_SEED = 0X, y = iris_dataset.iloc[:,0:4], iris_dataset.iloc[:,4]from sklearn.preprocessing import LabelEncoder
encoder_object = LabelEncoder()
y = encoder_object.fit_transform(y)
#Base Learners.
rf_clf = RandomForestClassifier(random_state=RANDOM_SEED,n_jobs=-1)
knn_clf = KNeighborsClassifier(p=2, metric='minkowski',n_jobs=-1)
dt_clf = DecisionTreeClassifier(criterion='gini', random_state=RANDOM_SEED)
lr = LogisticRegression(random_state=RANDOM_SEED) # meta classifier
#sclf = StackingClassifier(classifiers=[rf_clf, et_clf, knn_clf, svc_clf, rg_clf, lr_clf, dt_clf, adab_clf], meta_classifier=lr)
sclf = StackingClassifier(classifiers=[rf_clf, knn_clf, dt_clf], meta_classifier=lr)
print("\nAccuracies of all classifiers using grid search cross validation.")params = {'randomforestclassifier__n_estimators': np.arange(10,20), 'randomforestclassifier__max_depth': np.arange(1,5),
'kneighborsclassifier__n_neighbors': np.arange(1,20,2),
'decisiontreeclassifier__max_depth': np.arange(1,5),
'meta-logisticregression__C': [0.001,0.01,0.1,1,10,100,1000]}
gsearch_cv = GridSearchCV(estimator=sclf, param_grid=params, cv=5, refit=True)
gsearch_cv.fit(X, y)
cv_keys = ('mean_test_score', 'std_test_score', 'params')print('Best parameters: %s' % gsearch_cv.best_params_)
print('Accuracy: %.2f' % gsearch_cv.best_score_)

Let’s see the output below. As we can see, using grid search cross validation has actually increased the accuracy of the ensemble model to 0.98.

Accuracies of all classifiers using grid search cross validation.
Best parameters: {'decisiontreeclassifier__max_depth': 1, 'kneighborsclassifier__n_neighbors': 7, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__max_depth': 3, 'randomforestclassifier__n_estimators': 15}
Accuracy: 0.98

Decision Regions of Stacking classifiers using the best hyperparameters.

import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
#Decision Regions for 4 algorithms.
X = np.array(iris_dataset[['petal_length','petal_width']])
y = np.array(y)
#Base Learners.
rf_clf = RandomForestClassifier(max_depth=3,n_estimators=15,random_state=RANDOM_SEED,n_jobs=-1)
knn_clf = KNeighborsClassifier(n_neighbors=7,p=2, metric='minkowski',n_jobs=-1)
dt_clf = DecisionTreeClassifier(max_depth=1,criterion='gini', random_state=RANDOM_SEED)
lr = LogisticRegression(C=0.1,random_state=RANDOM_SEED) # meta classifier
sclf = StackingClassifier(classifiers=[rf_clf, knn_clf, dt_clf], meta_classifier=lr)
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(20,16))
for clf, label, grd in zip([rf_clf, knn_clf, dt_clf, sclf], ["RandomForestClassifier", "KNeighborsClassifier", "DecisionTreeClassifier", "StackingClassifier"], itertools.product([0, 1], repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title(label)
plt.show()
Decision Regions for all three base classifiers vs the meta-classifier. (Using the best hyper-parameter values)

Multi-Level Stacking (Using 81 base learners across 4 stages)

The examples I have covered for Stacking classifiers are all ensemble models which has one single level. But in general multiple models can be stacked across multiple level. I will show you one such Stacking design used by the winners of the Kaggle KDD cup competition. This image was uploaded in a PPT designed by the SAS team. The entire architecture was designed by author Jeong-Yoon Lee, who is a Data Scientist in Microsoft, and winner of the KDD cup! You can check this wonderful video by SAS team in YouTube at this link:https://www.youtube.com/watch?v=9IyJ4HvubGo&t=1070s, for further elaborations.

The stacking architecture used by the winning team of KDD cup 2015.

The above stacking model uses 64 base learners on seven feature sets at level 1. All the 64 models are trained independently on the training set. The predictions of these 64 models are fed to Stage 1 ensemble classifiers which has 15 more models. The predictions of these 15 models are further combined using majority vote and their output is fed to Stage 2 ensemble classifiers, which has 2 models. Finally, the output of these 2 models are fed to the meta-classifier in Stage 3, which gives us the final output of the model.

Cascading classifiers:

Cascading, according to Google, in simple English literature means “a process whereby something, typically information or knowledge, is successively passed on”.

Cascading is one of the most powerful ensemble learning algorithm which is used by Machine Learning engineers and scientists when they want to be absolutely dead sure about the accuracy of a result. For example, suppose we want to build a machine learning model which would detect if a credit card transaction is fraudulent or not.

If you think about it, it’s a binary classification problem where a class label 0 means the transaction is not fraud & a class label 1 means the transaction is fraudulent. In such a model, it’s very risky to put our faith completely on just one model. So what we do is build a sequence of models (or a cascade of models) to be absolutely sure about the fact that the transaction is not fraudulent. Cascade models are mostly used when the cost of making a mistake is very very high. I will try to explain cascading with the help of a simple diagram.

Different stages in a cascade system of classifiers for a given query point.

Look at the above diagram. Given that we have a transaction query point Xq, we will feed it to Model 1. Model 1 can be anything — a random forest, or a logistic regression model or maybe a support vector machine. It can be anything! Basically what Model 1 does is that it predicts class probabilities to determine to which class do a given query point has higher chances of belonging to.

Let’s say class label 1 means the transaction is fraudulent, and class label 0 means the transaction is not fraud. Typically, the predicted probabilities is given by this — P(Yq=0) and P(Yq=1), where Yq is our actual class label. Now let’s assume that P(Yq=0), i.e. the probability of the transaction to be not fraudulent is very high. If you think carefully, if P(Yq=0) is extremely high, we will say that the transaction is not fraud. Let’s assume we have set a threshold of 99%. It means if and only if P(Yq=0) > 0.99, we will declare the final prediction to be not fraudulent. However, if P(Yq=0) < 0.99 we are not very sure if or not it’s a fraudulent transaction although though there is a high chance that the transaction is not fraudulent. In such a case, when P(Yq=0) < 0.99, we want to be really really sure that the transaction is not fraudulent. We need to be absolutely careful because if our model fails to detect a fraudulent transaction we might lose millions of dollars!

So even when we are slightly unsure, we will train another Model 2. Model 2 does the same thing, it receives the query point and predicts P(Yq=0). Just like in stage 1, if P(Yq=0) > 0.99, we will declare the transaction to be not fraudulent and terminate the loop. But again if we get P(Yq=0) < 0.99, we aren’ sure! Hence, we will pass the query point to another Model 3 in the cascade which does the same thing.

In a typical cascading system the complexity of models increases as we add more and more models to the cascade. Please note that all the models in a cascade are super powerful and has a very high accuracy on unseen data. However, it might happen that none of the models can give us a value of P(Yq=0) > 0.99. In such a case, typically there is a human being who sits at the end of a cascade. This person will personally call the customer and ask him whether or not he has done the transaction. Now, we are absolutely certain that the transaction is not a fraud one when the customer says that he is the one who has done the transaction.

Different training stages while building a cascade system of classifiers.

The above diagram shows us the different stages of training in a cascade system of classifiers. Here we have four models in a sequence. We will train Model 1 on the whole training dataset and evaluate its performance on the test dataset. Now wherever we are sure that the class label is 0, we will categorize all such points in a dataset D’. This is at stage 1 of the training phase. Thus D’ contains all such data points for which the class label is 0, i.e. the transaction is not fraudulent.

Now, wherever we are not sure about some data points (if they are fraudulent or not) we pass it to the next stage i.e. Model 2 in our case. Hence, model 2 will only train on the dataset which does not contain points from D’. Model 2 can’t be exposed to all such points for which we are sure that the class label is 0. It will only train on the points for which we are not sure. After this, we will put all such points in a dataset D’’ for which we are sure that they belong to class label 0 at the end of stage 2.

We will repeat the same procedure for stage 3 as well, unless the training phase reaches the final model. Intuitively, if you think about it, the cascades are designed in such a way that the next model in the sequence is only trained on the data points for which the model isn’t sure what the class label is. We must always train our model as per what type of data it would see during runtime in the production environment.

I hope I was able to give an intuition of how cascading in machine learning works! For the above explanations on cascading, I have referred to Mr. Varma’s teachings and diagrams from the AAIC course. I have not provided any code samples for cascading classifiers as I have not implemented it. Keep watching this space for more. I will update this blog in future with code samples on how to build cascade classifiers after I have implemented it myself.

There is a very interesting paper on cascading models written by few Ph.D. folks at Stanford University. These guys have actually build a super powerful framework called Cascaded Classification Models (CCM), which aims at improving the accuracy/performance at each level by repeatedly instantiating the classifiers coupled by their input/output variables in a cascade. You can read more about the paper at this link: https://ai.stanford.edu/~koller/Papers/Heitz+al:NIPS08a.pdf

There is also a very interesting blog on how to build a face detection model using cascade classifiers. If you are interested you can visit http://scikit-image.org/docs/dev/auto_examples/xx_applications/plot_face_detection.html to see how it works!

CLOSING THOUGHTS:

If you have come this far, thank you for having the immense patience to read through all the stuffs. I hope I was able to make you guys understand what ensemble techniques are and how do they help in increasing the accuracy in Machine Learning models. In general, ensemble learning is mostly used by Kaggle participants to extremely fine-tune their models. In industries however, stacking huge models in several levels is a very costly procedure. And a 0.5% increase in accuracy practically wont have the greatest of impact on business problems. However, for Kaggle competitions, ensemble techniques are state of the art strategies that can be used to increase the accuracy by even 0.001, which might just help you to win the competition!

Here are few solid recommendations as part of some additional readings on ensemble learning algorithms:

  1. The diagrams are drawn using an awesome tool provided by this website: https://www.draw.io/. You can use this tool to draw anything you like and it’s as simple as drawing using your crayons!
  2. Visit this link https://www.appliedaicourse.com/, to dive deep into Machine Leaning/Deep Learning/AI. Mr. Varma has explained most of the high level contents using simple, robust mathematics.
  3. For additional details about ensemble techniques please read this nice blog written by Dr. Robi Polikar of Rowan University at this link: http://www.scholarpedia.org/article/Ensemble_learning
  4. Also, there is an extremely beautiful series of videos at MIT OpenCourseWare by Dr. Patrick Winston. Check this link: https://www.youtube.com/watch?v=UHBmv7qCey4
  5. For additional details about ensemble techniques please read this nice blog written by Dr. Robi Polikar of Rowan University at this link: http://www.scholarpedia.org/article/Ensemble_learning

You can download the Jupyter notebook version of this blog from my GitHub profile at this link : https://github.com/saugatapaul1010/ensemble-learning-github/blob/master/Blog%20Code.ipynb

--

--