Ensemble Learning

Sanjay Singh
Dec 2, 2019 · 7 min read

In my previous blog I explained Bias, Variance and Irreducible errors.

Here’s link to the blog -> Bias Variance Irreducible Error and Model Complexity Trade off

One of the techniques to reduce these errors (Bias and Variance) is Ensemble Learning. It combines several machine learning models to get optimized result with decreased variance (bagging), bias (boosting) and improved prediction (stacking).

In this blog, you are going to have hands-on practice on Ensemble Learning methods.

Data Source:

We are going to use Pima-Indians-diabetes database from below link. Download diabetes.csv file from below link.

The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.


Predict outcome (diabetic or not) based on patient’s BMI, insulin level, age and other feature values.

Let’s try different supervised learning methods and calculate their accuracy.

Execute below lines of code to read the data into pandas data frame, get feature value matrix, label array, and split train and test data set.

#Import Libraries
import pandas as pd
import numpy as np
# Read data into pandas dataframe
df=pd.read_csv(r'<put your file path here>\diabetes.csv')
#Define Feature Matrix (X) and Label Array (y)
#Define train and test data set
from sklearn.model_selection import train_test_split

Let’s try different classifiers and calculate their accuracy.

KNN Classifier:

from sklearn.neighbors import KNeighborsClassifierknn=KNeighborsClassifier(n_neighbors=12)
print("KNN Accuracy ",knn.score(X_test,y_test))

KNN Accuracy is 78%

KNN Accuracy  0.7857142857142857

Decision Tree Classifier:

from sklearn.tree import DecisionTreeClassifierdec_cls=DecisionTreeClassifier()
print("Decision Tree Accuracy ",dec_cls.score(X_test,y_test))

Decision tree classifier accuracy is about 78%

Decision Tree Accuracy  0.7792207792207793

Logistics Regression :

from sklearn.linear_model import LogisticRegression
print("Logistic Regression Accuracy ",lrc.score(X_test,y_test))

Accuracy for Logistic Regression is 81%.

Logistic Regression Accuracy  0.8181818181818182

Support Vector Machine (SVM) Classifier:

from sklearn.svm import SVC
print("SVC Accuracy ",svc_classifier.score(X_test,y_test))

SVC Accuracy is about 81%

SVC Accuracy  0.8181818181818182

Voting Classifier:

We trained different models (SVM, KNN, Logistics, Decision Tree) using the same training data set and calculated individual accuracy. How about pitting these models against each other and selecting the best among them. This can be done using VotingClassifier class from sklearn.

from sklearn.ensemble import VotingClassifier
vote_cls = VotingClassifier(estimators=[('lr', svc_classifier), ('dt', lrc),('ab',knn),('dec',dec_cls)], voting='hard')
print('Voting Classifier Accuracy ', vote_cls.score(X_test,y_test))

Voting classifier accuracy is 81%

Voting Classifier Accuracy  0.8181818181818182

Make a note of voting=’hard’ option in VotingClassifier.

There are two kinds of voting: hard and soft.

a) In hard voting majority determines the outcome. This is like selecting mode of individual values. We had following individual accuracy score of models

KNN Accuracy  0.7857142857142857
Decision Tree Accuracy 0.7922077922077922
SVC Accuracy 0.8181818181818182
Logistic Regression Accuracy 0.8181818181818182

Majority score is 81%. No wonder the hard voting classifier resulted into 81% accuracy.

However, make a note that hard voting classifier gets the mode of each predicated label and not overall outcome.

b) Soft voting is applicable in regression analysis or probability based classifiers (ex. Logistic Regression). Soft voting classifier calculates weighted average of individual outcomes.


So far we have used different models on same training data set, got individual prediction and used voting classifier to get best outcome.

Instead of using different models on same training data set, how about splitting the training data set into several small subsets, training a model on these data and calculating overall outcome using voting for classifier and averaging for regression. This is called Bagging.

Using bootstrap sampling, bagging creates several subsets of original training data. Split of training data into smaller subset is done such that each sub-set has at least 62% unique training points.

Note: Only the overall training data set is split in smaller sets. Features are not compromised. All the features are considered in every sub-set.

Figure 1 explains Bagging.

Fig 1: Bagging

As decision tree classifier gave maximum accuracy, let’s use the Bagging on this model.

We are going to split the training data into 25 sub sets (base_estimators)

from sklearn.ensemble import BaggingClassifier#Bagging Decision Tree Classifier
#initialize base classifier
#number of base classifier
#bagging classifier
bag_cls=BaggingClassifier(base_estimator=dec_tree_cls,n_estimators=no_of_trees,random_state=10, bootstrap=True, oob_score=True)
print("Bagging Classifier Accuracy ",bag_cls.score(X_test,y_test))

Accuracy has increased to 82%.

Bagging Classifier Accuracy  0.8246753246753247

As evident by this example, bagging has improved the accuracy.

Let’s try bagging with KNN classifier.

#Bagging KNN Classifier
#initialize base classifier
knn_cls=KNeighborsClassifier(n_neighbors=12)#number of base classifier
#bagging classifier
bag_cls=BaggingClassifier(base_estimator=knn_cls,n_estimators=no_of_trees,random_state=10, bootstrap=True, oob_score=True)
print("Bagging Classifier Accuracy ",bag_cls.score(X_test,y_test))

Accuracy is 78%.

Bagging Classifier Accuracy  0.7857142857142857

In case of KNN accuracy remains same. Bagging has not improved the prediction.

Bagging brings in good improvements in classifiers like Simple Decision Tree, however it could not improve KNN. This is because KNN is stable model based on neighboring data points.

Random Forest:

Random forest is enhanced version of Bagging. In case of bagging the training data is split in several sub-set without compromising features. Each subset contain all the features.

Consider a typical Decision tree classifier. If training data set contains 11 features, the regular Decision tree as well as Bagging classifier will contain all 11 features.

Regular Decision Tree Structure

In Random forest, instead of using all the features, a random subset of feature is selected in each subset of training data.

Random tree will look like below figure.

Random Forest

There are more than one tree (called as estimators) and each tree contains only selected number of features.

Random forest is a fast and very effective classifier. Let’s use this for the same data set and confirm if there are any improvements.

from sklearn.ensemble import RandomForestClassifier
rnd_clf=RandomForestClassifier(n_estimators=53, n_jobs=-1, random_state=8)
print("Random Forest Score ",rnd_clf.score(X_test,y_test))

Accuracy score is 83%

Random Forest Score  0.8311688311688312

So, there is improvement. However, finding the number of estimators is key. General belief is that more number of estimators merrier, but that’s not always true.


In case of bagging, the training data sub-set was feed to models in parallel. The outcome was decided based on overall performance of the models on training data set.

Boosting takes care of increasing the performance of weak learner by reducing the bias and making weak learner learn from each outcome of previous model run on training data sub-set. Boosting follows sequential learning.

Below diagram explains boosting.



Adaboost is a famous ensemble boosting classifier. It works sequentially as explained in above figure. It start with random subset of training data. It iteratively trains the model by selecting next training subset based on the prediction accuracy of previous classification. It reduces bias, by assigning higher weight to wrong classified observations. This way in the next iteration these observations gets higher probability for classification. This iteration continues until it reaches to the specified maximum number of estimators.

Let’s use Adaboost and confirm if it improves the accuracy.

from sklearn.ensemble import AdaBoostClassifier
adb_cls=AdaBoostClassifier(n_estimators=153, learning_rate=1)
print("AdaBoost Classifier ",adb_cls.score(X_test,y_test))


AdaBoost Classifier  0.8376623376623377

Not bad! It’s has improved the performance to 83%.

Gradient Boosting Model (GBM)

Gradient Boosting Model is one of the most used and most efficient ensemble model.

Gradient Boosting can be expanded as Gradient Descent + Boosting.

Gradient Descent focuses on optimization of loss function. It can explained well using linear regression.

Below is equation for linear regression.

Below is formula for loss function Mean Square Error (MSE)

MSE Formula

Gradients descent focuses on finding optimal values of weight w, such that MSE is minimum.

It starts with a random value of w and calculates impact of changing w on MSE. It keeps changing w until it finds minimum MSE as shown in below figure.

The size of each step is called Learning Rate. Learning rate can be passed as hyper parameter to the classifier. High learning rate means moving fast towards optimal point, however might also result in overshooting the lowest point and thereby missing optimal value of w. Keeping learning rate lower mitigate this risk but it require more CPU power, as there are more calculations involved.

Gradient Boosting focuses on optimizing residual error. It follows boosting mechanism of sequential learning of models. The focus is to optimize the loss function.

Run below line of code to see if there’s any improvement using Gradient Boosting.

Here, we are passing different values of learning rate and finding the optimal value of learning rate based on model score.

from sklearn.ensemble import GradientBoostingClassifier
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1,1.25,1.55,1.65,1.75]
for learning_rate in lr_list:
gb_clf = GradientBoostingClassifier(n_estimators=53, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
gb_clf.fit(X_train, y_train)
print("Learning rate: ", learning_rate)
print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))


Learning rate:  0.05
Accuracy score (training): 0.798
Learning rate: 0.075
Accuracy score (training): 0.805
Learning rate: 0.1
Accuracy score (training): 0.816
Learning rate: 0.25
Accuracy score (training): 0.853
Learning rate: 0.5
Accuracy score (training): 0.902
Learning rate: 0.75
Accuracy score (training): 0.925
Learning rate: 1
Accuracy score (training): 0.940
Learning rate: 1.25
Accuracy score (training): 0.953
Learning rate: 1.55
Accuracy score (training): 0.935
Learning rate: 1.65
Accuracy score (training): 0.938
Learning rate: 1.75
Accuracy score (training): 0.919

The accuracy can be improved to 95.3% by using the learning rate 1.25.

This is a big improvement from 81% of base learner Decision Tree.


Happy Machine learning until next blog!



Data Science, Machine Learning and Artificial Intelligence

Sanjay Singh

Written by



Data Science, Machine Learning and Artificial Intelligence

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade