# Ensemble Learning

In my previous blog I explained Bias, Variance and Irreducible errors.

Here’s link to the blog -> Bias Variance Irreducible Error and Model Complexity Trade off

One of the techniques to reduce these errors (Bias and Variance) is Ensemble Learning. It combines several machine learning models to get optimized result with decreased variance (bagging), bias (boosting) and improved prediction (stacking).

In this blog, you are going to have hands-on practice on Ensemble Learning methods.

Data Source:

The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Challenge:

Predict outcome (diabetic or not) based on patient’s BMI, insulin level, age and other feature values.

Let’s try different supervised learning methods and calculate their accuracy.

Execute below lines of code to read the data into pandas data frame, get feature value matrix, label array, and split train and test data set.

`#Import Librariesimport pandas as pdimport numpy as np# Read data into pandas dataframedf=pd.read_csv(r'<put your file path here>\diabetes.csv')#Define Feature Matrix (X) and Label Array (y)X=df.drop(['Outcome'],axis=1)y=df['Outcome']#Define train and test data setfrom sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)`

Let’s try different classifiers and calculate their accuracy.

KNN Classifier:

`from sklearn.neighbors import KNeighborsClassifierknn=KNeighborsClassifier(n_neighbors=12)knn.fit(X_train,y_train)y_pred_knn=knn.predict(X_test)print("KNN Accuracy ",knn.score(X_test,y_test))`

KNN Accuracy is 78%

`KNN Accuracy  0.7857142857142857`

Decision Tree Classifier:

`from sklearn.tree import DecisionTreeClassifierdec_cls=DecisionTreeClassifier()dec_cls.fit(X_train,y_train)y_pred_dec=dec_cls.predict(X_test)print("Decision Tree Accuracy ",dec_cls.score(X_test,y_test))`

Decision tree classifier accuracy is about 78%

`Decision Tree Accuracy  0.7792207792207793`

Logistics Regression :

`from sklearn.linear_model import LogisticRegressionlrc=LogisticRegression()lrc.fit(X_train,y_train)y_pred_log=lrc.predict(X_test)print("Logistic Regression Accuracy ",lrc.score(X_test,y_test))`

Accuracy for Logistic Regression is 81%.

`Logistic Regression Accuracy  0.8181818181818182`

Support Vector Machine (SVM) Classifier:

`from sklearn.svm import SVCsvc_classifier=SVC(kernel="linear",random_state=0)svc_classifier.fit(X_train,y_train)y_pred_svc=svc_classifier.predict(X_test)print("SVC Accuracy ",svc_classifier.score(X_test,y_test))`

`SVC Accuracy  0.8181818181818182`

Voting Classifier:

We trained different models (SVM, KNN, Logistics, Decision Tree) using the same training data set and calculated individual accuracy. How about pitting these models against each other and selecting the best among them. This can be done using VotingClassifier class from sklearn.

`from sklearn.ensemble import VotingClassifiervote_cls = VotingClassifier(estimators=[('lr', svc_classifier), ('dt', lrc),('ab',knn),('dec',dec_cls)], voting='hard')vote_cls.fit(X_train,y_train)y_pred_vote_cls=vote_cls.predict(X_test)print('Voting Classifier Accuracy ', vote_cls.score(X_test,y_test))`

Voting classifier accuracy is 81%

`Voting Classifier Accuracy  0.8181818181818182`

Make a note of voting=’hard’ option in VotingClassifier.

There are two kinds of voting: hard and soft.

a) In hard voting majority determines the outcome. This is like selecting mode of individual values. We had following individual accuracy score of models

`KNN Accuracy  0.7857142857142857Decision Tree Accuracy  0.7922077922077922SVC Accuracy  0.8181818181818182Logistic Regression Accuracy  0.8181818181818182`

Majority score is 81%. No wonder the hard voting classifier resulted into 81% accuracy.

However, make a note that hard voting classifier gets the mode of each predicated label and not overall outcome.

b) Soft voting is applicable in regression analysis or probability based classifiers (ex. Logistic Regression). Soft voting classifier calculates weighted average of individual outcomes.

# Bagging

So far we have used different models on same training data set, got individual prediction and used voting classifier to get best outcome.

Instead of using different models on same training data set, how about splitting the training data set into several small subsets, training a model on these data and calculating overall outcome using voting for classifier and averaging for regression. This is called Bagging.

Using bootstrap sampling, bagging creates several subsets of original training data. Split of training data into smaller subset is done such that each sub-set has at least 62% unique training points.

Note: Only the overall training data set is split in smaller sets. Features are not compromised. All the features are considered in every sub-set.

Figure 1 explains Bagging.

As decision tree classifier gave maximum accuracy, let’s use the Bagging on this model.

We are going to split the training data into 25 sub sets (base_estimators)

`from sklearn.ensemble import BaggingClassifier#Bagging Decision Tree Classifier#initialize base classifierdec_tree_cls=DecisionTreeClassifier()#number of base classifierno_of_trees=25#bagging classifierbag_cls=BaggingClassifier(base_estimator=dec_tree_cls,n_estimators=no_of_trees,random_state=10, bootstrap=True, oob_score=True)bag_cls.fit(X_train,y_train)bag_cls.predict(X_test)print("Bagging Classifier Accuracy ",bag_cls.score(X_test,y_test))`

Accuracy has increased to 82%.

`Bagging Classifier Accuracy  0.8246753246753247`

As evident by this example, bagging has improved the accuracy.

Let’s try bagging with KNN classifier.

`#Bagging KNN Classifier#initialize base classifierknn_cls=KNeighborsClassifier(n_neighbors=12)#number of base classifierno_of_trees=25#bagging classifierbag_cls=BaggingClassifier(base_estimator=knn_cls,n_estimators=no_of_trees,random_state=10, bootstrap=True, oob_score=True)bag_cls.fit(X_train,y_train)bag_cls.predict(X_test)print("Bagging Classifier Accuracy ",bag_cls.score(X_test,y_test))`

Accuracy is 78%.

`Bagging Classifier Accuracy  0.7857142857142857`

In case of KNN accuracy remains same. Bagging has not improved the prediction.

Bagging brings in good improvements in classifiers like Simple Decision Tree, however it could not improve KNN. This is because KNN is stable model based on neighboring data points.

## Random Forest:

Random forest is enhanced version of Bagging. In case of bagging the training data is split in several sub-set without compromising features. Each subset contain all the features.

Consider a typical Decision tree classifier. If training data set contains 11 features, the regular Decision tree as well as Bagging classifier will contain all 11 features.

In Random forest, instead of using all the features, a random subset of feature is selected in each subset of training data.

Random tree will look like below figure.

There are more than one tree (called as estimators) and each tree contains only selected number of features.

Random forest is a fast and very effective classifier. Let’s use this for the same data set and confirm if there are any improvements.

`from sklearn.ensemble import RandomForestClassifierrnd_clf=RandomForestClassifier(n_estimators=53, n_jobs=-1, random_state=8)rnd_clf.fit(X_train,y_train)rnd_clf.predict(X_test)print("Random Forest Score ",rnd_clf.score(X_test,y_test))`

Accuracy score is 83%

`Random Forest Score  0.8311688311688312`

So, there is improvement. However, finding the number of estimators is key. General belief is that more number of estimators merrier, but that’s not always true.

# Boosting

In case of bagging, the training data sub-set was feed to models in parallel. The outcome was decided based on overall performance of the models on training data set.

Boosting takes care of increasing the performance of weak learner by reducing the bias and making weak learner learn from each outcome of previous model run on training data sub-set. Boosting follows sequential learning.

Below diagram explains boosting.

Adaboost is a famous ensemble boosting classifier. It works sequentially as explained in above figure. It start with random subset of training data. It iteratively trains the model by selecting next training subset based on the prediction accuracy of previous classification. It reduces bias, by assigning higher weight to wrong classified observations. This way in the next iteration these observations gets higher probability for classification. This iteration continues until it reaches to the specified maximum number of estimators.

Let’s use Adaboost and confirm if it improves the accuracy.

`from sklearn.ensemble import AdaBoostClassifieradb_cls=AdaBoostClassifier(n_estimators=153, learning_rate=1)adb_cls.fit(X_train,y_train)y_adb_pred=adb_cls.predict(X_test)print("AdaBoost Classifier ",adb_cls.score(X_test,y_test))`

Outcome

`AdaBoost Classifier  0.8376623376623377`

Not bad! It’s has improved the performance to 83%.

Gradient Boosting Model is one of the most used and most efficient ensemble model.

Gradient Descent focuses on optimization of loss function. It can explained well using linear regression.

Below is equation for linear regression.

Below is formula for loss function Mean Square Error (MSE)

Gradients descent focuses on finding optimal values of weight w, such that MSE is minimum.

It starts with a random value of w and calculates impact of changing w on MSE. It keeps changing w until it finds minimum MSE as shown in below figure.

The size of each step is called Learning Rate. Learning rate can be passed as hyper parameter to the classifier. High learning rate means moving fast towards optimal point, however might also result in overshooting the lowest point and thereby missing optimal value of w. Keeping learning rate lower mitigate this risk but it require more CPU power, as there are more calculations involved.

Gradient Boosting focuses on optimizing residual error. It follows boosting mechanism of sequential learning of models. The focus is to optimize the loss function.

Run below line of code to see if there’s any improvement using Gradient Boosting.

Here, we are passing different values of learning rate and finding the optimal value of learning rate based on model score.

`from sklearn.ensemble import GradientBoostingClassifierlr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1,1.25,1.55,1.65,1.75]for learning_rate in lr_list:    gb_clf = GradientBoostingClassifier(n_estimators=53, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)    gb_clf.fit(X_train, y_train)print("Learning rate: ", learning_rate)    print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))`

Result:

`Learning rate:  0.05Accuracy score (training): 0.798Learning rate:  0.075Accuracy score (training): 0.805Learning rate:  0.1Accuracy score (training): 0.816Learning rate:  0.25Accuracy score (training): 0.853Learning rate:  0.5Accuracy score (training): 0.902Learning rate:  0.75Accuracy score (training): 0.925Learning rate:  1Accuracy score (training): 0.940Learning rate:  1.25Accuracy score (training): 0.953Learning rate:  1.55Accuracy score (training): 0.935Learning rate:  1.65Accuracy score (training): 0.938Learning rate:  1.75Accuracy score (training): 0.919`

The accuracy can be improved to 95.3% by using the learning rate 1.25.

This is a big improvement from 81% of base learner Decision Tree.

Awesome!!

Happy Machine learning until next blog!

Reference:

## Sanrusha

#### Data Science, Machine Learning and Artificial Intelligence

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade