Ensemble Modelling- How to perform in python.

Shivani Parekh
Analytics Vidhya
Published in
5 min readAug 26, 2020
Ensemble Model- Image by author.

Hey ! This is Shivani Parekh. This article of mine is going to be on understanding how you can use different machine learning models together using ensemble methods and improve the accuracy of the model overall.

What is ensemble modelling and why do we use it ?

Ensemble Modelling is combining two or more models together and then synthesize the result into single score. Single model can have biases, high variability or inaccuracies that affect the reliability of its analytical findings, so ensemble model can be used to improve all of this.

We will be using breast cancer data set which can be found at https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset

Overview of Breast Cancer Dataset

As we can see there are 6 features in this dataset, from which out target feature is diagnosis. We will be predicting if the women has breast cancer or not. diagnosis has two values , 0 and 1.
0 means doesn’t have breast cancer and 1 means has breast cancer.

So we’ll start with importing important libraries required to make prediction.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selectionfrom sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import classification_report , confusion_matrix ,accuracy_score

So very well ,we are using pandas (data analysis and manipulation tool), numpy to work with arrays , matplotlib.pyplot for plotting the graphs, sklearn (machine learning library — containing all ML algorithms), sklearn.metrics module includes score functions, performance metrics and pairwise metrics and distance computations.

Next, lets read the csv file using pandas and show the contents of the file using head() method of pandas.

dataset=pd.read_csv(“D:\\Breast_cancer_data.csv”)
dataset.head()

Now lets seperate the target feature from this dataframe by Selecting only first 5 columns in X and target(diagnosis) in variable target using iloc.

X=dataset.iloc[:,0:5]
print(X)
target=dataset.iloc[:,5]
print(target)

Now, spliting dataset into train and test using train_test_split() method. test_size takes splitting value here it is 20%, so 80% is training data and 20% is testing.

from sklearn.model_selection import train_test_split
X_train, X_test, target_train, target_test = train_test_split(X, target, test_size = 0.20)

Now we will be taking individual models and predicting the accuracy.
K-Nearest algorithm has been used here. n_neighbors specifies number of neighbors to use.

#creating empty lists
estimators=[] #will store model names and its classifier instance.
accuracys=[] #will store accuracy of each model.
model1=KNeighborsClassifier(n_neighbors=3)#adding model1 to list, this step is performed for ensemble method.
estimators.append((“KNN”,model1))
model1.fit(X_train,target_train)target_pred1=model1.predict(X_test)KNNacc=accuracy_score(target_test,target_pred1)
print(“KNN acc:”,KNNacc)
#adding accuracy of this model to list,this step is performed for data visualization.
accuracys.append(KNNacc)

We can see that KNN model is giving us the accuracy of .8508.

Now what does model1.fit(X_train,target_train) do?
It trains the model based on the training data.
what does model1.predict(X_test) do?
It predicts on the test data ,here X_test.

Similarly we are going to do the same for other predicting using other two models we are using ,i.e. DecisionTreeClassifier() and SVC().

model2=DecisionTreeClassifier()
estimators.append((“cart”,model2))
model2.fit(X_train,target_train)
target_pred2=model2.predict(X_test)
Dtacc=accuracy_score(target_test,target_pred2)
print(“Decision Tree acc:”,Dtacc)
accuracys.append(Dtacc)
model3=SVC()
estimators.append((“svm”,model3))
model3.fit(X_train,target_train)
target_pred3=model3.predict(X_test)
SVMacc=accuracy_score(target_test,target_pred3)
print(“SVM acc :”,SVMacc)
accuracys.append(SVMacc)

Decision tree accuracy is .8596 and SVC accuracy is .8684.

So, this is the part where we will use ensemble method VotingClassifier() here. It will take the argument estimators=estimators here (since the list name is same). Let us recall , estimators has 3 models i.e. KNN, DecisionTree, SVC.

ensemble=VotingClassifier(estimators)
ec=ensemble.fit(X_train,target_train)
target_pred=ec.predict(X_test)
print(target_pred)

Here above we have created a single model which is predicting the output class based on the highest majority of votes received by the class.

Lets see the predicted values and actual values in a simple way so we can compare it.

df=pd.DataFrame({‘Actual’:target_test, ‘Predicted’:target_pred})
df.head(20)

Accuracy of ensemble using accuracy_score(target_test,target_pred) , confusion matrix and also we can check performance using classification report which gives us precision,recall,f1-score and support.

ensem_acc=accuracy_score(target_test,target_pred)
print(“Accuracy of ensemble model is :”,ensem_acc)
print(confusion_matrix(target_test,target_pred))
print(classification_report(target_test,target_pred))

So here we can see that accuracy of ensemble model is .8771. It is clearly more than other models.

print(“KNN acc:”,KNNacc)
print(“Decision Tree acc:”,Dtacc)
print(“SVM acc :”,SVMacc)
print(“Ensemble acc:”,ensem_acc)

Here is the entire code.

So thats pretty much it. In the next article, I will show you visualizing and comparing accuracys of this models.

I hope you liked my article 😃. Please appreciate my hardwork if possible by providing claps for me 👏👏. Thank you.

--

--

Shivani Parekh
Analytics Vidhya

I write on data science and analytics, visualizations , new technologies and tools. 🎇Aiming to improve my writing with each article I post.