Iris Species Classification Using Auto ML (pycaret)

Jerry John

Published in

Analytics Vidhya

7 min readJun 29, 2020

Click for more information about iris datasets.

Click for full code.

1.Introduction

In this article we are going to learn how to build a model through a method which is quite different from the normal traditional method or so called raw method used in other Kaggle Works. The drawback of traditional method is that a lot of time is wasted in data pre-processing, feature selection, model selection, hyper parameter tuning etc.. These days many Auto MI’s are available which can be easily pip installed and used very effectively. A lot of time-consuming works can be simply done with a couple of lines. In the majority of cases model accuracy level is more in this than the model which is made using the traditional method.

2. About the datasets

Each row of the table represents an iris flower, including its species and dimensions of its botanical parts, sepal and petal, in centimetre. Here we are looking into three different species of iris flowers. They are

Iris Versicolor
Iris Setosa
Iris Virginica

It includes 50 samples of each iris species as well as some features of the flowers. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

3. Installing necessary packages

Before we move towards the coding part, first import all the necessary packages.

!pip install pycaretimport numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #Data visualization 
import seaborn as sns #Data visualization

pycaret : The AutoML we are using here is pycaret

numpy : Library used for linear algebra

pandas : Library used for data processing, CSV file I/O (e.g. pd.read_csv

matplotlib : Library used for data visualization.

seaborn : Library used for data visualization.

4. Getting the data

We can get the datasets using pandas library.

# Getting the dataset to "dataset" variable

dataset = pd.read_csv("../input/iris/Iris.csv") 
# the iris dataset is now a Pandas DataFrame# Showing first 5 rows.

dataset.head()

5. Now we are plotting the graphs by comparing each of the columns.

a. Petal Length & Petal Width

sns.FacetGrid(dataset,hue='Species',size=5).map(plt.scatter,'PetalLengthCm','PetalWidthCm').add_legend()

b. Sepal Width & Petal Width

sns.FacetGrid(dataset,hue='Species',size=5).map(plt.scatter,'SepalWidthCm','PetalWidthCm').add_legend()

c. Sepal Width & Petal Length

sns.FacetGrid(dataset,hue='Species',size=5).map(plt.scatter,'SepalWidthCm','PetalLengthCm').add_legend()

d. Sepal Length & Petal Width

sns.FacetGrid(dataset,hue='Species',size=5).map(plt.scatter,'SepalLengthCm','PetalWidthCm').add_legend()

e. Sepal Length & Petal Length

sns.FacetGrid(dataset,hue='Species',size=5).map(plt.scatter,'SepalLengthCm','PetalLengthCm').add_legend()

f. Sepal Length & Sepal Width

sns.FacetGrid(dataset,hue='Species',size=5).map(plt.scatter,'SepalLengthCm','SepalWidthCm').add_legend()

6. Checking the datasets whether balanced or unbalanced

There is no null values, the data is in a perfect condition so no data pre-processing is need here.

We know that there are three different species of Iris flowers (i.e.: Iris versicolor, Iris setosa and Iris virginica). For every classification problems it is better if the datasets are balanced(i.e. : if there are two classes infected(1) and non-infected(0) with total 1000 rows of data, we can say the datasets is balanced if 50% of data set is infected(1) and the other 50% is not infected(0)). In our case if all the three species are having approximately same amount of data then it is the best case. (If in case the datasets are unbalanced then there are lot of chances for under fitting and over fitting)

Let’s check this condition below.

dataset['Species'].value_counts().plot.pie(explode=[0.1,0.1,0.1],autopct='%1.1f%%',shadow=True,figsize=(10,8))
plt.show()

It is a balanced datasets

7. Preparing the data for model selection

In this step we are splitting the datasets into two. The first part contains 95% of the data that is used for training and testing. The remaining 5% is stored and is used to try with the final model we developed (This data is named as unseen data).

8. Importing pycaret classification method

The Auto Ml we are using here is pycaret. It’s working is very easy as we want to give the datasets as data and our target column as target. We can also set some other features as shown below.

# Importing pycaret classification method

from pycaret.classification import *# This is the first step of model selection
# Here the data is our datasets, target is the labeled 
column(dependent variable), section is is random number for future identification.exp = setup(data = data, target = 'Species', session_id=77 )

# After this we will get a list of our columns and its type, just confirm they are the same. Then hit enter.

After completing we will get a list of columns and its type, just confirm they are the same. Then hit enter.

9. Comparing the models

After confirming the column types, we can run our datasets with a couple of ML algorithms and compare the performance, as shown below.

#This command is used to compare different models with our dataset.
#The accuracy,F1 etc of each model is listed in a table.
#Choose which model you want
compare_models()

In the above table we can see many models with a good accuracy .

Next step is to decide which algorithm has to be used and then copy it’s code.

Codes for different models are given below.

Logistic Regression ‘lr’

K Nearest Neighbour ‘knn’

Naives Bayes ‘nb’

Decision Tree ‘dt’

SVM (Linear) ‘svm’

SVM (RBF) ‘rbfsvm’

Gaussian Process ‘gpc’

Multi Level Perceptron ‘mlp’

Ridge Classifier ‘ridge’

Random Forest ‘rf’

Quadratic Disc. Analysis ‘qda’

AdaBoost ‘ada’

Gradient Boosting Classifier ‘gbc’

Linear Disc. Analysis ‘lda’

Extra Trees Classifier ‘et’

Extreme Gradient Boosting ‘xgboost’

Light Gradient Boosting ‘lightgbm’

Cat Boost Classifier ‘catboost’

10. Creating the model

Now create the model using the codes.

# With this command we are creating a Naives Byes model
# The code for Naives Byes is " nb "
# fold is the number of fold you want

nb_model = create_model('nb', fold = 10)

This table shows the accuracy and other readings, for all the 10 folds.

Now tuning of the hyper parameters.

Tuning the hyper parameters will be useful to increase the accuracy and other features.

For unbalanced datasets we mainly look F1 score, as our datasets our balanced we can use the accuracy.

For this datasets we are already getting 100% accuracy.So without tuning the hyper parameters, it will work.

11. Tuning the hyper parameters

nb_tuned = tune_model('nb')

12. Plotting the ROC Curves

As the curve moves towards the x and y axis, the performance is increased.

plot_model(nb_tuned, plot = 'auc')

13. Confusion Matrix

plot_model(nb_tuned, plot = 'confusion_matrix')

Here we can see that every value is Predicted accurately. All are in true positive.

14. Predicting the accuracy using the test datasets

predict_model(nb_tuned);

We get a accuracy of 1 (i.e. 100% accuracy)

15. Checking with the unseen data

Initially we separated a part of the datasets as unseen data set for checking the final developed model. Below we are checking this. The result is a data frame with Label and the score(last two columns). Where the label is the predicted label and score is how much percentage does the machine think of having an accuracy.

new_prediction = predict_model(nb_tuned, data=data_unseen)

16. Summary

As we see above, we got a high accuracy model with 100% accuracy, with no over fitting. Auto Ml is preferred more because it’s less time consuming and gives a very good result. The hyper parameter tuning is not that easy for the less experienced people, but it makes a huge difference in the performance of the model.

As NO one is perfect, if anyone find any errors or suggestion please feel free to comment below.