Exploratory Data Analysis : Iris Dataset

Published in

Analytics Vidhya

7 min readApr 3, 2021

Hello There , Namaste!!

The Iris flower data set or Fisher’s Iris data set is one of the most famous multivariate data set used for testing various Machine Learning Algorithms.

This is my version of EDA on Iris Dataset.

Data insights of each and every visualization step has been given

Importing relevant libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
sns.set()

Source Of Data

Data has been stored inside a csv file namely ‘iris.csv’

Loading data

iris_data = pd.read_csv(‘iris.csv’)
iris_data

Complete Iris dataset — Complete Iris Dataset

Visual description of various features of Iris Species

Gaining information from data

iris_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Data Insights:

1 All columns are not having any Null Entries

2 Four columns are numerical type

3 Only Single column categorical type

Statistical Insight

iris_data.describe()

Data Insights:

Mean values
Standard Deviation ,
Minimum Values
Maximum Values

Checking For Duplicate Entries

iris_data[iris_data.duplicated()]

There are 3 duplicates, therefore we must check whether each species data set is balanced in no's or no

Checking the balance

iris_data[‘species’].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

Therefore we shouldn’t delete the entries as it might imbalance the data sets and hence will prove to be less useful for valuable insights

Data Visualization

Species count

plt.title(‘Species Count’)
sns.countplot(iris_data[‘species’])

Data Insight:

This further visualizes that species are well balanced
Each species ( Iris virginica, setosa, versicolor) has 50 as it’s count

Uni-variate Analysis

Comparison between various species based on sepal length and width

plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on sapel length and width’)
sns.scatterplot(iris_data[‘sepal_length’],iris_data[‘sepal_width’],hue =iris_data[‘species’],s=50)

Data Insights:

Iris Setosa species has smaller sepal length but higher width.
Versicolor lies in almost middle for length as well as width
Virginica has larger sepal lengths and smaller sepal widths

Comparison between various species based on petal length and width

plt.figure(figsize=(16,9))
plt.title(‘Comparison between various species based on petal lenght and width’)
sns.scatterplot(iris_data[‘petal_length’], iris_data[‘petal_width’], hue = iris_data[‘species’], s= 50)

Data Insights

Setosa species have the smallest petal length as well as petal width
Versicolor species have average petal length and petal width
Virginica species have the highest petal length as well as petal width

Bi-variate Analysis

sns.pairplot(iris_data,hue=”species”,height=4)

Data Insights:

High co relation between petal length and width columns.
Setosa has both low petal length and width
Versicolor has both average petal length and width
Virginica has both high petal length and width.
Sepal width for setosa is high and length is low.
Versicolor have average values for for sepal dimensions.
Virginica has small width but large sepal length

Checking Correlation

plt.figure(figsize=(10,11))
sns.heatmap(iris_data.corr(),annot=True)
plt.plot()

Data Insights:

Sepal Length and Sepal Width features are slightly correlated with each other

Checking Mean & Median Values for each species

iris.groupby(‘species’).agg([‘mean’, ‘median’])

visualizing the distribution , mean and median using box plots & violin plots

Box plots to know about distribution

boxplot to see how the categorical feature “Species” is distributed with all other four input variables

fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1])
plt.show()

Data Insights:

Setosa is having smaller feature and less distributed
Versicolor is distributed in a average manner and average features
Virginica is highly distributed with large no .of values and features
Clearly the mean/ median values are being shown by each plots for various features(sepal length & width, petal length & width)

Violin Plot for checking distribution

The violin plot shows density of the length and width in the species. The thinner part denotes that there is less density whereas the fatter part conveys higher density

fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()

Data Insights:

Setosa is having less distribution and density in case of petal length & width
Versicolor is distributed in a average manner and average features in case of petal length & width
Virginica is highly distributed with large no .of values and features in case of sepal length & width
High density values are depicting the mean/median values, for example: Iris Setosa has highest density at 5.0 cm ( sepal length feature) which is also the median value(5.0) as per the table

Mean / Median Table for reference

Plotting the Histogram & Probability Density Function (PDF)

plotting the probability density function(PDF) with each feature as a variable on X-axis and it’s histogram and corresponding kernel density plot on Y-axis.

sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_length") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_width") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_length") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_width") \
.add_legend()
plt.show()

Plot 1 | Classification feature : Sepal Length

Plot 2 | Classification feature : Sepal Width

Plot 3 | Classification feature : Petal Length

Plot 4 | Classification feature : Petal Width

Data Insights:

Plot 1 shows that there is a significant amount of overlap between the species on sepal length, so it is not an effective Classification feature
Plot 2 shows that there is even higher overlap between the species on sepal width, so it is not an effective Classification feature
Plot 3 shows that petal length is a good Classification feature as it clearly separates the species . The overlap is extremely less (between Versicolor and Virginica) , Setosa is well separated from the rest two
Just like Plot 3, Plot 4 also shows that petal width is a good Classification feature . The overlap is significantly less (between Versicolor and Virginica) , Setosa is well separated from the rest two

Choosing Plot 3 (Classification feature as Petal Length)to distinguish among the species

The pdf curve of Iris Setosa ends roughly at 2.1

Data Insights:

The pdf curve of Iris Setosa ends roughly at 2.1
If petal length < 2.1, then species is Iris Setosa
The point of intersection between pdf curves of Versicolor and Virginica is roughly at 4.8
If petal length > 2.1 and petal length < 4.8 then species is Iris Versicolor
If petal length > 4.8 then species is Iris Virginica

With this I finish this blog. In the next blog we will apply Machine learning classification algorithms for predicting the species

Thank you so much for taking your precious time to read this blog. Feel free to point out any mistake(I’m a learner after all) and provide respective feedback or leave a comment.

Dhanyvaad!!

Feedback:
Email: pranshu453@gmail.com

Exploratory Data Analysis : Iris Dataset

Importing relevant libraries

Source Of Data

Loading data

Gaining information from data

Data Insights:

Statistical Insight

Data Insights:

Checking For Duplicate Entries

Checking the balance

Data Visualization

Species count

Data Insight:

Uni-variate Analysis

Data Insights:

Data Insights

Bi-variate Analysis

Data Insights:

Checking Correlation

Data Insights:

Checking Mean & Median Values for each species

Box plots to know about distribution

Data Insights:

Violin Plot for checking distribution

Data Insights:

Plotting the Histogram & Probability Density Function (PDF)

Data Insights:

Choosing Plot 3 (Classification feature as Petal Length)to distinguish among the species

Data Insights:

Written by Pranshu Sharma