Exploratory Data Analysis : Iris Dataset

Pranshu Sharma
Analytics Vidhya
Published in
7 min readApr 3, 2021
Iris Flower
Photo by Mike on Unsplash

Hello There , Namaste!!

The Iris flower data set or Fisher’s Iris data set is one of the most famous multivariate data set used for testing various Machine Learning Algorithms.

This is my version of EDA on Iris Dataset.

Data insights of each and every visualization step has been given

Importing relevant libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
sns.set()

Source Of Data

Data has been stored inside a csv file namely ‘iris.csv’

Loading data

iris_data = pd.read_csv(‘iris.csv’)
iris_data

Complete Iris dataset
Complete Iris Dataset
Visual description of various features of Iris Species

Gaining information from data

iris_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Data Insights:

1 All columns are not having any Null Entries

2 Four columns are numerical type

3 Only Single column categorical type

Statistical Insight

iris_data.describe()

Data Insights:

  1. Mean values
  2. Standard Deviation ,
  3. Minimum Values
  4. Maximum Values

Checking For Duplicate Entries

iris_data[iris_data.duplicated()]

Duplicate Entries

There are 3 duplicates, therefore we must check whether each species data set is balanced in no's or no

Checking the balance

iris_data[‘species’].value_counts()

setosa        50
versicolor 50
virginica 50
Name: species, dtype: int64

Therefore we shouldn’t delete the entries as it might imbalance the data sets and hence will prove to be less useful for valuable insights

Data Visualization

Species count

plt.title(‘Species Count’)
sns.countplot(iris_data[‘species’])

Data Insight:

  1. This further visualizes that species are well balanced
  2. Each species ( Iris virginica, setosa, versicolor) has 50 as it’s count
Iris Flower Species

Uni-variate Analysis

Comparison between various species based on sepal length and width

plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on sapel length and width’)
sns.scatterplot(iris_data[‘sepal_length’],iris_data[‘sepal_width’],hue =iris_data[‘species’],s=50)

Data Insights:

  1. Iris Setosa species has smaller sepal length but higher width.
  2. Versicolor lies in almost middle for length as well as width
  3. Virginica has larger sepal lengths and smaller sepal widths

Comparison between various species based on petal length and width

plt.figure(figsize=(16,9))
plt.title(‘Comparison between various species based on petal lenght and width’)
sns.scatterplot(iris_data[‘petal_length’], iris_data[‘petal_width’], hue = iris_data[‘species’], s= 50)

Data Insights

  1. Setosa species have the smallest petal length as well as petal width
  2. Versicolor species have average petal length and petal width
  3. Virginica species have the highest petal length as well as petal width

Bi-variate Analysis

sns.pairplot(iris_data,hue=”species”,height=4)

Data Insights:

  1. High co relation between petal length and width columns.
  2. Setosa has both low petal length and width
  3. Versicolor has both average petal length and width
  4. Virginica has both high petal length and width.
  5. Sepal width for setosa is high and length is low.
  6. Versicolor have average values for for sepal dimensions.
  7. Virginica has small width but large sepal length

Checking Correlation

plt.figure(figsize=(10,11))
sns.heatmap(iris_data.corr(),annot=True)
plt.plot()

Heatmap

Data Insights:

  1. Sepal Length and Sepal Width features are slightly correlated with each other

Checking Mean & Median Values for each species

iris.groupby(‘species’).agg([‘mean’, ‘median’])

mean and median outputs

visualizing the distribution , mean and median using box plots & violin plots

Box plots to know about distribution

boxplot to see how the categorical feature “Species” is distributed with all other four input variables

fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1])
plt.show()

Box Plots

Data Insights:

  1. Setosa is having smaller feature and less distributed
  2. Versicolor is distributed in a average manner and average features
  3. Virginica is highly distributed with large no .of values and features
  4. Clearly the mean/ median values are being shown by each plots for various features(sepal length & width, petal length & width)

Violin Plot for checking distribution

The violin plot shows density of the length and width in the species. The thinner part denotes that there is less density whereas the fatter part conveys higher density

fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()

Violin Plot

Data Insights:

  1. Setosa is having less distribution and density in case of petal length & width
  2. Versicolor is distributed in a average manner and average features in case of petal length & width
  3. Virginica is highly distributed with large no .of values and features in case of sepal length & width
  4. High density values are depicting the mean/median values, for example: Iris Setosa has highest density at 5.0 cm ( sepal length feature) which is also the median value(5.0) as per the table

Mean / Median Table for reference

Plotting the Histogram & Probability Density Function (PDF)

plotting the probability density function(PDF) with each feature as a variable on X-axis and it’s histogram and corresponding kernel density plot on Y-axis.

sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_length") \
.add_legend()

sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_width") \
.add_legend()

sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_length") \
.add_legend()

sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_width") \
.add_legend()
plt.show()

Plot 1 | Classification feature : Sepal Length
Plot 2 | Classification feature : Sepal Width
Plot 3 | Classification feature : Petal Length
Plot 4 | Classification feature : Petal Width

Data Insights:

  1. Plot 1 shows that there is a significant amount of overlap between the species on sepal length, so it is not an effective Classification feature
  2. Plot 2 shows that there is even higher overlap between the species on sepal width, so it is not an effective Classification feature
  3. Plot 3 shows that petal length is a good Classification feature as it clearly separates the species . The overlap is extremely less (between Versicolor and Virginica) , Setosa is well separated from the rest two
  4. Just like Plot 3, Plot 4 also shows that petal width is a good Classification feature . The overlap is significantly less (between Versicolor and Virginica) , Setosa is well separated from the rest two

Choosing Plot 3 (Classification feature as Petal Length)to distinguish among the species

Plot 3 | Classification feature : Petal Length

The pdf curve of Iris Setosa ends roughly at 2.1

Data Insights:

  1. The pdf curve of Iris Setosa ends roughly at 2.1
  2. If petal length < 2.1, then species is Iris Setosa
  3. The point of intersection between pdf curves of Versicolor and Virginica is roughly at 4.8
  4. If petal length > 2.1 and petal length < 4.8 then species is Iris Versicolor
  5. If petal length > 4.8 then species is Iris Virginica

With this I finish this blog. In the next blog we will apply Machine learning classification algorithms for predicting the species

Thank you so much for taking your precious time to read this blog. Feel free to point out any mistake(I’m a learner after all) and provide respective feedback or leave a comment.

Dhanyvaad!!

Feedback:
Email: pranshu453@gmail.com

--

--