Exploratory Data Analysis : Iris Dataset
--
Hello There , Namaste!!
The Iris flower data set or Fisher’s Iris data set is one of the most famous multivariate data set used for testing various Machine Learning Algorithms.
This is my version of EDA on Iris Dataset.
Data insights of each and every visualization step has been given
Importing relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
sns.set()
Source Of Data
Data has been stored inside a csv file namely ‘iris.csv’
Loading data
iris_data = pd.read_csv(‘iris.csv’)
iris_data
Gaining information from data
iris_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Data Insights:
1 All columns are not having any Null Entries
2 Four columns are numerical type
3 Only Single column categorical type
Statistical Insight
iris_data.describe()
Data Insights:
- Mean values
- Standard Deviation ,
- Minimum Values
- Maximum Values
Checking For Duplicate Entries
iris_data[iris_data.duplicated()]
There are 3 duplicates, therefore we must check whether each species data set is balanced in no's or no
Checking the balance
iris_data[‘species’].value_counts()
setosa 50
versicolor 50
virginica 50
Name: species, dtype: int64
Therefore we shouldn’t delete the entries as it might imbalance the data sets and hence will prove to be less useful for valuable insights
Data Visualization
Species count
plt.title(‘Species Count’)
sns.countplot(iris_data[‘species’])
Data Insight:
- This further visualizes that species are well balanced
- Each species ( Iris virginica, setosa, versicolor) has 50 as it’s count
Uni-variate Analysis
Comparison between various species based on sepal length and width
plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on sapel length and width’)
sns.scatterplot(iris_data[‘sepal_length’],iris_data[‘sepal_width’],hue =iris_data[‘species’],s=50)
Data Insights:
- Iris Setosa species has smaller sepal length but higher width.
- Versicolor lies in almost middle for length as well as width
- Virginica has larger sepal lengths and smaller sepal widths
Comparison between various species based on petal length and width
plt.figure(figsize=(16,9))
plt.title(‘Comparison between various species based on petal lenght and width’)
sns.scatterplot(iris_data[‘petal_length’], iris_data[‘petal_width’], hue = iris_data[‘species’], s= 50)
Data Insights
- Setosa species have the smallest petal length as well as petal width
- Versicolor species have average petal length and petal width
- Virginica species have the highest petal length as well as petal width
Bi-variate Analysis
sns.pairplot(iris_data,hue=”species”,height=4)
Data Insights:
- High co relation between petal length and width columns.
- Setosa has both low petal length and width
- Versicolor has both average petal length and width
- Virginica has both high petal length and width.
- Sepal width for setosa is high and length is low.
- Versicolor have average values for for sepal dimensions.
- Virginica has small width but large sepal length
Checking Correlation
plt.figure(figsize=(10,11))
sns.heatmap(iris_data.corr(),annot=True)
plt.plot()
Data Insights:
- Sepal Length and Sepal Width features are slightly correlated with each other
Checking Mean & Median Values for each species
iris.groupby(‘species’).agg([‘mean’, ‘median’])
visualizing the distribution , mean and median using box plots & violin plots
Box plots to know about distribution
boxplot to see how the categorical feature “Species” is distributed with all other four input variables
fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1])
plt.show()
Data Insights:
- Setosa is having smaller feature and less distributed
- Versicolor is distributed in a average manner and average features
- Virginica is highly distributed with large no .of values and features
- Clearly the mean/ median values are being shown by each plots for various features(sepal length & width, petal length & width)
Violin Plot for checking distribution
The violin plot shows density of the length and width in the species. The thinner part denotes that there is less density whereas the fatter part conveys higher density
fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()
Data Insights:
- Setosa is having less distribution and density in case of petal length & width
- Versicolor is distributed in a average manner and average features in case of petal length & width
- Virginica is highly distributed with large no .of values and features in case of sepal length & width
- High density values are depicting the mean/median values, for example: Iris Setosa has highest density at 5.0 cm ( sepal length feature) which is also the median value(5.0) as per the table
Mean / Median Table for reference
Plotting the Histogram & Probability Density Function (PDF)
plotting the probability density function(PDF) with each feature as a variable on X-axis and it’s histogram and corresponding kernel density plot on Y-axis.
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_length") \
.add_legend()sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_width") \
.add_legend()sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_length") \
.add_legend()sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_width") \
.add_legend()
plt.show()
Data Insights:
- Plot 1 shows that there is a significant amount of overlap between the species on sepal length, so it is not an effective Classification feature
- Plot 2 shows that there is even higher overlap between the species on sepal width, so it is not an effective Classification feature
- Plot 3 shows that petal length is a good Classification feature as it clearly separates the species . The overlap is extremely less (between Versicolor and Virginica) , Setosa is well separated from the rest two
- Just like Plot 3, Plot 4 also shows that petal width is a good Classification feature . The overlap is significantly less (between Versicolor and Virginica) , Setosa is well separated from the rest two
Choosing Plot 3 (Classification feature as Petal Length)to distinguish among the species
The pdf curve of Iris Setosa ends roughly at 2.1
Data Insights:
- The pdf curve of Iris Setosa ends roughly at 2.1
- If petal length < 2.1, then species is Iris Setosa
- The point of intersection between pdf curves of Versicolor and Virginica is roughly at 4.8
- If petal length > 2.1 and petal length < 4.8 then species is Iris Versicolor
- If petal length > 4.8 then species is Iris Virginica
With this I finish this blog. In the next blog we will apply Machine learning classification algorithms for predicting the species
Thank you so much for taking your precious time to read this blog. Feel free to point out any mistake(I’m a learner after all) and provide respective feedback or leave a comment.
Dhanyvaad!!
Feedback:
Email: pranshu453@gmail.com