Exploratory Data Analysis: Uni-variate analysis of Iris Data set
How to perform Data analysis and visualization on Iris flower data set using Pandas, Matplotlib and Seaborn libraries of Python.
What is Exploratory Data Analysis?
Exploratory data analysis (EDA) is a process of analyzing data by using simple concepts from statistics & probability and presenting the results in easy-to-understand pictorial format.
In one sentence — Being Sherlock Holmes of data!
About Iris Data set
The Iris flower data set consists of 50 samples from each of three species of Iris Flowers — Iris Setosa, Iris Virginica and Iris Versicolor . The Iris flower data set was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper “The use of multiple measurements in taxonomic problems”.
Iris data is a multivariate data set. Four features measured from each sample are —sepal length, sepal width, petal length and petal width, in centimeters.
Iris data is publicly available to use and is one of the most widely used data set, mostly by the beginners in the area of Data Science & Machine Learning. It consists of a set of 150 records under 5 attributes — Sepal length, Sepal width, Petal length, Petal width and Class-Labels(Species).
Download Iris data set here — University Of California, Irvine archives (open resources)
In Machine learning terminology, the observed features like sepal length, sepal width, petal length and petal width are called independent variables while the class-label which is to be determined is called dependent variable.
What is our Objective?
Given the sepal length, sepal width, petal length and petal width, classify the Iris flower into one of the three species — Setosa, Virginica and Versicolor.
Basic Statistical Analysis — Central Tendency and Spread of Data
Import all necessary libraries of Python —
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
The Iris data set is stored in .csv format. ‘.csv’ stands for comma separated values. It is easier to load .csv files in Pandas data frame and perform various analytical operations on it.
Load Iris.csv into a Pandas data frame —
iris = pd.read_csv(“iris.csv”)
Determining the mean and median of the different species present in the data set —
iris.groupby('species').agg(['mean', 'median'])
For all the species, the respective values of the mean and median of it’s features are found to be pretty close. This indicates that data is nearly symmetrically distributed with very less presence of outliers. Box plot(explained later) is one of the best statistical tool used for outlier detection in the data.
Computing the Standard deviation —
iris.groupby('species').std()
Standard deviation (or variance) is an indication of how widely the data is spread about the mean.
Box Plot and Violin Plot
Box Plot
Box plot, also known as a box and whisker plot, displays a summary of a large amount of data in five numbers — minimum, lower quartile(25th percentile), median(50th percentile), upper quartile(75th percentile) and maximum data values.
Plotting the box-plots using Seaborn library —
sns.set(style="ticks")
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.boxplot(x='species',y='sepal_length',data=iris)
plt.subplot(2,2,2)
sns.boxplot(x='species',y='sepal_width',data=iris)
plt.subplot(2,2,3)
sns.boxplot(x='species',y='petal_length',data=iris)
plt.subplot(2,2,4)
sns.boxplot(x='species',y='petal_width',data=iris)
plt.show()
The isolated points that can be seen in the box-plots above are the outliers in the data. Since these are very few in number, it wouldn't have any significant impact on our analysis.
Violin Plot
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of data across several levels of one (or more) categorical variables(flower species in our case) such that those distributions can be compared. Unlike box plot, in which all of the plot components correspond to actual data points, the violin plot additionally shows the kernel density estimation of the underlying distribution.
Plotting Violin plots —
sns.set(style=”whitegrid”)
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.violinplot(x=’species’,y=’sepal_length’,data=iris)
plt.subplot(2,2,2)
sns.violinplot(x=’species’,y=’sepal_width’,data=iris)
plt.subplot(2,2,3)
sns.violinplot(x=’species’,y=’petal_length’,data=iris)
plt.subplot(2,2,4)
sns.violinplot(x=’species’,y=’petal_width’,data=iris)
plt.show()
Violin plots typically are more informative as compared to the box plots as violin plots also represent the underlying distribution of the data in addition to the statistical summary.
Probability Density Function (PDF) & Cumulative Distribution Function (CDF)
Uni-variate as the name suggests is one variable analysis. Our ultimate aim is to be able to correctly identify the specie of Iris flower given it’s features — sepal length, sepal width, petal length and petal width. Which among the four features is more useful than other variables in order to distinguish between the species of Iris flower ? To answer this, we will plot the probability density function(PDF) with each feature as a variable on X-axis and it’s histogram and corresponding kernel density plot on Y-axis.
Before we begin further analysis, we need to split the Data Frame according to the 3 distinct class-labels — Setosa, Versicolor and Virginica.
iris_setosa = iris[iris[“species”] == “setosa”]
iris_versicolor = iris[iris[“species”] == “versicolor”]iris_virginica = iris[iris[“species”] == “virginica”]
Plotting the Histogram & PDF using Seaborn FacetGrid object —
# sepal length
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_length") \
.add_legend();# sepal width
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_width") \
.add_legend();# petal length
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_length") \
.add_legend();# petal width
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_width") \
.add_legend();plt.show()
*All lengths are in centimeters.
The density plot alongside(Plot 1) reveals that there is a significant amount of overlap between the species on sepal length, so it wouldn’t be a good idea to consider sepal length as a distinctive feature in our uni-variate analysis.
With sepal width as a classification feature(Plot 2), the overlap is even more than sepal length as seen in Plot 1 above. The spread of the data is also high. So, again we cannot make any comment on the specie of the flower given it’s sepal width only.
The density plot of petal length alongside(Plot 3) looks promising from the point of view of uni-variate classification. The Setosa species are well separated from Versicolor and Virginica, although there is some overlap between the Versicolor and Virginica, but not as bad as the the above two plots.
The density plot of petal width alongside(Plot 4) also looks good. There is slight intersection between the Setosa and Versicolor species, while the overlap between the Versicolor and Virginica is somewhat similar to that of petal length(Plot 3).
To summarize, if we have to choose one feature for classification, we will pick petal length (Plot 3) to distinguish among the species. If we have to select two features, then we will choose petal width as the second feature, but then again it would be a wiser to look at pair-plots(bi-variate and multivariate analysis) to determine which two features are most useful in classification.
We have already established above how petal length could stand out as an useful metric to differentiate between the species of Iris flower. From our preliminary investigation, below pseudo-code can be constructed —
(Note that this estimation is based on the kernel density smoothed probability distribution plots obtained from histograms)
If petal_length < 2.1
then specie = ‘Setosa’
else if petal_length > 2.1 and petal_length < 4.8
then specie = ‘Versicolor’
else if petal_length > 4.8
then specie = ‘Virginica’
*all lengths are in centimeters.
Although the Setosa is clearly separated, there is a small overlap between the Versicolor and Virginica species. The reason why we intuitively considered 4.8 mark to distinguish between Virginica and Versicolor is because from the density plot we can clearly see that although not all, but majority of the Versicolor flowers has petal length less than 4.8 while majority of the Virginica flowers has petal length greater than 4.8.
With this preliminary analysis, it is quiet possible that some Versicolor flowers who’s petal length is greater than 4.8 will get incorrectly classified as Virginica. Similarly, some Virginica flowers who’s petal length happen to be less than 4.8 will get incorrectly classified as Versicolor.
Is there some way to measure what proportion or what percentage of Versicolor and Virginica flowers will be incorrectly classified with above analysis ? That’s where Cumulative distribution plots comes into the picture!
Using Cumulative Distribution Function (CDF) plots to quantify the proportion of misclassified flowers in above analysis
The area under the plot of PDF over an interval represents the probability of of occurrence of random variable in that interval. In our analysis, petal length is the random variable.
Mathematically, CDF is an integral of PDF over the range of values that a continuous random variable takes. CDF of a random variable evaluated at any point ‘x’ gives the probability that a random variable will take a value less than or equal to ‘x’.
Plotting CDF and PDF for Iris Setosa, Versicolor and Virginica flowers for comparative analysis of the petal length —
plt.figure(figsize = (15,10))# setosa
counts, bin_edges = np.histogram(iris_setosa['petal_length'],
bins = 10, density = True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label = 'Setosa PDF')
plt.plot(bin_edges[1:], cdf, label = 'Setosa CDF')# versicolor
counts, bin_edges = np.histogram(iris_versicolor['petal_length'],
bins = 10, density = True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label = 'Versicolor PDF')
plt.plot(bin_edges[1:], cdf, label = 'Versicolor CDF')# virginica
counts, bin_edges = np.histogram(iris_virginica['petal_length'],
bins = 10, density = True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label = 'Virginica PDF')
plt.plot(bin_edges[1:], cdf, label = 'Virginica CDF')plt.legend()
plt.show()
From the above CDF plots, it can be seen that 100 % of the Setosa flower species have petal length less than 1.9. Near about 95 % of the Versicolor flowers have petal length less than 5, while about 10% of the Virginica flowers have petal length less than 5. So, we will incorporate our newly found insights into our previously written pseudo-code to construct a simple uni-variate ‘classification model’.
If petal_length < 1.9
then specie = ‘Setosa’
(accuracy = 100%)
else if petal_length > 3.2 and petal_length < 5
then specie = ‘Versicolor’
(accuracy = 95%)…
…else if petal_length > 5
then specie = ‘Virginica’
(accuracy = 90%)
Thus by using the cumulative distribution plot, we get a better picture and robust understanding of distribution leading to formulation of simple uni-variate classification model.
This was all about univariate analysis. In my subsequent article, I have performed full-fledged exploratory data analysis — Univariate, bi-variate and multi-variate exploratory data analysis on Haberman’s Survival data set.