Exploratory Data Analysis of IRIS Data Set Using Python

Venkata Sai Reddy Avuluri
6 min readMay 13, 2019

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

source:https://en.wikipedia.org/wiki/Iris_flower_data_set

images for the three flowers.

1.Iris Setosa

Each of these Flowers have 4 features.

  1. Petal Length
  2. Petal Width
  3. Sepal Length
  4. Sepal Width
2.Iris Versicolor
3.Iris Viriginica

A Flower is classified as either among those based on the four features given.

We are having the data set to analyze the features of flowers and say what category exactly the flowers belongs to.

data set:https://github.com/saireddyavs/applied-ai/blob/master/iris.xlsx

1.Understanding the data set.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
col=['sepal_length','sepal_width','petal_length','petal_width','type']
iris=pd.read_csv("iris.xlsx",names=col)

we have loaded the data set .

Now let’s see the shape and size etc.,

print("First five rows")
print(iris.head())
print("*********")
print("columns",iris.columns)
print("*********")
print("shape:",iris.shape)
print("*********")
print("Size:",iris.size)
print("*********")
print("no of samples available for each type") print(iris["type"].value_counts())
print("*********")
print(iris.describe())
******************************************************

Ouput:

First five rows
sepal_length sepal_width ... petal_width type
0 5.1 3.5 ... 0.2 Iris-setosa
1 4.9 3.0 ... 0.2 Iris-setosa
2 4.7 3.2 ... 0.2 Iris-setosa
3 4.6 3.1 ... 0.2 Iris-setosa
4 5.0 3.6 ... 0.2 Iris-setosa
[5 rows x 5 columns]
*********
columns Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'type'], dtype='object')
*********
shape: (150, 5)
*********
Size: 750
*********
no of samples available for each type
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50

As we can see above data distribution of data points in each class is equal so Iris is a balanced dataset as the number of data points for every class is 50.

2. Analysis of data set

2.a: Univariate analysis

Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn’t deal with causes or relationships (unlike regression) and it’s major purpose is to describe; it takes data, summarizes that data and finds patterns in the data.

Distribution plots

Distribution plots are used to visually assess how the data points are distributed with respect to its frequency.
* Usually the data points are grouped into bins and the height of the bars representing each group increases with increase in the number of data points
lie within that group. (histogram)
* Probability Density Function (PDF) is the probability that the variable takes a value x. (smoothed version of the histogram)
* Kernel Density Estimate (KDE) is the way to estimate the PDF. The area under the KDE curve is 1.
* Here the height of the bar denotes the percentage of data points under the corresponding group.

Univariate analysis here i’am using pdf,cdf graph and box-plot and violin plot.

so now let’s divide our data set into three parts:

iris_setosa=iris.loc[iris["type"]=="Iris-setosa"]
iris_virginica=iris.loc[iris["type"]=="Iris-virginica"]
iris_versicolor=iris.loc[iris["type"]=="Iris-versicolor"]

Now we are having each flower details separetly

So plotting the respective histogram’s of each flowers.

sns.FacetGrid(iris,hue="type",size=3).map(sns.distplot,"petal_length").add_legend()
sns.FacetGrid(iris,hue="type",size=3).map(sns.distplot,"petal_width").add_legend()
sns.FacetGrid(iris,hue="type",size=3).map(sns.distplot,"sepal_length").add_legend()
sns.FacetGrid(iris,hue="type",size=3).map(sns.distplot,"sepal_width").add_legend()
plt.show()

output of the histograms’s

Observations:

  • by using petal length we can separate iris-setosa
  • by using sepal length,sepal width we can’t do anything because it’s all messed up and we can’t separate the flowers
  • in petal width iris setosa is not distributed properly
  • so we are using the petal length as feature to separate at least the iris setosa.

So plot the graph by finding the pdf and cdf .

Here for each feature there will be seperate pdf and cdf.

#for iris_setosa
counts,bin_edges=np.histogram(iris_setosa["petal_length"],bins=10,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)

cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.show()

#for iris_virginica
counts,bin_edges=np.histogram(iris_virginica["petal_length"],bins=10,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)

cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.show()

#for iris_versicolor
counts,bin_edges=np.histogram(iris_versicolor["petal_length"],bins=10,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)

cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.show()

Output graphs of pdf and cdf:

iris_setosa
iris_virginica
iris_versicolor

BoxPlot :

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

ref to how to see boxplot

Now let’s plot box plot for our iris data set.

sns.boxplot(x="type",y="petal_length",data=iris)
plt.show()

output graph:

boxplot for iris data set taking petal_length as a feature

it actually shows what percentile ranges in what region.

you can also compute by using the program.

program to compute percentiles:https://github.com/saireddyavs/applied-ai/blob/master/percentiles%20quantiles.py

violin plot:

Violin Plot is a method to visualize the distribution of numerical data of different variables. It is similar to Box Plot but with a rotated plot on each side, giving more information about the density estimate on the y-axis.
The density is mirrored and flipped over and the resulting shape is filled in, creating an image resembling a violin. The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot. On the other hand, the boxplot more clearly shows the outliers in the data.

Violin Plots hold more information than the box plots, they are less popular. Because of their unpopularity, their meaning can be harder to grasp for many readers not familiar with the violin plot representation.

Now let’s plot violin plot for our iris data set.

sns.violinplot(x="type",y="petal_length",data=iris)
plt.show()

output graph:

violin plot for iris data set on petal length

violin plot=boxplot+ histogarm

squares are box plot and white dot indicates 50% percentile.

curves are nothing but distribution curves.

2.b Bivariate analysis

Scatter plot

A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables — one plotted along the x-axis and the other plotted along the y-axis

we can plot the scatter plot between any two features.

i’am taking an example of petal length and petal width.

scatter plot between petal length and petal width

pair plot

Pair Plots are a really simple (one-line-of-code simple!) way to visualize relationships between each variable. It produces a matrix of relationships between each variable in your data for an instant examination of our data.

pair plot gives scatter plot of different features.

pair plot for iris data set.

sns.set_style("whitegrid")
sns.pairplot(iris,hue="type",size=3);
plt.show()

output graph:

pair plot for iris data set

from the graph we can see the scatter plot between the any two features and the distributions.

from the distributions above peatl length is separating the iris setosa from remaining .

from plot between petal length and petal width we can separate the flowers

easily

Example:(assumed observations from graph)

if 0≤petal_length≤2 and 0≤petal_width≤0.7then setosa

if 2≤petal_lenght≤5.2 and 1≤petal_length≤1.7 then versicolor and

else virginica

complete code for above all :https://github.com/saireddyavs/applied-ai

--

--

Venkata Sai Reddy Avuluri

student at Rajiv Gandhi University of Knowledge Technologies in Basar, Telangana, India.