Exploratory data analysis (EDA) on Iris Dataset using Python

9 min readSep 7, 2019

By Definition, Exploratory data Analysis is an approach to analysing data to summarise their main characteristics, often with visual techniques.

It is always a good to explore and compare a data set with multiple exploratory techniques. After the exploratory data analysis you will get confidence in your data to point where you’re ready to engage a machine learning algorithm and other benefit of EDA is to selection of feature variable that will be used later for Machine Learning.
In this blog, we take Iris Dataset and perform various EDA techniques using python which are given below :-

Introduction to IRIS dataset
1D ,2D and 3D scatter plot
Pair plots
Histogram
Introduction of PDF(Probability Density Function)
Introduction of CDF (Cumulative Distribution Function)
Mean, Variance and Standard Deviation
Median and Quantiles
Box-plot and whisker
Violin Plots

You can easily download the dataset from Kaggle from below link.

https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv

Understanding the dataset

Here we can see that given 4 features i.e sepal length, sepal width, petal length, and petal width determine whether a flower is Setosa, Versicolor or Virginica.

Sepal length,Sepal width, Petal length, Petal width are called feature/Variable/Input-variable/Independent-variable
Species are called Labels/Dependent-variable/out-variable/class/class-label/Response label

Importing libraries and loading the file

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns#Load Dataset
iris = pd.read_csv(“iris.csv”)

Understanding Data

print(iris.shape) #prints no. of row and columns
>(150,5)print(iris.columns) #prints name of columns
>Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species'],dtype='object')iris[“species”].value_counts()
>setosa        50
 virginica     50
 versicolor    50
 Name: species, dtype: int64

As you can see after execution of this “iris[“species”].value_counts()” ,the data distribution among setosa, virginica, versicolor are equal so iris dataset is a Balanced dataset (as the number of data points for every class is 50).

1D Scatter plot

iris_setso = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];plt.plot(iris_setso["petal_length"],np.zeros_like(iris_setso["petal_length"]), 'o')plt.plot(iris_versicolor["petal_length"],np.zeros_like(iris_versicolor["petal_length"]), 'o')plt.plot(iris_virginica["petal_length"],np.zeros_like(iris_virginica["petal_length"]), 'o')plt.grid()
plt.show()

Observation() | Conclusion

Green points are Virginica, orange points are Versicolor and blue points are Setosa
Virginica and Versicolor are overlapping
1D Scatter are very hard to read and understand

2D scatter plot

iris.plot(kind="scatter",x="sepal_length",y="sepal_width")
plt.show()

In the above figure, we are plotting sepal length on x-axis and sepal width on y-axis.we are scattering all the points that we have and putting it on the plot.. and it is called a 2D plot because we are using 2 features i.e on x-axis and y-axis.

In the above figure, we are’t able to understand which is setosa or versicolor or virginica flower because all points are in same colour. It cannot make much sense out it

So let’s try to plot 2-D Scatter plot with colour for each flower.

sns.set_style("whitegrid");
sns.FacetGrid(iris,hue="species",size=4) \
    .map(plt.scatter,"sepal_length","sepal_width") \
    .add_legend()
plt.show()

Observation(s) | Conclusion

Blue points can be easily separated from red and green by drawing a line.
But red and green data points cannot be easily separated.
Using sepal_length and sepal_width features, we can distinguish Setosa flowers from others.
Separating Versicolor from Viginica is much harder as they have considerable overlap.

3D Scatter Plot

import plotly.express as px
fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width',color='species')
fig.show()

Here we are using plotly library for plotting as you can see we have used sepal length on the x-axis, sepal width on the y-axis and petal length on the z-axis.
A 3D plot will be used for three variables or dimensions. However, what would do if we have more than 3 dimensions or features in our dataset as we humans do have the capability to visualize more than 3 dimensions?
One solution to this problem is pair plots.

Pair plots

A pairs plot allows us to see both distribution of single variables and relationships between two variables.
For example, let’s say we have four features ‘sepal length’, ‘sepal width’, ‘petal length’ and ‘petal width’ in our iris dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs in this case will be :

sepal length, sepal width
sepal length, petal length
sepal length, petal width
sepal width, petal length
sepal width, petal width
petal length, petal width

So, here instead of trying to visualize four dimensions which is not possible. We will look into 6 2D plots and try to understand the 4-dimensional data in the form of a matrix.

sns.set_style("whitegrid");
sns.pairplot(iris,hue="species",size=3);
plt.show()

As Seen Above, The Pair Plots Can Be Divided Into Three Parts:

The diagonal plot which showcases the histogram. The histogram allows us to see the PDF/Probability distribution of a single variable
Upper triangle and lower triangle which shows us the scatter plot.
The scatter plots show us the relationship between the features. These upper and lower triangles are the mirror image of each other.

Pair plot will only plot the variables which are numerical. The variables which are of String type, by default pair plot won’t plot automatically. If you want to plot, then you need to encode it as numerical. However, Seaborn will encode internally and assign a label to each unique value in the non-numerical values.

Limitation of Pair plot:-

If you have d features, you will have a pair plot of size dxd cells where each cell is a plot between a pair of features.So, pair plots are hard to use when we have high dimensional data. For high dimensional data we can use PCA, t-SNE.

Observation(s) | Conclusion

petal length and petal width are the most useful features to identify various flower types.
While Setosa can be easily identified (linearly separable), virginica and Versicolor have some overlap (almost linearly separable).
We can find “lines” and “if-else” conditions to build a simple model to classify the flower types.

Histogram and Introduction of PDF

From Wikipedia

A histogram is an accurate graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable).To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.

sns.FacetGrid(iris,hue="species",size=5) \
    .map(sns.distplot,"petal_length") \
    .add_legend();
    
plt.show();

Here in the figure, x-axis is the petal length and the y axis is a count of no of points that exist in the given range. And using this plot we can able to observe how many points are there in particular regions.Histogram basically represents how many points exist for each value on the x-axis.

PDF is smoothness of histogram

Univariate Analysis using PDF

Let’s do some univariate ananlysis using PDF to find one of my 4 variable i.e Sepal length, Sepal width, Petal length, and Petal width is more useful to distinguish my flowers.

sns.FacetGrid(iris,hue="species",size=5) \
    .map(sns.distplot,"petal_width") \
    .add_legend();
    
plt.show();

Observation(s) | Conclusion

As we compare petal length and petal width then petal length the setosa are better separable than using patel eidth
There is overlap between vericolor and virginca

ssns.FacetGrid(iris,hue="species",size=5) \
    .map(sns.distplot,"sepal_width") \
    .add_legend();
    
plt.show();

Observation(s) | Conclusion

And as we can see virginica and versicolor are fully overlapped.

sns.FacetGrid(iris,hue="species",size=5) \
    .map(sns.distplot,"sepal_length") \
    .add_legend();
    
plt.show();

Observation(s) | Conclusion

Here we can’t separate any class because all of them are overlapped
From above observations we can say that Sapel length is worst than Patel length and Patel width.

CDF(Cumulative distribution function)

Let’s Plot PDF and CDF using petal length

iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf); 
>>>[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0.   0.04]
print(bin_edges);
>>>[1.   1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]
cdf = np.cumsum(pdf)
plt.grid()
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

25% of point typically have their petal length between 1.5 and 1.6 called PDF
There are 82% of setosa flower that have their petal length is less than 1.6(PL ≤ 1.6)
There are 20% of setosa flower have their petal length is less than 1.3

As you can see in figure,Orange line represent CDF and Blue line represent PDF. X-axis represent percentage and Y-axis represent petal length.

How to get CDF coordinate?

Let’s suppose we have 50 setosa flower.How many flower has petal length of 1.6. Then you have to count.So suppose 40 setosa flower has petal length less than or equal to 1.6.

40 / 50 = 82% == 0.82 (Y-axis)

Differentiate of CDF = PDF
Integration of PDF = CDF

Mean, Variance and Standard Deviation

Mean is average of a given set of data. Let us consider below example.

Ex - 2,4,4,4,5,5,7,9These eight data points have the mean (average) of 5:
(2+4+4+4+5+5+7+9) / 8 = 5

Variance is the sum of squares of differences between all numbers and means.Deviation for above example. First, calculate the deviations of each data point from the mean, and square the result of each:

(2-5)^2 = 9                      (5-5)^2 = 0
(4-5)^2 = 1                      (5-5)^2 = 0
(4-5)^2 = 1                      (7-5)^2 = 4
(4-5)^2 = 1                      (9-5)^2 = 16variance = (9+1+1+1+0+4+16)/8 = 4

Standard Deviation is square root of variance. It is a measure of the extent to which data varies from the mean.

Standard deviation -> Square root of 4 = 2

Let’s start coding on Mean,Variance and Std-deviation

print("Means:")
print(np.mean(iris_setosa["petal_length"]))print(np.mean(np.append(iris_setosa["petal_length"],50)));
print(np.mean(iris_virginica["petal_length"]))
print(np.mean(iris_versicolor["petal_length"]))
print("\nStd-dev:");
print(np.std(iris_setosa["petal_length"]))
print(np.std(iris_virginica["petal_length"]))
print(np.std(iris_versicolor["petal_length"]))OutPut: -Means:
1.464
2.4156862745098038
5.5520000000000005
4.26

Std-dev:
0.17176728442867112
0.546347874526844
0.4651881339845203

Observation(s) | Conclusion

Now we can say that Satosa has less petal length
Virginica and Versicolor both have slightly closer patel length.

Median and Quantiles

Median:-

The Median is the “middle” of a sorted list of numbers.

How to Find the Median Value?

find the median of 12,3,6?
1. Put them in order - 3,6,12
2. The middle is 6, So median is 6find the median of 12,6,8,4
1.Put them in order - 4,6,8,12
2. Here middle is 6 or 8 then (6+8)/2 = 7,So median is 7

Quantiles:-

Any set of data, arranged in ascending or descending order, can be divided into various parts, also known as partitions or subsets, regulated by quantiles. Quantile is a generic term for those values that divide the set into partitions of size n, so that each part represents 1/n of the set.

x={5,6,9,11,13,20,26}

first quartile, or Q1 = 6
second quartile, or Q2 = 11
third quartile, or Q3 = 20

Box-plot and whisker

A box and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set.
Box-plot with whiskers: another method of visualising the 1-D scatter plot more intuitive

sns.boxplot(x="species",y="petal_length", data=iris)
plt.show()

Violin plots

From wikipedia

A violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.

sns.violinplot(x=”species”,y=”petal_length”, data=iris, size=8)
plt.show()

Conclusion:

After doing Eda we are now able to understand the data and the important features completely so we are ready to apply machine learning model on it. You can get all above code from

rishavhack/Machine-Learning-Using-Python

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Thanks for reading

Exploratory data analysis (EDA) on Iris Dataset using Python

Understanding the dataset

Importing libraries and loading the file

Understanding Data

1D Scatter plot

2D scatter plot

3D Scatter Plot

Pair plots

As Seen Above, The Pair Plots Can Be Divided Into Three Parts:

Limitation of Pair plot:-

Histogram and Introduction of PDF

Univariate Analysis using PDF

CDF(Cumulative distribution function)

Mean, Variance and Standard Deviation

Median and Quantiles

Box-plot and whisker

Violin plots

Conclusion:

rishavhack/Machine-Learning-Using-Python

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Written by Rishav Srivastava