Exploratory data analysis (EDA) on Iris Dataset using Python
By Definition, Exploratory data Analysis is an approach to analysing data to summarise their main characteristics, often with visual techniques.
It is always a good to explore and compare a data set with multiple exploratory techniques. After the exploratory data analysis you will get confidence in your data to point where you’re ready to engage a machine learning algorithm and other benefit of EDA is to selection of feature variable that will be used later for Machine Learning.
In this blog, we take Iris Dataset and perform various EDA techniques using python which are given below :-
- Introduction to IRIS dataset
- 1D ,2D and 3D scatter plot
- Pair plots
- Histogram
- Introduction of PDF(Probability Density Function)
- Introduction of CDF (Cumulative Distribution Function)
- Mean, Variance and Standard Deviation
- Median and Quantiles
- Box-plot and whisker
- Violin Plots
You can easily download the dataset from Kaggle from below link.
https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
Understanding the dataset
Here we can see that given 4 features i.e sepal length, sepal width, petal length, and petal width determine whether a flower is Setosa, Versicolor or Virginica.
- Sepal length,Sepal width, Petal length, Petal width are called feature/Variable/Input-variable/Independent-variable
- Species are called Labels/Dependent-variable/out-variable/class/class-label/Response label
Importing libraries and loading the file
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns#Load Dataset
iris = pd.read_csv(“iris.csv”)
Understanding Data
print(iris.shape) #prints no. of row and columns
>(150,5)print(iris.columns) #prints name of columns
>Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species'],dtype='object')iris[“species”].value_counts()
>setosa 50
virginica 50
versicolor 50
Name: species, dtype: int64
As you can see after execution of this “iris[“species”].value_counts()” ,the data distribution among setosa, virginica, versicolor are equal so iris dataset is a Balanced dataset (as the number of data points for every class is 50).
1D Scatter plot
iris_setso = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];plt.plot(iris_setso["petal_length"],np.zeros_like(iris_setso["petal_length"]), 'o')plt.plot(iris_versicolor["petal_length"],np.zeros_like(iris_versicolor["petal_length"]), 'o')plt.plot(iris_virginica["petal_length"],np.zeros_like(iris_virginica["petal_length"]), 'o')plt.grid()
plt.show()
Observation() | Conclusion
- Green points are Virginica, orange points are Versicolor and blue points are Setosa
- Virginica and Versicolor are overlapping
- 1D Scatter are very hard to read and understand
2D scatter plot
iris.plot(kind="scatter",x="sepal_length",y="sepal_width")
plt.show()
In the above figure, we are plotting sepal length on x-axis and sepal width on y-axis.we are scattering all the points that we have and putting it on the plot.. and it is called a 2D plot because we are using 2 features i.e on x-axis and y-axis.
In the above figure, we are’t able to understand which is setosa or versicolor or virginica flower because all points are in same colour. It cannot make much sense out it
So let’s try to plot 2-D Scatter plot with colour for each flower.
sns.set_style("whitegrid");
sns.FacetGrid(iris,hue="species",size=4) \
.map(plt.scatter,"sepal_length","sepal_width") \
.add_legend()
plt.show()
Observation(s) | Conclusion
- Blue points can be easily separated from red and green by drawing a line.
- But red and green data points cannot be easily separated.
- Using sepal_length and sepal_width features, we can distinguish Setosa flowers from others.
- Separating Versicolor from Viginica is much harder as they have considerable overlap.
3D Scatter Plot
import plotly.express as px
fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width',color='species')
fig.show()
Here we are using plotly library for plotting as you can see we have used sepal length on the x-axis, sepal width on the y-axis and petal length on the z-axis.
A 3D plot will be used for three variables or dimensions. However, what would do if we have more than 3 dimensions or features in our dataset as we humans do have the capability to visualize more than 3 dimensions?
One solution to this problem is pair plots.
Pair plots
A pairs plot allows us to see both distribution of single variables and relationships between two variables.
For example, let’s say we have four features ‘sepal length’, ‘sepal width’, ‘petal length’ and ‘petal width’ in our iris dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs in this case will be :
- sepal length, sepal width
- sepal length, petal length
- sepal length, petal width
- sepal width, petal length
- sepal width, petal width
- petal length, petal width
So, here instead of trying to visualize four dimensions which is not possible. We will look into 6 2D plots and try to understand the 4-dimensional data in the form of a matrix.
sns.set_style("whitegrid");
sns.pairplot(iris,hue="species",size=3);
plt.show()
As Seen Above, The Pair Plots Can Be Divided Into Three Parts:
- The diagonal plot which showcases the histogram. The histogram allows us to see the PDF/Probability distribution of a single variable
- Upper triangle and lower triangle which shows us the scatter plot.
- The scatter plots show us the relationship between the features. These upper and lower triangles are the mirror image of each other.
Pair plot will only plot the variables which are numerical. The variables which are of String type, by default pair plot won’t plot automatically. If you want to plot, then you need to encode it as numerical. However, Seaborn will encode internally and assign a label to each unique value in the non-numerical values.
Limitation of Pair plot:-
If you have d features, you will have a pair plot of size dxd cells where each cell is a plot between a pair of features.So, pair plots are hard to use when we have high dimensional data. For high dimensional data we can use PCA, t-SNE.
Observation(s) | Conclusion
- petal length and petal width are the most useful features to identify various flower types.
- While Setosa can be easily identified (linearly separable), virginica and Versicolor have some overlap (almost linearly separable).
- We can find “lines” and “if-else” conditions to build a simple model to classify the flower types.
Histogram and Introduction of PDF
From Wikipedia
A histogram is an accurate graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable).To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.
sns.FacetGrid(iris,hue="species",size=5) \
.map(sns.distplot,"petal_length") \
.add_legend();
plt.show();
Here in the figure, x-axis is the petal length and the y axis is a count of no of points that exist in the given range. And using this plot we can able to observe how many points are there in particular regions.Histogram basically represents how many points exist for each value on the x-axis.
PDF is smoothness of histogram
Univariate Analysis using PDF
Let’s do some univariate ananlysis using PDF to find one of my 4 variable i.e Sepal length, Sepal width, Petal length, and Petal width is more useful to distinguish my flowers.
sns.FacetGrid(iris,hue="species",size=5) \
.map(sns.distplot,"petal_width") \
.add_legend();
plt.show();
Observation(s) | Conclusion
- As we compare petal length and petal width then petal length the setosa are better separable than using patel eidth
- There is overlap between vericolor and virginca
ssns.FacetGrid(iris,hue="species",size=5) \
.map(sns.distplot,"sepal_width") \
.add_legend();
plt.show();
Observation(s) | Conclusion
- And as we can see virginica and versicolor are fully overlapped.
sns.FacetGrid(iris,hue="species",size=5) \
.map(sns.distplot,"sepal_length") \
.add_legend();
plt.show();
Observation(s) | Conclusion
- Here we can’t separate any class because all of them are overlapped
- From above observations we can say that Sapel length is worst than Patel length and Patel width.
CDF(Cumulative distribution function)
Let’s Plot PDF and CDF using petal length
iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
>>>[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]
print(bin_edges);
>>>[1. 1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]
cdf = np.cumsum(pdf)
plt.grid()
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
- 25% of point typically have their petal length between 1.5 and 1.6 called PDF
- There are 82% of setosa flower that have their petal length is less than 1.6(PL ≤ 1.6)
- There are 20% of setosa flower have their petal length is less than 1.3
As you can see in figure,Orange line represent CDF and Blue line represent PDF. X-axis represent percentage and Y-axis represent petal length.
How to get CDF coordinate?
Let’s suppose we have 50 setosa flower.How many flower has petal length of 1.6. Then you have to count.So suppose 40 setosa flower has petal length less than or equal to 1.6.
40 / 50 = 82% == 0.82 (Y-axis)
Differentiate of CDF = PDF
Integration of PDF = CDF
Mean, Variance and Standard Deviation
Mean is average of a given set of data. Let us consider below example.
Ex - 2,4,4,4,5,5,7,9These eight data points have the mean (average) of 5:
(2+4+4+4+5+5+7+9) / 8 = 5
Variance is the sum of squares of differences between all numbers and means.Deviation for above example. First, calculate the deviations of each data point from the mean, and square the result of each:
(2-5)^2 = 9 (5-5)^2 = 0
(4-5)^2 = 1 (5-5)^2 = 0
(4-5)^2 = 1 (7-5)^2 = 4
(4-5)^2 = 1 (9-5)^2 = 16variance = (9+1+1+1+0+4+16)/8 = 4
Standard Deviation is square root of variance. It is a measure of the extent to which data varies from the mean.
Standard deviation -> Square root of 4 = 2
Let’s start coding on Mean,Variance and Std-deviation
print("Means:")
print(np.mean(iris_setosa["petal_length"]))print(np.mean(np.append(iris_setosa["petal_length"],50)));
print(np.mean(iris_virginica["petal_length"]))
print(np.mean(iris_versicolor["petal_length"]))
print("\nStd-dev:");
print(np.std(iris_setosa["petal_length"]))
print(np.std(iris_virginica["petal_length"]))
print(np.std(iris_versicolor["petal_length"]))OutPut: -Means:
1.464
2.4156862745098038
5.5520000000000005
4.26
Std-dev:
0.17176728442867112
0.546347874526844
0.4651881339845203
Observation(s) | Conclusion
- Now we can say that Satosa has less petal length
- Virginica and Versicolor both have slightly closer patel length.
Median and Quantiles
Median:-
The Median is the “middle” of a sorted list of numbers.
How to Find the Median Value?
find the median of 12,3,6?
1. Put them in order - 3,6,12
2. The middle is 6, So median is 6find the median of 12,6,8,4
1.Put them in order - 4,6,8,12
2. Here middle is 6 or 8 then (6+8)/2 = 7,So median is 7
Quantiles:-
Any set of data, arranged in ascending or descending order, can be divided into various parts, also known as partitions or subsets, regulated by quantiles. Quantile is a generic term for those values that divide the set into partitions of size n, so that each part represents 1/n of the set.
x={5,6,9,11,13,20,26}
- first quartile, or Q1 = 6
- second quartile, or Q2 = 11
- third quartile, or Q3 = 20
Box-plot and whisker
A box and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set.
Box-plot with whiskers: another method of visualising the 1-D scatter plot more intuitive
sns.boxplot(x="species",y="petal_length", data=iris)
plt.show()
Violin plots
From wikipedia
A violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
sns.violinplot(x=”species”,y=”petal_length”, data=iris, size=8)
plt.show()
Conclusion:
After doing Eda we are now able to understand the data and the important features completely so we are ready to apply machine learning model on it. You can get all above code from
Thanks for reading