Exploratory Data Analysis of IRIS Dataset

Published in

The Startup

6 min readAug 12, 2020

Let’s explore one of the simplest datasets, The IRIS Dataset which basically is a data about three species of a Flower type in form of its sepal length, sepal width, petal length, and petal width. The data set consists of 50 samples from each of the three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Our objective is to classify a new flower as belonging to one of the 3 classes given the 4 features.

Download IRIS data from here.

Here I'm importing the libraries in ipython notebook using Anaconda Navigator(download: https://www.anaconda.com/products/individual). which can be useful in our exploratory data analysis like pandas, matplotlib, numpy and seaborn.

Importing the data from pandas through read_csv()

Here, IRIS is a balanced dataset because the number of data points for every class Setosa, Virginica, and Versicolor is 50. If the classes are having the different numbers of data points each then it’s an imbalanced dataset.

2D Scatter Plot

By using the pandas object we created before we can plot a simple 2D graph of the features we give as x and y parameters of the plot() method of pandas. Matplotlib method show() helps to actually plot the data.

But by Seaborn we can plot a more informative graph by color-coding by each flower type.

Here in the above graph notice that Blue Setosa points can be easily separated from Orange Versicolor and Green Verginica points by simply drawing a line but the Orange and Green points are still complex to be separated because they are overlapping. So by using sepal_length and sepal_width features of the data we can get this much information.

2D Scatter Plot: Pair Plot

Pair Plot by Seaborn is capable of drawing multiple 2D Scatter Plots for each possible combination of features in one go.

So here if we observe the pair plots then we can say petal_length and petal_width are the most essential features to identify various flower types. While Setosa can be easily linearly separable, Virnica and Versicolor have some overlap. So we can separate them by a line and some “if-else” conditions.

1D Scatter Plot, Histogram, PDF & CDF

As we can observe the graph, it's very hard to make sense as points are overlapping a lot. There are better ways to visualize the scatter plots. By Seaborn, we can plot a Probability Distribution Function cum Histogram.

Histogram : Histogram is the plot representing the frequency counts of each data window of the feature for which the plot is drawn (Bar shapes in the graph).
PDF : Probability Density Function is basically a smoothed histogram. Every point on the PDF represents the probability for that particular value in the data (bell shaped curve in the graph). PDF gets formatted using Kernel Density Estimation. For each value of the point on x-axis, y-axis value represents its probabily of occuring in the dataset. More the y value more of that value exists in the dataset.

Now from these graphs, we can observe that by using just one feature a simple model can be formed by if..else condition as if(petal_length) < 2.5 then flower type is Setosa.

Now, what if we need the percentage of Versicolor points having a petal_length of less than 5 ? here comes CDF in our rescue!

CDF: Cumulative Density Function is the cumulative sum of the PDF. Every point on the CDF curve represents integration of the PDF till that point of CDF. Below is the histogram of the Yield. Every point on the CDF represents how much percentage of the total points belong to below that point.

To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but are not required to be) of equal size(for more information: https://www.datacamp.com/community/tutorials/histograms-matplotlib).

Now by plotting of CDF of petal_length for various types of flowers in a combined manner we can get an overall picture of the data.

Mean, Variance and Standard Deviation

Mean: https://en.wikipedia.org/wiki/Mean

Variance: https://en.wikipedia.org/wiki/Variance

Standard Deviation: https://en.wikipedia.org/wiki/Standard_deviation

Median, Percentile, Quantile, MAD, IQR

Median: https://en.wikipedia.org/wiki/Median

Percentile: https://en.wikipedia.org/wiki/Percentile

Quantile: https://en.wikipedia.org/wiki/Quantile

MAD: Median Absolute Deviation: https://en.wikipedia.org/wiki/Median_absolute_deviation

IQR: Interquantile Range: https://en.wikipedia.org/wiki/Interquartile_range

Box Plots

Box plots with whiskers is another method for visualizing the 1D Scatter Plot more intuitively. The boxes in the graph represent Interquantile Range as the first horizontal line from the bottom of the box represents 25th percentile value, the middle line represents the 50th percentile and the top line represents the 75th percentile. The black lines outside of the boxes are called whiskers. It’s not fixed what whiskers represent but it might be the minimum value of the feature at below horizontal line and maximum value at the top horizontal line in some cases.

Violin Plots

Violin plot by Seaborn combine PDF and Box-Plot. As in the below plot, on all three colors, PDFs of petal_length are on the sides of the shape, and in the center in black, there is a representation of Box-Plots.

Multivariate Probability Density: Contour Plot

Seaborn provides jointplot() method for contours. The name is “jointplot” because it represents Contours as well as PDFs on the edges. More the darker the region the more the probability of occurring that value of features for which the graph is plotted.