Plotting for EDA: Using Iris Flower Data set

Deepak Jain
Applied Machine Learning
5 min readMay 24, 2020

In this article we will understand basics of Exploratory Data Analysis (EDA) for plotting using the “Hello World!!” project of the machine learning world - The Iris Data set

What is EDA?

Exploratory Data Analysis is basically a task of analyzing our data using concepts from the field of statistics, linear algebra, plotting tools and other techniques so as to understand what our data set is all about before we go on to build model on it.

This is an extremely important stage in the entire project life cycle. For any given data set, the first thing we do is perform EDA. “Exploratory” basically means we are trying to understand/explore a totally unknown data set and know what is it about.

Download link for Iris Flower data set here.

Now, lets understand the problem that we are trying to solve and a bunch of terminologies we will be using throughout this article.

Objective (in layman's terms): The objective here is to classify a new flower as one of the 3 types given the 4 features. When doing data analysis, we always need to understand what the task/objective is? Data Analysis should always be inline with our objective. Here we are performing a classification task. Meaning, given certain features of a flower, we try to classify it into 1 of the 3 types.

Data set: The data set consists of 3 types of flowers - iris-setosa, iris-virginica and iris-versicolor. All 3 of them belong to the iris family. We have 4 features i.e Sepal Length, Sepal Width, Petal Length and Petal Width and based on these 4 features we have a class label (“Species”) which tells us which type of flower it is.

Other terminologies: The 4 features (Sepal Length, Sepal Width, Petal Length and Petal Width) are also called as variables or attributes or independent variables or input-variable
Each record that contains the value of these features is called as a data-point or an observation or a vector.
The type of flower (“Species”) is called as the class-label or dependent variable or class or label or Output-variable or response-variable

Objective redefined (in machine learning terms): Given the 4 variables (Sepal Length, Sepal Width, Petal Length and Petal Width), we need to predict the class-label (“Species”)

Now, one point that might arise in your mind is why use these 4 features to identify the type of flower? Why not use the color or the shape of the flower?

Here, the domain knowledge is very important. You must identify which variables will be the most useful to identify the outcome (in this case, the flower)

Now, lets go through some of the code and perform some EDA. For clear understanding, I have

Importing the required libraries

Before we go further, it is always good to have a look at the various columns/attributes that are present in the data so that we could get a rough idea on how to begin with our EDA.

Now, you might wonder why after having a look at the first 5 rows, we executed a “value_counts” function. The simple reason behind this is to understand if we have a balanced data set or not and based on the above output of “value_counts” function, we can say that we have a balanced data set.

What do we mean by balance and imbalance data set?
Let me take a very simple example to clear this concept.

Assume I have data (points) of 10,000 patients. Out of these 10,000 patients, say 500 patients have diabetes. Our objective is to find out whether a given person has diabetes or not based on the features/attributes provided. Here, we can say that the data set is highly imbalanced because, only 500 patients out of 10,000 have diabetes. Suppose we had at least 4000 patients out of 10,000 that would still be termed as balanced. But just 500 out of 10,000 is highly imbalanced data set.
What is the impact does balance vs imbalance data set on the outcome of our model and how can we deal with that? This is a topic for another blog at later point in time. For now, in case of iris data set, we seem to have a well balanced data set. We have 50 data points of each of the 3 types of flowers.

We`ll start with visualizing various plots. We will write some code (its alright if you don`t understand the code right now) and write our observations based on it.

Scatter plot

In the above figure, we are plotting, sepal_length vs sepal_width and color coding based on the type of flower. From the graph it is clearly visible that almost all my blue points (iris-setosa) can be well separated if we were to draw a line between the two groups.

Box plot
Violin plot
Histogram

In the above plot, we have constructed a histogram of petal_length. It is very clear that there is no overlap in values of Iris-setosa whereas there is some overlap in values between Iris-versicolor and Iris-virginica

The above graph is the pair plot. Here, we can clearly see that petal_length vs petal_width separates the Iris-setosa in a more concise way

To summarize what we have done so far, we have simply defined some terminologies and plotted to few graphs to understand the distribution of data and get familiar with the features.

In the part-2 of this series (WIP), we will discuss about various concepts like PDF, CDF, mean, variance, standard deviation, etc

Until then, Happy Learning!

--

--