Exploratory Data Analysis : Iris DataSet

Hari Mittapalli
5 min readDec 24, 2018

--

Hello All. I know there are tons of repositories available for the Exploratory Data Analysis on the famous Iris Data set.

I am not an expert in Data Science, so i don’t intend to do all the algorithmic code instead i will use the tools that have already been developed by others

This is my version of EDA on Iris Dataset since it is one of the most common datasets that I came across in my journey into the data science world.

There are many versions of this dataset but i will be using the one from sklearn.datasets

from sklearn.datasets import load_iris

This Dataset has five features which are Petal Length, Petal Width, Sepal Length, Sepal Width and Species Type.

Import other required libraries for our analysis

import pandas as pd
import numpy as np
import seaborn as sns

Now we need to create a pandas dataframe from the iris dataset.
load_iris is a function in sklearn.datasets which is a bunch of data and target variables and the description of datset.
we use DataFrame function in the pandas library to convert the array of data to Pandas Dataframe with the columns “Petal length”,”Petal Width”,”Sepal Length”,”Sepal Width” and create a new column “Species” with target values from the dataset. I have used lambda function to apply a function to convert the target values which are 0,1,2 to the corresponding target values(“setosa”,”versicolor”,”virginica”) for better understanding.

dataset=load_iris()
data=pd.DataFrame(dataset[‘data’],columns=[“Petal length”,”Petal Width”,”Sepal Length”,”Sepal Width”])
data[‘Species’]=dataset[‘target’]
data[‘Species’]=data[‘Species’].apply(lambda x: dataset[‘target_names’][x])

we use data.head() function to see the first 5 records of the dataframe.

data.head()

use the function shape() to find the dimensions of the dataframe. Which return a tuple value of rows and columns. In our Dataset we have 150 rows/records and 5 columns/features.

data.shape()

Use describe() function to see the statistics of the dataset such as mean, median, mode, standard deviation etc.

data.describe()

We need to verify the features are of which datatypes. We can use info() function to do the same. We can see that all of the features except Species are of float datatype and Species is of object/categorical datatype

data.info()

Let’s see if there are any null values present in the dataset. If there are any null values present then we need to follow one of the below steps
* Drop records which have NA values
* Substitute mean value (mean if the feature is numerical or mode if the feature is categorical) of the column/feature for the NA values
* Fill the NA values with “?” or -9999

data.isnull().sum()

Here we can clearly see that we don’t have any null values in our Dataset.

Now we will do some plotting/visualizing our data to understand the relation ship between the numerical features.
I have used seaborn library for plotting, we can also use python matplotlib library to visualize the data.

There are different types of plots like bar plot, box plot, scatter plot etc.
Scatter plot is very useful when we are analyzing the relation ship between 2 features on x and y axis.
In seaborn library we have pairplot function which is very useful to scatter plot all the features at once instead of plotting them individually.

sns.pairplot(data)

Now we will see how these features are correlated to each other using heatmap in seaborn library. We can see that Sepal Length and Sepal Width features are slightly correlated with each other.

plt.figure(figsize=(10,11))
sns.heatmap(data.corr(),annot=True)
plt.plot()

Let’s see how our data is distributed based on Sepal Length and Width features using scatterplot.

sns.FacetGrid(data,hue=”Species”)\
.map(plt.scatter,”Sepal Length”,”Sepal Width”)\
.add_legend()
plt.show()

Similarly scatter plot of data based on Petal Length and Width features

sns.FacetGrid(data,hue=”Species”)\
.map(plt.scatter,”Petal length”,”Petal Width”)\
.add_legend()
plt.show()

Now let’s visualize the data with violin plot of all the input variables against output variable which is Species. The violinplot shows density of the length and width in the species. The thinner part denotes that there is less density whereas the fatter part conveys higher density

plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.violinplot(x=”Species”,y=”Sepal Length”,data=data)
plt.subplot(2,2,2)
sns.violinplot(x=”Species”,y=”Sepal Width”,data=data)
plt.subplot(2,2,3)
sns.violinplot(x=”Species”,y=”Petal length”,data=data)
plt.subplot(2,2,4)
sns.violinplot(x=”Species”,y=”Petal Width”,data=data)

And similarly use boxplot to see how the categorical feature “Species” is distributed with all other four input variables.

plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.boxplot(x=”Species”,y=”Sepal Length”,data=data)
plt.subplot(2,2,2)
sns.boxplot(x=”Species”,y=”Sepal Width”,data=data)
plt.subplot(2,2,3)
sns.boxplot(x=”Species”,y=”Petal length”,data=data)
plt.subplot(2,2,4)
sns.boxplot(x=”Species”,y=”Petal Width”,data=data)

With this i will end my first blog post. Will Apply machine learning models on this dataset in the next post.
If you find a mistake (I’m a beginner after all) in any of the work in this piece, feel free to send feedback in the links below or leave a comment.

Thank you so much for taking the time to read this piece!
Merry Christmas !!!!

Feedback:
Email: harimittapalli24@gmail.com

--

--