EDA and ML with Kaggle Iris Datasets

Bipin Kumar Chaurasia
4 min readAug 16, 2020

--

While working with different datasets available on kaggle and thereafter working with Exploratory Data Analysis, I came across with Seaborn Python Library for data visualizations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

After importing above libraries, now we are going to analyse our data reading the csv file with pandas dataframe and then using function shape to find out the number of rows and columns in the given dataframe.

With above shape of the dataframe, now we are going to analyze the head of the dataframe as follows,

Also, we can analyse the dataframe columns object type either as follows where every feature has 150 values each with no null or na values,

Also, we can analyse the different attribute of given features from Iris datasets which are non object data types as follows,

Now, let’s analyse the data for Species object with count plot feature from sns catplot where it shows that every species has 50 each count of data as follows,

Lets analyse the univariate analysis with histogram and KDE with distribution plot as follows,

The above variations as shown with histogram can be also verified with skewness and kurtosis values as,

Let’s analyse the dependency of one feature with another using sns pairplot,

From the above figure, we can analyse that the feature PetalLengthCm and PetalWidthCm has approx linear relationship which we can also verify with heatmap with following diagram where correlation value is also 0.96,

Now with boxplot bivariate analysis, let’s analyse the data with Species and SepalLengthCm, SepalWidthCm, PetalLengthCm and PetalWidthCm as,

Also, violin plot bivariate analysis, let’s analyse the data with Species and SepalLengthCm, SepalWidthCm, PetalLengthCm and PetalWidthCm as,

Now, Let’s dig deeper with different available machine learning algorithms to find out the accuracy, confusion matrix, etc as follows,

Here below, we are splitting the datasets into train and test datasets with 70% into training datasets and 30% into testing datasets, also, we have taken the random_state to be as 42 so that the data split remains constant whenever we are following the same steps again.

When the random_state values varies, the distribution of the datasets will vary too and hence, therefore, the final accuracy percentage will differ too.

After splitting the data, we can use different available ML algorithms from SK Learn to implement model as follows,

That’s it! Thank you so much for reading until the end of this blog. I’ll appreciate further if you could please comment down your opinions as well!

--

--