One of the very famous classification problems in Machine Learning is the IRIS Flower classification problem.This Blog Post aims at understanding this problem and the underlying concepts of machine learning
Given Sepal and Petal lengths and width predict the class of Iris
Data Description :
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
Basic Data Analysis :
- The dataset provided has 150 rows
- Dependent Variables : Sepal length.Sepal Width,Petal length,Petal Width
- Independent/Target Variable : Class
- Missing values : None
Code : data.info()
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length 150 non-null float64
sepal_width 150 non-null float64
petal_length 150 non-null float64
petal_width 150 non-null float64
class 150 non-null object
dtypes: float64(4), object(1)
The dataset is divided into Train and Test data with 80:20 split ratio where 80% data is training data where as 20% data is test data.
code : train,test = sklearn.model_selection.train_test_split(data,test_size=0.2,random_state=7)
Exploratory Data Analysis :
EDA can be done using one feature,known as Univariate analysis or multiple features known as Multivariate analysis.
Univariate Analysis :
- Count Plot
A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable
2.Histograms with KDE
A Histogram visualises the distribution of data over a continuous interval or certain time period. Each bar in a histogram represents the tabulated frequency at each interval/bin.
Reading a count plot is easy.But how do u read and analyse histograms?
Histograms help give an estimate as to where values are concentrated, what the extremes are and whether there are any gaps or unusual values. They are also useful for giving a rough view of the probability distribution.
Multivariate Analysis :
Multivariate analysis involves analysis of two or more than two variables.
A Box and Whisker Plot (or Box Plot) is a convenient way of visually displaying the data distribution through their quartiles.Although box plot comes under univariate analysis but if you consider distribution of a variable with respect to other variable it becomes multivariate analysis.
Here are the types of observations one can make from viewing a Box Plot:
What the key values are, such as: the average, median 25th percentile etc.
If there are any outliers and what their values are.
Is the data symmetrical.
How tightly is the data grouped.
If the data is skewed and if so, in what direction.
A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.
A simple heat map provides an immediate visual summary of information.There can be many ways to display heat maps, but they all share one thing in common — they use color to communicate relationships between data values that would be would be much harder to understand if presented numerically in a spreadsheet.
Pair plot makes high-level scatter plots to capture relationships between multiple variables within a dataframe
Pair plots help us to understand the relationship between various independent features
Data Modelling :
IRIS class prediction is a multiclass classification problem where target variable has three classes -Iris Setosa,Iris Versicolour,Iris Virginica
What Is Multiclass Classification?
Each training point belongs to one of N different classes. The goal is to construct a function which, given a new data point, will correctly predict the class to which the new point belongs.
I trained the data on various Classification algorithms and got the following accuracies on training data set
Logistic Regression: 0.966667 (0.040825)
Decision Tree: 0.975000 (0.038188)
Linear Discriminant Analysis: 0.975000 (0.038188)
K Nearest Neighbours: 0.983333 (0.033333)
Naive Bayes: 0.975000 (0.053359)
Support Vector Machines: 0.991667 (0.025000)
It can be seen Support Vector Machines give the best results on training dataset.
On checking the performance of SVM on test data I got following results :
Accuracy : 0.9333333333333333
You can find my code at this Link
Happy Learning !!