Cluster Analysis With Iris Data Set

Ahmed Yahya Khaled
The Startup
Published in
6 min readAug 22, 2020

Clustering with R

Image by Gerd Altmann from Pixabay

This article is about hands-on Cluster Analysis (an Unsupervised Machine Learning) in R with the popular ‘Iris’ data set.

Let’s brush up some concepts from Wikipedia

Machine learning is the study of computer algorithms that improve automatically through experience. It is seen as a subset of Artificial Intelligence. Machine learning algorithms build a mathematical model based on sample data, in order to make predictions or decisions without being explicitly programmed to do so.

Supervised Learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.

Unsupervised Learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision.

Cluster Analysis or Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

About Iris Data set

Iris Flower (google image)

Iris flower data set was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. This is perhaps the best known database to be found in the pattern recognition literature. Iris data set gives the measurements in centimetres of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Image from techgig.com

So, let’s start now ! You may like to download the Iris Data set & the R-script from my github repository.

Hope you have R & RStudio installed for a hands-on experience with me :)

Objective

The Objective is to segment the iris data(without labels) into clusters — 1, 2 & 3 by k-means clustering & compare these clusters with the actual species clusters — setosa, versicolor, and virginica.

Install and Load R Packages

‘tidyverse’, ‘cluster’ and ‘reshape2’ — these three R packages are required here. Install if not done earlier. We need to load the packages with the library function.

install.packages(“tidyverse”)       # for data work & visualization
install.packages(“cluster”) # for cluster modeling install.packages("reshape2") # for melting data
# note : not required if already installed
library(tidyverse)
library(cluster)
library(reshape2)

Import the Iris Data set

We can import from disc after setting the working directory where the csv file is.

setwd(“E:/my_folder/work_folder”)mydata <- read.csv(“iris.csv”)

Or, get it from the in-build R datasets

mydata <- iris

Explore the Data set

With below functions we can check the data set before exploring

glimpse(mydata)
head(mydata)
View(mydata)

Let’s visualize the data now with ggplot2

Sepal-Length vs. Sepal-Width

ggplot(mydata)+
geom_point(aes(x = Sepal.Length, y = Sepal.Width), stroke = 2)+
facet_wrap(~ Species)+
labs(x = ‘Sepal Length’, y = ‘Sepal Width’)+
theme_bw()

Petal-Length vs. Petal-Width

ggplot(mydata)+
geom_point(aes(x = Petal.Length, y = Petal.Width), stroke = 2)+
facet_wrap(~ Species)+
labs(x = ‘Petal Length’, y = ‘Petal Width’)+
theme_bw()

Sepal-Length vs. Petal-Length

ggplot(mydata)+
geom_point(aes(x = Sepal.Length, y = Petal.Length), stroke = 2)+
facet_wrap(~ Species)+
labs(x = ‘Sepal Length’, y = ‘Petal Length’)+
theme_bw()

Sepal-Width vs. Pedal-Width

ggplot(mydata)+
geom_point(aes(x = Sepal.Width, y = Petal.Width), stroke = 2)+
facet_wrap(~ Species)+
labs(x = ‘Sepal Width’, y = ‘Pedal Width’)+
theme_bw()

Box plots

ggplot(mydata)+
geom_boxplot(aes(x = Species, y = Sepal.Length, fill = Species))+
theme_bw()
ggplot(mydata)+
geom_boxplot(aes(x = Species, y = Sepal.Width, fill = Species))+
theme_bw()
ggplot(mydata)+
geom_boxplot(aes(x = Species, y = Petal.Length, fill = Species))+
theme_bw()
ggplot(mydata)+
geom_boxplot(aes(x = Species, y = Petal.Width, fill = Species))+
theme_bw()

k-means Clustering

k-means clustering is a method of vector quantization, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

Find the optimal number of clusters by Elbow Method

set.seed(123) # for reproduction
wcss <- vector()
for (i in 1:10) wcss[i] <- sum(kmeans(mydata[, -5], i)$withinss)plot(1:10,
wcss,
type = ‘b’,
main = paste(‘The Elbow Method’),
xlab = ‘Number of Clusters’,
ylab = ‘WCSS’
)

the elbow point : k(centers) = 3

Apply kmeans function to the feature columns

set.seed(123)
km <- kmeans( x = mydata[, -5] , centers = 3)
yclus <- km$cluster
table(yclus)

output :

> table(yclus)
yclus
1 2 3
50 62 38

the kmeans has grouped the data into three clusters- 1, 2 & 3 having 50, 62 & 38 observations respectively.

Visualize the kmeans clusters

clusplot(mydata[, -5],
yclus,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 0,
plotchar = FALSE,
span = TRUE,
main = paste(‘Clusters of Iris Flowers’)
)

Compare the clusters

mydata$cluster.kmean <- ycluscm <- table(mydata$Species, mydata$cluster.kmean)
cm

output :

> cm
1 2 3
setosa 50 0 0
versicolor 0 48 2
virginica 0 14 36

[(50 + 48 + 36)/150] = 89% of the k-means cluster output matched with the actual Species clusters. versicolor(Cluster 2) & virginica(Cluster 3) have some overlapping features which is also apparent from the cluster visualizations.

Tiles plot : Species vs. kmeans clusters

mtable <- melt(cm)ggplot(mtable)+
geom_tile(aes(x = Var1, y = Var2, fill = value))+
labs(x = ‘Species’, y = ‘kmeans Cluster’)+
theme_bw()

Scatter plots (to view Species & kmeans custers)

Sepal-Length vs. Sepal-Width

mydata$cluster.kmean <- as.factor(mydata$cluster.kmean)# Sepal-Length vs. Sepal-Width (Species)
ggplot(mydata)+
geom_point(aes(x = Sepal.Length, y = Sepal.Width,
color = Species) , size = 10)+
labs(x = 'Sepal Length', y = 'Sepal Width')+
ggtitle("Species")+
theme_bw()
# Sepal-Length vs. Sepal-Width (kmeans cluster)
ggplot(mydata)+
geom_point(aes(x = Sepal.Length, y = Sepal.Width,
color = cluster.kmean) , size = 10)+
labs(x = 'Sepal Length', y = 'Sepal Width')+
ggtitle("kmeans Cluster")+
theme_bw()

Petal-Length vs. Petal-Width

# Petal-Length vs. Petal-Width (Species)
ggplot(mydata)+
geom_point(aes(x = Petal.Length, y = Petal.Width,
color = Species) , size = 10)+
labs(x = 'Petal Length', y = 'Petal Width')+
ggtitle("Species")+
theme_bw()
# Petal-Length vs. Petal-Width (kmeans cluster)
ggplot(mydata)+
geom_point(aes(x = Petal.Length, y = Petal.Width,
color = cluster.kmean) , size = 10)+
labs(x = 'Petal Length', y = 'Petal Width')+
ggtitle("kmeans Cluster")+
theme_bw()

Hope you enjoyed the cluster analysis in R with interesting visualizations. Feel free to comment for any remark, suggestion or topic that you like me to write on.

This happened to be my first Medium article. A many more will come as I progress through the ladder of data science and machine learning fields. Keep calm and stay with me !

Photo by Kevin Ku on Unsplash

Let’s ace data !!

--

--