Explore the Basics of K-means Clustering in R based on iris dataset

Dima Diachkov
Data And Beyond
Published in
5 min readJul 9, 2023

In the vibrant world of data science, datasets serve as the canvas on which we paint our insights and discoveries. Earlier we have talked about dummy datasets and their suitability for specific purposes. One such dataset that has been a cornerstone for budding data enthusiasts and seasoned professionals alike is the Iris dataset.

This is part #29 of the “R for Applied Economics” guide, where we collectively explore various depths of R, data science, and financial/economic analysis. Today, we delve into this botanical wonderland, applying basic machine learning algorithms to cluster our floral friends.

The Iris dataset, a multivariate data set introduced by the British statistician and biologist Ronald Fisher, consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the lengths and the widths of the sepals and petals.

By the way, this is how iris looks like:

Credits: Unsplash | Flavia Bon

Let’s start our journey by loading the dataset and exploring it, as you cannot cluster or classify something without exploration of it before that.

# Load the iris dataset 
data(iris)

# Display the first few rows of the dataset
head(iris)
Output for the code above

We have data but we are aware of what exactly it looks like in total. Let’s plot it. Fortunately for us, this dummy dataset is very simple and that is why we just ggplot it. However, we have 4 dimensions to explore, so I will try to put it all onto the same chart as X and Y axis, size and color, while the species type will be the shape type.

library(ggplot2)
# Create a scatter plot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Petal.Length, size = Petal.Width, shape = Species)) +
geom_point() +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Iris Dataset: Four-Dimensional Visualization", x = "Sepal Length", y = "Sepal Width", color = "Petal Length", size = "Petal Width") +
theme_minimal()
Output for the code above

We clearly see that “setosa” has formed a separate easily distinguishable cluster (violet round dots), while “versicolor” and “virginica” have some overlaps on Sepal.Length and Sepal.Width, but Petal.Length isdefinitely different for these two species. Why? As you may see, “virginica” has bigger petal length (approx. 5–6), while “versicolor” has something between 2.5 and 4.

We definitely see a pattern. Let’s not hesitate further and create our first clusters. How, you may ask? We will simply use k-means. This is your first friend there. K-means clustering is an unsupervised learning algorithm that partitions a dataset into ‘k’ distinct, non-overlapping subgroups or ‘clusters’, where each data point belongs to the cluster with the nearest mean. The ‘k’ in k-means represents the number of clusters, which is specified by a user. The ‘means’ in k-means refers to the centroids or the geometric mean of each cluster. Here we know that we have to predict 3 species (so 50% of the work is done😉)

# Apply k-means clustering with k = 3 (for the three species of Iris)
set.seed(20) # for reproducibility
iris_cluster <- kmeans(iris[, 1:4], centers = 3)

# Add the cluster assignments to the iris dataset
iris$Cluster <- as.factor(iris_cluster$cluster)

# Plot the clusters
library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Cluster)) +
geom_point() +
labs(title = "K-means Clustering of Iris Dataset")
Output for the code above

This plot reveals how the k-means algorithm has partitioned the data into three distinct clusters, corresponding to the three species of Iris. Do you see how familiar that is? Our cluster 1 is very similar to the distribution of “setosa” cluster, while clusters 2 and 3 suspiciously repeat the distribution of “versicolor” and “virginica” respectively. But can we track how accurate were we in clustering?

However, there’s a catch: the k-means algorithm doesn’t know anything about the actual species of the flowers, so the cluster numbers it assigns are arbitrary. This means that you might need to “relabel” the clusters so that they match up with the actual species.

# attach additional library for a confusion matrix
library(caret)

# relabeling
iris$ClusterGuessedName <- factor(iris$Cluster, labels = c("setosa", "versicolor", "virginica"))

# Create a confusion matrix
confusionMatrix(iris$ClusterGuessedName, iris$Species)
Output for the code above

As you see here, we were accurate in 89,33% of cases. There are some discrepancies in two species that are having overlaps, that we noticed earlier. We made 16 errors and guessed right 134 objects out of 150. And this is basically how the “Accuracy” score is derived. These are our errors.

Zoom-in of the confusion matrix for k-means

Pretty good result for our plain code, which took 5 minutes to create.

Wrap up

Not bad for a start, right? At this stage, we will pause until the next article appears. So far, we introduced the unsupervised ML algorithm k-means and the accuracy concept (be careful we the accuracy by the way, it may be tricky…)

Was it all that can be told about k-means? No. Are there other algorithms? Of course! Further, we will delve into classification with decision trees and other techniques.

Please clap 👏 and subscribe if you want to support me. Thanks! ❤️‍🔥

--

--

Dima Diachkov
Data And Beyond

Balancing passion with reason. In pursuit of better decision making in economic analysis and finance with data science via R+Python