Exploring and Understanding Complex Data Sets with Cluster Analysis in R

Published in

8bitDS

4 min readFeb 6, 2023

https://rpkgs.datanovia.com/factoextra/reference/fviz_cluster.html

Cluster analysis is an unsupervised machine learning technique that partitions a set of objects into clusters based on their similarity. It is a powerful tool for exploring and understanding complex data sets and for discovering patterns, trends and relationships in data.

Cluster analysis is implemented in R through a variety of packages, including “stats”, “cluster” and “factoextra”. In this tutorial, we will use the “factoextra” package, which provides a user-friendly interface to perform cluster analysis and to visualize the results.

To start, you need to install and load the “factoextra” package:

install.packages("factoextra")
library(factoextra)

Preparing the Data

The first step in cluster analysis is to prepare the data. The data must be transformed into a matrix or data frame that can be used as input for the clustering algorithms. In this tutorial, we will use the iris dataset, which is available in the base R installation.

data("iris")
head(iris)

Determining the Number of Clusters

One of the key decisions in cluster analysis is determining the number of clusters to extract from the data. There are several methods to determine the number of clusters, including visual inspection of the dendrogram, the elbow method, and the silhouette method.

The elbow method involves plotting the sum of squared distances of the objects from the centroids for different numbers of clusters. The idea is to choose the number of clusters at which the sum of squared distances decreases significantly with every increase in the number of clusters.

fviz_nbclust(iris[,1:4], kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)

In this example, the optimal number of clusters appears to be 3, which is consistent with the number of species in the iris dataset.

The silhouette method measures the similarity between an object and its own cluster compared to other clusters. The silhouette score ranges from -1 to 1, where a higher score indicates a better match between the object and its own cluster.

fviz_silhouette(iris[,1:4], kmeans, centers = 3)

K-Means Clustering

K-means is the most commonly used clustering algorithm and is based on the iterative assignment of objects to the closest centroid. The algorithm starts with a random selection of k centroids and then iteratively updates the assignment of objects to clusters and the position of the centroids until convergence.

set.seed(123)
k <- 3
km.res <- kmeans(iris[,1:4], centers = k, nstart = 25)
km.res

Visualizing the Clusters

The “factoextra” package provides several functions to visualize the results of the clustering analysis. The first step is to assign the cluster labels to the original data:

iris_cluster <- iris
iris_cluster$cluster <- as.factor(km.res$cluster)

The “fviz_cluster()” function is used to visualize the clusters in a scatter plot:

fviz_cluster(km.res, iris[,1:4])

This plot shows the distribution of the objects in each cluster and the position of the centroids. Another useful visualization is the scatter plot matrix, which shows the relationship between each pair of variables:

fviz_pairs(iris_cluster, ellipse = TRUE, col.regions = iris_cluster$cluster)

This plot shows the distribution of each variable within each cluster, and highlights the separation between the clusters.

Hierarchical Clustering

Hierarchical clustering is another popular clustering algorithm that builds a hierarchy of clusters by successively merging or splitting existing clusters. There are two main types of hierarchical clustering: Agglomerative and Divisive.

In Agglomerative hierarchical clustering, the algorithm starts with each object as its own cluster and then iteratively merges the closest pairs of clusters until all objects are in the same cluster.

hc.res <- hcut(dist(iris[,1:4]), k = 3)
iris_cluster$hc.cluster <- as.factor(hc.res$cluster)

The “fviz_dend()” function is used to visualize the dendrogram, which shows the hierarchy of the clusters:

fviz_dend(hc.res, k = 3)

This dendrogram shows the history of the merging of the clusters, and can be useful for determining the optimal number of clusters.

Comparison of Clustering Algorithms

The choice of clustering algorithm depends on the data, the research question, and the computational resources. It is a good idea to compare the results of different algorithms to ensure that the conclusions are robust.

cluster_comp <- cbind(iris_cluster, km = as.factor(km.res$cluster))
fviz_cluster(km.res, iris[,1:4], ggtheme = theme_classic(), palette = "jco",
             geom = c("point", "centroid"), ellipse.type = "convex",
             ellipse = TRUE, main = "K-Means") +
  fviz_cluster(hc.res, iris[,1:4], ggtheme = theme_classic(), palette = "jco",
               geom = c("point", "centroid"), ellipse.type = "convex",
               ellipse = TRUE, main = "Agglomerative Hierarchical Clustering")

In this example, both K-means and Agglomerative Hierarchical Clustering produce similar results, with a clear separation between the clusters.

Conclusion

In this tutorial, we have demonstrated how to perform cluster analysis in R using the “factoextra” package. We have shown how to determine the number of clusters, how to perform K-means clustering and Agglomerative Hierarchical Clustering, and how to visualize the results. Cluster analysis is a powerful tool for exploring and understanding complex data sets and for discovering patterns, trends and relationships in data.