K-means exercise in R language
As a novice in genomics data analysis, one of my goal is to benchmark how well a clustering method works. I ran across this practice of doing k-means at R-exercises the other day and felt it might be a nice start because k-means is easy to perform and conceptually simple for me to correlate what is happening behind the clustering machinery.
It starts with manipulating the built-in iris
dataset as usual. (I would load ggplot2
and cowplot
first.) Now, let’s have some k-means done.
# Set random seed = 1
set.seed(1)# Do k-means with 3 centers
iris_k <- kmeans(iris[, c(1,2)], 3)# Append cluster identity
iris_df <- iris
iris_df$cluster <- iris_k$cluster# Append cluster identity
iris_df <- iris
iris_df$cluster <- iris_k$cluster
species <- iris_df$Species
sepal_only <- iris_df$cluster# Check species proportion in each cluster
table(species, sepal_only)# sepal_only
# species 1 2 3
# setosa 50 0 0
# versicolor 0 12 38
# virginica 0 35 15
The cluster identities do not correspond well with species identity, and the result does not look so impressive. It is imaginable because the different species intermingle on the sepal dimensions, and kmeans()
considering sepal dimensions only would have a hard time in telling them apart.
# Include petal info into k-means
iris_k2 <- kmeans(iris[, c(1:4)], 3)
iris_df$cluster_sp <- iris_k2$cluster
s_and_p <- iris_df$cluster_sp# Check the proportion of species of each cluster
table(species, s_and_p)# s_and_p
# species 1 2 3
# setosa 0 50 0
# versicolor 2 0 48
# virginica 36 0 14
After taking petal length and width, the clustering identities agreed more with the “real” species identities.
Effect of linear transformation: What would happen if we multiply Petal.Width
by 2?
# Multiply Petal.width by 2 and do k-means again
iris_df$Petal.Width <- iris_df$Petal.Width * 2
iris_k_2pw <- kmeans(iris_df[, c(1:4)], 3)iris_df$cluster_pw2 <- iris_k_2pw$cluster
doubled <- iris_df$cluster_pw2# Estimate how much cluster identity agrees with each other before
# and after doubling petal width
table(s_and_p, doubled)# doubled
# s_and_p 1 2 3
# 1 36 0 2
# 2 0 50 0
# 3 9 0 53
Though the clustering result did not change too much, linear transformation indeed altered the clustering identity. I was not expecting that at first, but then I realized k-means takes only distance into consideration, and linear transformation on one dimension does change relative distance. So, while transforming one dimension would influence clustering result, multiplying everything at the same time should not change cluster identity.
# Multiply everything by 2 and do k-means again
iris_df2 <- iris_df * 2
iris_k_db <- kmeans(iris_df2[, c(1:4)], 3)all_doubled <- iris_k_2pw$cluster# Estimate how much cluster identity agrees with each other before
# and after doubling petal width
table(all_doubled, doubled)# doubled
# all_doubled 1 2 3
# 1 45 0 0
# 2 0 50 0
# 3 0 0 55
And that’s definitely the case.
Scaling data
One way to think of it might be that k-means considers Euclidean distance, and if we expand or shrink one dimension, the influence of that dimension on distance would change accordingly. To mitigate this asymmetry of influence, scaling might be a good way if we assume every dimension should have equal impact in clustering.
# Doing z-transformation with scale()
iris_df[, c(1:4)] <- scale(iris_df[, c(1:4)])
iris_k_z <- kmeans(iris_df[, c(1:4)], 3)iris_df$cluster_kz <- iris_k_z$cluster
z_trans <- iris_df$cluster_kz# Doing z-transformation with scale() of the original iris
iris_ori <- scale(iris[, c(1:4)])
iris_k_zo <- kmeans(iris_ori[, c(1:4)], 3)
z_before_dbl <- iris_k_zo$cluster# Before and after doubling with z-transformation
table(z_before_dbl, z_trans)# z_trans
# z_before_dbl 1 2 3
# 1 47 0 0
# 2 0 50 0
# 3 0 0 53
Scaling do a good job to give us consistent result regardless of whether there is linear transformation. It is thus advisable to scale the data in some way to make sure we could capture the diversity fairly and would not let the dimension largest in number dominate the whole clustering.
Yet another dataset
Now, we move on to another dataset from Kaggle.com. The author of this exercise suggested us to cluster passengers according to sex, number of sibling or spouse on board (SibSp), number of children or parents on board (Parch), and ticket fare (fare), and see if the cluster could predict survival.
# Playing with training dataset from Titanic@Kaggle
train <- read.csv("train.csv", stringsAsFactors = F)# Make Sex and Pclass dummy variable
train$Sex <- factor(train$Sex, levels = c("female", "male"),
labels = c("0", "1"))
train$Pclass <- ifelse(train$Pclass == "3", 1, 0)# K-means (k = 4) according to Sex, SibSp, Parch, Fare
tita_k <- kmeans(select(train, Sex, SibSp, Parch, Fare), centers = 4, nstart = 20)# Survival of each cluster
table(train$Survived, train$cluster_1)[2, ]/colSums(train$Survived, train$cluster_1)# 1 2 3 4
# 0.7000000 0.2820037 0.4703390 0.6770833
It does look like the rate of survival differs between groups. To evaluate how big k should be to capture the most information from the dataset, variance explained could be calculated here. It was actually super-easy because kmeans()
already prepared us the sum of squares between clusters (betweenss) and total sum of squares (totss) in its output.
# Trying different k (2 - 20)
## Initiate a list to store the results
tita_km <- list()for (i in c(2:20)) {
index <- i - 1
tita_km[[index]] <- kmeans(select(train, Sex, SibSp, Parch, Fare), centers = i, nstart = 20)
}# Calculate variance explained by cluster
ve <- sapply(tita_km, function(x) x[["betweenss"]]/x[["totss"]])
plot(ve ~ c(2:20), xlab = "k",
ylab = "SoS between clusters / Total SoS",
col = c(rep("black",3), "red", rep("black",15)), pch = 16)
With visualization, it seems the margin drops when k > 5, and thus the elbow method would suggest k = 5 as the optimal cluster number.
train$cluster_5 <- tita_km[[5]][["cluster"]]
table(train$Survived, train$cluster_5)[2, ] /colSums(table(train$Survived, train$cluster_5))# 1 2 3 4 5 6
# 0.05882353 0.46351931 0.28927203 1.00000000 0.68888889 0.65384615
This time, the difference of survival between clusters becomes even sharper. Does more variance explained necessarily lead to better correlation with outcome of interest? To test this notion, I went back to do the same thing for the iris dataset.
# Testing different k in irisiris_km <- list()for (i in c(2:20)) {
index <- i - 1
iris_km[[index]] <- kmeans(select(iris, -Species), centers = i, nstart = 20)
}# Calculate variance explained
ve_iris <- sapply(iris_km, function(x) x[["betweenss"]]/x[["totss"]])
plot(ve_iris ~ c(2:20), xlab = "k",
ylab = "SoS between clusters / Total SoS",
col = c(rep("black",2), "red", rep("black",16)), pch = 16)
It seems that the “elbow” is on k = 4.
# k = 4
iris$clus4 <- iris_km[[3]][["cluster"]]
k4 <- as.data.frame(table(iris$Species, iris$clus4))k4p <- ggplot(k4, aes(x = Var2, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
labs(x = "Cluster", fill = "Species") +
ggtitle("k = 4 for iris")# k = 3
iris$clus3 <- iris_km[[2]][["cluster"]]
k3 <- as.data.frame(table(iris$Species, iris$clus3))k3p <- ggplot(k3, aes(x = Var2, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
labs(x = "Cluster", fill = "Species") +
ggtitle("k = 3 for iris")
plot_grid(k4p, k3p, labels = "AUTO")
It seems that versicolor and virginica are hard to tell apart based on the information iris dataset provides, but k = 4 did a better job in identifying a subset of versicolor apart from the mixed population. In this case, k number indicated by elbow method seems to score again.
Take home message
This exercise helped me learn the fundamental behaviors of k-means and how R implemented it among other things. It was a pleasant surprise to discover scale()
, so I no longer have to reinvent the wheel. The detailed results in addition to cluster identity kmeans()
provides by default not only make life easier and remind me of the things to look after clustering.
- K-means sort of summarizes information of multiple dimension to categorize them.
- K-means is distance-based, so linear transformation would change the clustering result.
- To prevent the dimensions bigger in number dominate clustering, scaling is a recommended before k-means if we assume every dimension is equal in importance.
- Besides cluster identity,
kmeans()
also gives total sum of squares, sum of square between clusters, and in-cluster sum of squares, and many other information by default (check?kmeans()
). This makes calculation of variance explained very intuitive. - Elbow method is based on the marginal gain of variance explained with adding more and more cluster and could help us assess the preferable number of cluster.