Clustering techniques with Gene Expression Data for Acute Myeloid Leukemia

Fig 1. Scikit learning cheat sheet

In this tutorial I will focus on different clustering techniques using gene expression data. In this tutorial I will use data from acute myeloid leukemia (AML), which is one of the most fatal malignancies. Data clustering methods are unsupervised techniques that are widely used in machine learning and it has been proved to be a useful tool in biomedicine research. Contrarily to supervised machine learning, clustering techniques interpret the data to find cluster (or group) in the feature space. Technically, clusters area of density in the feature space where the observations are closer to the member of a given cluster in comparison to other clusters. In more simple words, cluster are group of elements that are similar considering their features. In addition, clustering is often an intermediate step in data pipeline. Many clustering techniques are used to discover group of patients that are responding differently to therapies, group showing a different survival. In biology clustering techniques are used for classification, prediction, sequence alignment, motif and pattern discovery (1), outlier detection and phylogenetic tree. Clustering techniques are also widely used in the analysis of gene expression data. A word of caution, since there are many available methods it is important to correct choose the most appropriate algorithm for the intended task (2).

For more information:

On acute myeloid leukemia: here

On medical image analysis an AI: here

On artificial intelligence on biological data: here

Dataset is available: here

The dataset used in this tutorial is obtained from Warnat-Herresthal et al. (3). They collected and re-analyzed many datasets from leukemia. The dataset and microarray technique are presented in details in the previous tutorial.

I write this tutorial in python in Google Colab. As seen in the previous dataset. I have assumed you already have imported the dataset on google drive.

%reload_ext autoreload%autoreload 2%matplotlib inlinefrom google.colab import drivedrive.mount(‘/content/gdrive’, force_remount=True)

import necessary libraries and the dataset:

#import necessary libraryimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport umap#datasetdata = pd.read_table("/content/gdrive/My Drive/aml/201028_GSE122505_Leukemia_clean.txt", sep = "\t")

After loading the dataset and necessary library let’s give a glance on the number of available disease conditions in the dataset.

#table of the diseasedata.disease.value_counts()

There are two category that are dominating the dataset: Acute Myeloid Leukemia (AML) and Acute Lymphoid Leukemia (ALL). For easier visualization we are grouping some categories.

#removing some disease typedata["disease"] = np.where(data["disease"] == "Diabetes_Type_I" , "Diabetes", data["disease"])data["disease"] = np.where(data["disease"] == "Diabetes_Type_II" , "Diabetes", data["disease"])other = ['CML','clinically_isolated_syndrome', 'MDS', 'DS_transient_myeloproliferative_disorder']data = data[~data.disease.isin(other)]data.shapetarget = data["disease"]df = data.drop("disease", 1)df = df.drop("GSM", 1)df = df.drop("FAB", 1)df.shape
target.value_counts()

We are also filter out the features with low variance and scaling the remaining features.

df = df.drop(df.var()[(df.var() < 0.3)].index, axis=1)from scipy.stats import zscoredf = df.apply(zscore)df.shape

Gene expression data are in general noisy and clustering techniques are useful to gain insight. Generally, we consider the gene expression profiles (the patients) as observation and the genes as features. The scope is to find gene expression profiles (patient or observation) that are similar inside a cluster and dissimilar to gene expression profiles present in other clusters. Similarities among the observation in a cluster are measured by distance (basically, two objects are in a cluster if they are close, this proximity is calculated on a specific distance). In other tasks, we can also conduct the clustering on the genes to discover group of genes that present a specific behavior (4). A prototyping example of how clustering has been using on cancer gene expression profiles is the work Alizadeh et al (5), where they used hierarchical cluster to divide diffuse large B-cell lymphoma in three groups.

Fig 2. Example of hierarchical clustering on gene expression data. Figure source: (5).

Clustering methods can be partial or complete: complete cluster allocate each observation to a cluster, while partial cluster no. Partial clustering allows that some genes are not allocate to a well-defined cluster (since some genes can be simply noise). The incorporation of these genes (which are noise) can modify the output, forcing the construction of cluster with unrelated members. There clustering methods can be classified as hard or overlapping. In the first case, hard clustering method assign each specific observation to a unique cluster. In the latter, overlapping clustering assign degrees of membership to different clusters for each observation.

Fig 3. Examples of different clustering techniques used in bioinformatics. Figure source: (2).

Hierarchical clustering

Algorithms based on hierarchical clustering (HC) are among the earliest clustering algorithm used to cluster gene expression data. The performance of the algorithm is sensitive to noise, has difficulties to handle missing data and not informative about required number of clusters. Hierarchical cluster (as many machine learning techniques) is sensible to scaling the data, for this reason we have scaled the data before.

There are two type of HC: agglomerative and divisive. Agglomerative or also bottom-up approach is starting from the individual data points (considering each data point a single cluster) and then starting to join these data points in clusters. The divisive or top-down approach is starting considering the data points as member of unique cluster and then dividing in different sub-clusters. The agglomerative approach is the most used.

different hierarchical clustering methods
Fig 4. It is describing the agglomerative and divisive clustering. In agglomerative we start from considering each data point as a cluster, in each iteration we are merging the closest clusters until we have one cluster. In divisive clustering all the data points belong to a single cluster, at each iteration we split the farthest point until each cluster contains a unique observation. Figure source: adapted from here

At the beginning each data point is a single cluster (if K is the number of clusters and N is the number of the observations, K = N). It is calculated the dissimilarity (or distance) between all the N objects (under the hood, it is calculated a distance matrix). In the first step, the two closest data points are joined in a cluster (then K is equal to K-1). In the following step, two clusters are joined (as for example, the cluster formed by the two data point originally joined and another close datapoint, then K is equal to K-2). The clusters that are separate by the shortest distance are agglomerated in one cluster. The process is repeated until a big cluster is formed. Once the big cluster is formed, considering the dendrogram (a tree shaped representation of the clusters’ distance) it can be split into multiple clusters.

hierarchical clustering algorithm step
Fig 5. Hierarchical clustering graphical screen. Figure source: adapted from: here

There are different ways to measure the distance between data points and clusters (Euclidean or Manhattan distances are among the most used).

Examples of some option to measure distances (6):

  • Measure the distance between the closest points of two clusters
  • Measure the distance between the farthest point of two clusters
  • Measure the distance between the centroid of two clusters. The centroid is the most representative data point in a cluster and often is the mean of data point values in that cluster (7).
  • Measure all the distance among all the data points in two different clusters and then consider the mean.

There are different methods which cluster should be linked:

  • Complete: Pairwise similarity between all the observations in cluster x and cluster y using the largest of similarities. Complete linkage tends to produce more compact clusters. This approach tends to break large clusters and is biased towards globular clusters.
  • Single or nearest neighbor clustering: Pairwise similarity between all the observations in cluster x and cluster y using the smallest of similarities. The cluster distance is calculated on a single pair of elements, selecting the two elements (one in cluster x and one in cluster y) that are closest to each other. It tends to produce long clusters. This approach can separate non-elliptical cluster shapes (if the gap between the two clusters is not small).
  • Average: Pairwise similarity between all the observations in cluster x and cluster y using the average of similarities. This approach is also biased towards globular clusters
  • Centroid: similarity between the centroid of cluster x and centroid of cluster y.
  • Ward method. The ward method minimizes the total within cluster variance. At each iteration the cluster x and cluster y that leads to the minimum increase if joined, would be merged. It tends to produce more compact clusters and it is biased towards globular clusters. The distance between clusters is the sum of squared differences within all clusters.
  • Other methods exist, but these are the most used.
different hierarchical clustering methods
Fig 6. clustering method differences. Figure source: adapted from: here and here

How you can see in the figure, selecting a method is deeply impacting the outcome. As a rule of thumb, complete and average tend to generate more balanced trees and they are most commonly used (together with Ward’s method). Complete linkage tends to break large cluster (which tend to misclassify some observations, as in the case of Iris dataset) while ward and single linkage method are less vulnerable to this issue. Single linkage is sensible to noisy data since it based to the minimum distance between two clusters (you can notice in HTRU2 dataset). Average method is less sensible to noisy data but it does a better job in handling large clusters than complete. Ward method is generally the default method, since it is often performing well.

different hierarchical clustering methods
Fig 7. Differences in clusters using different methods. Two datasets are shown in figure. Figure adapted from: here
Fig 8. different outcome of different methods. Figure source: here

Let’s try some code now. First, we use PCA (see precedent tutorial) to visualize the data.

from sklearn.decomposition import PCA
pca = PCA(n_components=50)
X = pca.fit(df).transform(df)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(target)
y_lan = le.transform(target)
pca_df = pd.DataFrame(columns = [“x”, “y”, “name”, “label”])
pca_df[“PCA1”] = X[:, 0]
pca_df[“PCA2”] = X[:, 1]
pca_df[“Disease”] = target
pca_df[“label”] = y_lan
sns.set(style=”whitegrid”, palette=”muted”)
#sns.set_theme(style=”whitegrid”)
ax = sns.scatterplot(x=”PCA1", y=”PCA2", hue=”Disease”, data=pca_df)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.savefig(“GSE122505_Leukemia_PCA.pdf”, dpi = 300)
#plt.savefig(“GSE122505_Leukemia_PCA.png”)
#pca_df.to_csv(“GSE122505_Leukemia_PCA.csv”)
PCA and rna expression

There are many libraries which include clustering algorithm. We will use for agglomerative clustering two libraries, Scipy and Scikit-learn. We are using the ward method and the Euclidean distance (the metric used to measure the distance between clusters). We are conducting the clustering and plotting the dendrogram (on the X axis there are listed the observations, on the Y is represented the Euclidean distance, the dendrogram are showing to which cluster belongs the observation).

#making the clusterfrom scipy.cluster.hierarchy import dendrogram, linkagefrom matplotlib import pyplot as pltlinked = linkage(df, ‘ward’)#plotting the dendrogramplt.figure(figsize=(100, 15))dendrogram(linked, orientation=’top’, distance_sort=’descending’, show_leaf_counts=True)plt.show()

The dendrogram is a hierarchical tree that is recording all the merge (or split in the case of divisive clustering). Moreover, it is recording also the distance between the clusters and making accessible all these information in an easy interpretable graph.

hierarchical clustering cutting the tree

As how we are described before, the process iterate until there is a unique big cluster. The dendrogram can be cut at a certain height. Cutting at that particular height, you can divide in different cluster.

As we see in the next figure, if you draw a red line at the height we are separating the tree in 2 clusters.

hierarchical clustering cutting the tree

Selecting at a different height we will have a different number of clusters. For example, here we will obtain 5 clusters.

hierarchical clustering cutting the tree

We will use now, scikit-learn which one of the most diffuse machine learning library.

Note that scikit-learning ask as to specify the number of clusters we want to retrieve, in this case we are expecting 7 clusters (the seven categories we have selected or grouped before, see previous tutorial for additional details). We also selecting ward as method and Euclidean distance.

Principal parameters:

  • n_clusters: select the number of cluster we want to find. The default is 2.
  • affinity: is the metric used to compute the linkage. The default is Euclidean (which is the most used). There are other options (“l1”, “l2”, “manhattan”, “cosine”, or “precomputed”). Precomputed, it requires that the user provides a distance matrix (practically, you can calculate a distance matrix using another metric not present in the library).
  • Linkage: the method, how to calculate the distance between the clusters. the default is ward.
  • distance_threshold: you can also provide a threshold of distance above which, clusters will not be merged
from sklearn.cluster import AgglomerativeClusteringcluster = AgglomerativeClustering(n_clusters=7, affinity=’euclidean’, linkage=’ward’)cluster.fit_predict(df)

which is returning a Numpy array with the clusters label (in this case from 0 to 6). Note, the cluster number is no meaning a mathematical implication but there a simple labels to distinguish them.

We can visualize a dendrogram with scikit-learning but as you can see this is rounded up the scipy function. We are first defining a function called plot_dendrogram and then we are conducing the clustering. Note, that we are selecting no number of clusters, this is to plot all the clusters. The model is truncated to shows only the first levels (to be more understandable) (8).

import numpy as npfrom matplotlib import pyplot as pltfrom scipy.cluster.hierarchy import dendrogramfrom sklearn.cluster import AgglomerativeClusteringdef plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx — n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack([model.children_, model.distances_, counts]).astype(float)# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model = model.fit(df)
plt.figure(figsize=(30, 10))
plt.title(‘Hierarchical Clustering Dendrogram’)
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode=’level’, p=3)
plt.xlabel(“Number of points in node (or index of point if no parenthesis).”)plt.show()
hierarchical clustering

Now we can plot our cluster labels on the PCA representation.

pca_df["HC_labels"] = cluster.labels_
pca_df['HC_labels'] = pca_df.HC_labels.astype('category')
sns.set(style="whitegrid", palette="muted")
#sns.set_theme(style="whitegrid")
ax = sns.scatterplot(x="PCA1", y="PCA2", hue="HC_labels", data=pca_df)# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.savefig("GSE122505_Leukemia_HC.pdf", dpi = 300)
#plt.savefig("GSE122505_Leukemia_HC.png")
#pca_df.to_csv("GSE122505_Leukemia_HC.csv")
PCA expression rna expression

Our dataset is composed of a large number of observations and much more features (the genes) and this is slowing the computation time. One solution, that can also improve the result, is to compute first the PCA and to conduct the hierarchical clustering on the PCA components (we are selectin here 50 principal components, but the number is arbitrary).

#compute PCA
pca = PCA(n_components=50)
X = pca.fit(df).transform(df)
#conduct hierarchical clustering
cluster = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='ward')
cluster.fit_predict(X)
#plot the labels
pca_df["HC_PCA_labels"] = cluster.labels_
pca_df['HC_PCA_labels'] = pca_df.HC_PCA_labels.astype('category')
sns.set(style="whitegrid", palette="muted")
ax = sns.scatterplot(x="PCA1", y="PCA2", hue="HC_PCA_labels", data=pca_df)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#save it
#plt.savefig("GSE122505_Leukemia_HC_PCA.pdf", dpi = 300)
#plt.savefig("GSE122505_Leukemia_HC_PCA.png")
#pca_df.to_csv("GSE122505_Leukemia_HC_PCA.csv")
principal component analysis rna expression

Some considerations:

  • the distance matrix, calculate and store all the possible pair combination between data points (affecting the RAM). The distance matrix is calculated at each iteration, making computationally expensive for large datasets. It is often used with small or medium datasets.
  • Another problem is each method has it is own limitations.
  • The clusters decision is not changing with the iteration. Once a step is completed, the clusters at the lower level cannot be changed. If for any reason there is a misclassification, this will be maintained to until the last step.
  • The is no clear objective function to be minimize or optimize (we will see later this in K-means).
  • If we are not sure about the number of clusters, hierarchical clustering permits to choose after the clustering and deciding how many clusters we want to select. The dendrogram is a hierarchical tree that permits easily to select how many clusters we want to select on the distance. Moreover, the dendrogram provides fine details about the observation relationships.

As last step about hierarchical clustering we will plot a heatmap similar to figure 2. Many times, in scientific articles there are shown heatmap with patients on one axis and a selection of gene on the other axis. Here instead of gene expression we are plotting on the x axis PCA components (20 in this case).

import seaborn as sns; sns.set_theme(color_codes=True)pca = PCA(n_components=20)
Y = pca.fit(df).transform(df)
Y = pd.DataFrame(Y)
k = ["green", "red", "orange", "black", "purple", "blue", "yellow"]
lut = dict(zip(target.unique(), k))
row_colors = target.map(lut)
g = sns.clustermap(Y, row_colors=row_colors)
heatmap with hierarchical clustering rna expression

k-means clustering

k-means clustering is one of the oldest and most approachable even for beginner, indeed there are many libraries and easy tutorials on how to apply in Python. K-means is an example of partitional clustering, a algorithm that divide data observations in nonoverlapping groups (a data point cannot be member of more than one group and each group (or cluster) must have at least one observations). These algorithms require that you specify in advance the number of clusters (the number of clusters is also called variable k). Many of the techniques work through an iterative process to assign subsets of data observation into k groups or clusters. Indeed, k-means and k-medoids are examples of partitional clustering algorithms. Another difference with hierarchical clustering we should keep in mind is that partitional clustering are not deterministic (in other words, maintaining the same input they can produce different results in two separate run). Partitional clustering are working well with clusters with spherical shape and they are easily scalable. On the contrary, they are not well suited when the clusters have complex shape and different sizes. The algorithms break down when used with clusters of different densities.

There are different version of k-means and derived algorithms, we will focus on the conventional version. The first step is to select the value of K (basically, the number of clusters you want to retrieve). Selecting K is also meaning to select randomly k centroids (as said before, the centroid represents the center of cluster). Behind the hood, the main process of the algorithm is called expectation-maximization. In the first part k-means assigns each observation to its nearest centroid (expectation). In the second step, k-means calculates the mean of all observations in a cluster and set a new centroid for that cluster (maximization). This iterative process continues until the centroids are not moving anymore.

k-means use sum of the squared error (SSE) as error function. At each step it calculated the squared Euclidean distance of each data observation to its closed centroid. The scope is to minimize the error function value and then converge to the minimum.

Fig 9. K-means algorithm. Figure source: here

Different initializations are leading to different results (which also make the algorithm nondeterministic). For this reason researchers normally run different initializations and then choose the one with the lowest SSE. You can appreciate a nice gif on that here.

k-means centroid after iterations
Fig 10. Expectation-maximization algorithm. Random initialized centroids converge to cluster centers. Figure source: here

We will use the scikit-learn library to conduce k-means. Again, an important pre-process step is feature scaling (kmeans is sensible to the scaling) and we have already scaled the data at the beginning. We used standardization (scaling the data to have a mean of 0 and a standard deviation of 1) which is the most diffuse but there are other methods.

Principal parameters of k-means algorithm in scikit-learn:

  • init controls the initialization, we will start setting it to random (random initialization). Scikit-learn provides another option “k-means++” which set up the initial centroids in a smarter way in the idea of speed up the convergence.
  • n_clusters sets the number of cluster (variable k) and this is the most important parameter.
  • n_init is setting the number of initializations. Since the algorithm is nondeterministic, scikit-learn starts more initializations (default is 10) and after all these k-means runs, it returns the result with lowest SSE
  • max_iter sets the number of maximum iterations for each run of k-means algorithm. The default is 300.
  • random_state generates a number for centroid initialization (allowing reproducibility).

We are expecting 7 clusters so we are setting n_clusters to 7 and leaving the other parameters to default. We are running the k-means on our dataset, but we can also do as we have seen before to run the algorithm on the PCA components (this will speed up the calculations)

#install kneed on google colab
!pip install kneed
#import required libraries
from kneed import KneeLocator
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
kmeans = KMeans(init="random", n_clusters=7, n_init=10, max_iter=300, random_state=42)kmeans.fit_predict(df)

k-means has different attributes we can look

# The lowest SSE value
kmeans.inertia_
# Final locations of the centroid
kmeans.cluster_centers_
# The number of iterations required to converge
kmeans.n_iter_

we are also plotting on our PCA representation the clusters obtained. Again, the cluster labels have no mathematical implication, the order of cluster labels derives from the random initialization of the centroids.

pca_df["kmeans_labels"] = kmeans.labels_
pca_df['kmeans_labels'] = pca_df.kmeans_labels.astype('category')
sns.set(style="whitegrid", palette="muted")
ax = sns.scatterplot(x="PCA1", y="PCA2", hue="kmeans_labels", data=pca_df)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#save it
#plt.savefig("GSE122505_Leukemia_kmeans.pdf", dpi = 300)
#plt.savefig("GSE122505_Leukemia_kmeans.png")
#pca_df.to_csv("GSE122505_Leukemia_kmeans.csv")
PCA python rna expression

We provided seven clusters as variable K because we expected this number from a priori knowledge. There are different methods to select the appropriate number of clusters, the most used are the elbow method and silhouette method (and often they are used as complementary methods).

For the elbow method we will run different k-means algorithm with incrementing the variable k. We will store in a list the different SSE value.

First, we store in a dictionary the default arguments of kmeans we want to keep and then we run k-means. We are using the dictionary unpacking operator **, this under the hood is allowing to our function to iterate through the dictionary (which is really useful to store parameters).

kmeans_kwargs = { "init": "random",  "n_init": 10, "max_iter": 300, "random_state": 42}
sse = [] #our list
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df)
sse.append(kmeans.inertia_)

We plot the number of clusters on x and the SSE on the y. Normally, more centroids you add (so with a higher value of K) the distance between each centroid and the data points decreases, leading to a lower SSE. The elbow point is a tradeoff between the error and number of clusters (too much clusters are meaningless, even if the error is lower). The elbow point (or knee of the curve) is the point where the curve visibly bends from high slope to low slop (becoming flat or close to be flat). Conceptually, after a particular value of cluster the SSE is no decreasing much, and there is benefit of much clusters. The elbow point is here 5.

plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
number of cluster selection, elbow method

If you are not sure about which is the elbow point you can use a Python package called kneed to identify it.

kl = KneeLocator(range(1, 11), sse, curve="convex", direction="decreasing")kl.elbow

The silhouette method measures the cluster cohesion and separation, it quantifies how a data observation fits into the assigned cluster on how close is it to the other observations in the cluster and how far is from the observations in other clusters.

The silhouette coefficient ranges between -1 and 1 (closer to 1 is meaning that the data points are closer to their clusters than to other clusters). A silhouette value close to 0 is meaning that the clusters are significantly overlapping between then

# A list containing all the average silhouette coffecient for each K
silhouette_coefficients = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df)
score = silhouette_score(df, kmeans.labels_)
silhouette_coefficients.append(score)
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
silhouette coefficient

Interesting we see that higher silhouette coefficient is for 5 and 7. As we see in the next figure, when the clusters are not spherical, elbow or silhouette method can return strange results (discording from the visual intuition). In this case, we see the results of two algorithms (k-means and DBSCAN, the latter algorithm will be discussed in the following tutorial) where they have a different silhouette coefficient. K-means has a higher silhouette coefficient, but visually DBSCAN seems to do a better job (we will discuss DBSCAN in detail in an upcoming tutorial).

k-means DBSCAN comparison
Fig 11. Silhouette coefficient for two different algorithm results. Figure source: here

Elbow and silhouette methods are not using the ground truth labels (ground truth labels are categorical values obtained by human or algorithm assignment, most of the time you do not have these labels). In this case we can use the disease type (information that is already available). When ground truth labels are known you can use a different metric. As an example, adjusted rand index (ARI) uses true cluster assignments to measure the similarity between true (original groups) and predicted labels (in this case by k-means). We can also do for our hierarchical cluster results.

from sklearn.metrics import adjusted_rand_scoreari_kmeans = adjusted_rand_score(target, pca_df["kmeans_labels"] )clust_kmeans = adjusted_rand_score(target, cluster.labels_)ari_kmeans, clust_kmeans

ARI value ranges between -1 and 1, where 0 is meaning random assignment and a score close to 1.

There are many other score you can use, scikit-learn provides an exhaustive list.

Some considerations:

  • k-means work well with clusters which have a spherical shape but they are not well suited for cluster with a complex shape. One of the assumptions of k-means is that data points will be closer to their cluster center than other clusters’ center, meaning is not effective if clusters have complicated shapes. The cluster boundaries are linear, which is not always possible for clustering with a complex shape (spectral clustering, a derived k-means algorithm using kernel projection to overcome this problem, in other word it transforms the data allowing to create linear boundaries).
  • It can be slow for a large number of observations. K-means is faster than hierarchical clustering if the number of k is small. For large dataset it exists the minibatch implementation, which use a subset of the dataset to update the centroid position at each step.
  • It is non-deterministic algorithm. The initial random seed can a large impact on the algorithm result. Convergence to a local minimum may produce counterintuitive (wrong results).
  • It is scalable algorithm, and it is also easy to implement
  • It can be difficult to predict the number of clusters
  • It is sensitive to scale, so it is important to scale your data before to use.
  • In general k-means clustering suffer the curse of dimensionality (when there are much more features in comparison to the observation). The convergence of the algorithm is less effective in founding the right boundaries (a solution is using the k-means on PCA components, you can iterate to find the best number of PCA components to use. For instance Spectral clustering is using a PCA step before the actual clustering)

In the following tutorial we will discuss further clustering algorithm.

Additional resources:

Here you can find the implementation of the algorithm from scratch: here

Code availability

You find the code for the tutorial and all the images here. You can use the provided Colab notebook without the need to have any package installed on your pc.

if you have found it interesting:

Please share it, you can also subscribe to get notified when I publish articles, you can also connect or reach me on LinkedIn. Thanks for your support!

Bibliography

1. Lakhani J, Chowdhary A, Harwani D. Clustering Techniques for Biological Sequence Analysis: A Review. Journal of Applied Information Science (2015) 3: doi:10.21863/jais/2015.3.1.003

2. Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E. Clustering Algorithms: Their Application to Gene Expression Data. Bioinform Biol Insights (2016) 10:237–253. doi:10.4137/BBI.S38316

3. Warnat-Herresthal S, Perrakis K, Taschler B, Becker M, Baßler K, Beyer M, Günther P, Schulte-Schrepping J, Seep L, Klee K, et al. Scalable Prediction of Acute Myeloid Leukemia Using High-Dimensional Machine Learning and Blood Transcriptomics. iScience (2019) 23: doi:10.1016/j.isci.2019.100780

4. Chandrasekhar T, Thangavel K, Elayaraja E. Effective Clustering Algorithms for Gene Expression Data. arXiv:12014914 [cs, q-bio] (2012) Available at: http://arxiv.org/abs/1201.4914 [Accessed February 11, 2021]

5. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature (2000) 403:503–511. doi:10.1038/35000501

6. Hierarchical Clustering with Python and Scikit-Learn. Stack Abuse Available at: https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/ [Accessed February 12, 2021]

7. How to Find the Centroid in a Clustering Analysis. Sciencing Available at: https://sciencing.com/centroid-clustering-analysis-10070345.html [Accessed February 12, 2021]

8. Plot Hierarchical Clustering Dendrogram — scikit-learn 0.24.1 documentation. Available at: https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py [Accessed February 14, 2021]

--

--

Salvatore Raieli
Peter Moss Leukaemia MedTech Research CIC

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence