Clustering ||Why and When||Hands-On ||Conclude

Atul Mishra
Analytics Vidhya
Published in
10 min readDec 3, 2020

Constructive, effective, beneficial, conclusive, definite, etc. such words are synonyms of word POSITIVE. Now if someone asks you to place these words in one of the bags (POSITIVE or NEGATIVE), which one would you choose?

POSITIVE right!. And what would you call a BAG with such words? Cluster of WORDS with same importance/same root word/even same origin, now this is where we have concept of CLUSTERING and forming CLUSTERS comes onn!.

If the above explanation was not clear, let me be more naive/formal: Grouping together ACTION Movies in a PLAYLIST, assembling all different types of TOOTHPASTES in one section and then later on dividing them based on PRICE/SIZE, clubbing together different types of ANIMALS under one SPECIE (example CAT FAMILY).

The CAT Family

Clustering is the task of segregating the population or data points into a different number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. Application of Clustering:

  • Customer Segmentation
  • Data Analysis
  • Can be used as Dimensionality Reduction technique
  • Outlier Detection
  • Search Engines and Segment an Image.

Now, let’s understand the CLUSTERING from a BUSINESS perspective.

In business intelligence, clustering can be used to organize a large number of customers into groups, where customers within a group share strong similar characteristics. This facilitates the development of business strategies for enhanced customer relationship management. Moreover, consider a consultant company with a large number of projects. To improve project management, clustering can be applied to partition projects into categories based on similarity so that project auditing and diagnosis (to improve project delivery and outcomes) can be conducted effectively.

Types of Clustering

In order to understand types of clustering, it is important to understand that how do we judge a CLUSTERING to be a good fit?

  • high intra-class similarity
  • low inter-class similarity

This means that in order to get good clustering result, intracluster distance has to be high and intercluster has to be low.

Intra-Class:

Intracluster is defined as distance between the data points in the same cluster.

Inter-Class:

Interclass is the distance between data points belonging to two different clusters.

Depiction
  1. Soft Clustering: A Technique where data-points can be shared/belong to one or more clusters.
  2. Hard Clustering: A technique where every single data point/object belongs to one cluster only. Assigning each instance to a single cluster.
Hard vs Soft Clustering

3. Hierarchical Clustering: One of the major considerations in using k-means clustering is deciding the value of k, either by silhouette score or elbow-curve method beforehand. The hierarchical clustering doesn’t have this restriction. Here, concept of Linkage is used to merge multiple sub-clusters into a single cluster, till all the points become part of single cluster. There are two types/ways in which hierarchical clustering can be performed Agglomerative(bottom to top) and Divisive(top to bottom).

Agglomerative Hierarchical Clustering

Let’s Talk about Silhouette Score,Elbow Curve, K-Means and the Hands-On which we will be doing!

Just have a look below, how many different types and methods are there for Unsupervised Clustering Algorithm and when to use scenario:

Clustering Roadmap!

Silhouette Score: Silhouette Analysis or Silhouette Coefficient is a measure of how similar a data point is to its own cluster(COHESION) compared to other clusters(SEPARATION). So to computer Silhouette Metric, we need to compare two measure, i.e. a(i) and b(i), where

  • a(i) is the average distance from own cluster (cohesion)
  • b(i) is the average distance from the nearest neighbor cluster (separation)

Now for every data point(i), a(i) should be as small as possible and b(i) should be as large as possible.

Silhouette Formula
  • The value of Silhouette Score range lies between -1 to +1.
  • The score closer to +1 indicates that the data point is very similar to other data points in the cluster.
  • The score closer to -1 indicates that the data point is not similar to the data points in its cluster.

Elbow Curve: To find the Sum of Square of Distances of sample from their respective cluster center. To choose the optimal number of clusters from Elbow-Method:

  • Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
  • For each k, calculate the total within-cluster sum of square (wss).
  • Plot the curve of wss according to the number of clusters k.
  • The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

Well, both of the above methods are used for selecting optimal number of CLUSTERS that can be formed in K-MEANS Clustering Algorithm!

OK, so far we have seen,

  • What is clustering, when should we perform clustering!
  • Different types of Clustering.
  • How can we select the optimal number of clusters for clustering.

Now, let’s see one of the Clustering Method, and how do we go on implementing it.

K-Means Clustering: The algorithm needs to find the data points whose values are similar to each other, and therefore these points would belong to the same cluster.

The method in which any clustering algorithm goes about doing that is through the method of finding something called as “Distance Measure”.The distance measure that is used in K-Means clustering is called as “Euclidean Distance Measure”.

d is the euclidean distance between 2 points

Mostly, when we perform clustering, we don’t have the labels and centroids. So how do we proceed?

  1. We start by placing the centroids randomly.
  2. Then label the instances into a cluster based on the euclidean distance and update the centroids, label the instances,update the centroids,label the instances,update the centroids, until the centroids stop moving,i.e. stop attaining new coordinates.

The algorithm is guaranteed to converge in a finite number of steps. That’s it, within just two steps of Assignment and Optimisation, iteratively the algorithm converges. It is the algorithm’s inner loop that iterates over 2 steps:

  • Assigning each observation X(i) to the closest cluster centroid.
  • Update each centroid to the mean of the points assigned to it.
Gif representation that compares the Sum of Squares for cluster per cluster and alternatively shows the cluster formation!

For a beautiful understanding and more clarity if still any doubt, one can refer to Stat-Quest Clustering video. This guy really explain well and sings good!😛 And if you want to play with k-means, this can help you!

Finally, we are here, the one which i was eager to reach too!.

Hands On!

Problem Statement: Use Clustering Methodology for Word Clustering. So, we have dataset of news tweets and we need to cluster out similar words. How do we go onto do that?

Let me quote a few things about the dataset and approach:

  • It has features such as: tweet_id,date and time of the tweet and tweet_text. These news tweets are obtained from BBC health tweets from UCI source dataset. There are around 3929 total tweets and we are free from any missing value from the entire data.
  • Now, since we will be doing clustering on feature tweet_text, we remove the features date and time || tweet_id.
  • Next, since this is textual data in a column, we will be doing some basic text cleaning and text processing steps.
  • Then after cleaning and processing, we will form a tf-idf vocabulary of one hot encoded words from the tweet_text feature. SInce, clustering is performed on numerical values, we need to build a word vocabulary and transform them.
  • Then using the elbow-curve method and with the help of Yellow-Brick library, we will find out the optimal number of clusters and will pass the same to final model of clustering. Yellow Brick library helps us to pass the optimal k everytime the model/script is compiled and hence we don't need to hardcore the value.

The final results are not very precise, since it is unsupervised learning and we look for clusters between range of 2 to 10, otherwise , same word might appear in 2 different clusters due to linguistic property of sentences.

Read the dataset and explore a bit:

data = pd.read_csv('Health-Tweets/bbchealth.txt', sep="|", header=None)
data.columns = ["tweet_id", "date_time", "tweet_text"]
data.head()
data.info() # Checking columns and number of entries
print('-------------------------')
print(round(data.isnull().sum()/len(data),2)) # Checking for any null values

So, here we see the total number of rows and column names. Also, none of the rows respective to mentioned features/columns are empty. Next, since we only need the tweet_text, we drop the remaining 2 columns!

X_train = data.drop(columns=['tweet_id','date_time']).reset_index(drop=True)
X_train.head()
print(X_train.shape) #(3929, 1)

Now comes, processing the text in our feature:

First, we just do a normal regex filter to remove all the href links present in our tweets and then we build a function that is used as an analyser when we start forming the tf-idf vocabulary. This helps in applying the function sentence by sentence.

Our function takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Return the cleaned text as a list of words
4. Remove words

X_train['tweet_text'] = X_train['tweet_text'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
X_train['tweet_text'] = X_train['tweet_text'].apply(lambda x: re.split('http:\/\/.*', str(x))[0])
# Our analyser function:
def text_process(text):
'''
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Return the cleaned text as a list of words
4. Remove words
'''
stemmer = WordNetLemmatizer()
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join([i for i in nopunc if not i.isdigit()])
nopunc = [word.lower() for word in nopunc.split() if word not in stopwords.words('english')]
return [stemmer.lemmatize(word) for word in nopunc]

Vectorization is nothing but creating a vector of words called vocabulary.
TFIDF Vectorizer is used to create a vocabulary. TFIDF is a product of how frequent a word is in a document multiplied by how unique a word is w.r.t the entire corpus. ngram_range parameter : which will help to create one , two or more word vocabulary depending on the requirement.

tfidfconvert = TfidfVectorizer(analyzer=text_process,ngram_range=(1,3)).fit(X_train.tweet_text)X_transformed=tfidfconvert.transform(X_train.tweet_text)# Checking the length of the Vocabulary Created!
len(tfidfconvert.vocabulary_) # 3933

Deciding number of Clusters using Elbow Curve!:

Sum_of_squared_distances = []
K = range(1,20)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(X_transformed)
Sum_of_squared_distances.append(km.inertia_)
import matplotlib.pyplot as pltplt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Normal Elbow Curve Method

But, let me show you the Yellow Brick Library Magic:

from yellowbrick.cluster import KElbowVisualizer
from yellowbrick.cluster.elbow import kelbow_visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,20))
# visualizer = kelbow_visualizer(KMeans(random_state=4), X_transformed, k=(2,50))
visualizer.fit(X_transformed) # Fit the data to the visualizer
visualizer.show(); # Finalize and render the figure
Yellow Brick Elbow Curve Graph

Just have a look at it, how beautiful, it looks! We can see the attached score and elbow at k value. It even has the axis for the time for every cluster computation! Next, thing is:

optimal_k = visualizer.elbow_value_
print('Optimal K found is:', optimal_k) # 9

Using elbow_value_ method from yellowbrick implementation, we can directly fetch the optimal k value in an automated way and this goes for out final modelling.

Modelling:

Now, at this moment, the optimal number of clusters we are getting is round 9 and hence the same shall be passed forward.

# Clustering the training sentences with K-means techniquemodelkmeans = KMeans(n_clusters=optimal_k, init='k-means++', n_init=100)
modelkmeans.fit(X_transformed)

After we have formed out model, we shall visualize also some data points to get feel of the same.

pca = PCA(n_components=2, random_state=42)
reduced_features = pca.fit_transform(X_transformed.toarray())
# reduce the cluster centers to 2D
reduced_cluster_centers = pca.transform(modelkmeans.cluster_centers_)
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=modelkmeans.predict(X_transformed));
plt.scatter(reduced_cluster_centers[:, 0], reduced_cluster_centers[:,1], marker='x', s=150, c='b');

The crosses are the centroids chosen by the model. The final clustered shape are highly non-uniform and hence if we select more number of clusters, we might get more uniform clusters. Next, we have tried MiniBatch KMEANS algorithm too on the same dataset as per our DataScience project approach!

Mini Batch KMEANS:

Now, Mini Batch KMEANS is just an improvement for the traditional KMEANS algorithm. Here, instead of using full dataset at each iteration, the algorithm is capable of using mini-batches, moving the centroids just slightly at each iteration.

This speeds up the algorithm by a factor of 3 or 4 and makes it possible to cluster huge datasets that donot fit in memory. As part of out experiment, we intent to use the same and look at the final clustering results.

clusters = MiniBatchKMeans(n_clusters=optimal_k, init_size=1024, batch_size=2048, random_state=20).fit_predict(X_transformed)# Looking at the Words Clustered out finally
def get_top_keywords(data, clusters, labels, n_terms):
df = pd.DataFrame(data.todense()).groupby(clusters).mean()
final_df = pd.DataFrame(columns=['Cluster','Cluster Words'])
for i,r in df.iterrows():
print('\nCluster {}'.format(i))
print(','.join([labels[t] for t in np.argsort(r)[-n_terms:]]))
x = i
y = [labels[t] for t in np.argsort(r)[-n_terms:]]
final_df = final_df.append({'Cluster' : x , 'Cluster Words' : y} , ignore_index=True)
return final_df
# print the top words
final_df = get_top_keywords(X_transformed, clusters, tfidfconvert.get_feature_names(), 15)

Similarly, we cumulate this result into a Dataframe:

pd.set_option("max_colwidth", None)
final_df.head(20)

Conclusion:

  • So, we looked at the base of need of a Clustering Algorithm.
  • Different types of Clustering techniques and implementation of K- Means and Mini Batch KMEANS.
  • We even explored some text processing/cleaning steps which can help in any problem statement where cleaning/processing of textual data is required.
  • Finally there are some more clustering algorithms you can explore for yourself, like : DBSCAN, BIRCH, Mean-Shift, Agglomerative, Affinity, Gaussian Mixture models.

I hope this piece of article helped readers in some or the other way. More topics are on the way, for any special topic, do let me know in the comments section and as always…… Happy Learning!

Connect with me at:

--

--