Leveraging Unsupervised Machine Learning for Text Clustering
Find correlation between clusters using an unsupervised Machine Learning algorithm for a given corpus
The jupyter notebook with all the code can be accessed here and the dataset is available at this link.
The objective of this blog is not just to apply an unsupervised machine learning algorithm on a given text corpus to cluster sentences and plot them on a cluster plot but to determine correlation among clusters (dense or overlapping). Don’t worry if you are unaware of certain terms referred to here, they’ll be explained in subsequent sections.
Unsupervised Machine Learning
Unsupervised learning is a machine learning technique where a model learns on its own by means of an algorithm to discover patterns from unlabelled data. In such a learning process the machine is forced to build a compact internal representation of its world and generate imaginative content by means of an algorithm such as KMeans, KNN or Neural Networks to name a few.
These algorithms exhibit self organization and capture patterns as probability densities. They interpret raw input and divide data objects into groups based on similarity and difference between them. Such algorithms find their application in clustering and anomaly detection.
Aspects of Text Clustering
It has become important to mine insights from generated data for analytics and cyber crime management. For this, textual documents having contextual similarity are grouped together and involves three aspects-
- Selection of a distance measure to identify proximity of feature vectors
- A criterion function to determine the best possible clusters
- A greedy algorithm for optimizing the chosen criterion function
A day-to-day application of text clustering is the Google search engine. When a word/phrase is searched, Google by means of an algorithm breaks down unstructured data from web pages and tags them with keywords from search results.
Before implementing text clustering on our corpus (dataset), here is an insight on the basic difference between clustering and classification-
Classification: A supervised learning approach to map an input to output based on some input-output pairs. Here the classes are known in advance such, as positive or negative sentences. It categories prediction values.
Clustering: An unsupervised learning approach to partition a corpus into groups (clusters). Here the objective is to segregate data in such a way that similar points are together in one cluster and points dissimilar to first cluster are lie in another cluster. It determines grouping in unlabelled data.
Now lets dwell deeper into the theoretical aspects of the algorithm shortlisted for the said task.
The K-Means algorithm
An example of centroid-based, greedy algorithm where each cluster is associated with a centroid. The aim is to minimize the sum of distances between data points and corresponding clusters.
The algorithm takes an unlabeled corpus as input, divides it into k number of clusters (value for k must be predetermined) and continues iteratively until the position or groups do not show any more movement.
Implementation:
Having understood about clustering and unsupervised learning now is the time to apply the algorithm on the corpus to cluster text into certain number of groups. Like any text clustering approach, the implementation involves the following steps-
- Text pre-processing: to identify noise and extract a set of words that describe the matter (descriptors)
- Feature extraction: calculating frequency of descriptors using TF-IDF
- Clustering: applying the algorithm on generated features
Text pre-processing:
The objective of this stage is to reduce the text to a form that is predictable and analyzable for the task at hand. This step is important for the forthcoming stages to yield the best possible results. Before beginning it is important to identify the corpus’ structure.
Every summary is assigned a ‘score’ which will be used to assign labels (positive or negative) to the corpus and elements of noise (HTML tags) can be observed corresponding to Id 150529.
Steps involved in preprocessing the corpus include-
- Stop word removal: certain words such as the, any, in are not very discriminative and generally innocuous for classification tasks, text clustering and information retrieval
- Stemming: In linguistic terms it is defined as the process of reducing words to their root form i.e. [‘eating’, ‘eats’] -> eats. The PorterStemmer class from Natural Language Toolkit would be used for the purpose
- Noise and punctuation marks removal: Removing HTML tags and punctuation marks such as .,? defined in English language
To perform the mentioned steps make use of the functions stated below-
def clean_html(text):
"""
function to clean text from html tags
:param string text: the text with html tags
return: text without html tags
"""
cleantext = re.compile('<.*?>') # html tag
cleantext = re.sub(cleantext,' ', text)
return cleantext
def clean_punc(text): """
function to clean text from punctuation marks or special characters
:param string text: text containing punctuation marks
return: text without punctuations
""" cleantext = re.sub(r'[\|/|")|(|#]', r'', text) # special_char
cleantext = re.sub(r'[.|,|?|!]', r' ', cleantext) # punctuations return cleantext
The functions for assigning scores are mentioned in the jupyter notebook, after assigning scores and cleaning, the corpus should resemble-
Feature Extraction:
Machine Learning algorithms deal with numeric data while Natural Language is mainly text hence it became important to map text data into real vectors, such a process was named Feature Extraction. The objective of this step is to represent how important a given word is to a set of documents so that the encoded data can be used by an algorithm.
TF-IDF (term frequency-inverse document frequency) is a statistical measure to evaluate a words relevance to the corpus. This process works by increasing the weights of a word when it appears frequently in a document and lowers the weight when common in many documents.
If a word is very common and appears in many documents it would be assigned a score approaching 0 otherwise the score approaches 1. TF-IDF score can be computed using the below mentioned formula-
Once words are transformed to numbers, the score is fed to the algorithm. An in depth understanding of the terms involved can be referred here. Let’s keep the heavy Mathematical understanding aside and implement this measure in Python using the sklearn library in two sweet steps.
from sklearn.feature_extraction import TfidfVectorizertfidf_vect = TfidfVectorizer()
tfidf = tfidf_vect.fit_transform(df['CleanedText'].values)
Clustering:
Having extracted features, now is the time to apply the KMeans algorithm on those features. As stated earlier, the number of clusters must be pre-determined, for which an elbow curve is plotted using the given code-
from sklearn.cluster import KMeanswcss = []for i in range(1,10):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=0)
kmeans.fit_predict(tfidf)
wcss.append(kmeans.inertia_)
plt.plot([1, 2, 3, 4, 5, 6, 7, 8, 9], wcss)
plt.xlabel('No of clusters')
plt.ylabel('Values')
plt.show()
As per the curve, 7 was chosen as the value for k (criterion function) to build a model to feed TF-IDF scores to and determine labels as elaborated below-
model = KMeans(n_clusters = 7, init = 'k-means++', random_state = 0)
labels = model.fit_predict(tfidf)
print(labels)
Labels are found to be a list of 0s and 1s, which are assigned to the corpus
df['Cluster_Label'] = labels # assigning labels to text
df.head()
Having assigned labels, let’s look at a few reviews assigned to each cluster. The code for this step can be found in the jupyter notebook
It can be observed that each cluster has a theme. Cluster 0 for example, elaborates on possession of cats and plants, Cluster 2 on the good taste of food items. However, their is some overlap as observed in the case of Cluster 5, the review of the chocolate bars highlights its good taste which is the general theme of Cluster 2.
To infer closely, the top words in each cluster (descriptors) are observed.
print("Top terms per cluster:")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
for i in range(7):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :7]:
print(' %s' % tfidf_vect.get_feature_names()[ind], end='')
print()
As per expectation, the clusters overlap. Consider cluster 2 and cluster 5 as an example, they overlap for the presence of the word ‘good’. Same words describing food for dogs and cats are observed in clusters 3 and 6.
We have inferred programmatically that clusters overlap. In the final section below, the magnitude of cluster correlation is determined statistically and plotted graphically by means of a scatter plot.
Cluster evaluation
Clustering algorithms are evaluated statistically by means of silhouette score or dunn index. The chosen corpus was evaluated on silhouette score.
Silhouette score: This score ranges between -1, 1 and indicates how well samples are clustered with respect to other samples. A score approaching 1 indicates dense clusters while a score approaching 0 indicates overlapping clusters.
Mathematically, for a mean intra-cluster distance (a) and mean nearest-cluster distance (b) the score is computed as-
silhouette_score = (b-a)/max(a, b)
The code to compute silhouette score using the sklearn library using euclidean distance as a distance metric is mentioned below.
from sklearn import metrics
from sklearn.metrics import silhouette_scores_score = silhouette_score(tfidf, labels, metric='euclidean')
Bingo! The score approaches 0 which statistically concludes overlapping clusters. The scatter plot (shown below) followed the general trend and was a straight line parallel to the Y-axis.
import matplotlib.pyplot as pltlist_00 = tfidf[labels == 0][:, 0].toarray()[:, 0]
list_01 = tfidf[labels == 0][:, 1].toarray()[:, 0]
list_10 = tfidf[labels == 1][:, 0].toarray()[:, 0]
list_11 = tfidf[labels == 1][:, 1].toarray()[:, 0]plt.scatter(list_00, list_01, s = 5, c = 'red', label = '0')
plt.scatter(list_10, list_11, s = 5, c = 'blue', label = '1')
plt.xlim([-0.1, 0.1])
plt.legend()
plt.show()
Conclusions:
- Text clustering is an unsupervised learning approach and groups similar points together as a single group
- For clustering documents with contextual similarity; euclidean distance served as the distance metric, k the criterion function and KMeans a greedy, unsupervised and centroid based algorithm
- The optimal value k=7 was determined by means of an elbow curve
- Silhouette score approached 0 and scatter plot was a straight line parallel to the Y axis. Which concludes clusters are overlapping
References:
Centroid based clustering algorithms: https://ijcsit.com/docs/Volume%205/vol5issue06/ijcsit2014050688.pdf
Understanding TF-IDF for machine learning: https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
The elbow method: https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/c71ea970-0f3c-4973-8d3a-b09a7a6553c1.xhtml
Dunn Index for K-Means clustering evaluation: https://python-bloggers.com/2022/03/dunn-index-for-k-means-clustering-evaluation/
All code can be accessed from this Github repo