Text Clustering with K-Means

Clustering national anthems with unsupervised learning

Lucas de Sá
8 min readDec 17, 2019

The portuguese version of this article can be read here. The jupyter notebook with all of the code can be found on this link and you can check out the final map here!

An idea ocurred to me while listening to a podcast. At one moment the participants mentioned that national anthems could be divided into groups, with some having more militaristic lyrics, such a France’s, others being more patriotic complimenting it’s rich nature, such as Brazil’s, and so on and so forth. As no one’s got the time to classify a bunch of national anthems let’s see how this can be done using unsupervised learning.

This analysis will be restricted to Europe, Latin America and North America. The code and dataset, which were built by me, is avaiblable on my github.

Introduction

The idea behind this article is to understand how to represent text data (in this case national anthems) so that it can be used as input for the K-Means algorithm. Since we are dealing with countries, it will also be nice to visualize it’s results in a interactive map with Folium.

The whole process can be divided into the following stages:

  1. Pre-processing of the Dataset
  2. Feature Extraction with TF-IDF
  3. Running K-Means and Cluster Analysis
  4. Clusters Visuzalisation in a Map with Folium

Pre-processing of the Dataset

Before we pre-process the dataset, it’s important to know how it is structured:

It’s nice to notice that the columns ‘alpha-2’ and ‘alpha-3’ represent the different countries ISO-Codes. This will be useful later.

It’s also interesting to know how the anthems look like. Looking at Hungary’s national anthem it’s clear that the dataset contains noise. Anthems generaly contain mentions to different culture’s locations and names, which many times are not represented by UTF-8 characters.

From now on we’ll be refer to our dataset as a corpus. This pre-processing stage refers to the series of treatments that must be done to the corpus so that it can be well represented by statistic model before it’s fed to the K-Means algorithm. The pre-processing routine is divided as such:

  1. Corpus tokenization, that is, divide the different texts into individual words.
  2. Stop words removal, which are common words (a, the, not, etc) that bring close to no contribution to the semantic meaning of a text
  3. Noise removal from the texts, that means basically anything that can’t be recongnized as a english word, such as words with non ASCII symbols, words together with numbers, etc.
  4. Stemming, which reduces a word to it’s root. Ie: [consultant, consulting, consultants] -> consult

Below are a list of auxiliary functions that remove a list of words (such as stop words) from the text, apply stemming and remove words with 2 letters or less and words 21 or more letters (the longues word in the english alphabet is ‘Incomprehensibilities’, with 21 letters)

Now we’ll define the main processing function, that uses regular expressions for noise removal and calls the functions defined above.

And after processing our corpus, this is what Hungary’s anthem looks like!

Feature Extraction with TF-IDF

The pre-processing phase was done so that this stage can yield the best possible results. Here we want to represent how important a word is to a set of documents so that the encoded data is ready to be used by an algorithm. This mapping process of text data into real vectors is know as feature extraction.

TF-IDF, short for Term Frequency-Inverse Document Frequency is a numerical statistic that intends to reflect the importance of a word in a corpus, which a term with a high weight is considered relevant.

This method works by increasing the weight of a word when it appears many times in a document, and lowering it’s weight when it’s common in many documents.

This method is divided in two steps:

. Term Frequency — TF

The term frequency tf(t, d) measures how frequently a term t occurs in a document d, giving a highers weight to more frequent terms.

. Inverse Document Frequency — IDF

The inverse document frequency idf(t, D) measures the importance of a term t in a set of documents D. This statistic lowers the importance of frequent terms with low importance and increases the weight of rare terms that give more meaning to a text.

The product of tf(t, d) by idf(t, D) yields the tf-idf score for each term.

Even though we went through all the math above, it’s quite easy to apply TF-IDF using the sklearn library!

The words with highest weights

Running K-Means and Cluster Analysis

K-Means is one of the simplest and most popular machine learning algorithms out there. It is a unsupervised algorithm as it doesn’t use labelled data, in our case it means that no single text belongs to a class or group. It is algo a clustering algorithm that classifys a dataset into a K number of clusters.

The idea behind this algorithm is that it’s clusters will be defined by K centroids, where each centroid is a point that represents the center of a cluster. This algorithm works iteractively, where initially each centroid is placed randomly in the vector space of the dataset and move themselves to the center of the points which are closer to them. In each new iteration the distance between each centroid and the points are recalculated and the centroids move again to the center of the closest points. The algorithm is finished when the position or the groups don’t change anymore or when the distance in which the centroids change doesn’t surpass a pre-defined threshold.

Visual representation of K-Means. Image extracted from this article.

As we initially don’t know the ideal number of clusters, we’ll run the algorithm with 2 to 7 groups and then select the best result.

With the results of all the different groupings we need to choose which K number is the best, which can be a very subjective analysis. There are some ways to measure the performance of the algorithm, like the Elbow Method or the Silhouette Score, which brings insight on how well defined the groups are. I highly recommned this article written by a fellow colleague to dig deeper into those methods.

To select the best number of clusters I went through all the different clusters in each grouping by looking at graphs with the most dominant words in each group. I came to the conclusion that the groups made much more sense when they were divided into 5 different clusters.

best_result = 5
kmeans = kmeans_results.get(best_result)
Most dominant words by cluster
Word Clouds

Looking at the clusters it’s clear that the words in each one of them have a theme. In Cluster 0 for example, there are more positive words like “heart”, “beauti” and “mother, while in Cluster 3 there are more conflitct related words such as “war”, “blood” and “glori”.

Now let’s group the labels from the K-Means results with each country.

labels = kmeans.labels_ 
data['label'] = labels
data.head()

Clusters Visuzalisation in a Map with Folium

The idea now is to color each country in it’s respective group in a map, also know as choropleth map. This can be done using the Folium library.

First let’s create palette of 5 colours, one for each cluster.

Now we’ll need to draw the coordinates of each country in a map. Here’s a Json file with the polygons of each country, identified by it’s ISO-Code.

Now it’s simple to merge the labels of each country with it’s respective polygon by it’s ISO-Code!

With the poligons and labels now associated in the dataframe above now we can create a map instance with Folium, draw the poligons and paint the countries with the color of it’s respective label.

You can play around with the map here

The clusters are:

  • Cluster 0 — Red, with words that praise the motherland of each nation
  • Cluster 1 — Yellow, with words that praise liberty
  • Cluster 2 — Green, with more religious terms
  • Cluster 3 — Blue, with terms that are more related to war
  • Cluster 4 — Magenta, with generic terms that can be part of any cluster

Conclusion

Text clustering is a process that involves Natural Language Processing (NLP) and the use of a clustering algorithm. This method of finding groups in unstructured texts can be applied in many different segments, such as feedback analysis, research segmentation, etc.

In K-Means there are many variables that influence the algorithm’s result, such as different distance metrics (euclidean, manhattan, cosine similarity, etc), the centroids initial positions, and grouping analysis can be very subjective.

It’s also important to note that K-Means is not the only way of clustering data, there are other different methods, such as hierarchical clustering algorithms.

--

--

Lucas de Sá

Stockholm based Data Engineer and Functional Programming Practioner.