Topic Modeling

Introducing: Hierarchical Topic Modeling with non-parametric depth estimation

Employing a hybrid topic modeling algorithm to efficiently subdivide a pool of samples of Steam tags: full code available on my repo

Michelangiolo Mazzeschi
Plain Simple Software

--

Topic Modeling is a technique to automatically divide a pool of samples into groups. This algorithm can be further improved by increasing the depth of the subdivision, this is commonly known as Hierarchical Topic Modeling.

Example of hierarchical clustering

Among the various Topic Modeling algorithms, I will focus on the ones that first fit each sample into an encoder to convert the unstructured data into vectors, and then apply a clustering algorithm to subdivide the data into groups with similar meaning.

Tackling the Sample Size disproportion

In most cases, the first topic modeling layers will use the entirety of the data, requiring a high number of clusters. We will eventually reach one level of depth where the clusters in each layer will be made of a much smaller number of samples (up to 1/100th of the previous ones).

For example, if we were to work with 2.000 samples, we could estimate the first layer to be subdivided into 100–200 clusters. The rule of thumb in NLP is that while the number of samples increases, their variance does not increase proportionally. Therefore, a dataset of 20.000 will likely use a similar number of clusters on the first layer. Eventually, we will get to the point where each layer will be made by only a few samples (ex. 10–20).

Visualization of clustering with different depths. Depth 1 will have huge clusters, while from depth 2 all clusters will be proportionally smaller.

Given these two examples, their respective cluster size will be:
- 200 x 10
- 200 x 10 x 10

There are several ways of conducting hierarchical topic modeling, but most of them will implement the same clustering strategy regardless of the depth of the subdivision.

Because of this aforementioned disproportion, I will employ a hybrid clustering strategy: one parametric algorithm for the first layer (k-means), and a non-parametric algorithm for the following layers (affinity propagation).

Different clustering algorithms

Let us analyze different clustering algorithms and their usefulness in our case:

k-means

The first clustering algorithm that comes to mind is k-means. It is a parametric algorithm, meaning that we need to know a-priori the number of clusters, which usually requires manual input from the user. This makes it rather difficult to automate but allows the developer the maximum level of control. I am proposing this solution assuming that the engineer employing this technique is conducting an EDA on the data for a single use case, rather than automating a pipeline.

HDBSCAN

HDBSCAN is probably the most advanced non-parametric clustering algorithm that can be employed on a huge number of samples. However, it tends to exclude samples that are then grouped in the same cluster, noted as -1, causing a disproportion in the labeled data.

Affinity Propagation

Affinity Propagation is a non-parametric clustering algorithm that works exceptionally well on a limited number of samples ( < 100), being able to estimate the optimal number of clusters. However, it tends to create a huge number of samples regardless of the hyperparameters we use to tune it. For example, on 10.000 samples it may likely estimate the number of clusters to be 1.000 which is highly unrealistic. Another issue with this algorithm is the likelihood of exceeding the computational limit of our machine, causing it to break. However, it is perfect for our use case.

Steam use case: sampling tags

To showcase the algorithm in a practical use case, we can download the Steam dataset (source: Kaggle), which contains the data of 60.000 games.

In this article, we are going to use Steam tags as use case

Every game on Steam has one or more arbitrary tags that game publishers use to classify their products. Without focusing on the data preprocessing steps, I have already extracted all the game tags: there are over 2000 tags that can be found on Steam.

Raw data: tags

import pandas as pd

df = pd.read_csv('steam_tags.csv')
df
Showcasing of all the categories in pandas DataFrame

Encoding

To proceed with Topic Modeling, we will vectorize each tag using a dedicated encoder for sentence similarity (there are also models specialized for clustering), to group the ones that are most similar to each other. This simple code will create a new column containing the corresponding vectors of each tag.

import pandas as pd
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
tqdm.pandas()

model = SentenceTransformer('all-MiniLM-L6-v2')

#creates a new column with the encoded text
df['text_vector_'] = df['tags'].progress_apply(lambda x : model.encode(x).tolist())
df
Vector Encoding of Steam Tags

Hierarchical Topic Modeling

The code below is the full implementation of hierarchical topic modeling. Ideally, I would perform the hierarchical clustering with a dendrogram, but I would not have the same level of control over the clustering method. In addition, I have never been able to find the code able to convert a dendrogram in the corresponding set of clusters.

Contrary to common belief, an implementation of hierarchical clustering does not require the use of recursion, but it can be done with the simple groupby function, available in pandas:

from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import numpy as np
import pandas as pd
import warnings
import hdbscan

warnings.simplefilter(action='ignore') #SUPPRESS ALL WARNINGS

def clustering_hdbscan(df):
#clustering
cluster = hdbscan.HDBSCAN(
min_cluster_size=2,
alpha=0.3,
leaf_size=5,
metric='euclidean',
cluster_selection_method='eom'
).fit([x[0] for x in df.values])
cluster_labels = cluster.labels_
return cluster_labels

def clustering_function_affinity(df):
# damping_value = 0.9
affinity_propagation = AffinityPropagation()
# display(df)
cluster_labels = affinity_propagation.fit_predict([x[0] for x in df.values])
n_clusters = len(np.unique(cluster_labels))
return cluster_labels

def clustering_function_kmeans(df, n_clusters=5):
while True:
try:
kmeans = KMeans(n_clusters=n_clusters)
# display(df)
cluster_labels = kmeans.fit_predict([x[0] for x in df.values])
cluster_labels = kmeans.labels_
return cluster_labels
except:
n_clusters -= 1

def hierarchical_clustering(kmeans_depth, df, depth):
"""
kmeans_depth is a list that contains the number of clusters to use for kmeans at each level of depth
Ex. kmeans_depth[0] is the number of clusters used for depth 1
Ex. kmeans_depth[1] is the number of clusters used for each subcluster in depth 2
"""

df = pd.DataFrame(df)

# Function to drop columns containing the word "label"
def drop_columns_containing_label(df, keyword='label'):
columns_to_drop = [col for col in df.columns if keyword.lower() in col.lower()]
df_dropped = df.drop(columns=columns_to_drop)
return df_dropped

# add the first cluster_labels
depth_ = 1
df[f'labels_{depth_}'] = clustering_function_kmeans(drop_columns_containing_label(df), n_clusters=kmeans_depth[depth_-1])
# df[f'labels_{depth_}'] = clustering_hdbscan(drop_columns_containing_label(df))

# add the following cluster_labels until depth is reached
for _ in range(depth-1):
list1 = list()
for g in df.groupby([x for x in df.columns if 'labels' in x]):
depth_ = len([x for x in df.columns if 'labels' in x]) + 1
df_ = g[1]
df_[f'labels_{depth_}'] = clustering_function_affinity(drop_columns_containing_label(df_))
list1.append(df_)
df = pd.concat(list1)
df

return df

Let us apply the algorithm to our dataset. We can estimate 200 to be the number of clusters for the first layer (rule of thumb, nothing scientific to avoid deviating from the scope of the article).

# clustering
df_topics = hierarchical_clustering(kmeans_depth=[200], df=df['text_vector_'], depth=3)
df_topics['categories_eng'] = df.iloc[df_topics.index]['tags']

cols = ['labels_1', 'labels_2', 'labels_3']
df_topics

The hierarchical_clustering function will generate the clustering labels from each sample, as shown below:

Hierarchical Clustering with depth 3

Let us look at the generated nested clusters:

counter = 0
for df_ in df_topics.groupby(['labels_1', 'labels_2', 'labels_3']):
display(df_[1])
counter += 1
if counter == 30: break
Sample of the output clusters

The results look very promising. As we can see, cluster 0 has 6 subclusters on the 2nd depth level, each subdivided into other clusters (usually up to 3) for the 3rd depth level.

--

--