BERTopic: topic modeling as you have never seen it before

Elisa Distante

Published in

Data Reply IT | DataTech

9 min readOct 20, 2022

NLP (Natural Language Processing) is one of the most complex fields of Artificial Intelligence.

Analyzing text data can be a challenging task, especially in some situations:

You have a large set of documents that you want to automatically categorize
You already have some manually identified categories but you want to discover new tags to organize your documents in a data-driven fashion
The categories describing your data can evolve over time

These are perfect use cases to apply Topic Modeling!

Topic modeling: main approaches

Topic modeling is the task of scanning a collection of documents and automatically identify:

the set of topics that best describes the collection
the topics found in each document

It falls under the class of unsupervised machine learning (Spoiler: with a special package you can do it in a semi-supervised or guided way!)

There are different algorithms available to apply topic modeling. Some of the most well-known are LDA (Latent Dirichlet Analysis), LSA (Latent Semantic Analysis) and NMF (Non-Negative Matrix Factorization). They are probabilistic and linear-algebraic algorithms that require text preprocessing as a preliminary step (removing stopwords, numbers, abbreviations, performing lemmatization, stemming…). Moreover, you have to explicitly provide the number of topics that you want to identify.

But the main limitation of these methods is: they do not use the power of Transformers! (And you know, Attention is all you need…)

Transformers are the de-facto standard in modern NLP and they led to the development of pretrained language models such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer).

Using transformers for topic modeling allows to build more sophisticated models that can capture semantic similarities between words.

Two topic models using transformers are BERTopic and Top2Vec. This article will focus on BERTopic, which includes many functionalities that I found really innovative and useful in a lot of projects.

BERTopic: the algorithm

BERTopic leverages BERT embeddings and the concept of c-TF-IDF (class-based TF-IDF) to create coherent and easily interpretable topics, described by automatically generated labels!

So how does it work?

BERTopic includes three main steps:

Document Embedding: first we need to create document embeddings. The default method is by using sentence-transformers models, available both for English and multi-language documents.
Document Clustering: the document embeddings just created are high dimensional, so before clustering them we need to perform dimensionality reduction through UMAP, which preserves both the local and global structure of embeddings, then we can apply a density-based clustering technique, such as HDBSCAN, to create our topic clusters and identify outliers where possible
Topic Representation: to represent each topic (i.e. cluster of embeddings), we can modify the well-known TF-IDF score. When you apply TF-IDF on a set of documents, you are comparing the importance of words between documents. If we treat all documents in a cluster as a single document and compute TD-IDF, the result would be importance scores for words within a cluster. If we extract the most important words per cluster, we get descriptions of topics! This technique is called class-based TF-IDF.

BERTopic in action

Building a topic model with BERTopic is easy and fast. Given the list of input documents, we will get as outputs for each document the topics and the corresponding probabilities that document i contains topic j.

from bertopic import BERTopic 
from sklearn.datasets import fetch_20newsgroups  docs = fetch_20newsgroups(subset='all', remove=('headers','footers', 'quotes'))['data']  
topic_model = BERTopic(calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)

After generating topics, we can inspect them:

topic_model.get_topic_info()

-1 is a “fake” cluster containing all outliers, so you should not take it into account. Each topic is assigned a representative name. The default name is a numeric ID followed by the 4 most representative keywords separated by underscores (but this can be customized).

We can get more info on a single topic, for example the most representative words and their corresponding scores:

topic_model.get_topic(0)

We can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

We can easily save a trained BERTopic model by calling savemethod:

from bertopic import BERTopic 
topic_model = BERTopic() 
topic_model.save("my_model")

Then, we can load the model when needed:

topic_model = BERTopic.load("my_model")

After having fit a model, we can use transform to predict topics on new documents:

topics, probs = topic_model.transform(docs)

The main advantage in using BERTopic is that it offers a high level interface with default settings that help to get started easily and build powerful models in few lines. But at the same time it provides you with a lot of different possibilities to customize all the settings to fit your use case and get the most of the package!

BERTopic top features

In the rest of the article I will point out some of my favourite BERTopic features (some of them have just been released)!

Preprocessing? No, thanks

When using document transformers-based embeddings there is typically no need to preprocess the data as it is important to analyze the whole context to understand the document. However if your data contains a lot of noise, for example HTML-tags, you could definitely remove them as the semantical context would remain the same.

Embeddings for everyone

As a default embedding back-end, you can use a SentenceTransformer model. But you can also try out different models: Gensim, Flair, Spacy... If you are familiar with these packages you could check if they can fit better to your use case. Now you can also natively select a Hugging Face transformers model!

Visualize and validate

Visualizing BERTopic output is important in understanding the model.

Even if specific metrics can be used to evaluate a topic model, such as topic similarity, coherence, significance (I strongly recommend OCTIS package), topic modeling can be quite a subjective field, making it difficult for users to validate a model.

In my experience, visualizing the topic hierarchy is extremely useful as it helps to understand the semantical structure the model has captured.

Recently, a new feature was introduced, that I really wished for! You can display an enhanced version of the topic hierarchy, where you can see a representation of “merged topics” at each level of the hierarchy.

The same hierarchical representation can be applied also to document embeddings (after reducing their dimensionality to 2 to speed-up the process) at different levels of the topic hierarchy.

We can also compute the probabilities of topics found in a document. In order to do so, we have to set calculate_probabilities to True as calculating them can be quite computationally expensive. Then, we use the variable probabilities that is returned from transform() or fit_transform() to understand how confident BERTopic is that certain topics can be found in a document:

Sometimes I have found useful to extract the probabilities assigned to each topic in order to output a list of topics found in each document (whose probability is above a certain threshold), not just a single one (which is the default output provided in the topics vector).

Find topics, reduce topics

BERTopic uses HDSCAN for clustering the data, so you don’t need to specify the number of clusters. However sometimes it can be very high, for example because many fine-grained topics are recognizable. In these cases, we can try to reduce the number of topics that have been created. You have basically three options:

Statically define the number of topics you want:

from bertopic import BERTopic 
topic_model = BERTopic(nr_topics=20)

2. Reduce the number of topics after having trained a BERTopic model. The advantage of doing so is that you can decide the number of topics after knowing how many are created.

topic_model = BERTopic() 
topics, probs = topic_model.fit_transform(docs)  
# Reduce topics 
new_topics, new_probs = topic_model.reduce_topics(docs, topics, nr_topics=30)

3. Merge selected topics after inspecting the hierarchy and the merged topics representation at each hierarchy level :

topics_to_merge = [1, 2] 
topic_model.merge_topics(docs, topics, topics_to_merge)

Describe your topics with the right words

The topics that are extracted from BERTopic are represented by words. But if you are not satisfied with the resulting representation you can use the function update_topics to update the topic representation with new parameters for c-TF-IDF.

You can play around with n_gram_range or use your own custom sklearn.feature_extraction.text.CountVectorizer and pass that
instead:

from sklearn.feature_extraction.text import CountVectorizer vectorizer_model = CountVectorizer(stop_words="English", ngram_range=(1, 5)) 
topic_model.update_topics(docs, topics, vectorizer_model=vectorizer_model)

You can also directly generate a new list of topic labels with get_topic_labels and define the number of words, the separator, word length, etc:

topic_labels = topic_model.generate_topic_labels(nr_words=3,                                                  topic_prefix=False,                                                  word_length=10,                                                  separator=", ")

We can then pass our newtopic_labels to set_topic_labels so that they can be used across most visualization functions:

topic_model.set_topic_labels(topic_labels)

The great advantage of passing custom labels to BERTopic is that you can use zero-shot classification models to fine-tune the labeling. For example, let’s say you have a set of potential topic labels that you want to use instead of the ones generated by BERTopic. You could use the bart-large-mnli model to find which user-defined labels best represent the BERTopic-generated labels:

from transformers import pipeline 
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")  
# A selected topic representation 
# 'god jesus atheists atheism belief atheist believe exist beliefs existence' 
sequence_to_classify =  " ".join([word for word, _ in topic_model.get_topic(1)])  
# Our set of potential topic labels 
candidate_labels = ['cooking', 'dancing', 'religion'] classifier(sequence_to_classify, candidate_labels)  
#{'labels': ['cooking', 'dancing', 'religion'], 
# 'scores': [0.086, 0.063, 0.850], 
# 'sequence': 'god jesus atheists atheism belief atheist believe exist beliefs existence'}

(Semi)-supervised or guided, if possible

There is a variation of BERTopic that allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels you might already have.

In semi-supervised topic modeling, we only have some labels for our documents. The documents for which we have labels are used to guide BERTopic to the extraction of topics for those labels. The documents for which we do not have labels are assigned a -1.

labels_to_add = ['comp.graphics', 'comp.os.ms-windows.misc',               'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',               'comp.windows.x',] 
indices = [category_names.index(label) for label in labels_to_add] 
y = [label if label in indices else -1 for label in categories]
topic_model = BERTopic(verbose=True).fit(docs, y=y)

Instead, in supervised topic modeling, we have labels for all our documents.

topic_model = BERTopic(verbose=True).fit(docs, y=categories)

However, this does not mean that only topics for these categories will be found. BERTopic is likely to find more specific topics than those you have already defined. This allows you to discover previously unknown topics!

There is another option to feed some preliminary knowledge into the topic model, which is guided topic modeling. This option allows the user to set a pre-defined number of topic representations that are sure to be in documents.

For example, imagine you have an IT Service Management System. You already know that many users have issued some tickets related to a login bug. We can create a seed topic representation containing the words bug, login, password, and username. By defining those words, a guided topic modeling approach will try to converge at least one topic to those words.

Topics evolve

Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. For example, in 1995 people may talk differently about environmental awareness than those in 2022. Although the topic itself remains the same, environmental awareness, the exact representation of that topic might differ.

We first need to fit BERTopic in the standard way. Thus, a general topic model will be created. Then, for each topic and timestep, we calculate the c-TF-IDF representation. This will result in a specific topic representation at each timestep.

Conclusions

In this article I just scratched the surface of all the possibilities that BERTopic offers. I successfully used it in different projects and I found it really impressive, because it allows you to quickly perform many different experiments thanks to its high-level interface, but then you can refine your models with all the available customizations and variations.

Moreover new features are frequently released and the author Maarten Grootendorst provides great support also in the Issues section of the Github repo.

I definitely recommend to explore BERTopic in your next NLP project that may need a topic modeling step!