Automatic Topic Labelling Using NLP

Photo by Roman Kraft on Unsplash

Overview

Natural Language Processing (NLP) is a relatively new and rapidly evolving field in Machine Learning. It enables computers to interpret, manipulate and generate human language, and we encounter it everywhere in our daily lives from voice assistants, spam detection, autocompletion, and more.

Among the many applications of NLP is topic modelling. It is an unsupervised machine learning approach that scans a group of text documents to reveal topics based on patterns in words, phrases or even semantics. This can be useful if you have a large text and want to know what it’s about without having to read through all of it!

In this article we will talk about how to extract and (more interestingly) automatically label topics from a large collection of news articles, using a combination of several common NLP techniques and state-of-the-art text models.

Modelling topics

To group articles into topics, we use the BERTopic topic modelling algorithm, which is an excellent package by Maarten Grootendorst:

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

… and continue from where his example ends, summarised by the code below:

from bertopic import BERTopic
from sklearn.datasets import fetch_20_newsgroups

# download dataset of 20,000 news articles
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

# fit the topic model to the data, assigning a topic to each article
topic_model = BERTopic()
topic_model.fit_transform(docs)

Here the BERTopic algorithm has separated the entire dataset of over 18,000 articles into topics, with topic sizes ranging from several thousand to a few dozen articles.

For the sake of this article, we focus on two different topics, represented by the following keywords:

Two topics and their associated keywords.

This keyword representation is useful and a human could make a pretty good guess as to what the topic is about. For example:

  • Topic 13 = “The Solar System”
  • Topic 56 = “Catholicism”

… but we would need to skim through the articles to make sure exactly.

Let us leverage more of the NLP tool belt to see if we can automatically label these topics without human intervention!

Text Summarisation

Compressing a document into a single label is something journalists do every day when creating headlines. Using this text-label training data, scientists have been able to train large text models (such as Google’s “T5” transformer model) on the specific task of generating a one-line heading for a given article. We will use Michau’s popular headline generation model.

However, there is one major challenge: Most text models are trained on sequences with a max length of 512 tokens!

This means that there is no guaranteed results for longer sequences, and in practice you will generally see very high memory usage combined with poor results when generating headlines from too much text. The number of tokens roughly corresponds with the number of words, and the articles in the dataset are typically around 150 words in length, but can be much longer.

Number of articles of a given token length in the 20 Newsgroup dataset.

As a topic typically contains tens to hundred of articles, it is not possible to generate a headline from the articles directly. We first need a method to somehow compress the information in the topics.

Strategy

Luckily, the aforementioned T5-model is already trained on the summarisation task (another common use for text-to-text models), so we propose the following method:

  1. Select articles at random from the topic
  2. Summarise the articles and combine their summaries (making sure the length of the combined summaries is less than the token limit)
  3. Generate a title from combination of summaries

Selecting articles at random might sound a bit iffy, as we run the risk of losing out of valuable information. We only have “space” to summarise about 15 articles before we again break the 512 token limit.

However, placing trust in our topic model we assume that the articles in the same topic contain roughly the same information, which will lead to roughly the same topic labels, independent of the random selection of articles.

… we will get back to this assumption shortly.

Label Generation

Let us try this approach to generate more concrete labels for our two topics, currently known as the “space-y topic 13” and the “biblical topic 56”.

Which gives the following output:

Automatically generated labels for two topics.

Not too bad! We have now generated concrete labels for our two topics, completely unsupervised! But how can we know if they are actually a good representation of the topics, or simply lucky guesses?

Iterative Labelling

To re-iterate, we are selecting random articles from each topic to generate a label, based on an assumption that the articles within a topic should be pretty similar in content in a label-generating sense. This is a bold assumption, and for the most part not true (especially for the larger 100+ article topics). In practice, the labels can vary quite a lot based on the random selection of articles. However, we can use this to our advantage in order to get even deeper understanding of our topics.

Instead of generating a single topic label, we will generate a bunch to gauge the topic “uniformity” and select the best alternative.

Label suggestions for topic 13.
Label suggestions for topic 56.

Clearly there is a lot of variation between the generated labels. While topic 56 seems pretty well defined, topic 13 is all over the place (although some of the labels raise some good questions).

The Solar System — Is it a Solar System?

Topic Uniformity

Let us now use the built in sentence transformer from the BERTopic model to embed the labels into vector representations, and measure the cosine similarity between each pair. This creates a symmetric “similarity matrix” of our labels that can be used to gauge the overall topic similarity based on our list of labels, and gives us a method to select the best alternative.

Matrices showing the cosine similarity between each pair of generated labels for the two topics.

We choose the top label based on which has the most in common with the others (i.e. the row/column in the similarity matrix with the biggest sum). As a result, the selected labels are often short and general, as opposed to long and sometimes oddly specific.

Earlier we saw how the labels for the space topic varied a lot, while the biblical topic labels were very consistent. We can assume that the label we select will be better if all the labels are similar, as opposed to picking one from a list of completely different labels. Therefore we will assign a similarity score based on the label uniformity. We define the similarity score as the average similarity between each pair of labels, found by taking the mean of the upper/lower triangular part of the similarity matrix.

Not only is this score a metric for how “well-defined” a topic is, but it also acts as a proxy for how much we can trust the selected label. Insight into the confidence of your predictions is as important as the predictions themselves.

Combining the keywords, best topic label, similarity score and a confidence indicator in a single plot:

Final result displaying all the extracted topic information.

… we get immediate insight into our two topics, and behold, they make a lot of sense!

To summarise…

Models exist that can generate headlines from a small amount of text. This is very cool, but real-world text data is usually way larger than these models allow, whether it is a real news article or a collection of smaller articles under a topic. For the latter, we have demonstrated an approach to navigate around the issue and have been able to consistently generate concise one-line labels.

These labels aren’t always perfect, but by also introducing a way of scoring them, we get some valuable insight in to whether or not the labels can be trusted, and the uniformity of the topics themselves.

--

--