Labeling comments with unsupervised topic modeling methods

Bob Tourne
randstad tech & touch
6 min readSep 15, 2022

This blogpost is concerned with an exploratory method that we developed to automatically label text data with interpretable topics. The examples that we show originate from a monthly employee survey from which we collected 180000 open answers over time.

Text data gathered through surveys (e.g., answers on open-ended questions) can be richer in information than closed-form answers (e.g., Likert scales), yet are often used to a far lesser extent. This is because analyzing unstructured text data often implies manually reading comments and labeling each of them for later use in quantitative analysis. A tedious task if the number of answers run in the hundreds, but a practically impossible task if the number of answers run in the tens- or hundreds of thousands. Luckily we have computers to assist us with this task.

An issue with text data is that computers cannot understand the semantic meaning of words in the same way that humans do. More traditional text mining approaches use word counts to compare different bodies of text with each other. Methods such as TF-IDF or LDA use these word counts to assess which words uniquely describe an answer and may therefore be regarded as the topic of that answer. However, word count methods do not work well for corpora with short texts and a large number of topics.

State-of-the-art techniques utilize methods which infer the relationships between words by looking at which words often appear next to each other. With these techniques we can represent the semantic relationships between words or entire documents through embeddings (i.e., vectors in a semantic space). The fact that the relationship among words and documents can be numerically represented make embeddings useful for a wider range of ML techniques.

Embeddings for words and complete answers can be generated jointly. This implies that we can assess which words are most closely related to a certain answer. In the example below, the idea of joint word and document embeddings is visually represented on a two-dimensional axis. We can see that two general topics appear in the data (deep learning and statistics) and that words as well as larger bodies of text with the same semantic meaning cluster together.

Image reprinted from https://github.com/ddangelov/Top2Vec

Note that by default these embeddings are 300-dimensional, which is not only impossible to visualize as nicely as the example above, but also causes sparsity (i.e., when the majority of the data does not convey any information) in the data. Therefore, it is advisable to perform dimensionality reduction with methods such as UMAP on your embeddings.

Finding topics in the embedding space

After creating the joint embeddings and performing dimensionality reduction, we could take two approaches.

  1. Unsupervised approach: in this approach we would use the answers as a starting point and infer topics for each answer based on the individual words that show high similarity to the answer
  2. Domain expertise approach: in this approach we would first use domain expertise to pre-define a list of frequently occurring topics in the data. Subsequently, we would use these topics as a starting point and match them with answers by searching the embeddings space

The unsupervised approach would be suitable for situations when: (i) we would want to avoid any (subjective) human interventions, (ii) pre-defining a list of topics is impractical or (iii) where there is a stream of data that might contain new and unforeseen topics. However, the downside to this approach is that the resulting topics will be more ambiguous, as they would merely be a collection of single words that show similarity with the text and need to be manually interpreted and processed in a later stage to be useful for further analysis.

The below example is a randomly sampled answer from our dataset with the 15 most similar words. As you can see, the words show similarity to the answer but are still ambiguous.

For our project, we chose to go for the second approach; use domain expertise to pre-define a list of frequently occurring topics and match answers by searching the embeddings space with these topics. We found the following topics to frequently occur in our dataset:

Matching answers with topics was done by measuring the cosine similarity between the embedding vectors of our topics and the embedding vectors of each of our answers. One problem that we encountered was that not all topics that we searched for yielded similarity scores within the same range, so a hard-coded threshold would be too strict for some topics and too liberal for others (to give some intuition for this, the maximum cosine similarity scores varied between 0.24 and 0.47).

The solution that we came up with was to rank all similarity scores between a topic and all answers and locate the knee point (i.e., the point at which the rate of change starts decreasing) on the resulting line, as presented in the image below. All scores above this point have a distinguishable higher score compared to the rest of the data, and can therefore be safely labeled with the topic.

With this approach (and conservative settings to locate the knee point) we were able to label roughly 15% of our dataset with an accuracy score between 90% and 100% (estimated by face value). This may not seem impressive, but in essence we have just created labeled training data which we can use to train a classifier to further label topics of unlabeled answers.

Text classification for labeling unlabeled answers

With these labeled instances a classifier can be trained to predict the yet unlabeled comments. In our experiment, we used a Naive Bayes classifier which we trained on the labeled data for each of our predefined topics. Each classifier was trained to predict the probability of a yet unlabeled comment belonging to a respective topic or not. The topic with the highest predicted probability with a threshold > 0.95, if any, would be assigned to the comment.

The table below shows the accuracy scores, derived from a 10-fold cross validation check per topic. Note that these scores might not be fully representative for the model accuracy on the yet unlabeled data due to the fact that only a small proportion of the total dataset could be used at this stage.

With this second iteration we were able to successfully label roughly 50% of our comments with a total accuracy of roughly 80% — 90% (again, estimated by face value). Keeping in mind that a significant proportion of the comments (around 25%) is not of interest to label at all and that the predefined list of topics may not be exhaustive, these results are adequate enough to bring this model further into production. Below are some examples of labeled answers.

--

--