Topic modeling for surveys or how we solved the problem of text clustering with LDA

Natalia Tsarkova
Exness Tech Blog
Published in
12 min readNov 10, 2022

Surveys are a powerful tool for gathering feedback and information. But what if you have hundreds or even thousands of responses given as free-text comments? In this article, we will describe an ML approach we used in Exness for analyzing Employee engagement survey results.

Introduction and problem description

Surveys are a powerful tool for gathering feedback and information about customers’ and employees’ desires, needs, and opinions. Indeed, what could be better and more efficient than asking someone for honest feedback and using it to improve your business practices?

Surveys can be of different types and have a wide variety of goals to fulfill business needs. Sometimes surveys give respondents an opportunity to leave a free-text comment — a great source of valuable information. But what if you have dozens, hundreds, or even thousands of responses? How do you identify positive and negative feedback in this amount of data? How do you understand the topic and main message of each response? This is where machine learning comes to the rescue, allowing us to extract valuable insights.

There are ML approaches that don’t require the use of complex neural networks or manual work, and one of them is described in this article. It can also be used for any kind of text data, that has to be grouped by topic or analyzed semantically, not just survey responses.

Exness use case

Every year Exness People Division conducts an Employee engagement survey. The survey contained not only numerical questions but also text questions as well, which led to almost 4000 text responses divided into 8 sections. Each section represents a particular question, for instance, “What can the management team do differently?” or “What can Exness improve to make you more satisfied with working at the company?”. In this article, we are going to use as an example only one section (question) out of eight.

Our data

The answers to these questions are raw text data. Most of the answers are presented in English, but a significant number (about 10% of the answers) are given in Russian. Also, 18 responses contained Chinese characters. For consistency, the answers in Russian and Chinese were automatically translated using Python tools via https://translate.google.com/.

Our approach

In order to research the data, the following methods and approaches were used:

  • Sentiment analysis for the text responses of the survey.
  • Word Clouds generation for each survey question separately and for the entire dataset.
  • Unsupervised machine learning applied to text data for each survey question separately.

Let’s have a look in more detail. There is an example of code in Python at the end of this article, so feel free to try it with your own data.

Sentiment analysis

Sentiment analysis is a machine learning tool that analyzes texts for polarity, from positive to negative. In simple words, a machine learning model provides a score of how positive or negative the overall mood of the text is. Our goal was to calculate the number of positive and the number of critical responses. The proportion of these two groups shows overall employee satisfaction. By tracking it over time with the next surveys we can understand changes in employees’ opinions and attitudes.

For the current survey, from the side of the People Division, we received manually labeled data, that was made by a human. This manual marking contained two categories to which the responses could belong: positive and criticizing. That allowed us to calculate the accuracy of automatic sentiment analysis.

The size of the dataset was too small to use supervised techniques (less than 1000 observations), thus, we chose unsupervised sentiment analysis and used SentimentIntensityAnalyzer from VADER (available in the NLTK package).

We defined a custom threshold of a score due to the characteristics of the initial data. Usually, such tools are used to assess comments on social networks or reviews of films, which are direct and straightforward. When giving feedback at work, people tend to use more neutral and positive speech patterns.

  • Data: all responses translated into English
  • Preprocessing: none, raw text data in English
  • “True” values: manual labeling provided by the People Division

To evaluate results there is a confusion matrix shown below, which represents the percentage from manually and automatically marked answers:

Result metrics:

  • accuracy = 0.644
  • precision = 0.650
  • recall = 0.644
  • f1-score = 0.646

According to the achieved values of the metrics, the result of automatic marking of the sentiment of the text can be considered quite good. The chosen threshold values can be used in the next surveys, which will help to understand the ratio of positive and critical responses. At the same time, it will help to avoid the need for manual markup.

Topic modeling

The next step after sentiment analysis is to identify the key topics of the answers for each question. It can be done in many ways, mostly from the sphere of unsupervised learning. In this research, we used topic modeling to find the main topics in the responses.

Topic modeling lets us analyze large volumes of texts by uniting documents together by topics having a large amount of unlabeled text data. Topic modeling can be thought of as a kind of text clustering. For the purposes of topic modeling, we used LDA (Latent Dirichlet Allocation Blei, Ng and Jordan, 2003) on the bag-of-words representation of survey answers. There are two main assumptions of LDA that make it a great tool for topic modeling in the sense of interpretability:

  1. Documents with similar topics use similar groups of words
  2. Topics of documents, which are called latent topics, can be found by searching for groups of words that frequently occur together in documents across the corpus.

And we can actually think of these two assumptions mathematically: we can say that documents are probability distributions over some underlying latent topics, and then topics themselves are probability distributions over words. So LDA represents documents as mixtures of topics that spit out words with certain probabilities.

Thus, an overall view of the approach is provided below.

Methodology reference

Thus, an overall view of the approach is provided below.

  1. Text preprocessing steps were applied:
  • Lower casing
  • Removal of punctuations
  • Removal of numbers
  • Tokenization
  • Lemmatization
  • Removal of stopwords, including a custom list of stopwords

2. Word Clouds for visualization

3. LDA for topic modeling

Use case example

Let’s have a look at the results of the described approach, using a particular question as an example. The question was “ What do you most like about working at Exness?”, and there were 597 responses. After text preprocessing, there were 568 answers left because some of them contained only stop words, like “All good”, “Yes”, etc.

Then we build the word cloud based on single words as well as combinations of two words, called bigrams:

As it can be seen, the most frequent phrases are: “hybrid mode”, “working environment”, “work-life balance”, “culture support”, “care”, and “multicultural environment”. This kind of visualization can bring some insights and information about answers, but there’s still no segmentation of answers based on the main topics. To solve this problem LDA was used.

The LDA model was built with LdaModel from gensim and then visualized using LDAvis system (LDAvis: A method for visualizing and interpreting topics, Sievert & Shirley, 2014). The result of the modeling is provided below and can be interpreted in the following way. The visualization has two basic pieces:

  • The left panel presents a global view of the topic model, and answers the questions “How prevalent is each topic?” and “How do the topics relate to each other?”. In this view, the topics are represented by circles in the two-dimensional plane whose centers are determined by computing the distance between topics. Each topic’s overall prevalence is encoded using the areas of the circles, where the topics are sorted in decreasing order of prevalence.
  • The right panel of the visualization depicts a horizontal bar chart whose bars represent the individual terms that are the most useful for interpreting the currently selected topic on the left, and allows users to answer the question, “What is the meaning of each topic?”.

A pair of overlaid bars represent both the corpus-wide frequency of a given term as well as the topic-specific frequency of the term. The left and right panels of the visualization are linked such that selecting a topic (on the left) reveals the most useful terms (on the right) for interpreting the selected topic. In addition, selecting a term (on the right) reveals the conditional distribution over topics (on the left) for the selected term. This kind of linked selection allows users to examine a large number of topic-term relationships in a compact manner.

The last part of the visualization is the option to change lambda values, which allows us to rank words: the decrease of the lambda parameter increases the relevance of the word to the topic, while increased lambda leads to word order according to their frequency only.

Overall view of clusters and words
First cluster
Second cluster
Third cluster

For instance, the top 10 keywords by topic with lambda = 1 will be the following:

Thus, the three topics can be interpreted in the following way according to the plot:

  • For the first topic the most specific words are care, support, benefit, flexibility, help, professional, grow, environment, culture, and others, which can be interpreted as the group of answers where respondents liked in a particular Exness environment that is supportive, provides help and benefits, it is flexible and helps to grow professionally.
  • The second topic has specific words: hybrid, appreciate, Cyprus, multicultural, structure, package, reward, program, event, and others. That can be interpreted as a focus on different working modes, such as hybrid and remote options, relocation to Cyprus, and the overall multicultural world of Exness.
  • And for the third topic the words are colleague, position, relationship, inspire, trading, question, internal, and area. Thus the cluster can be described as relationships with colleagues and the trading sphere.

Code example

Imports and functions

Let’s move to the most practical part of the article: code examples with explanations. As always, we start with imports:

import pandas as pd

#for sentiment analysis
import spacy
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#for word clouds
from wordcloud import WordCloud, STOPWORDS

#LDA modeling
import gensim
from gensim import corpora
from gensim.models import LdaModel

#Visualisations
import seaborn as sns
import matplotlib.pyplot as plt
import pyLDAvis.gensim_models

#Settings, formatting, stopwords, and Spacy model
from tqdm.auto import tqdm
tqdm.pandas()
sns.set_palette("Accent")
pd.options.display.float_format = "{:,.3f}".format

nlp = spacy.load("en_core_web_sm")
stopwords = gensim.parsing.preprocessing.STOPWORDS

At the next step we define functions that are necessary for text preprocessing:

import re

def has_cyrillic(text):
return bool(re.search("[\u0400-\u04FF]", text))

def has_chinese(text):
return bool(re.search("[\u4e00-\u9fff]+", text))

def remove_numbers_and_punctuation (text):
return re.sub(r'[^A-Za-z ]+', '', text)

def tokenize_words(text):
return nltk.word_tokenize(text)

def lemmatize_words_spacy(text):
text = nlp(str(text)) # create a Doc object
return " ".join([token.lemma_ for token in text])

def remove_stopwords(text, stopwords=stopwords):
return [word for word in text if word not in stopwords]

Also, there is a function for word clouds visualization:

import random
def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
return "hsl(0, 0%%, %d%%)" % random.randint(1, 25)

def show_cloud(df_column, max_words=200, stopwords=STOPWORDS, title=None, suptitle=None, save_name=None, show=False):
words = " ".join(df_column.tolist())
wordcloud = WordCloud(
width=1200,
height=800,
max_words=max_words,
background_color='white',
stopwords=stopwords,
random_state=42,
collocation_threshold=4,
min_word_length=2,
min_font_size=16,
).generate(words)
plt.figure(figsize=(16, 8), facecolor=None)
plt.imshow(wordcloud.recolor(color_func=grey_color_func, random_state=3),interpolation="bilinear")
plt.axis("off")
if save_name is not None:
plt.savefig(save_name)
if show is True:
plt.show()

And the last set of functions to define are the ones about LDA modeling:

def lda_preproc(df_column):
words = " ".join(df_column.tolist())
words = words.split(", ")
words = [t.split(" ") for t in words]
dictionary = corpora.Dictionary(words)
corpus = [dictionary.doc2bow(text) for text in words]
# Filter out words that occur less than 5% documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=0.05, no_above=0.5)
return corpus, dictionary

def lda_plot_results(corpus, id2word=None, num_topics=10, save_name=None, iterations=50, alpha='asymmetric'):
lda = LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=42,
iterations=iterations,
passes=5,
alpha=alpha,
per_word_topics=False)
lda_display = pyLDAvis.gensim_models.prepare(lda, corpus, id2word, sort_topics=False)
if save_name is not None:
pyLDAvis.save_html(lda_display, save_name+'.html')
pyLDAvis.display(lda_display)
return pyLDAvis.display(lda_display), lda

Sentiment analysis

As can be seen from the code below, the preprocessing required only translating the responses that had non-English characters:

df = pd.read_csv('Engagement Survey.csv')
df['cyrillic'] = df['response'].apply(has_cyrillic)
df['chinese'] = df['response'].apply(has_chinese)

from google_trans_new import google_translator
translator = google_translator()
df.loc[df['chinese'], 'resp_eng'] = df.loc[df['chinese'], 'response'].apply(lambda x: translator.translate(x, lang_src='ch', lang_tgt='en'))
df.loc[df['cyrillic'], 'resp_eng'] = df.loc[df['cyrillic'], 'response'].apply(lambda x: translator.translate(x, lang_src='ru', lang_tgt='en'))
df.loc[(~df['cyrillic'])&(~df['chinese']), 'resp_eng'] = df.loc[(~df['cyrillic'])&(~df['chinese']), 'response']

The resulting dataset looked as follows:

Then SentimentIntensityAnalyzer was used with a further decomposition of scores. The threshold parameter was tuned manually for dividing answers into groups of positive and criticizing answers. Pay attention that for this task the best metrics were achieved for the “positive” type of classification, but this can vary, as well as a threshold, depending on the dataset. And in the end, classification_report allows checking the metrics compared to the manual labeling.

sid = SentimentIntensityAnalyzer()

df['scores'] = df['resp_eng'].apply(lambda response: sid.polarity_scores(response))

df['positive'] = df['scores'].apply(lambda score_dict: score_dict['pos'])
df['negative'] = df['scores'].apply(lambda score_dict: score_dict['neg'])
df['neutral'] = df['scores'].apply(lambda score_dict: score_dict['neu'])
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df_eng = df
threshold = 0.2
df_eng['estim'] = df_eng['positive'].apply(lambda x: 1 if x >= threshold else 0)
df_eng['manual_estim'] = df_eng['positive_or_not'].apply(lambda x: 1 if x =='Positive' else 0)

from sklearn.metrics import classification_report
pd.DataFrame(classification_report(df_eng['manual_estim'], df_eng['estim'], digits=100, output_dict=True)).transpose()

So the output of the code above is the following:

Word clouds and LDA

Preprocessing

At the first stage of preprocessing, punctuation and numbers were removed, words were lemmatized using Spacy model, and all of the words were lowercase.

Then the responses were divided into tokens:

df_lda = df[df['question']=='What do you most like about working at Exness?'].copy()

df_lda['text'] = df_lda['resp_eng']
df_lda['text'] = df_lda['text'].progress_map(remove_numbers_and_punctuation)
df_lda['text'] = df_lda['text'].progress_map(lemmatize_words_spacy)
df_lda['text'] = df_lda['text'].str.lower()
df_lda['text'] = df_lda['text'].progress_map(tokenize_words)

Stop words removal

Stopwords of Gensim stopwords list were removed, as well as words that have less than four symbols:

df_lda['text_stopwords_removed'] = df_lda['text'].progress_map(remove_stopwords)
df_lda['text_stopwords_removed'] = df_lda['text_stopwords_removed'].apply(', '.join)
df_lda['text_stopwords_removed'] = df_lda['text_stopwords_removed'].apply(lambda x: ' '.join(word for word in x.split() if len(word)>3))

An important step of preprocessing is removing stopwords that are specific for this particular dataset:

from collections import Counter

words = " ".join(df_lda['text_stopwords_removed'].tolist())
words = words.split(", ")
cnt = Counter(words)
cnt = pd.DataFrame(dict(cnt), index=[0]).transpose().sort_values([0], ascending=False)
cnt.columns = ['count']

plt.figure(figsize=(8, 6))
sns.lineplot(y=cnt['count'], x=range(1, len(cnt)+1))
plt.xlim(0, 100)
plt.title('Words count')

To find out how many words have to be removed as custom stopwords, we used the plot above: Y-axis shows the frequency of a particular word and X-axis represents the order number of this word (sorted by frequency). As can be seen, the frequency starts to decrease after word number 20 approximately. Thus it was decided to remove the first 20 most frequent words as the most probable stop words.

custom_stopwords = list(cnt.head(20).index)
stopwords = set(stopwords).union(set(custom_stopwords))

df_lda['text_stopwords_removed'] = df_lda['text'].apply(lambda x: remove_stopwords(x, stopwords))
df_lda['text_stopwords_removed'] = df_lda['text_stopwords_removed'].apply(', '.join)
df_lda['text_stopwords_removed'] = df_lda['text_stopwords_removed'].apply(lambda x: ' '.join(word for word in x.split() if len(word)>3))

Word clouds

Finally, let’s build word clouds:

show_cloud(df['text_stopwords_removed'], max_words=200, show=True)

And the out that we have seen already:

Pay attention to the max_words parameter which defines the number of words to be shown.

LDA

And last but not least is the LDA model with visualization. The most important parameters of the function are:

  • corpus is a stream of document vectors or sparse matrix of shape (num_documents, num_terms)
  • num_topics is the number of requested latent topics clusters to be extracted from the training corpus and can be changed for finding the optimal number of topics
  • id2word is mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing
corpus, dictionary = lda_preproc(df['text_stopwords_removed'])
vis, lda = lda_plot_results(corpus=corpus, id2word=dictionary, num_topics=3, save_name='LDA results')
vis

So finally there is a visualization of the model:

Overall view of clusters and words
First cluster
Second cluster
Third cluster

Conclusion

Surveys in all their variety, including free-form questions, are an important part of gathering feedback. However, with an increase in the number of answers, the problem of automating the analysis inevitably arises. At the same time, surveys are not the only source of textual data.

This article has described a number of approaches that make it possible to simplify and automate the analysis and breakdown of text sets into topics, in particular:

  • use Word clouds to visualize basic words and phrases.
  • the use of SentimentIntensityAnalyzer from VADER with the selection of an individual threshold to identify emotional coloring.
  • LDA modeling for finding topics and topic keywords.

As mentioned above, this approach allows you to analyze not only surveys but also other sets of texts, the number of which does not allow you to analyze them in detail manually.

--

--

Natalia Tsarkova
Exness Tech Blog

ML engineer. DS, ML, Python, curiosity, continuous learning and passion for life ❤️