Leveraging BERTopic Modeling to Discover User Insights

Fernando Garcia

Published in

DNC Tech Team

6 min readJan 9, 2024

Leveraging BERTopic Modeling to Discover User Insights

TLDR:

Given large, unstructured text data, decided to use topic modeling with NLP model.

Selected BERTopic model because it can interpret sentences (capture semantic context of text), provides more human-comprehensible topic labels, and is customizable for data team’s use-case.

Allowed for the data team to integrate conveniently filtered user insights into their research and development of products and services.

During my time at the DNC as a data team intern, I was tasked with interpreting large, unstructured data extracted from our user support ticketing system. As my first internship experience, I was challenged with understanding a team’s core objectives and creating a NLP model that best aligns with their use case. Ultimately, my textual analysis allowed for the data team to easily gain insights for selected topics to incorporate into their research and development of products and services.

Introduction

The Community Team at the DNC hears directly from our users through user-created tickets. We get a wide range of questions from requests for tips on how to use our Democratic Support score to requests for better cell phone coverage of key groups. Following the extraction of messy, text-based data on user requests, we wanted to perform a textual analysis of the tickets to get a better understanding of our users and the kind of questions, challenges, or remarks they bring up to the DNC. Developing this strong understanding would not only improve transparency between the DNC and its users, but would also guide research, implementation, and development of its services and data products.

What is Topic Modeling, and why?

In natural language processing, topic modeling is an unsupervised machine learning technique typically used as a data-mining tool to discover patterns and semantic structures within a corpus of documents. In essence, the model is used to scan words and phrases within text to uncover topics, grouping similar documents together through term frequency or semantic similarities.

For the DNC’s use-case, I decided that topic modeling would be a sufficient approach for a more in-depth analysis of tickets to discover patterns and insights within their user requests.

Methodology

There are a plethora of topic modeling algorithms available to consider. The two that I particularly explored were Latent Dirichlet Allocation (LDA) and BERTopic; both commonly used topic models on text-based data.

Many initial trials were conducted on the dataset in preparation for each model, but something that concerned me was LDA’s predefined number of topics as a required parameter to train the model. Due to the unexplored context and large size of our dataset, it was difficult to determine an expected number of topics before conducting the analysis. LDA’s probabilistic model relied on a mixture of words approach, particularly looking at the distribution of words within documents for topic classification. Thus, the relationship of words within their sentences was not considered, failing to account for the semantic context of tickets.

Luckily for us, BERTopic’s hybrid model leveraged a clustering and dimensionality reduction algorithm as well as Bidirectional Encoder Representations from Transformers (BERT) embedding models to capture the semantic context of words and phrases within documents to create clusters. It additionally uses class-based Term Frequency — Inverse Document Frequency (c-TF-IDF) and representation models to enhance the unique labels associated with each cluster. The customizable models for each component in the BERTopic pipeline allowed me to create a topic model that was optimized for our use-case.

Preprocessing

As I’ve hinted before, there was a lot of data cleaning and preparation required before training our model. Through exploration of the tickets, I quickly discovered three key areas of pre-processing:

Data prep: Unnesting and converting relevant features, specifically subject line, body text, and ticket tags, into data types suitable for textual analysis was necessary to create input documents for our topic model.
Text cleaning: Ensured the entire corpus was case-insensitive, and
Removing noisy text: Web data, special characters, emoticons; created functions to parse out forwarded messages and email signatures.

Parameter Tuning

Because each component is made up of one or more models, BERTopic contains comprehensive documentation related to fine-tuning the topic model. I particularly spent a long time trialing different HDBScan and UMAP models to optimize the clusters and minimize the amount of tickets classified as outliers. I ultimately tuned my model to produce distinct microclusters with the intention of merging similar topics into larger groups while also reducing the outlier cluster. Creating these microclusters would give me the freedom to consult with the data team members about which topics they believed were interesting enough to warrant its own cluster, or group together partially differing redundancies.

Result

The resulting model was created on a subset of our corpus, filtered tickets by tags of our team’s interest. As expected, many microclusters were returned by the model, some of which were very similar to other clusters in the output.

I worked on merging topics using similarity scores between clusters and BERTopic’s hierarchical clustering integration, which demonstrates the hierarchical nature of topics to represent their similarity.

(Similarity score heatmap matrix demonstrating relationships between topics.)

(Visualization of the hierarchical clustering on the original output of the topic model.)

After merging topics, I reviewed the newly created clusters to ensure they were both interpretable and usable before accepting the final topic model. To prepare for the DNC data team’s usage, I joined the BERTopic document data, including each ticket’s classified topic, top key terms, and classification probability score, with the original dataset. This allowed for the data team to view all of the original metadata associated with each ticket alongside the resulting features returned by the topic model, giving them a comprehensive view of each request. Finally, I created interactive visualizations integrated with BERTopic, such as visualizing the cluster groups and showing the evolution of topics over time, for the team to gain more user insights. Mapping the clusters allows us to see the distinction between topics as well as their proximity to outlier tickets, which could be useful for viewing the effects of outlier reduction strategies when tuning the model. Visualizing topics over time not only lets us distinguish between which topics are still relevant or relics of past cycles, but it also allows us to see the key term frequencies of each topic during different time periods.

(Visualization of topic clusters reduced to 2D pane.)

(Visualization of topics over time; the interface in the notebook also gives term frequencies at given points in time.)

Closing Thoughts

The topic modeling classification was and will continue to be integrated into user research components for projects at the DNC. Isolating topics of interest from a large dataset allows for the data team to understand feature requests or challenges faced by users and set priorities for advancing their products.

A valuable asset of the project was its replicability and scalability. The DNC can pick up from where the project was left off to gain user insights on tickets filtered by areas of interest. Furthermore, individual models can be fine tuned for various use-cases, whether it be a more global or local view on user requests.

The BERTopic modeling approach was uncharted territory at the DNC. The project allowed for the data team to see the potential of gaining valuable insights from performing a textual analysis on the unstructured data. Making sense of the large text data by narrowing the team’s focus on relevant tickets allows for the team to easily integrate user insights into their research and development process for their products and services. The topic model will serve as the foundation for potential work continued by their data team to enhance the DNC’s understanding of its campaign users and ultimately optimize its capabilities.