TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

Topic Modelling with BERTtopic in Python

Hands-on tutorial on modeling political statements with a state-of-the-art transformer-based topic model

Petr Korab
TDS Archive
Published in
5 min readApr 1, 2024

--

Photo by Harryarts on Freepik

Topic modeling (i.e., topic identification in a corpus of text data) has developed quickly since the Latent Dirichlet Allocation (LDA) model was published. This classic topic model, however, does not well capture the relationships between words because it is based on the statistical concept of a bag of words. Recent embedding-based Top2Vec and BERTopic models address its drawbacks by exploiting pre-trained language models to generate topics.

In this article, we’ll use Maarten Grootendorst’s (2022) BERTopic to identify the terms representing topics in political speech transcripts. It outperforms most traditional and modern topic models in topic modeling metrics on various corpora and has been used in companies, academia (Chagnon, 2024), and the public sector. We’ll explore in Python code:

  • how to effectively preprocess data
  • how to create a Bigram topic model
  • how to explore the most frequent terms over time.

1. Example data

As an example dataset, we’ll use the Empoliticon: Political Speeches-Context & Emotion dataset, released under the…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Petr Korab
Petr Korab

Written by Petr Korab

Python engineer /NLP / data Viz. Text Mining Stories founder textminingstories.com

Responses (2)