Faster Topic Modeling with BERTopic and RAPIDS cuML

Maarten Grootendorst
RAPIDS AI
Published in
7 min readJan 19, 2023

--

Authors (alphabetical): Nick Becker (NVIDIA), Dante Gama Dessavre (NVIDIA), Maarten Grootendorst (BERTopic), and Corey Nolet (NVIDIA)

Introduction

BERTopic is a topic modeling framework that brings neural network embeddings and classical machine learning techniques together into a state-of-the-art solution. The library provides a friendly user interface for many different tasks, including guided, supervised, semi-supervised, hierarchical, and dynamic topic modeling (and more). It even supports topic visualization. To learn more about BERTopic, you can watch this brief video introduction:

BERTopic has become a core tool for topic modeling, but workflows containing 50,000 or more documents present scaling challenges to practitioners. With expanded support for RAPIDS cuML in the BERTopic v0.13 release, it’s now possible to get results even faster and scale to larger datasets.

In this blog, we describe how you can go faster on CPUs by using lightweight pipeline components instead of the advanced pipeline BERTopic uses by default. Then, we demonstrate how NVIDIA GPUs accelerate the key computational bottlenecks of the default BERTopic pipeline. Based on benchmarks of two basic workflows on a corpus of 100,000 product reviews, using PyTorch and RAPIDS cuML with BERTopic on a GPU can provide a 10–15x or greater speedup to the default pipelines.

Speeding Up BERTopic on CPUs

BERTopic enables plugging-in different algorithms for tasks like embedding creation, dimensionality reduction, and clustering to help scale to larger datasets. For example, you could get faster results on your CPU by replacing a neural network with a TF-IDF transformation in the embeddings step, or by replacing UMAP with TruncatedSVD in the dimensionality reduction step. The BERTopic documentation provides recommendations for lightweight, optimized CPU pipelines if you’d like to get faster results without a GPU.

This comes with a tradeoff, though. Using the default (computationally expensive) techniques like neural network embeddings and UMAP can provide excellent results. We’d like to be able to use these more expensive techniques but simply go faster. If you don’t want to change the underlying pipeline techniques, GPUs are a great option.

Why GPUs? Where BERTopic Spends Time

There are significant computational challenges running the default BERTopic fit_transform on a corpus of 100,000 Amazon customer product reviews. The BERTopic step in the following code took about 14 minutes on a system with dual Intel Xeon E5–2698 20-core CPUs (80 logical cores).

from bertopic import BERTopic
import pandas as pd

df = pd.read_json(PATH, lines=True, nrows=100000)
docs = df.reviewText.tolist()

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

Please note: for all of the following analyses, performance and relative time elapsed of different operations will vary a bit across runs.

By inspecting the profile of this workload with Snakeviz, two core bottlenecks jump out:

  • 88% of the total time is spent encoding the documents into neural network embeddings with SentenceTransformers, using PyTorch under the hood (733 seconds)
  • 10% is spent reducing dimensionality with UMAP (86 seconds)

The workload gets even more time consuming when setting calculcate_probabilities=True to perform soft clustering with HDBSCAN. This is often done to visualize topic distributions, use the “probabilities” strategy to reduce HDBSCAN outliers, or to generate topic-document probabilities.

from bertopic import BERTopic
import pandas as pd

df = pd.read_json(PATH, lines=True, nrows=100000)
docs = df.reviewText.tolist()

topic_model = BERTopic(calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)

Now, HDBSCAN soft clustering becomes an additional bottleneck, essentially doubling total time required for topic modeling to more than 26 minutes:

  • 47% of the total time is spent on soft clustering with HDBSCAN (737 seconds)
  • 47% creating the embeddings with SentenceTransformers (742 seconds)
  • 5% reducing dimensionality with UMAP (88 seconds)

A 14 or 26 minute runtime for topic modeling is a major hurdle to any kind of exploratory analysis or parameter optimization. Things would only get worse if we had more reviews.

Accelerating BERTopic with NVIDIA GPUs

Fortunately, we can accelerate all three of these core bottlenecks by using NVIDIA GPUs. On a CUDA-enabled system, run the following and you’re ready to go:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install bertopic
pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com

PyTorch on GPUs

Let’s start by running the original code but using the CUDA-enabled version of PyTorch:

from bertopic import BERTopic
import pandas as pd

df = pd.read_json(PATH, lines=True, nrows=100000)
docs = df.reviewText.tolist()

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

Using PyTorch on an A100 GPU significantly accelerates the document embedding step from 733 seconds to about 70 seconds. The overall total time required drops to about 190 seconds, already 4.5x faster than before!

With SentenceTransformers and PyTorch running on the GPU, the time elapsed breakdown is quite different:

  • 37% of the total time is spent creating the embeddings with SentenceTransformers on the GPU (70 seconds)
  • 53% spent reducing dimensionality with UMAP on the CPU (100 seconds)

This means we can get another large speedup by bringing UMAP onto the GPU.

RAPIDS

RAPIDS cuML provides a GPU-accelerated UMAP and HDBSCAN, which we can easily drop into this workflow using the example from the BERTopic documentation.

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd

df = pd.read_json(PATH, lines=True, nrows=100000)
docs = df.reviewText.tolist()

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)

Using cuML, we can run UMAP in about 5 seconds rather than 100 seconds, resulting in a further 2x speedup for the overall workload (84 seconds rather than 190 seconds).

Compared to the original 836 seconds for the standard fit_transform entirely on CPUs, using PyTorch and RAPIDS provides a full 10x speedup.

What About Calculating Probabilities?

From the CPU profiles, we saw that speeding up HDBSCAN isn’t critical unless we need to calculate probabilities. How much of an impact does running HDBSCAN on a GPU make when we do need to get those probabilities?

A massive impact, it turns out. HDBSCAN soft clustering took about 737 seconds on the CPU, but can be done in less than one second on the GPU. Running the same code above but setting calculate_probabilities=True, the workload takes about 85 seconds to complete rather than 1583 seconds (about 19x faster).

At this point, almost the entire workload is running on the GPU.

Performance Best Practices

A full BERTopic workload on 100,000 reviews in 85 seconds is excellent, but being content isn’t our job. For exploratory analysis or when optimizing topic modeling, it’s common to use the same set of embeddings and vary other aspects of the workload such as UMAP or HDBSCAN parameters.

BERTopic enables you to create your neural network embeddings once and reuse them. As creating the embeddings is the most expensive step (taking up more than 75% of the 85 seconds), we recommend doing this once upfront to enable even quicker iteration.

In the examples above using 100,000 Amazon reviews, the profiles and benchmarks indicate that each topic modeling run would only take about 20 seconds when using already created neural network embeddings. The BERTopic documentation includes a simple example of how to do this.

Conclusion

BERTopic is a powerful tool for topic modeling, and its composable library design makes it possible for you to get results faster on both CPUs and GPUs.

If you’re working on a CPU, you can swap out computationally expensive operations like neural network embeddings and UMAP by using a lightweight configuration. If you have an NVIDIA GPU, you can use PyTorch and RAPIDS cuML to get large speedups with the state-of-the-art BERTopic configuration used by default.

In the benchmark workload described above processing 100,000 product reviews, we saw more than 10–15x speedups by using both PyTorch and RAPIDS. The expanded support for RAPIDS in BERTopic v0.13 makes it easier than ever before to get started. Just pip install BERTopic and cuML.

To learn more, visit the BERTopic and RAPIDS cuML documentation. Happy topic modeling!

The RAPIDS team consistently works with the open-source community to understand and address emerging needs. If you’re an open-source maintainer interested in bringing GPU-acceleration to your project, please reach out on Github or Twitter. The RAPIDS team would love to learn how potential new algorithms or toolkits would impact your work.

--

--

Maarten Grootendorst
RAPIDS AI

Data Scientist | Psychologist. Passionate about anything AI-related! Get in touch: www.linkedin.com/in/mgrootendorst/