Accelerating Topic modeling with RAPIDS and BERT models

Vibhu Jawa
RAPIDS AI
Published in
8 min readMar 15, 2022

By: Vibhu Jawa and Mayank Anand

In the time since this blog post was released, the BERTopic library has added initial support for cuML. We recommend using cuML directly with BERTopic, which you can do by following the example below drawn from the BERTopic documentation.

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)

The following is the full, original blog.

TLDR: This blog covers “Topic modeling” using RAPIDS, Numba, CuPy, HuggingFace, and PyTorch to do text processing, Deep Learning based embedding creation, dimensionality reduction, and clustering all on NVIDIA GPUs in an order of magnitude less time than on CPUs.

Most of the text data in the wild is unlabelled and topic modeling is often the first tool that a data scientist deploys to make sense of it all. Topic modeling is an unsupervised machine learning technique that automatically analyzes text data to determine clusters of words and phrases based on their commonality.

From e-commerce companies using topic modeling to do auto product categorization to SaaS (Software as service) companies using it to automate issue ticket mapping to provide effective customer service, to streaming companies using it to recommend better content to even bioinformatic researchers using it to extract hidden knowledge and relations in the exponentially growing bioinformatics data, the use cases of topic modeling run deep and wide. Check out this blog to learn more about the industrial application of topic modeling.

This blog showcases how the GPU-accelerated Python ETL, Machine Learning, and Deep Learning ecosystems integrate together to accelerate the topic modeling workflows end-to-end. We’ll introduce the BERTopic algorithm for topic modeling, and compare the CPU vs GPU implementations of each of the algorithm stages. Using RAPIDS, Numba, CuPy, and PyTorch together, you’ll see how to accelerate your custom workflows and we’ll show that it is possible to reduce the labeling process from over an hour on CPU to just four minutes on GPU, on the canonical news headline data set.

BERTopic:

BERTopic is a topic modeling technique that leverages transformers and class-based TF-IDF to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions.

This diagram summarizes the BERTopic algorithm:

See this more detailed article and the associated library to learn more.

Speedup Snapshot:

Check out the snapshot below to see how each bit of the workflow is accelerated on GPUs.

Dataset: We use the news headlines dataset, containing 1,226,258 headlines published over an eighteen-year period.

Hardware: We use a single 32 GB V100 GPU and an Intel Xeon 2698 with 40 Physical (80 virtual) Cores for all given execution times.

BERTopic CPU vs GPU Speedup

Workflow Steps

Now we’ll go into details of each of the steps of this workflow and show how GPUs can be easily plugged into them yielding big speedups along the way.

Step 1: Embedding Creation

The first step of the algorithm is to create document embeddings using a DL-based transformer model. Embedding creation has two stages:

  1. Tokenization
  2. Transformer model-based Deep Learning

Tokenization:

We first need to tokenize our strings to make them ingestible to the Transformer based Deep Learning model. Tokenization is accelerated using the SubwordTokenizer from RAPIDS cuDF.

The advantages of using cudf subword tokenizerinclude:

  • The tokenizer itself is about 324x faster than HuggingFace’s Fast CPU-based RUST tokenizer BertTokenizerFast.batch_encode_plus.
  • Tokens are extracted and kept in GPU memory and then used in subsequent tensors, all without leaving GPU memory, avoiding expensive copies and transfers.

This end-to-end translates to taking 0.033s on the GPU compared to 18.7s with a multi-core CPU implementation. For the technical details, check out the embedding extraction code here.

Transformer Based Deep Learning model:

We use all-MiniiLM-L6-v2 from the HuggingFace library as an English language model trained specifically for semantic similarity tasks, which works well for most use-cases. This is the default in BERTopic but can be switched to any model appropriate for your use case.

We get a 7.87x speedup using GPU for model inference vs. multi-threaded CPU implementation, speeding up embedding creation from 496s to 63s

Step 2. Dimensionality Reduction with UMAP

Clustering algorithms suffer the Curse of Dimensionality in high dimensional space, so we first need to reduce the dimensionality of the embeddings that we generated.

The BERTopic algorithm relies on the UMAP (Uniform Manifold Approximation and Projection) algorithm to reduce dimensionality. We use cuML’s UMAP to accelerate this on GPU.

Compared to the multi-threaded CPU implementation, RAPIDS is 27.7x faster on GPU, reducing the computation time from 2718s to 98s.

To learn more about UMAP with cuML, check out this paper by the RAPIDS team!

Step 3. Clustering (HDBSCAN) (4.1 times faster with RAPIDS) :

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm based on DBSCAN. It first performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN) and be more robust to parameter selection.

We use cuML’s HDBSCAN to accelerate it on the GPU. We are currently 4.1x faster when clustering on the 5-dimensional dataset obtained from UMAP, reducing computation time from 382s to 92s. On larger dimensionality, GPUs can be up to 29x faster and we are currently working on bringing similar speedups to lower dimensions too.

To learn more about HDBSCANwith cuML, check out this blog.

Step 4. Topic Creation with Class-Based TF-IDF

So far, we successfully reduced the dimensions and cluster documents, which is great, but how do we know what people are talking about in those documents? Or a better question would be, how do we extract content-based topics based on the cluster of documents given to us?

For doing just this, BERTopic highlights a very useful approach, a variant of TF-IDF known as c-TF-IDF or class-based TF-IDF. As described here, applying TF-IDF on a set of documents, we get the relative importance of words between documents, but if we group all documents with the same cluster-ID, we get the scores for words within a cluster. The words with the highest scores will represent the theme of that cluster.

We accelerate class-based TF-IDF on GPUs by writing a thin wrapper on cuML’s TF-IDF vectorizer. This gives us a 7.9x speedup over scikit-learn’s CPU implementation, reducing the time from 15.2s to 1.92s.

Check out this blog to learn more about TF-IDF at scale with cuML.

Step 5. Topic Representation with Numba and CuPy

We could be combing through billions of documents and can potentially obtain millions of words in a cluster, but from the user perspective, only the top few words are essential to get an idea of what’s being talked about in a particular cluster. Keeping that in mind, we choose the words with the highest c-TF-IDF scores for each topic.

The implementation essentially boils down to getting the top n values from each matrix row we obtain from the Topic creation step. We do this by leveraging CuPy and Numba together (Check out the code at the link). This reduces the computation time from 8.2s to 0.602s, giving a 12.5x speedup.

To learn more about Numba and how it can accelerate UDFs on GPUs, check out docs here and learn how CuPy accelerates standard array functions here.

Putting It All together:

When we put it all together, we get 12.5 times speedup when using GPUs end to end vs. just using the GPU for Deep Learning.

When comparing end-to-end CPU (Intel Xeon 2698 with 40 Physical Cores, 80 virtual cores) , end-to-end GPU (32GB Nvidia V100 GPU) is 14.2 times faster. This translates to the workflow taking only 4.2 minutes on GPU vs over an hour on the CPU.

The documentation and library code is at this link. Give it a go.

The internal and external API of our implementation closely follows BERTopic API so it should just be a line change for default workflow and extending it to your use case should be straightforward too.

Wrap Up

We’ve demonstrated that NVIDIA GPUs can provide massive acceleration for various parts of a topic modeling-based workflow, which is a mix of ETL, traditional machine learning, and deep-learning by at least an order of magnitude. So whether your workflow is pure ETL, machine learning or deep learning, or a mix of these, you will probably see performance gains by using GPUs.

Furthermore, this workflow is an excellent example of how so many open source libraries like HuggingFace Transformers, PyTorch, CuPy, and Numba integrate seamlessly with the NVIDIA RAPIDS ecosystem enabling acceleration of diverse workflows on GPUs.

If RAPIDS has caught your eye, please check out the GitHub page or visit the website. Is there something you want to try out with RAPIDS but are unsure how to begin? Just ask us! Contributions in the form of questions, issues, feature requests, pull requests are always welcome!

--

--