Multilingual GPU-Powered Topic Modelling at Scale

Introducing Buzzwords — Bumble’s Open-Source GPU Topic Modelling Library

Stephen O' Farrell
Bumble Tech
Published in
6 min readAug 30, 2022

--

With the abundance of text data available nowadays, it is becoming increasingly important for companies to be able to gain insights from free-form text datasets. Over the past couple of years, Topic Modelling has been quickly rising in prominence as a major field of research in this space. At Bumble Inc. — the parent company that operates Badoo, Bumble and Fruitz — we’re embracing this and have been successfully using topic modelling to better understand our members and their needs.

With millions of people across the globe using our apps, we needed a solution that would allow us to build and deploy topic models at a massive scale. To this end, we developed Buzzwords: a GPU-powered topic modelling tool able to train and deploy topic models at rapid speeds. We’ve been using Buzzwords internally with great results for the past year and are thrilled to announce that we’re now releasing it as a fully open-sourced project!

What is Topic Modelling?

Text is categorised into various relevant topics through topic modelling

Topic modelling is a fairly simple concept: it’s the process of clustering groups of text together based on the topics that they relate to. It’s an unsupervised machine learning technique that can provide powerful insights into a dataset of text documents. For instance, if you send out a survey and get a significant amount of feedback from your users, how do you get an idea of the main topics being discussed? Manually sifting through thousands of responses can be very time-consuming, especially if the survey spans multiple languages.

Topic modelling solves this problem by automatically providing insights into the most prevalent topics, allowing you to quickly analyse large datasets of text without the need for manual labelling.

How does the algorithm work?

Buzzwords is based on BERTopic — a library for topic modelling developed back in 2020. The algorithm powering BERTopic is quite easy to understand:

  1. Text converted to embedding with SentenceTransformers
  2. Embedding Dimensionality Reduced with UMAP
  3. Embeddings Clustered with HDBSCAN
  4. Keywords gathered with a Class-Based TF-IDF implementation & MMR for relevance

The beauty of this is the flexibility of the SentenceTransformers module: you can build a multilingual topic model simply by choosing a different pre-trained model or even plug in a custom embedding model of your own. With BERTopic it’s easy to build a BERT-based multilingual topic model with topics that are described very well by a selection of keywords most relevant to each topic.

Why didn’t we just use BERTopic?

The issue with BERTopic is that only step 1 of 4 is optimised for GPU. This means that BERTopic struggles when you try to train on larger datasets. We trained a BERTopic model on 1 million sentences and it took a total of ~40 minutes to complete. To test the prediction speeds, we also trained a model on 1m sentences and used that model to predict on a further 1m — with the predicting adding a further 30 minutes to the total runtime. When you’re dealing with the scale of data that Bumble deals with, this needs to be much faster.

What we did

Fortunately for us, RAPIDS’ cuML library had recently added HDBSCAN support to go with their existing UMAP implementation. RAPIDS is Nvidia’s GPU-powered suite of tools for Data Science algorithms, allowing us to speed up the computation on the BERTopic algorithm massively. By replacing the CPU-powered implementations of UMAP and HDBSCAN found in BERTopic with their GPU-powered counterparts in RAPIDS, massive performance improvements were seen.

But there was a catch: cuML’s HDBSCAN had no way of predicting new data with a trained model. The CPU-powered HDBSCAN offers approximate_predict for this, but there is no equivalent in cuML. So, what was the next obvious step? Build our own!

To allow for GPU-powered inference, we combine two versions of HDBSCAN with FAISS

We did this by building our own custom HDBSCAN module — combining elements from both implementations into one powerful class. We stripped down the approximate_predict function from scikit-learn’s HDBSCAN and replaced the KDTree search with a FAISS index. This allowed us to recreate the original library’s prediction function but with the more computation-heavy sections optimised for GPU. This FAISS index can be recreated easily once you’ve trained a model. This allowed us to drop the index for easier saving of trained models — recreating the index for this model whenever you want to load it up for inference. Now we were able to train and deploy GPU-powered topic models at rapid speeds!

What makes Buzzwords so good?

The primary benefit is the speed of the library. This increase in speed allows you to iterate more quickly and tune your parameters much more effectively. Maybe your first run had too many outliers due to your parameters, maybe you want larger topics in your final set, or maybe you want to make the model more efficient by using a smaller, monolingual embedding. With Buzzwords, the faster training speeds allow you to tweak things multiple times and still deliver quickly.

Training times are significantly reduced, especially when dealing with large-scale datasets

But the improved speeds aren’t only found at training time. The addition of our custom HDBSCAN inference means that the prediction times are also much faster. This improved efficiency allows you to deploy at a much bigger scale.

Another issue that plagues this algorithm is the high number of outliers that you can encounter when training. It’s not uncommon to have ~30% of your datapoints disregarded as outliers. When carrying out an analysis, this can be acceptable but. when deploying to production it can be a huge problem. To counteract this, Buzzwords is implemented using Matryoshka models — where you can train models recursively on those outliers.

The final model has N+M+L topics with reduced outliers vs the basic model with just N topics

This approach reduces outliers by training subsequent topic models on the outliers of previous models. This allows you to get far more granular results and prevents production models from disregarding significant chunks of your data. Each model is grouped together into one singular model for inference, so there is no difference between deploying a Buzzwords model with no recursions and one with multiple recursions.

And this is just a taste, Buzzwords offers much more — such as alternative keyword extractors, support for image topic models, word lemmatising and much more to come!

Conclusion

Buzzwords has unlocked a lot of potential for us in its various forms; from production APIs using KServe to ad-hoc analyses on our internal datasets as well as scheduled batch training pipelines on GCP. The increased speeds make it very easy for us to employ topic modelling across a wide range of use cases, and we’re excited to see what uses the wider industry can find for the library with this open-source release.

This is the first library that the Data Science team at Bumble has released publicly but it certainly won’t be the last. We’re trying to use ML to create a world where all relationships are healthy and equitable. If this sounds like something you’d be interested in then please check out our Careers page or reach out to us for more information!

--

--