Multilingual message content moderation at scale
Part 1: introduction, model’s design and production infrastructure
As we all know — words are powerful. They can be used to lift people up, but sometimes are intended to cause harm. At Bumble Inc.— the parent company that operates Badoo and Bumble, two of the world’s highest-grossing dating apps with millions of users worldwide — we believe it’s unacceptable to be abusive in any form — online or offline. That’s why we’ve introduced the Rude Message Detector on the Badoo app: a fully multilingual, machine learning-based toxicity detector within our app that’s designed to protect our community from harmful messaging.
In this post, we will cover some of the technical aspects and challenges of the year-long project that allowed us to build a reliable and scalable engine that powers the Rude Message Detector and other critical internal integrity products within our apps.
Operating across 150 countries and being available in 50 different languages, we needed to design a system that was fully and natively multilingual, with comparable performances in all of the supported languages and without necessarily knowing the language of each message beforehand, relaxing a lot of requirements in other steps of the moderation pipeline, e.g. language detection.
Transformers-based architectures and Foundation models
Since their introduction in 2017, Transformers-based architectures have been the de-facto standard for state-of-the-art Natural Language Processing (NLP) applications. Like any other form of Recurrent Neural Networks, of whatever form, they are designed to handle sequential data (inputs where sample order impacts their meaning, e.g. text) with the four main differences being: the complete disposal of recurrence and convolutions; the removal of the need to process data in order; the enhancement of parallelisation capabilities; and shorter training times.
This is all achieved thanks to a (multi-headed) self-attention mechanism that can compute the relative importance of words in a sentence while receiving it as a whole as input. This differs from classical RNNs that compute mutual importance and derive context in a serialised fashion one word after the other.
The Transformer’s architecture made it possible to train massive language models (e.g. BERT, GPT-2, GPT-3) outperforming the state-of-the-art when fine-tuned on plenty of downstream NLP tasks. Their availability to the AI community drove a huge paradigm shift for what it means to do ML at scale in the industry. Now it is possible to leverage pre-trained language models that need no more than design decisions and fine-tuning for the specific task at hand, given that the contextual-aware embeddings are already giving a state-of-the-art representation of the input raw features.
All these topics were recently covered in an impressive piece of work from the Center for Research on Foundation Models (CRFM) at Stanford. After introducing the definition of Foundation Models, it highlights the impacts and risks of this trend, mainly due to the possible emerging capabilities of these big, obscure models and complex architectures, all trained with datasets of unseen sizes.
Design and multilingual validation
XLM-RoBERTa (XLM-R) is a perfect example of a foundation model, thanks to its 270 millions+ parameters and its training set of 2TB data in 100 languages. Originally trained on 500 GPUs in a self-supervised fashion (Masked Language Model), it tries to predict the masked token in the input sentence. It has been shown to outperform mBERT and to have reliable performances both in high and low resources languages. The absence of language embeddings in the input made it a perfect choice for the problem we wished to solve.
After iterating on different options both from a business and performance perspective, we carefully adapted the original architecture for a 3-class multilabel classification problem, namely: Sexual, Insults and Identity Hate. The most recent dataset is composed of 15+ languages (including English, Portuguese, Russian and Thai) and a total of more than 3M messages, collected and labelled over a period of several months period with an outstanding internal process of multiple checks of over the same message to ensure data quality and concept reliability. The latter proved to be crucial in delicate NLP tasks. We learned that leveraging multiple human moderators can also help in setting realistic baselines for the task in hand: if multiple agents are having a hard time in agreeing on a case, how can we expect that a machine learning model will be able to make a confident decision?
All the languages have been added incrementally starting with English and Portuguese, covering more regions of our global user base every week, constantly monitoring the performances on the previous languages and adapting hyperparameters (batch size, learning rate and decay) to the ever-expand internal dataset. This agile approach has also proven very successful for managing AI projects at this scale, unlocking significant business impact from the very early stages while allowing expectations to be managed and facilitating prioritisation of the new languages to be added.
Being able to safely release periodic updates to the model has required considerable effort. We had to design a reliable offline validation pipeline able to replicate and predict the actual impact of a new version in production, especially at later stages, with millions of users and tens of thousands of messages per second being processed by the service. In our case, on top of the usual challenges arising from user-facing machine learning projects, new and challenging levels of complexity have been added due to the inherently multilingual setting and the heavy imbalance of our target concepts. If samples of the real production load are crucial for monitoring the predicted percentage of positive examples, they cannot be used for assessing model performances in any stable and reliable way. This is because of the extremely low percentage of toxic messages and edge cases (luckily for us!).
To deal with all these challenges, we designed and implemented an ad-hoc Tensorboard-based validation routine, able to quickly point to potential improvements and possible weaknesses of the currently trained model when compared with historical and productionised versions. After some initial setbacks and real-world mess ups (🙂, it happens) we came up with a restricted set of metrics that needed to be monitored before giving the model a green light. These were tightly linked to Integrity KPIs and to the requirements and constraints of the downstream processes according to the model’s predictions.
Production infrastructure and business impact
Thanks to a successful partnership with our Engineering team, in the Data Science team we have been developing machine learning solutions at scale for a while. We now have an internal suite of services for deploying and monitoring Tensorflow-based, deep learning models, guaranteeing both high performances and full observability. Working on our NLP solution on the back of this impressive track record and real-world experience allowed us to make educated design decisions and has reduced the gap between experimentation and production deployment, a common pain point for machine learning projects.
Like all the Transformers derived models (or state-of-the-art NLP systems in general), XLM-R requires the input in tokenised format, leveraging a slightly different version of the well-known SentencePiece tokeniser. In order to use our internal inference engine, we had to re-implement the tokeniser in order to run it as a layer in the Tensorflow graph, and the two parts are now executed together during inference.
The Rude Message Detector engine runs on several GPU nodes in two different zones, optimising resource usage thanks to input’s dynamic batching. If needed, we can spin up and down nodes, in order to cope with the ever-increasing load that the service has to deal with.
All the machine learning systems come natively with a shared logging system for centralised monitoring and real-time anomaly detection. Since the very early releases, we have been able to verify in production the performance of our model and to roll back versions if they didn’t behave as expected on a particular set of metrics. The centralised machine learning dashboard is a very interesting place for data scientists and product managers to check models’ performances and to possibly prioritise potential improvements or add new languages to extend the original scope.
Conclusion and next steps
Our mission is to create a world where all relationships are healthy and equitable. This has its challenges, especially when it comes to machine learning and NLP. We embarked on a journey to ship a native multilingual engine, able to keep our users safe whatever language they use to send messages on our platforms. Luckily in recent years, academic researchers have developed impressive architectures and methods that have been extremely helpful in paving the way for us to develop the engine that powers the Rude Message Detector and numerous other internal solutions.
This paradigm shift in AI and the infinite possibilities it unlocks for the industry might come with some additional challenges arising from the emerging capabilities of these huge deep learning models trained on uncurated text crawled from the web. In Part 2 of this article, we will dive into some of the inner representations of our model for insights into how it makes decisions and any potential biases in its reasoning process.
Thanks to Marc Garcia, Kevin O’Callaghan, Mikhail Masin, Alexander Purtov and Gleb Vazhenin for their crucial contributions to this project.
Thanks for reading! If you enjoy our Data Science team project, we’re always welcoming new people to the team. Find out more here.