Multilingual message content moderation at scale

Part 2: Embeddings space analysis and cross-lingual representations

Massimo Belloni
Bumble Tech
Published in
8 min readMay 24, 2022

In Part 1 of this series, we introduced the machine learning engine that powers the Rude Message Detector, the Badoo app’s fully-multilingual toxicity detector designed to protect our community from harmful messaging.

When a member receives a message within our app that could be harmful or hurtful to the reader, thanks to the model we’re able to check in with the member in real-time through a pop-up message

In the previous post, we covered three main topics: the high-level technical aspects of our NLP model; the approach we followed to fine-tune it; and the lessons we learned along the way. In this second post, we are interested in opening the machine learning ‘black box’ of the model, as we attempt to gain a better understanding of its inner workings and how it deals with different languages and their internal representations. We also explore some possible further use cases for such a powerful deep learning model as this one.

To perform these experiments, we leveraged some general purpose and open-source multi-lingual datasets available through HuggingFace — an NLP library currently being used by some of the biggest AI organisations in the world, Bumble Inc. included. Even though our model is trained with internal data, we decided to perform these experiments using text that was as general as possible, highlighting features of the original model that can be leveraged in other downstream tasks (e.g. cross-lingual similarity and language detection), for example: are similar sentences in different languages represented close to each other? Are multilingual embeddings a good proxy for language detection?

Introduction and recap

As already mentioned, our Rude Message Detector is an XLM-RoBERTa model at heart, originally supporting 100 languages and fine tuned with an internal dataset of ~3M messages in 15+ idioms (including English, Portuguese, Russian, Thai and others). Even if we were to redesign some parts of the architecture, the foundation model powering our system would definitely remain the key to its success, thanks to its emerging capabilities and internal representations. One of the most impressive features of XLM-R is its native understanding of ~100 languages, keeping them all together in the deep embedding spaces. As stated in Part 1, XLM-R is a state-of-the-art Transformers-based language model which is constrained during pre-training to learn how to predict masked tokens in multilingual sentences from a newly-created CommonCrawl dataset (2.5TB in total). Each batch (8,192 samples) contains sentences in a single language, and the rate of sampling between different languages (α) is set experimentally by the researchers to achieve the best validation performances in a setting where, clearly, some languages (English, Russian, and Indi) are more represented than others.

A comparison between CC-100 (dataset used for pre-training XLM-R) and Wikipedia (used for mBERT). The bar chart clearly shows which languages are represented the most in the dataset (arXiv:1911.02116 [cs.CL]).

XLM-RoBERTa (base) has 12 hidden layers with 12 self-attention heads each, for a total of ~270M parameters. Each token in the input sentence (N, containing BOS — beginning of a sentence — and EOS — end of a sentence — tokens as well) is first mapped to a context-unaware 768-dimensional embedding (one per token, Nx768). Each embedding then goes through the hidden layers (H=12), and from each we can retrieve the Nx768 context-aware embeddings (H x N x 768 overall), resulting from the self-attention mechanisms. What is usually considered as the output of XLM-R are the Nx768 embeddings of the last hidden layer and the 768-sized pooled output, built by combining the Nx768 representations into a singular, final one. Thus, it can be leveraged as sentence embedding to perform tasks such as text classification. In our case, we decided to use the 768-sized embedding of the BOS token instead (token 0), leading to higher validation performances on our problem.

Analysing the (H+1) x N x 768 embeddings gives interesting insights into how the fine-tuned model stores the semantics of words and the reasons for them, which is particularly interesting given the multilingual nature of the underlying XLM-R model.

Sentence embeddings and cross-lingual similarity

In this first experiment, we try to assess if sentence embeddings from any model’s layer are a good proxy for multilingual sentence similarity or if sentences in different languages tend to have similar embeddings regardless of their meaning.

To perform this experiment, we retrieved two machine translation datasets from HuggingFace:

  • ted_iwlst2013 is a collection of TED talks in multiple languages, where each record is a sentence extracted from a speech in English, with the correspondent translation in the target language.
  • opus_books is collected with a very similar approach, but all the sentences come from copyright-free books with each sentence translated in multiple languages.

In this experiment we focused on English, French, Portuguese, Dutch, Italian and Russian; all were supported by our final model and fine-tuned on internal datasets. All the pairs from the two datasets are built with English as a source language and have the other language as a target.

High-level description of the experimentation strategy: negative samples are collected at random from the target language (N-2).

The experiment consists of 2,500 samples per dataset-language pairs (2 datasets, 5 languages per dataset: en-fr, en-pt, en-nl, en-it, en-ru). For each combination (e.g. ted_iwlst2013, en-fr) we collected the positive pair (an English sentence and its corresponding translation) and randomly selected N other sentences from those in the target language. We retrieved the model’s embeddings for the BOS token for the last 5 hidden layers (out of H=12) and the pooling output. We used this to measure the cosine similarity over the full embedding size (768) between the source sentence in English, the target positive sentence and the N negative sentences in the target language. Success is defined by the anchor sentence being more similar to its translation (positive pair) while failure consists of a higher similarity to one of the N negatives sampled at random.

The success rate of the experiment for the last 5 hidden layers of the model (+ pooling) on all the languages of interest. The baseline can be considered as ⅙ = ~16%.

The results show pretty good evidence of cross-lingual semantic embeddings in all the model’s layers, with a decrease in performances the closer to the final classification head. This all points towards there being a more task-specific encoding the closer we get to the output. The best performing layers in this experiment were the penultimate layer and the 5th-from-last hidden layer, independently of the dataset used or the model’s token size. There is no overwhelming evidence of some languages performing significantly better than the others, but this assumption has to be validated separately by double-checking the number of samples coming from each language both in the internal training set and in the CC-100 training set used for XLM-R.

To get even more information on the actual content of the hidden layers’ embeddings, we also run a modified version of the experiment outlined above. Together with the positive target, for each sentence, we add an additional N random sentences in English together with the positive target one and the N random sentences in the destination language. As above, we define success as when the embedding with the highest cosine similarity is the positive target one and a failure where it is one of the 2N negative sentences (now both in English and in the destination language). For each miss, we also record the language of the selected sentence (English or destination language), trying to determine if English embeddings are inherently closer one to another no matter the input sentence.

The average success rate of the experiment as per the original setting above (yellow) and when adding N English sentences to the samples set (purple). The results are significantly lower but still outperform the baseline, which now is around 1/11=9% (harder experiment).

Despite the slightly diminished performances, the percentage of successes is still significantly above baseline, which is also significantly lower in the second experiment than in the first (1/(2N+1) vs 1/(N+1)). It is also meaningful to show how many of the messages wrongly defined by the technique as the right translation are in English or in the target language.

In pink the absolute difference between the performances in the original setting and in the new one (yellow-purple as per the above chart); in light blue the percentage of the times that the closest embedding is in English (as per destination language) when missing the target. Since they are clearly the majority, this means that English embeddings are probably closer to each other in some dimensions of the embeddings’ space.

This first set of experiments is very significant in outlining some degree of cross-lingual semantic in the final fine-tuned XLM-R embeddings at any depth, with performances fluctuating from one layer to the other, probably in correlation to the distance to the final classification head. Adding English sentences (source language) to the KNN pool also points in the direction of language components in the embeddings, possibly indicating a potential use case of the model being a good proxy for language detection.

Dimensionality reduction for language detection

Using the same information and datasets we can design a slightly modified experiment, trying to visualise the embeddings for a corpus of random sentences in different languages. The goal here is to qualitatively assess if embeddings in different languages can be easily separated in a lower-dimensional space.

Of the embeddings, 768-sized ones are computed for all the layers as in the experiment above for 10k random samples, per language, from pairs in all the different datasets-languages combinations — e.g. 10k English and French sentences coming from ted_iwlst2013 en-fr, 10k English and Russian sentences coming from opus_books en-ru, etc. Sentences are selected at random because in this experiment we are not interested in their meaning, but just about the languages, they are written in.

After retrieving the embeddings we tried different dimensionality reduction techniques in order to be able to plot them in two- or three-dimensional spaces. The one that led to better and more meaningful visualisations was t-SNE, after reducing 768 to 50 thanks to PCA, as recommended. Used alone, PCA was unable to create relevant partitions for the problem at hand.

Graphical representation of English and Russian embeddings in a t-SNE 2D space. For some layers more than for others, the clouds look decently separable.

The results change little when using 3 dimensions, but the right number of embeddings or network architecture for a possible language detection task requires more in-depth research.

Graphical representation of English and French embeddings in a t-SNE 3D space. Similarly to the 2D case, some layers more than others have nicely separable shapes.


The two experiments point quite significantly in two different but coexistent directions: 1) XLM-R fine-tuned embeddings can be leveraged for cross-lingual semantic similarity and 2) the same embeddings (possibly reduced) can be used for language detection: adding the right classification head to the top at the right depth should suffice for reliably detecting the language of a sentence in input, even if XLM-R has not been specifically trained on this task.

Thanks to Nicola Ghio for his contribution to this project.

Thanks for reading! If you enjoy hearing about our Data Science team projects and are interested in the sort of problems we’re tackling, we’re always welcoming new people to join us, find out more here



Massimo Belloni
Bumble Tech

Machine Learning @ Bumble