Bumble Tech
Published in

Bumble Tech

Multilingual message content moderation at scale

Part 2: Embeddings space analysis and cross-lingual representations

When a member receives a message within our app that could be harmful or hurtful to the reader, thanks to the model we’re able to check in with the member in real-time through a pop-up message

Introduction and recap

As already mentioned, our Rude Message Detector is an XLM-RoBERTa model at heart, originally supporting 100 languages and fine tuned with an internal dataset of ~3M messages in 15+ idioms (including English, Portuguese, Russian, Thai and others). Even if we were to redesign some parts of the architecture, the foundation model powering our system would definitely remain the key to its success, thanks to its emerging capabilities and internal representations. One of the most impressive features of XLM-R is its native understanding of ~100 languages, keeping them all together in the deep embedding spaces. As stated in Part 1, XLM-R is a state-of-the-art Transformers-based language model which is constrained during pre-training to learn how to predict masked tokens in multilingual sentences from a newly-created CommonCrawl dataset (2.5TB in total). Each batch (8,192 samples) contains sentences in a single language, and the rate of sampling between different languages (α) is set experimentally by the researchers to achieve the best validation performances in a setting where, clearly, some languages (English, Russian, and Indi) are more represented than others.

A comparison between CC-100 (dataset used for pre-training XLM-R) and Wikipedia (used for mBERT). The bar chart clearly shows which languages are represented the most in the dataset (arXiv:1911.02116 [cs.CL]).

Sentence embeddings and cross-lingual similarity

In this first experiment, we try to assess if sentence embeddings from any model’s layer are a good proxy for multilingual sentence similarity or if sentences in different languages tend to have similar embeddings regardless of their meaning.

  • opus_books is collected with a very similar approach, but all the sentences come from copyright-free books with each sentence translated in multiple languages.
High-level description of the experimentation strategy: negative samples are collected at random from the target language (N-2).
The success rate of the experiment for the last 5 hidden layers of the model (+ pooling) on all the languages of interest. The baseline can be considered as ⅙ = ~16%.
The average success rate of the experiment as per the original setting above (yellow) and when adding N English sentences to the samples set (purple). The results are significantly lower but still outperform the baseline, which now is around 1/11=9% (harder experiment).
In pink the absolute difference between the performances in the original setting and in the new one (yellow-purple as per the above chart); in light blue the percentage of the times that the closest embedding is in English (as per destination language) when missing the target. Since they are clearly the majority, this means that English embeddings are probably closer to each other in some dimensions of the embeddings’ space.

Dimensionality reduction for language detection

Using the same information and datasets we can design a slightly modified experiment, trying to visualise the embeddings for a corpus of random sentences in different languages. The goal here is to qualitatively assess if embeddings in different languages can be easily separated in a lower-dimensional space.

Graphical representation of English and Russian embeddings in a t-SNE 2D space. For some layers more than for others, the clouds look decently separable.
Graphical representation of English and French embeddings in a t-SNE 3D space. Similarly to the 2D case, some layers more than others have nicely separable shapes.

Conclusion

The two experiments point quite significantly in two different but coexistent directions: 1) XLM-R fine-tuned embeddings can be leveraged for cross-lingual semantic similarity and 2) the same embeddings (possibly reduced) can be used for language detection: adding the right classification head to the top at the right depth should suffice for reliably detecting the language of a sentence in input, even if XLM-R has not been specifically trained on this task.

--

--

We’re the tech team behind social networking apps Bumble and Badoo. Our products help millions of people build meaningful connections around the world.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store