Bumble Tech
Published in

Bumble Tech

Multilingual message content moderation at scale

Part 2: Embeddings space analysis and cross-lingual representations

When a member receives a message within our app that could be harmful or hurtful to the reader, thanks to the model we’re able to check in with the member in real-time through a pop-up message

Introduction and recap

A comparison between CC-100 (dataset used for pre-training XLM-R) and Wikipedia (used for mBERT). The bar chart clearly shows which languages are represented the most in the dataset (arXiv:1911.02116 [cs.CL]).

Sentence embeddings and cross-lingual similarity

  • ted_iwlst2013 is a collection of TED talks in multiple languages, where each record is a sentence extracted from a speech in English, with the correspondent translation in the target language.
  • opus_books is collected with a very similar approach, but all the sentences come from copyright-free books with each sentence translated in multiple languages.
High-level description of the experimentation strategy: negative samples are collected at random from the target language (N-2).
The success rate of the experiment for the last 5 hidden layers of the model (+ pooling) on all the languages of interest. The baseline can be considered as ⅙ = ~16%.
The average success rate of the experiment as per the original setting above (yellow) and when adding N English sentences to the samples set (purple). The results are significantly lower but still outperform the baseline, which now is around 1/11=9% (harder experiment).
In pink the absolute difference between the performances in the original setting and in the new one (yellow-purple as per the above chart); in light blue the percentage of the times that the closest embedding is in English (as per destination language) when missing the target. Since they are clearly the majority, this means that English embeddings are probably closer to each other in some dimensions of the embeddings’ space.

Dimensionality reduction for language detection

Graphical representation of English and Russian embeddings in a t-SNE 2D space. For some layers more than for others, the clouds look decently separable.
Graphical representation of English and French embeddings in a t-SNE 3D space. Similarly to the 2D case, some layers more than others have nicely separable shapes.

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store