Cross Lingual Models( XLM-R )

6 min readSep 2, 2020

A deep dive into XLM-R

Introduction:

A new model from Facebook AI called XLM-R , where ‘R’ stands for Roberta, By the name it’s very common to assume that XLM in addition to Roberta, instead of BERT, but it would be incorrect to say that. XLM-R is different from XLM, avoiding TLM(Translation language model) objective, it just trains Roberta on a huge multilingual dataset at a large scale. Around 100 languages was extracted from CommonCrawl datasets, i.e. 2.5TB of text data (unlabeled). XLM-R is trained only with an objective of the Masked language Model(MLM) in a Roberta way.

XLM-R is a transformer-based multilingual masked language model pre-trained on a text in 100 languages, which obtains state-of-the-art performance on cross-lingual classification, sequence labeling, and question answering.

It achieves state-of-the-art performance in cross-lingual understanding, a task in which a model is trained in one language and then used with other languages without additional training data.

What are Cross-Lingual Models:

An unsupervised method for learning cross-lingual representations using cross-lingual language modeling and investigate two monolingual pretraining objectives. A new supervised learning objective that improves cross-lingual pretraining when parallel data is available.

Cross-lingual language model pretraining is either CLM(Causal Masked Modeling), MLM(Masked Language Modeling), or MLM used in combination with TLM. For the CLM and MLM objectives, training the model with batches of 64 streams of continuous sentences composed of 256 tokens.

Background:

In 2017, Transformers was invented, which introduces an attention mechanism that takes the entire text input at the same time to learn context relations between words. The Transformer has two parts an encoder and the decoder,encoder that reads the text input and generates a lateral representation of it (e.g. a vector for each word), and a decoder that produces the translated text from that representation.More details on Transformers can be found here.
The BERT model was introduced in 2018 . It uses the Transformer’s encoder to learn a language model by masking some of the words and then trying to predict them, allowing it to use the entire context, i.e. words to the left and right of a masked word.More details on BERT can be found here.

BERT was trained on around 100 languages, but at that time it cannot be optimized for multi-lingual models. To overcome that, XLM modifies BERT in the following way:

XLM uses byte pair encoding(BPE) which increases the shared vocab between different languages while the BERT uses word or characters as the input in the model.
In XLM, each sample of training contains same text in two languages, now the model can use the context from one of the languages and predict the tokens in the other, while in BERT model, which focuses on predicting the masked tokens(each sample of text build on a single language).

One of the major problems with XLM was, it requires parallel examples which can be difficult to obtain at a sufficient scale, While in XLM-R which follows the self-supervised technique.

Architecture:

We can say that XLM-R follows the same approach as XLM, only introducing changes that improve performance at scale. XLM-R is a scaled-up version of XLM-100.

The main training objective of XLM-R is the masked language model, as the name Roberta suggests, XLM-R is trained in a ROBERTa fashion i.e. using Masked language model objective.

XLM-R uses a Transformer model trained with the multilingual MLM objective using only monolingual data. It sample streams of text from each language and train the model to predict the masked tokens in the input, using sentence piece with a unigram language model applying subword tokenization on the raw text, and sample batches from different languages using the same sampling distribution of (α = 0.3).

MLM main aim was to mask one or more words in a sentence and give the model to predict the masked tokens given others words in the sentence. The TLM objective extends MLM to pairs of parallel sentences. Now To predict a masked English word, the model can attend to both the English sentence and its French translation, encouraged to align English and French representations. Now the model can leverage the french context if the English one is not sufficient to infer the masked the English words.

What’s new in XLM-RoBERTa ?

One of the biggest update that XLM-R offer is the increase amount of training data. XLM-R was trained on 2.5TB of newly created clean Common Crawl data in 100 languages. It outperforms previously released multi-lingual models like mBERT or XLM on tasks like classification, sequence labeling and question answering.

The increase in the size of the Common Crawl dataset over Wikipedia per language (from XLM-R paper)

In the figure above we can see that the size of the dataset which is used to train XLM Roberta that is represented by blue bars and the orange bars are the size of the dataset used to train XLM-100 and multilingual BERT, each bar represents one language.
XLM-R was trained on Common Crawl which increases the amount of data for low-resource languages by two orders of magnitude on average, and low resource language benefit from other languages. It uses one Common Crawl dump for English and twelve dumps for all other languages, which increases the size of the Dataset, for low-resource languages like Burmese and Swahili.

In the Figure above , which shows the XNLI performance Vs the number of languages the model is pretrained. The overall XNLI accuracy decreases from 71.8% to 67.7% from XLM-7 to XLM-100.

In the figure above which shows the importance of scaling the size of the model with an increase in the no of languages. For the overall size of the model, scaling the size of the shared vocabulary which can improve the performance of the multilingual model on different downstream tasks.

The figure above shows that the higher the value of α, better is the performance on high-resource languages, and vice-versa. While considering overall performance, 0.3 is the optimal value for (α).

Results on cross-lingual classification:

Report of accuracy on each of the 15 XNLI languages and the
average accuracy.

In the figure shown above is the result of accuracy and the average accuracy on each of the 15 XNLI languages. Dataset D used for pretraining, the number of models #M and the number of languages #lg the model handles.

XLM-R sets the new state of the art on XNLI. On the cross-lingual transfer, XLM-R obtains 80.9% accuracy, outperforming the XLM-100 and mBERT open-source models by 10.2% and 14.6% average accuracy. On the Swahili and Urdu low resource languages, XLM-R outperforms XLM-100 by 15.7% and 11.4%, and mBERT by 23.5% and 15.8%.

XLM-R obtains a new state of the art on XNLI of 83.6% average accuracy.

Application:

XLM-R can be fine-tuned on just one language and zero-shot transfer to the other languages.

XLM-R handles 100 languages and still remains competitive with monolingual counterparts.

Conclusion:

XLM-R is the new state of the art multilingual masked language model trained on 2.5 TB of newly created clean Common Crawl data in 100 languages. It provides strong improvement over previous multilingual models like mBERT and XLM on classification, sequence labeling and question answering, however it exposes the limitations of multilingual MLMs, in particular by uncovering the high-resource versus low-resource trade-off, and the importance of key hyperparameters.

XLM-R is the first multilingual model to outperform traditional monolingual baselines that rely on pre-trained models. Models such as multilingual BERT and XLM have limitations to learn useful representations for low-resource languages.XLM-R exposes the effectiveness of multilingual models over monolingual models which shows strong improvement on low-resource languages.