XLMs Pretraining Model | Towards AI

Cross-lingual Language Model

Discussing XLMs and unsupervised cross-lingual word embedding by multilingual neural language models

Edward Ma
Edward Ma
Jul 15 · 5 min read
Photo by Edward Ma on Unsplash

A pre-trained model is proven to improve the downstream problem. Lample and Conneau propose two new training objectives to train cross-lingual language models (XLM). This approach leads to achieving state-of-the-art results on Cross-lingual Natural Language Inference (XNLI). On the other hand, Wada and Iwata proposed another way to learn cross-lingual text representation without parallel data. They named it Multilingual Neural Language Models.

This story will discuss Cross-lingual Language Model Pretraining (Lample and Conneau, 2019) and Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models (Wada and Iwata, 2018)

The following are will be covered:

  • Data
  • Cross-lingual Language Model Architecture
  • Multilingual Neural Language Models Architecture
  • Experiment


  • MultiUN (Ziemski et al., 2016): French, Spanish, Russian, Arabic and Chinese
  • IIT Bombay corpus (Anoop et al., 2018): Hindi
  • OPUS (Tiedemann, 2012): German, Greek, Bulgarian, Turkish, Vietnamese, Thai, Urdu, Swahili and Swahili

Wada and Iwata use News Crawl 2012 monolingual corpus for every language except for Finnish while using News Crawl 2014 for Finnish.

Cross-lingual Language Model Architecture

Input Representation

Besides subword, XLM also feed position embeddings (represent the position of a sentence) and language embeddings (represent different language) into different Language Model (LM) to learn text representation. Those LM are:

  • Causal Language Modeling (CLM)
  • Masked Language Modeling (MLM)
  • Translation Language Modeling (TLM)

Causal Language Modeling (CLM)

Masked Language Modeling (MLM)

The differences between Devlin et al. (2018) are:

  • Using an arbitrary number of sentences but not pairs of sentences only
  • Subsample high-frequency subword
MLM Architecture (Lample and Conneau, 2019)

Translation Language Modeling (TLM)

Subwords are randomly picked in both language data. Both language subword can be leveraged to predict any MASK word.

TLM Architecture (Lample and Conneau, 2019)

Multilingual Neural Language Models Architecture

The following figure shows the architecture of this model while:

  • f: Forward and backward LSTM network
  • EBOS: Initial input of embeddings
  • WEOS: Indicate how likely it is that the next word is the end of a sentence
  • El: Word embeddings of language l
  • Wl: Linear projections of language El, which is used to calculate the probability distribution of the next word
The architecture of the multilingual neural language model (Wada and Iwata 2018)


XLM result among models (Lample and Conneau, 2019)

As Wada and Iwata focus on solving only a small amount of monolingual data are available or the domains of monolingual corpora are different across language scenarios. They intend to use a different data set the size to see the performance. The following figure demonstrated this model outperform others model if the data set size is small.

The comparison result of multilingual neural language model (Wada and Iwata 2018)

Take Away

  • CLM does not scale to a cross-lingual scenario.
  • XLM may not fit for low resource language as if required parallel data (TML) to boost up the performance. Meanwhile, Multilingual Neural Language Models are designed to overcome this limitation.

Like to learn?

Extension Reading


Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Edward Ma

Written by

Edward Ma

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade