A pre-trained model is proven to improve the downstream problem. Lample and Conneau propose two new training objectives to train cross-lingual language models (XLM). This approach leads to achieving state-of-the-art results on Cross-lingual Natural Language Inference (XNLI). On the other hand, Wada and Iwata proposed another way to learn cross-lingual text representation without parallel data. They named it Multilingual Neural Language Models.
This story will discuss Cross-lingual Language Model Pretraining (Lample and Conneau, 2019) and Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models (Wada and Iwata, 2018)
The following are will be covered:
- Cross-lingual Language Model Architecture
- Multilingual Neural Language Models Architecture
Lample and Conneau use Wikipedia dump for monolingual data while cross-lingual data come from:
- MultiUN (Ziemski et al., 2016): French, Spanish, Russian, Arabic and Chinese
- IIT Bombay corpus (Anoop et al., 2018): Hindi
- OPUS (Tiedemann, 2012): German, Greek, Bulgarian, Turkish, Vietnamese, Thai, Urdu, Swahili and Swahili
Wada and Iwata use News Crawl 2012 monolingual corpus for every language except for Finnish while using News Crawl 2014 for Finnish.
Cross-lingual Language Model Architecture
To handle out-of-vocabulary (OOV) and cross-lingual, byte pair encoding (BPE) subword algorithm is applied to split a word to subword. Instead of using a different set of subword per language, the same alphabet, digit, special token, and proper noun are shared to improve the alignment of embedding spaces across languages.
Besides subword, XLM also feed position embeddings (represent the position of a sentence) and language embeddings (represent different language) into different Language Model (LM) to learn text representation. Those LM are:
- Causal Language Modeling (CLM)
- Masked Language Modeling (MLM)
- Translation Language Modeling (TLM)
Causal Language Modeling (CLM)
CLM consists of a Transformer to learn text representation by providing a set of previous features. Given the previous hidden state to the current batch, the model predicts the next word.
Masked Language Modeling (MLM)
Lample and Connea follow Devlin et al. (2018) approach to pick 15% of subword randomly and replacing it by reserved word ([MASK])80% of the time, by a random work 10% of the time and remaining unchanged 10% of the time.
The differences between Devlin et al. (2018) are:
- Using an arbitrary number of sentences but not pairs of sentences only
- Subsample high-frequency subword
Translation Language Modeling (TLM)
CLM and MLM are designed for monolingual data while TLM targets on cross-lingual data. BERT use segment embeddings to represent different sentence in a single sequence of input while replace it by language embeddings to represent different language.
Subwords are randomly picked in both language data. Both language subword can be leveraged to predict any MASK word.
Multilingual Neural Language Models Architecture
Wada and Iwata noticed that parallel data does not suitable for low resource language. As the model is not able to learn text representation from parallel data, subword embeddings will not be the same across language. However, they shared bidirectional LSTM to learn word embeddings for multilingual. Since architecture is shared across language, Wada and Iwata believe that the model can learn similar embeddings if a token is the same.
The following figure shows the architecture of this model while:
- f: Forward and backward LSTM network
- EBOS: Initial input of embeddings
- WEOS: Indicate how likely it is that the next word is the end of a sentence
- El: Word embeddings of language l
- Wl: Linear projections of language El, which is used to calculate the probability distribution of the next word
Basically, XLM (MLM + TLM) achieves good results across languages. As authors noticed CLM does not scale in cross-lingual problem, they do not include CLM training object in the following model comparison.
As Wada and Iwata focus on solving only a small amount of monolingual data are available or the domains of monolingual corpora are different across language scenarios. They intend to use a different data set the size to see the performance. The following figure demonstrated this model outperform others model if the data set size is small.
- BERT use segment embeddings (represent different sentence) while XLM use language embeddings (represent different language).
- CLM does not scale to a cross-lingual scenario.
- XLM may not fit for low resource language as if required parallel data (TML) to boost up the performance. Meanwhile, Multilingual Neural Language Models are designed to overcome this limitation.
Like to learn?
I am a Data Scientist in the Bay Area. Focusing on state-of-the-art work in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or follow me on Medium or Github.
- Bidirectional Encoder Representations from Transformers (BERT)
- 3 subword algorithms help to split a word
- Generative Pre-Training
- XLM Implementation (PyTorch)
- G. Lample and A. Conneau. Cross-lingual Language Model Pretraining. 2019
- J. Devlin, M. W. Chang , K. Lee, K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018
- T. Wada and T. Iwata. Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models. 2018