Member-only story
Multilingual Transformers
Why BERT is not the best choice for multilingual tasks
Last year, we saw rapid improvements in transformer architectures. Being the GLUE benchmark the main reference point for the state-of-the-art in language understanding tasks, most of the research efforts focused on English data. BERT, RoBERTa, DistilBERT, XLNet — which one to use? provides an overview of recent transformer architectures and their pros and cons.
It is challenging to keep track of the GLUE leader board because the progress on language understanding tasks is so fast-paced. Every month a different team takes the top position.
At the same time, transformer architectures have been applied to multilingual tasks. To evaluate these tasks, the approaches discussed here use the cross-lingual Natural Language Inference (XNLI) corpus consisting of labelled sentences in 15 languages. Each data point consists of a Premise and a Hypothesis. Premises and Hypotheses have been labelled for textual entailment: i.e. how the Hypothesis is related to the Premise.