CMU, DeepMind & Google’s XTREME Benchmarks Multilingual Model Generalization Across 40 Languages

Synced
Synced
Apr 15, 2020 · 4 min read
Image for post
Image for post

An essential step toward more general-purpose natural language processing (NLP) models is to achieve a certain level of multilingual competence. But considering most of the estimated 6,900 languages worldwide lack sufficient data to train robust models, existing NLP research and methods still tend to focus on specific English-language tasks.

Junjie Hu is a PhD student at the Language Technologies Institute of Carnegie Mellon University. Hu previously interned at Google, where he observed it was difficult to train cross-lingual models due to a lack of established and comprehensive tasks and environments for evaluating and comparing model performance on cross-lingual generalization.

Although recent multilingual approaches like mBERT and XLM have shown impressive results in learning general-purpose multilingual representations, a fair comparison between these models remains difficult as most evaluations focus on different sets of tasks designed for similar languages.

Hu and another CMU researcher together with DeepMind’s Sebastian Ruder and researchers from Google recently published a study introducing XTREME, a multi-task benchmark that evaluates cross-lingual generalization capabilities of multilingual representations across 40 languages and nine tasks. Hu told Synced “Hopefully XTREME can encourage more research efforts in building multilingual NLP models and effective human curations for multilingual resources.”

Image for post
Image for post

Based on the researchers’ analysis, the cross-lingual transfer performance of current models varies significantly both between tasks and languages. To maximize language diversity for the benchmark, the team selected 40 languages from various language families and with diverse written scripts out of the 100 languages with the most Wikipedia articles.

“We also made sure to cover languages with low, medium, and high resource, in other words, find a balance between language diversity and resource availability,” Hu said. The research included under-studied languages such as the Dravidian language Tamil spoken in southern India, Sri Lanka, and Singapore; as well as Niger-Congo languages Swahili and Yoruba.

Hu says there is an ongoing effort to extend XTREME to cover up to 100 languages.

Each task covers a subset of the 40 languages, so in order for a model to succeed on the XTREME benchmark it needs to learn multilingual representations that summarize linguistic information at different levels and generalize to the diverse set of cross-lingual transfer tasks.

XTREME focuses on the zero-shot cross-lingual transfer scenario where annotated training data is provided in English but none is provided in the target language. To evaluate performance using XTREME, models must first be pretrained on multilingual text using objectives that encourage cross-lingual learning, then fine-tuned on task-specific English data. XTREME can then evaluate the models on zero-shot cross-lingual transfer performance, for example on other languages for which no task-specific training data was provided.

In experiments with state-of-the-art pretrained multilingual models such as mBERT, XLM, XLM-R, and M4, the researchers found that performance differences were most pronounced on syntactic and sentence retrieval tasks. While the multilingual models approached human level performance on many tasks in English and did reasonably well on languages in the Indo-European family, they struggled with Sino-Tibetan, Japonic, Koreanic, and Niger-Congo languages.

“Overall, a large gap between performance in English and other languages remains across all models and settings, which indicates that there is much potential for research on cross-lingual transfer,” Ruder and co-author Melvin Johnson wrote in a Google blog post on the paper.

Advanced techniques developed for English-language applications have dominated most of the recent and impressive NLP breakthroughs. Building on cross-lingual deep contextual representations, the researchers believe this new work can contribute to improving NLP performance for the 80 percent of humans who speak languages other than English.

The paper XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization is on arXiv, and the code is on GitHub.

Journalist: Yuan Yuan | Editor: Michael Sarazen

Thinking of contributing to Synced Review? Synced’s new column Share My Research welcomes scholars to share their own research breakthroughs with global AI enthusiasts.

Image for post
Image for post

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Image for post
Image for post

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.

Image for post
Image for post

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium