FILTER: Understand Foreign Languages Better

A technique that promotes a deeper understanding between different languages

Rohit Pillai
The Startup
5 min readSep 16, 2020

--

Photo by GRÆS Magazine on Unsplash

Based on sources from across the internet, the total world population that speaks English is somewhere between 1 in 6 and 1 in 7. Despite this underwhelming minority of the world’s population speaking English, a vast majority of natural language datasets for understanding and generation like the Stanford Question Answering dataset (SQuAD) and GLUE datasets as well as the large scale pretrained models like BERT, RoBERTa and ALBERT that have revolutionized the NLP world are solely based on the English language.

However, there has been a recent focus on other languages with the creation of multi-lingual large scale pretrained models like XLM and XLM-RoBERTa and the introduction of complex multi-lingual tasks like question answering, document classification, information retrieval and more. Mostly recently, XTREME and XGLUE are 2 collections of multi-lingual datasets that require models to be good at several tasks to perform well on their scoreboards.

Now let’s dive a little deeper into how these multi-lingual models are created. There’s 2 main schools of thought to approaching this:

  1. Learning an embedding that is common across all the languages in the world (or whatever’s available to train with at least). This can be done by either feeding a huge amount of text in a multitude of languages to a large language model like XLM or XLM-RoBERTa to learn an implicit embedding (a word has a different embedding based on the context) or by just using a simple neural network to learn an explicit embedding like Word2Vec (a word always maps to the same embedding)
  2. Translating the data either at train (Translate-train) or test (Translate-test) time. Translate-train is when the training data in English is translated to a foreign language and the translated text is added to the training dataset. Translate-test is when the foreign language data is converted to English at test time and the model does its prediction on this English data. The prediction can be translated back to the foreign language for tasks like question answering or span selection or is just a class that doesn’t require translation.

Even though the dataset in the translate-train method has been augmented by the translated English sentences, the input data is still only one language. Nowhere in the translate-train architecture do multiple languages ever interact. Thus, translate-train is more of a data augmentation method than one that promotes an understanding of multiple languages.

The translate-train architecture. The streams on the left and the right go through the same set of transformers independently

Inspired by the translate-train method but with a desire to make the model understand the relationships between 2 languages, the researchers at Microsoft Dynamics 365 AI Research propose a new way to train a multi-lingual model called FILTER. The input to FILTER is the same as any translate-train model: the sentence or paragraph in English (or any other source language) (E) along with its corresponding translated version in the target foreign language (F) you want to train.

FILTER is also a 3 stage architecture like translate-train. But that’s where the similarities stop. Where translate-train uses either (E) and (F) as input, FILTER uses both (E) and (F). These sentences both go through 2 copies (one for each language) of a “local” transformer, which has m layers, to learn unique embeddings for each of the languages.

The output from both these “local” transformers is then fed into a cross-lingual “fusion” transformer with k layers. Here FILTER tries to glean information and learn relationships across (E) and (F).

Finally, there are 2 “domain” transformers (which are once again copies of each other) that have 24-k-m layers and are both task and language specific. The label provided to each “domain” transformer is the corresponding language’s label.

The numbers m and k are hyper parameters that can be tuned based on the task as well.

The FILTER architecture. The 2 “local” transformers (m layers) and the 2 “domain” transformers (24-k-m layers) share parameters between them i.e. they are copies of one another.

For simple tasks like classification, the labels remain the same in both languages and it’s easy to train the final language specific layers. But what about for tasks like question answering, entity recognition or part of speech tagging? The labels may not be directly applicable to the target translated text because of the way different languages structure their sentences. How do you train the target language part of the model?

FILTER has a solution for this too using knowledge distillation. First train a teacher model using FILTER with just the source language (generally English) label. Once you have this teacher model, train a student model with the target language labels being the output from the task specific target transformer of the teacher model. This way, the student learns all the hidden knowledge that the teacher has as well.

Since the final transformer is task specific, FILTER can be easily applied to a variety of tasks by just changing the final transformer. To prove the generalizability of FILTER, the researchers applied it to the multi-task datasets of XTREME and XGLUE. FILTER was able to perform well on a multitude of tasks like multi-language question answering, sentence retrieval, sentence-pair classification, named entity recognition and more to achieve the #1 position on both these complex leaderboards!

Here’s a link to the paper if you want to know more about the FILTER model and click here to see more of our publications and other work.

References

  1. Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; and Johnson, M. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In International Conference on Machine Learning.
  2. Liang Y, Duan N, Gong Y, et al. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation[J]. arXiv preprint arXiv:2004.01401, 2020.
  3. Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun, Jingjing Liu. 2020. FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding. arXiv preprint arXiv:2009.05166
  4. Translation tool to convert text from English to a foreign language and vice versa

--

--

Rohit Pillai
The Startup

I’m an engineer at Microsoft Dynamics 365 AI Research and I’ll post our new NLP, CV and Multimodal research . Check out https://medium.com/@rohit.rameshp