A Brief Introduction to Cross-Lingual Information Retrieval

Published in

LILY Lab

6 min readMar 7, 2019

CLIR and its Motivation

Cross-lingual Information Retrieval is the task of retrieving relevant information when the document collection is written in a different language from the user query. Figure 1 below shows a typical architecture of a CLIR system. There are many situations where CLIR becomes essential because the information is not in the user’s native language.

Figure 1. Typical architecture of a CLIR system [1].

Translation Approaches

CLIR requires the ability to represent and match information in the same representation space even if the query and the document collection are in different languages. The fundamental problem in CLIR is to match terms in different languages that describe the same or a similar meaning. The strategy of mapping between different language representations is usually machine translation. In CLIR, this translation process can be in several ways.

Document translation [2] is to map the document representation into the query representation space, as illustrated in Figure 2.
Query translation [3] is to map the query representation into the document representation space, as illustrated in Figure 3.
Pivot language or Interlingua [4,5] is to map both document and query representations to a third space.

Figure 2. Document Translation for CLIR [8].

Figure 3. Query Translation for CLIR [8].

Query translation is generally considered the most appropriate approach.
The query is short and thus fast to translate than the document, and it is more flexible and allows more interaction with users if the user understands the translation. However, query translation can suffer from translation ambiguity, and this problem is even more obvious for the short query text due to the limited context. By contrast, document translation can provide more accurate translation thanks to richer contexts. Document translation also has the advantage that once the translated document is retrieved, the user can directly understand it, while the query translation still needs a post-retrieval translation. However, several experiments show that there is no clear evidence of one approach or the other using the same machine translation system [6], and the effectiveness is more dependent on the translation direction [7].

Conferences and Data Sets

There are several data sets available for CLIR. The first is the TREC (Text REtrieval Conference) organized by the National Institute of Standards and Technology (NIST). It starts with English to Spanish and then more languages are added including French, German, Italian, Dutch, Chinese, Arabic, and so on. The second data set is from CLEF (Cross-Language Experiment Forum). It focuses on European languages where the first experiments include English, German, French, and Italian documents using queries in Dutch, English, French, German, Italian, Spanish, Swedish, and Finnish. The third one is the NTCIR series of workshops organized by the National Institute for Informatics (NII) of Japan. They emphasize Asian languages such as Japanese, Chinese, Korean, Vietnamese, Mongolian.

Recent Progress

Very recently, cross-lingual word embeddings and neural network based information retrieval systems have become increasingly popular. Cross-lingual word embeddings can represent words in different languages in the same vector space by learning a mapping from monolingual embeddings even from no bilingual supervision. Neural information retrieval can build better representations for documents and queries and learn to rank directly from relevance labels. Here we briefly discuss three recent papers in this direction.

DUET

This is the paper Learning to Match using Local and Distributed Representations of Text for Web Search, WWW 2017 by Bhaskar Mitra, Fernando Diaz, and Nick Craswell.

In traditional information retrieval approaches, we build a local representation by discrete terms in the text. The relevance of a document is based on the exact matches of query terms in the body text. On the other hand, models such as latent semantic analysis and latent Dirichlet allocation learn low dimensional vector representations of terms. The query and the document are matched in the latent semantic space. In this work, they propose a document ranking model consisting of two separate deep neural network sub-models. The first sub-model matches the query and the document using a local representation of text, while the second learns a distributed representations for queries and documents before matching them.
The overall architecture is shown in Figure 4.

MUSE

The second paper is Word translation without parallel data, ICLR 2018 by Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou.

The paper study cross-lingual word embeddings where word embeddings for two languages are aligned in the same representation space (Figure 5). State-of-the-art methods for cross-lingual word embeddings rely on bilingual supervision such as dictionaries or parallel corpora. Recent studies try to alleviate the bilingual supervision need by using character-level information and iterative training.
However, they are not achieving performance on par with other supervised methods. This work proposes to learn a mapping to align monolingual word embedding spaces in an unsupervised way without any parallel data.
The experiment also demonstrates their method is even better than existing supervised methods on some language pairs.

Check out their implementation and Multilingual word Embeddings at MUSE.

Figure 5. Mapping between Word Embedding Spaces [10].

Unsupervised CLIR

The third paper is Unsupervised Cross-Lingual Information Retrieval using Monolingual Data Only, SIGIR 2018 by Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto, Ivan Vulić.

They propose an unsupervised CLIR framework. To this end, they leverage shared cross-lingual word embedding spaces induced solely from monolingual corpora in two languages through an iterative process based on adversarial neural networks. The information retrieval is performed by calculating semantic similarity directly from the cross-lingual embedding space. This does not require any bilingual supervision or relevance labels of documents.

Awesome-CLIR

Finally, we created a curated list of resources for CLIR. Please check out Awesome-CLIR!

References

[1] Jian-Yun Nie. “Cross-language information retrieval”. In: Synthesis Lectures on Human Language Technologies 3.1 (2010), pp. 1–125.

[2] Douglas W Oard and Paul Hackett. “Document translation for cross-language text retrieval at the University of Maryland”. In: TREC. Citeseer.1997, pp. 687–696

[3] Gao, Jianfeng, et al. “Improving query translation for cross-language information retrieval using statistical models.” Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2001.

[4] Ruiz, Miguel, et al. “CINDOR conceptual interlingua document retrieval: TREC-8 evaluation.” TREC. 1999.

[5] Kando, Kazuaki Kishida1 Noriko. “Hybrid Approach of Query and Document Translation with Pivot Language for Cross-Language Information Retrieval.”

[6] Franz, Martin, J. Scott McCarley, and Salim Roukos. “Ad hoc and multilingual information retrieval at IBM.” NIST special publication SP (1999): 157–168.

[7] McCarley, J. Scott. “Should we translate the documents or the queries in cross-language information retrieval?.” Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 1999.

[8] http://www.ccs.neu.edu/home/jaa/IS4200.12S/Handouts/cross_language.pdf

[9] https://www.microsoft.com/en-us/research/uploads/prod/2018/04/NeuralIR-Nov2017.pdf

[10] https://code.fb.com/ai-research/unsupervised-machine-translation-a-novel-approach-to-provide-fast-accurate-translations-for-more-languages/