From text unlabelled data to multilingual representations
Note: This article was originally published in 2016.
This post reviews some scientific papers on which Heuritech’s R&D department put its focus on. It may be a brand new model, an improvement over a former one, an empirical or theoretical analysis of a phenomenon, or anything we find interesting. Our main topics of interest here are representation learning, language modeling, multimodal learning and models of alignment.
Learning Distributed Representations of Sentences from Unlabelled Data (Felix Hill, Kyunghyun Cho, Anna Korhonen)
Link to paper on arxiv.
This article presents a comparison between different neural network architectures that can learn a representation of sentences, in an unsupervised fashion, in a similar way that we can learn word embeddings with word2vec. It compares methods that directly learn from raw unstructured data (such as Skip-thought vectors, Paragraph Vectors) to ones that learn from structured resources (aligned multi-lingual sentences, words linked with their definition, texts linked with their images). The author also present two new models:
- Sequential Denoising Autoencoder: using the LSTM encoder-decoder paradigm blended with the de-noising auto-encoder approach, the model predicts the original sentence given a corrupted version of itself
- FastSent: can be seen as a Skip-thought where the encoder is a sum over word embeddings, which tries to predict independently the words from the previous and following sentences.
Very interesting benchmarks are carried out for all of these representations. They discriminate between supervised and unsupervised evaluation. They show that some models must be followed by the learning of a linear model to be expressive, while others yield an embedding space where cosine similarity is meaningful enough to be used as an evaluation of the semantic relatedness between sentences.
One of the conclusions of this article is that “the role of word order is unclear”. As they say, models that are sensitive to word order (RNN-based ones) tend to perform worse than those that are not, on average and in both supervised and unsupervised benchmarks. It would be interesting to extend this analysis to tasks like automatic translation and summarization, to see if a bag of word encoders can be powerful enough to learn how to sequentially decode a sentence.
Correlational Neural Networks
At Heuritech, we want to have distributed vector representations of words for more than 10 different languages, including English, French, Spanish, Chinese to obtain a unified multilingual embedding that is consistent with:
- similar words in one language (such as “cat” and “dog” ) should have a cosine similarity close to 1,
- words that have the same meaning across different languages (such as “cat” in English, “chat” in French and “gato” in Spanish) should have close representations.
We have tuned our own word2vec models  that induce as many embedding spaces as the number of languages we deal with. Now, we need to project all our monolingual embedding spaces into one multilingual common space. Given a word and its translation in another language, let us consider that the embeddings of these two words in their respective spaces are two views of the same thing. We use Common Representation Learning (CRL) to project these two views in the same space. Two popular paradigms are Canonical Correlation Analysis (CCA) based approaches and AutoEncoder (AE) based approaches.
- CCA  maximizes the correlation of the projected word embeddings on the common subspace: this method was nonlinearly extended with deep CCA 
- AE  learns a common representation to perform self-reconstruction and cross-reconstruction. The main drawback of AE is that the views are not guaranteed to be projected to the same “part” of the common subspace
Correlational Network  combines CCA and AE approaches to induce correlation of the projected views but also enable self-reconstruction and cross-reconstruction. It is a neural network architecture whose objective function is composed of two terms. The first term is an Autoencoder-like term: it aims at reconstructing each of the views from itself and from the other. The second term is a CCA-like term: it forces the hidden representations of the two views to be highly correlated. A deep Correlational Network has been developed to improve accuracy.
In order to deal with more than two languages, we use Bridge Correlational Neural Networks : it only requires one pivot language that works as a bridge among them. It means that we can project all of our monolingual spaces into one multilingual embedding, given a dataset containing aligned views between a pivot language (e.g. English) and every other language.
These alignments across words embeddings enable us to:
- Match corresponding items across views, i.e make transliteration equivalence: given an English word, we can find the most similar words in any other language
- Improve single view performance: our models are more accurate for our different tests in the aligned space than they are in their respective monolingual spaces
- Transfer learning: we could learn a classifier on a English dataset (for which we find labeled data more easily), and then apply this classifier on an Esperanto dataset
Alexandre Ramé, Hedi Ben Younes & Charles Ollion
 G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning, pages 1247–1255, 2013.
 S. Chandar, M. M. Khapra, H. Larochelle, and B. Ravindran. Correlational
neural networks. Neural computation, 2015.
 S. Lauly, H. Larochelle, M. Khapra, B. Ravindran, V. C. Raykar, and A. Saha. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems, pages 1853–1861, 2014.
 T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 J. Rajendran, M. M. Khapra, S. Chandar, and B. Ravindran. Bridge correlational neural networks for multilingual multimodal representation learning. arXiv preprint arXiv:1510.03519, 2015.
 B. Thompson. Canonical correlation analysis. Encyclopedia of statistics in behavioral science, 2005.
Originally published at https://lab.heuritech.com.