# Cross-Lingual Word Embeddings — What they are?

## Reference Paper — A Strong Baseline for Learning Cross-Lingual Word Embeddings from Sentence Alignments

### Word embedding

Word embedding are representation of words which can be easily understood by computers (i.e. mapping them to real numbers), suppose you have vocabulary of five words — “cat, rat, dog, sparrow, eagle” you can easily assign them number like:

`cat -> 1eagle -> 2rat -> 3sparrow -> 4dog -> 5`

But can we have a better representation, like this:

`cat -> 1dog -> 2rat -> 3sparrow -> 4eagle -> 5`

See that the birds are closer now than the mammals(isn’t this similar to how you associate {cat, dog, rat} and {sparrow, eagle} in your mind), this is better than the previous one, but `dog-cat=1` and `rat-cat=2` Basically we are not able to measure distances between the same class(mammals) and if somehow we are able to introduce distances between the embedding of words, it would be awesome right. (Why?)

So now increase the dimension and give them the following embedding:

`cat -> (1 0)dog -> (0 1)rat -> (1 1)sparrow -> (-1 0)eagle -> (0 -1)`

Yeah! So distance between member of same groups is `sqrt(2)` and different groups are little far away. But how to assign as such? Is there a way?
Yes! There are algorithms like word2vec and GloVe. They consider that if two words are co-occurring then they are somewhat similar. So based on frequency of co-occurrence (this is the tip of the iceberg) and some more tricks they create word vectors in high dimensional space which are able to cover a lot of semantic similarity. Ah! there is a famous example:
`King-man+woman = Queen`
Other resources : word vectors, word2vec, GloVe

### Basics

Word Feature Matrices:
Our aim is to represent words with real valued vectors such that we are able to compute some “semantic” similarity using vector similarity metrics like cosine similarity. To proceed we start with sparse word-feature matrices, either using as it is or by reducing the dimension.
Example:

`He loves Maths.She loves English.He is a hunter.She is a hunter.`

Now give them the following vector representation:

`vocab -> (he, loves, maths, she, English, is, a, hunter)He ->    ( 2,     1,     1,   0,       0,  1, 1,      1)loves -> ( 1,     2,     1,   1,       1,  0, 0,      0)Maths -> ( 1,     1,     1,   0,       0,  0, 0,      0)She ->   ( 0,     1,     0,   2,       1,  1, 1,      1)English->( 0,     1,     0,   1,       1,  0, 0,      0)is ->    ( 1,     0,     0,   1,       0,  2, 2,      2)a ->     ( 1,     0,     0,   1,       0,  2, 2,      2)hunter ->( 1,     0,     0,   1,       0,  2, 2,      2)`

We have used co-occurrence between these words as feature. Now if we try to find most similar word to `he`, we get the following cosine distances:

`He - He -> 1.0He - loves -> 0.59He - Maths -> 0.77He - she -> 0.45He - English -> 0.20He - is -> 0.71He - a -> 0.71He - hunter -> 0.71`

We see that the most similar words for `he` includes `Maths`, `Hunter`.
But can we do better, is there a better matrix which can reflect the association between each word and each feature.

There are three common association metrics:
1. L1 Row Normalization
2. Inverse Document Frequency
3. Point-wise Mutual Information

Furthermore we can reduce the dimension of matrices with algorithms like SVD or negative sampling.

But what can we do with these?
We can search for inference like `'A' is to 'B' then 'C' is to ?` , or even searches based on content rather than spell match, it would be very cool if someone makes an android app which fetches word vectors of text used in your inbox. Now you can search for `college` instead of `institute` to fetch a message which contains `Indian Institute of Technology Kharagpur would be closed from April 29th to July 14th` (Note that the word college is not used even once in the message)

### Cross Lingual Word Embeddings

It is basically projecting words of different language in the same space. That is you can query for “Hochschule” (German word for “college”) and still get to the above mentioned message.

Previous Approaches:
We try to learn the embedding by training on a corpora.These can be classified into three main categories:

• Word-level alignment of the two languages → But this is very hard. Many words do not have single word translation in the other language, or some meaning is lost while translation. In Hindi, “Tumhari” means “Yours” and “Aapki” also means “Yours” but in “Aapki” the speaker is giving respect to the listener. Also sizable bilingual dictionaries are not available for many languages.
• Document-level alignment of the two languages → Like Wikipedia articles in two different languages. This set of algorithms need a massive amount of data to make up for the lack of lower-level alignments
• Sentence-level alignment of the two languages → These algorithms follow the middle path and make use of best of both the extremes. Here we get to use some alignments and also the bottleneck of word to word translation is removed.

How to build on sentence-aligned data?

• Source+Target: Represent each word with all the words that appeared with it in the same sentence in both the source language and target language and also the co-occurrence matrix is maintained. One more approach is to restrict the context window in source sentence to a certain distance within the sentence.
• Sentence ID: Represent each word by the sentence ids in which it appeared, indifferent to the number of times it appeared in each one.

In the paper they have shown that:

algorithms based on the sentence-ID feature space perform consistently better than those using source+target words

Reason: They advocate the source+target feature might be covering more information than is actually needed for translation, like topical similarity. Sentence ID features, on the other hand, are simpler, and might therefore contain a cleaner translation-oriented signal.

But then they have shown that all these algorithms built on Sentence ID, be it neural networks(Bilingual Autoencoders), matrix factorization(Inverted Index) or Expectation Maximization(IBM Model 1, a traditional model) the overall difference in performance seems to be rather marginal.

This suggests that the main performance factor is not the algorithm, but the feature space: sentence IDs

### Proposed Model

In the paper they have used dice coefficient and drew a parallel between it and the dot product between two L1 normalized sentence ID word-vectors. Then have used Sentence ID based matrix as the word-feature matrix and reduced its dimension using negative sampling method (used by Mikolov in word2vec (SGNS → Skip Gram based Negative Sampling)).

As they are using sentence id for getting the features, it can be easily extended to multiple languages. That is, multiple languages can be represented together in the same matrix. As per their experiments they have shown that dimensionalilty reduction over multi-lingual matrix produces a better result than that on a bilingual matrix.
This is due to signals from multiple languages helping the bilingual translation.

TL;DR
In a nutshell they have shown that modern methods are not able to outperform traditional models based on sentence ID, but introducing multiple language information helped them achieve a better result (+4.69%).

Another awesome post on Cross-Lingual Word Embedding model by Sebastian Ruder.