Stories by Satya Vasanth Tumati on Medium

GloVe: Global Vectors for Word Representation

Satya Vasanth Tumati — Fri, 20 Apr 2018 15:54:15 GMT

The Problem:

For any unsupervised model, statistics of the data is staple. Skip-gram, CBOW and related models capture the semantic information but don’t take advantage of the co-occurrence statistics. Even though Matrix decomposition methods use this, they don’t capture the semantic information. The dimensions of meaning cannot be observed in these methods.

The Solution:

Two words can be better distinguished with respect to a context word by the ratio of the co-occurrence probabilities. This function, that distinguishes two word vectors based on a context word vector need to be formalized in a way that it is symmetric between context words and words. This model weighs all the co-occurrences equally. In order to adjust this, the cost function will be a weighted least squares regression model. The weight function is chosen to be non-decreasing and also the high frequent co-occurrences are not overweighted.

src : https://nlp.stanford.edu/projects/glove/

The paper formulates the cost function of the Skip-gram model i.e sum of log probabilities of all words in all the contexts as a weighted sum of the cross-entropy of distributions(Co-occurrence(P), Soft-max(Q)). This model with a new objective function is interpreted as a Global Skip-Gram model.

The cross-entropy measure models the distributions that have long tail poorly by giving too much weight to unlikely events. To bound the cross-entropy measure, the values of the distribution Q has to be normalized which is computationally expensive owing to the |V| number of exponents to be calculated for each word. Instead of using the weighted cross entropy, the weighted squares of the difference between the unnormalized values of the distributions is used. But the unnormalized values of the co-occurrence matrix can dominate that of SoftMax. To address this, the logarithmic values are used instead. The weighting factor is also adjusted by using the weighting function mentioned above.

The computational complexity of the model depends on the number of non zero entries of the co-occurrence matrix as the weighting function becomes 0 otherwise. |V|2 doesn’t give a stricter bound as it is greater than any corpora in the order of billions. The tighter bound on the number of non zero elements co-occurrence matrix is determined by modelling the matrix entries as power-law function of the frequency rank of the word pair.

Experiments and Results:

src: GloVe: Global Vectors for Word Representation, Jeffrey Pennington et. al

The model has experimented with the word analogy task and CoNELL-13 benchmark dataset for named entity recognition. The model achieved around 70% overall accuracy and the accuracy for semantic tasks is about 80% in the symmetric context. The model used the word2vec tool which uses negative sampling instead of hierarchical softmax. The model performance is compared SVD related models and also against Mikolov’s CBOW. On almost all of the many datasets and in the tasks of word analogy and NER, this model has performed better than any other existing models. The Global Vector model by varying the iterations is compared against CBOW and Skip-gram by varying the number of negative samples. The accuracy of the GloVe model increases with the number of iterations but the improvements are diminishing after 15 iterations. In contrast, the accuracy of Skip-gram slightly increases with the negative samples and the CBOW model performs model when the negative samples are 10 and decreases beyond.

src: GloVe: Global Vectors for Word Representation, Jeffrey Pennington et. al

My take:

The paper has taken the best of the Matrix factorization models and Local context window models and presented a computationally simple yet efficient model to determine a rich distributed word representations. The Abstract of the paper that the model properties needed for such regularities to emerge in word vectors are analyzed but there is no explicit discussion related to this.

Tomas Miklov’s ‘Distributed Representations of Words and Phrases and their Compositionality’ in 500…

Satya Vasanth Tumati — Wed, 11 Apr 2018 02:12:59 GMT

Tomas Miklov’s ‘Distributed Representations of Words and Phrases and their Compositionality’ in 500 words

src: Tomas Miklov et.al Distributed Representations of Words and Phrases and their Compositionality

The Skip-gram model that was introduced recently before this proved efficient to learn high-quality word representations. These represent encapsulate a large number of semantic and syntactic relations between the words. The paper improves on this model and introduces extensions to the existing Skip-gram model to improve the computational complexity and get better results too.

Skip-gram model finds word representations that are useful for predicting the surrounding words. Each word has an input(‘when word is treated as center’) and output(‘when the word is treated as context’) representation. The accuracy of the model can be improved by considering a larger context but at the expense of training time.

Distributed representations are used in statistical language modeling, speech recognition, and machine translation but the striking feature of the Skip-gram model is that it doesn’t involve dense matrix multiplication and thereby making the training extremely fast.

Hierarchical Softmax:
The Skip-gram uses Softmax to compute p(wt+j |wt ). This is quite expensive as the cost of computing the partial derivative of the probabilities can be in the order of the size of the words (W) which can be ~exp(5) — exp(7).
A Hierarchical Softmax is proved to be a good computationally efficient approximation. It uses a binary tree representation of the output layer. All the words are embedded as a leaf of the tree. Any internal node n represents the relative probabilities of its children. In order to compute the probability P(wO|wI) now, in the equation, only the components of nodes appearing in the path from the root to wO are considered.
This boils down the complexity to logW instead of W.
For achieving faster training the tree used will be a binary Huffman tree as shorter codes are assigned to frequent words.

Negative Sampling:
Noise Contrastive Elimination can be used in place of hierarchical softmax to avoid the expensive computation of the denominator in the Softmax. In general, it uses a logistic regressor to classify between data and noise. The task is to distinguish the target word from the noise distribution using logistic regression, and are k negative samples for each sample.
Negative Sampling(NEG) is a simplification of NCE when the negative samples are equal to the total words W.

Subsampling of frequent words:
The idea of this extension is that the vector representations of frequent words do not change significantly after training on several million examples. We keep each word in the training set with a probability inversely proportional to the square-root of its frequency. Words appearing more than a threshold will be simply discarded.
As the model is linear, it makes the vectors reasonable for linear analogical reasoning. Among the given extensions, NEG performs better than the rest with over 61% test accuracy on the Analogy test set.

Compositionality observed in the learned word vectors | src: Tomas Miklov’s ‘Distributed Representations of Words and Phrases and their Compositionality’

Learning Phrases:
Another important extension is to be able to find the representation for phrases. We first find words that appear frequently together, and infrequently in other contexts. A score function is used to determine if a bigram is used as a phrase based on the threshold. We can represent the phrases without greatly increasing the vocabulary size.
The Hierarchical Softmax version of this model performed well when the frequent words are subsampled. The accuracy of phrase analogy task reached up to 72% when the model is trained with the dataset with 32 billion words and only 47% with 1 billion words suggesting that a large amount of data gives better results.

Yoshua Bengio’s A Neural Probabilistic Language Model in 500 words

Satya Vasanth Tumati — Wed, 11 Apr 2018 02:02:06 GMT

Hello World, Welcome to my first blogpost. In this and in the upcoming blog posts, I will be posting the short summaries of some of the most popular research papers in the area of NLP. Suggestions and constructive criticism are most welcome.

The Problem:

The fundamental problem for probabilistic language modeling is that the joint distribution of a large number of discrete variables results in exponentially large free parameters. It is called ‘Curse of Dimensionality’. This demands a use of modeling using continuous variables where the generalization can be easily achieved. The function that is learned will then have a local smoothness and every point (n-gram sequence) have significant information about a combinatorial number of neighboring points.

The Solution:

The paper presents an effective and computationally efficient probabilistic modeling approach that overcomes the curse of dimensionality. It also overcomes the problem when a totally new sequence not present in the training data is observed. A neural network model is developed which has the vector representations of each word and parameters of the probability function in its parameter set. The objective of the model is to find the parameters that minimize the perplexity of the training dataset. The model eventually learns the distributed representations of each word and the probability function of a sequence as a function of the distributed representations. The Neural model has a hidden layer with tanh activation and the output layer is a Softmax layer. The out of the model for each input of (n-1) prev word indices are the probabilities of the |V| words in the vocabulary.

src: Yoshua Bengio et.al. A Neural Probabilistic Language Model

The Significance:

This model is capable of taking advantage of longer contexts. Some traditional n-gram based models have slightly mitigated the problem of appearance of the new sequence by gluing overlapping sequences. But they could only account for shorter contexts. Continuous representation with each word having a vector representation, it is now possible to estimate the probabilities for a sequence unseen in the training corpus. The probability function uses parameters which increase only linearly with the size of the vocabulary and linear with the size of the dimension of the vector representation. The curse of dimensionality is solved as we don’t need the exponential number of free parameters. An extension of this work presents an architecture that outputs the energy function instead of the probabilities and also takes care of out-of-vocabulary words.

Experimentation and Results:

The two corpora selected seemed standard with significantly large data. While Brown corpus has all the English textbooks, the AP corpus has news from ’95 and ’96. The models that are compared are modified backoff n-gram models which perform better than the standard models. The test perplexity difference of the neural network was 24% in case of Brown corpus and 8% with AP News corpus. The best performance is observed when there are 10 hidden nodes in the MLP of Neural model.

src: Yoshua Bengio et.al. A Neural Probabilistic Language Model

My Take:

This paper uses the best features like learning the statistical model, using word similarities, using a distributed vector representation for each word, and using Neural Nets from different works etc.. and puts them together to find an elegant solution to the problem of statistical language modeling. Besides presenting an elegant model, the paper also mentioned how to take advantage of the present day computational resources in achieving the task quickly and efficiently by giving a description of data-parallel and parameter-parallel implementation. The idea of a mixture of models by using this model along with the trigram model and also how the author attributed the increase in performance to ‘neural model and trigram model making errors in different places’ is remarkable. The part where using direct connections to the output layer implies that the number of hidden layers will be 2 isn’t clearly explained. The work also presented clear future directions in terms of understanding the word representations, introducing prior knowledge and representation of conditional probability as a tree structure.