A Comprehensive Introduction to Word Vector Representations

Published in

AI Society

6 min readFeb 9, 2017

Making a computer mimic the human cognitive function of understanding text is a really hot topic nowadays. Applications range from sentiment analysis to text summary and language translation among others. We call this field of computer science and artificial intelligence Natural Language Processing, or NLP (gosh, please don’t confuse with Neuro-linguistic Programming).

Bag of Words

The ‘Bag of Words’ model was an important insight that made NLP thrive. This model consists on receiving a list of labeled text corpora, making a word count on each corpus and determining with how much frequency each word (or morpheme [1] to be more precise) appears for every given label. After that, the Bayes’ Theorem is applied on an unlabeled corpus to test which label (a sentiment analysis that labels between positive and negative, perhaps) it has a higher probability of belonging to, based on morpheme frequencies.

Even Though decent (>90%) test scores can be achieved with this method, it has 2 problems:

Syntactic and semantic accuracy isn’t as high as it should because of the fact that context is king. For instance; ‘Chicago’ means one thing and ‘Bulls’ means another, but ‘Chicago Bulls’ means a completely different thing. Counting word-frequencies doesn’t take this into account.
For more practical use cases, we need to understand that data in real-life tends to be unlabeled, therefore passing from a supervised to an unsupervised learning method yields a greater utility.

Simple Co-occurrence Vectors

Analyzing the context in which a word is used is a transcendental insight to attack this problem. Taking into account a word’s neighboring words is what has made NLP take a quantum leap in the most recent years.

We will set a parameter ‘m’ which stands for the window size. In this example we’ll use a size of 1 for educational purposes but 5–10 tends to be more common. This means that each word will be defined by its neighboring word to the left as well as the one to the right. This is modeled mathematically by constructing a co-occurrence matrix for each window. Let’s look at the following example:

I love Programming. I love Math. I tolerate Biology.

Here the word ‘love’ is defined by the words ‘I’ and ‘Programming’, meaning that we increment the value both for the ‘I love’ and the ‘love Programming’ co-occurrence. We do that for each window and obtain the following co-occurrence matrix:

Once we have the co-occurrence matrix filled we can plot its results into a multi-dimensional space. Since ‘Programming’ and ‘Math’ share the same co-occurrence values, they would be placed in the same place; meaning that in this context they mean the same thing (or ‘pretty much’ the same thing). ‘Biology’ would be the closest word to these 2 meaning ‘it has the closest possible meaning but it’s not the same thing’, and so on for every word. The semantic and syntactic relationships generated by this technique are really powerful but it’s computationally expensive since we are talking about a very high-dimensional space. Therefore, we need a technique that reduces dimensionality for us with the least data-loss possible.

Singular Value Decomposition

The idea here is to store only the most ‘important’ information in order to have a dense vector (eliminating as much 0’s as possible to keep only the relevant values) with a low number of dimensions. The way we do this is by applying a technique borrowed from Linear Algebra called Singular Value Decomposition [2] which in summary is the generalization of the eigendecomposition of a positive semidefinite normal matrix (such as the matrix in the example above, which is a symmetric one with positive eigenvalues).

This approach generates really interesting semantic and syntactic relationships. Semantically we could visualize things such as ‘San Francisco’ and ‘New York’ are at the highest level of similarity possible, at the next level of similarity there’s ‘Toronto’ and at the next one there’s ‘Tokyo’. Syntactically we can find words clustered around their respective morphemes; for example ‘write’, ‘wrote’ and ‘writing’ can be clustered together and then far away there’s another cluster with the words ‘cook’, ‘cooking’ and ‘cooked’. With this approach dimensionality has indeed been reduced, however, the computational cost of this approach scales quadratically (O(mn²) for the nxm matrix) which is something not very desirable. Let us then introduce you to a model that solves this computational complexity runtime issue:

GloVe

The way that we are going to finally solve our computational complexity issue is by predicting the surrounding words of every word instead of counting co-occurrences directly. This method is not only more computationally efficient but it also makes it viable to add new words to the model, which in other words means the model scales with corpus size. There are various prediction models but we’re going to talk about one in particular that generates really powerful word relationships, called GloVe: Global Vectors for Word Representation. [3]

The way these models predict surrounding words is by maximizing the probability of a context word occurring given a center word by performing a dynamic logistic regression. This just means we are going to find the global optimum of a probability function. Review Convex Optimization [4] if this doesn’t sound familiar. Our cost function is the following:

Then something mind-blowing happens. The multi-dimensional plot (represented in 2 dimensions here) understands that what Dollar is to Peso, USA is to Colombia; as well as that what Dollar is to USA, Peso is to Colombia. The most impressive thing about this isn’t that cognitive intelligence assessments test how well can a human build these kind of relations, but that the semantic relation between words turns into a mathematical one. For instance, if you perform the vector operation Peso — Dollar + USA, you will get Colombia as a result. The reason why this happens is because these words tend to appear in the same context. Imagine we are training a corpora of economic news; you’ll often find fragments such as “The {Country} {Currency} appreciated” or “Firms that import from {Country1} to {Country2} are worried because the {Currency2} has depreciated with respect to the {Currency1}.”

This first tutorial has been a lot about the mathematical background behind modern deep learning for NLP techniques. With this notion we can now crack some code to perform sentiment analysis, which we’ll do on our next tutorial.

Happy hacking!

[1] The smallest meaningful unit of a word. For example; reading, read and readable share the morpheme ‘read’. Python libraries such as nltk allow you to run an algorithms that reduce each word in a corpus to its morpheme in only a few lines of code.

[2] Here’s a comprehensive tutorial from MIT OCW: https://www.youtube.com/watch?v=cOUTpqlX-Xs If you need a Linear Algebra refresher, please take it.

[3] https://pdfs.semanticscholar.org/b397/ed9a08ca46566aa8c35be51e6b466643e5fb.pdf

[4] http://cs229.stanford.edu/section/cs229-cvxopt.pdf

Thanks to Juan C. Saldarriaga, Ana M. Gómez and Melissa M. Argote for revising the drafts of this text.

A Comprehensive Introduction to Word Vector Representations

Written by Esteban Vargas