Neural Networks & Word Embeddings

Published in

Nwamaka Imasogie’s Machine Learning and Artificial Intelligence Projects

18 min readOct 11, 2019

--

All modern NLP techniques use neural networks as a statistical architecture. Word embeddings are mathematical representations of words, sentences and (sometimes) whole documents. Embeddings allow us to compare text, visualize it, and improve the accuracy of newer models. Stanford’s Natural Language Processing with Deep Learning Course is a great course that I’ll follow to get started [video’s 1–5 and the associated readings]. For assignment 2, I’ll complete the coding and written parts Implementing Word2vec in PyTorch.

Representing words by their context

[Lecture 1 — Stanford NLP with Deep Learning — Introduction and Word Vectors].

Distributional semantics

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by. In other words “You shall know a word by the company it keeps”.

Word Embeddings (a.k.a. Word vectors)

We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts [here’s a visual video snippet]. They are a distributed representation. In practice the minimum dimensionality is 50, 300 (on laptop), 1000, 2000, or 4000 are common vector sizes used.

Word2vec

Word2vec is a framework for learning word vectors. It’s a very simple and scalable way of learning vector representations of words. Ultimately, we want a model that gives a reasonably high probability estimate to all words that occur in the context (fairly often).

Main Idea/Summary:

Iterate through each word of the whole corpus
Predict surrounding words using word vectors
Update vectors so you can predict well

More Detailed Steps:

Start with a large corpus of text (with actual sentences).
Every word in a fixed vocabulary is represented by a vector. Start with a random vector for each word.
Go through each position t in the text, which has a center word c and outside context words o.
Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa).
Keep adjusting the word vectors to maximize this probability.

The question now is how do we calculate the probability of the center word and the context given the center word? [Video Snippet]

We will use two vector representations per word, w. So we will use one vector for a word when it’s the center word and we are predicting other words (v). Then we have a second vector for the word when it’s a context word (u).

NOTE: Generally we may be using Word2vec to get the context of, let’s say, 10 words or so. So it will be a “loose” model where we might get a 5% chance to one of those 10 words being guessed correctly. So while we should not expect to get 97%, the goal is, for example, we hope to capture that the word “withdrawal” is likely to occur with “bank” instead of “football” a non-related word.

Word2vec the Skip-Gram Model

Here’s an excellent article that covers the skip gram neural network architecture for Word2Vec.

The skip-gram model proposes a simple single-layer architecture based on the inner product between two word vectors. The objective is to predict a word’s context given the word itself.

The intuition of skip-gram is:

Treat the target word and a neighboring context word as positive examples.
Randomly sample other words in the lexicon to get negative samples.
Use logistic regression to train a classifier to distinguish those two cases.
Use the regression weights as the embeddings.

[Source: Jurafsky, SLP, Ch 6.8]

Imagine a sentence like the following, with a target word apricot, and assume we’re using a window of ±2 context words:

Our goal is to train a classifier such that, given a tuple (t, c) of a target word t paired with a candidate context word c (for example (apricot, jam), or perhaps (apricot, aardvark)) it will return the probability that c is a real context word (true for jam, false for aardvark):

The probability that c is a real context word.

How does the classifier compute the probability P? The intuition of the skip-gram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. How can we compute similarity between embeddings? Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product).

Of course, the dot product t·c is not a probability. Cosine isn’t a probability either.

To turn the dot product into a probability, we’ll use the logistic or sigmoid function σ(x), the fundamental core of logistic regression:

The probability that word c is a real context word for target word t is thus computed as:

The sigmoid function just returns a number between 0 and 1, so to make it a probability we’ll need to make sure that the total probability of the two possible events (c being a context word, and c not being a context word) sums to 1.

The probability that word c is not a real context word for t is thus:

The equation above gives us the probability for one word, but we need to take account of the multiple context words in the window. Skip-gram makes the strong but very useful simplifying assumption that all context words are independent, allowing us to just multiply their probabilities:

In summary, skip-gram trains a probabilistic classifier that, given a test target word t and its context window of k words c1:k , assigns a probability based on how similar this context window is to the target word. The probability is based on applying the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had embeddings for each target word and context word in the vocabulary. Let’s now turn to learning these embeddings (which is the real goal of training this classifier in the first place).

Learning skip-gram embeddings

Read chapter 6.8, page 20 of Jurafsky’s Vector Semantics and Embeddings book for a great explanation of how Word2vec learns embeddings.

Efficient Estimation of Word Representation in Vector Space [Paper]

In this Google paper, Efficient Estimation of Word Representation in Vector Space, it was observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set. Using the DistBelief distributed framework, it should be possible to train the CBOW (Continuous Bag-of-Words) and Skip-gram models even on corpora with one trillion words, for basically unlimited size of the vocabulary.

Distributed Representations of Words and Phrases and their Compositionality [Paper]

http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

In this Google paper they found that by subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. Also because the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada” they present a simple method for finding phrases in text. Top highlights from the paper are:

The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document.
While NCE can be shown to approximately maximize the log probability of the softmax, the Skipgram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality.
The results show that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation
10⁻⁵ subsampling can result in faster training and can also improve accuracy, at least in some cases
The Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words.

Assignment 1 —Explore Word2Vec [GitHub][Code]

I completed Assignment 1that was issued from Stanford CS224N course.
In this assignment I explore two types of word vectors: those derived from co-occurrence matrices, and those derived via word2vec.

Below is my Jupyter Notebook:

Notebook on nbviewer

A co-occurrence matrix counts how often things co-occur in some environment. Given some word $w_i$ occurring in the…

nbviewer.jupyter.org

Word Vectors

Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence matrices, and those derived via word2vec.

Note on Terminology: The terms “word vectors” and “word embeddings” are often used interchangeably. The term “embedding” refers to the fact that we are encoding aspects of a word’s meaning in a lower dimensional space.

Part 1: Count-Based Word vectors

Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words.

Create a Co-occurrence Matrix: I create a co-occurrence matrix. A co-occurrence matrix counts how often things co-occur in some environment.
Dimensionality Reduction: Use sklearn.decomposition.TruncatedSVD and construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top _k_ components and produce a new matrix of k-dimensional embeddings.

Part 2: Prediction-Based Word Vectors

Here I explore the embeddings produced by word2vec.

Dimensionality Reduction: Reduce dimensionality of Word2Vec Word Embeddings using truncated SVD.
Synonyms & Antonyms: Demonstrate counter-intuitive examples and explore why they happen.
Solving Analogies with Word Vectors: I use gensim to solve analogies. For example, for the analogy “man:king :: woman:x”, what is x?
Finding Analogies: Use GenSim’s most_similar function to find analogies, e.g. Austin is the capital of Texas, and Atlanta is the capital of Georgia.
Explore Incorrect Analogies: Gensim doesn’t always produce correct results. Here I explore that.
Analyze Bias in Word Vectors: Use the most_similar function to find another case where some bias is exhibited by the vectors. E.g. I discovered that there’s a bias which results in the correlation of mothers and doctors to nurse. While father and doctors is correlated to physician.

NOTE: In machine learning, the ideal algorithm has low bias and can accurately model the true relationship. And it has low variability, by producing consistent predictions across different datasets. [This video explains it clearly].

Word Vectors “2" and Word Senses

Stanford CS224N: NLP with Deep Learning | Lecture 2 — Word Vectors and Word Senses

Don’t over trust PCA 2-dimensional displays. We took 100D to 2D projections. It captures some of the major geometry of the space but it loses a huge amount of information. So if it looks close together in the 2D space, it might have been close in the original space, or they might have been that some words were lost in the 2D projection because there were other patterns that were dominant and chosen as the first 2 principal components.

We have 2 matrices. For the center words, we have a matrix, V, where for each word in our vocabulary we have a vector (represented as rows). So the V matrix has 6 words. For the outside matrix, U, we have a 2nd vector for each word. So if the center word is word 4, we take are taking a dot product of v4 and each row of U. Which gives us a vector of dot products between U and V4. After that we run softmax’s on each of those numbers (element-wise), which gives us a probability distribution over words in the context.

CAUTION: These 2-dimensional pictures are exceedingly misleading. For example, “Nokia” is next to “Samsung” so let’s say you expect it to be far away from words in the bottom side. Or you might actually want the effect that Nokia is close to Finland, for some reason. But you cannot do that in 2D vector spaces. **Most properties of high dimensional vector spaces are very unintuitive, because in a high-dimensional vector space a word can be close to lots of other words in different directions**.

Skip-grams (SG) versus Continuous Bag of Words (CBOW)

In this lecture he presents the skip-grams model. However, there are two different Word2vec model variants:

Skip-grams — Predict the context (“outside”) words (position independent) given center word.
Continuous Bag of Words (CBOW) — Predict the center word from the (bag of) context words.

Disadvantage: Both SG and CBOW do not operate directly on the co-occurrence statistics of the corpus. Instead, these models scan context windows across the entire corpus, which fails to take advantage of the vast amount of repetition in the data.

For additional efficiency in training use negative sampling because in practice naive sampling is way too slow.

Negative Sampling

Probability of a word in the context (o), given a center word ( c ).

The idea of negative sampling is that we are going to train binary logistic regressions. We want to give a high probability to the word that was actually observed (center word). So we will do one binary logistic regression for the actual word observed (in the numerator). Then we will randomly sample a bunch of other words (the negative samples) and try to give them as low of a probability as possible because it wasn’t the words that were actually seen. So in other words:

Main idea: train binary logistic regressions for a true paid (center word and word in its context window) versus several noise pairs (the center word paired with a random word).

GloVe — Global Vectors for Word Representations. Combining the best of both worlds.

GloVe is a new global log-bilinear regression model that combines the advantages of the two major model families: global matrix factorization (like LSA) and local context window methods (like skip-gram). Why?

Currently, both families suffer significant drawbacks. While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts. [Source: GloVe paper]

NOTE: The main problem with word-word co-occurrence matrices is that the number of times two words co-occur with “the” or “and”, for example, will have a large effect on their similarity despite conveying little about semantic relatedness!

Crucial insight!: We can use ratios of co-occurrence probabilities can encode meaning components. The goal is: How can we keep the benefits of the neural net methods while trying to do things with a count matrix?

Co-occurrence probabilities for target words **ice** and **steam** with selected context words from a 6 billion token corpus. Only in the ratio does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.

If we have a word like “ice” how often are things going to co-occur with it? Solids should co-occur a lot and “gas” shouldn’t. “Water” should occur a lot while another random word won’t occur much. The interesting thing lies in the difference between the two co-occurrence probabilities P(x|ice) versus P(x|steam). The ratio is a “dimension of meaning”. Notice how in “water” and “fashion”, they cancel out to ~1. However, there is a “dimension of meaning” for “solid” and “gas”. The ratio is better able to distinguish relevant words (solid and gas) from irrelevant words (water and fashion) and it is also better able to discriminate between the two relevant words.

**Question**: How can we capture rations of co-occurrence probabilities as linear meaning components in a word vector space? **Answer**: Make the dot products equal to the log of the co-occurrence probability. Which yields the fact that a vector difference leads to a log of the co-occurrence probabilities.

The GloVe model:

GloVe model: J is the objective function with a squared loss. The dot product of wi and wj should be as similar as possible to the log of the co-occurrence probability so they will be “lost” to the extent that they are not the same! Then bias terms, bi and bj, are added because maybe the word is very common or very uncommon. The **f-function** in front caps the effect that very common word-pairs can have on the performance of the system.

GloVe: Global Vectors for Word Representation [Paper]

GloVe: Global Vectors for Word Representation

Improving Distributional Similarity with Lessons Learned from Word Embeddings [Paper]

https://www.aclweb.org/anthology/Q15-1016.pdf

In this paper they reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, they show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, they observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.

They tuned the following hyperparameters:

window size (win)dynamic context window (dyn)subsampling (sub)deleting rare words (del)shifted PMI (neg)context distribution smoothing (cds)adding context vectors (w+c)eigen weighting (eig)vector normalization (nrm)

Evaluation methods for unsupervised word embeddings [Paper]

https://www.aclweb.org/anthology/D15-1036.pdf

We present a comprehensive study of evaluation methods for unsupervised embedding techniques that obtain meaningful representations of words from text. Different evaluations result in different orderings of embedding methods, calling into question the common assumption that there is one single optimal vector representation. We present new evaluation techniques that directly compare embeddings with respect to specific queries. These methods reduce bias, provide greater insight, and allow us to solicit data-driven relevance judgments rapidly and accurately through crowdsourcing.

Intrinsic Evaluations

Intrinsic evaluations directly test for syntactic or semantic relationships between words. There are 4 broad categories: Relatedness, analogy, categorization and selectional preference.

Intrinsic evaluations measure the quality of word vectors by directly measuring correlation between semantic relatedness and geometric relatedness

Extrinsic Tasks

Extrinsic evaluations measure the contribution of a word embedding model to a specific task.

We find that extrinsic evaluations, although useful for highlighting specific aspects of embedding performance, should not be used as a proxy for generic quality.
Different tasks favor different embeddings.

Word Window Classification, Neural Networks, and Matrix Calculus

https://www.youtube.com/watch?v=8CWyBNX6eDo&feature=youtu.be

This lecture covers:

Classification
Neural networks introduction
Named Entity Recognition
(Model) Binary true vs. corrupted word window classification model
Matrix calculus

Cross entropy loss

We are assuming there’s a true probability distribution p.
Let’s say we computed a distribution probability q from our softmax regression, for example.
And we’d like to have a measure of whether our estimate probability distribution is a good one.

We go through the classes, and say what’s the probability of the class according to the true model. Using that weighting, we then work out the log of the probability according to our estimated model. We sum those up and negate it and that gives our cross entropy measure.
In general this gives us a measure of estimation between distributions.
But in our particular binary text classification case, in each example, we have a piece of labeled training data. So in our example, where p = [0,…,0,1,0,…,0] the correct answer is class 7. Therefore our true distribution is class 7 has probability of 1 and probability 0 over everything else.
In the cross entropy formula above, the summation of p(c) will then be 1 when c=7 and 0 everywhere else. Which will basically leave only the negative log of the true class q(7).
Because of one-hot p, the only term left is the negative log probability of the true class.

Classification difference with word vectors

Compared to conventional machine learning classification, classification with word vectors is a weird idea. Rather than saying we just have the parameters W we are also saying that all of the word representations, x, are also parameters of our model. So in NLP deep learning we are actually going to change the representations of words to allow our classifiers to do better! (In other words, we simultaneously change (learn) both the weights and word representations). The word vectors re-represent one-hot vectors — move them around in an intermediate layer vector space — for easy classification with a (linear) softmax classifier via later x = Le. L is a lexicon matrix comprised of different one-hot vectors for each word in your corpus. e is a one-hot vector which, when multiplied by L, the effect is to bring out one column of L.

Really you can think of word vector embedding as having a model with one more neural network layer. See below:

Non-linear activation functions and why they’re needed

Without non-linearities, deep neural networks can’t do anything more than a linear transform. With more layers, they can approximate more complex functions! For example, in the classification problem below, non-linearities allow for better decision boundaries in more complex scenarios resulting in the second graph.

**Left**: Linear decision boundaries. **Right:** Non-linear decision boundaries.

Maximum margin objective function

Like most machine learning models, neural networks also need an optimization objective, a measure of error or goodness which we want to minimize or maximize respectively.

The maximum margin objective is a popular error metric (commonly associated with SVMs).

Let’s look at this example where we want to perform Named Entity Recognition on this sentence:

“Museums in Paris are amazing”

The maximum margin formula is:

minimize J = max(∆ + sc − s, 0)

The intuition behind doing this is that we only care the the “true” data point have a higher score than the “false” data point and that the rest does not matter. Thus, we want our error to be (sc − s) if sc > s else 0.
Each window with an NER location at its center should have a score +1 higher than any window without a location at its center
∆ is a safety margin. We would want the “true” labeled data point to score higher than the “false” labeled data point by some positive margin ∆.
s is the score computed for “true” labeled data points. s = score(museums in Paris are amazing)
sc is the score of the “corrupt” window. s = score(museums in Paris are amazing). sc = score(Not all museums in Paris)
[source: cs224class slides]

Assignment 2: Implementing Word2Vec [Jupyter Notebook][Github]

In the coding part I implement the word2vec model and train my own word vectors with stochastic gradient descent (SGD).

(a) Implement the sigmoid function to apply the sigmoid function to an input vector. In the same file, fill in the implementation for the softmax and negative sampling loss and gradient functions: naiveSoftmaxLossAndGradient and getNegativeSamples. Then, fill in the implementation of the loss and gradient functions for the skip-gram model: negSamplingLossAndGradient and skipgram.
(b) Complete the implementation for your SGD optimizer: sgd.
(c) Show time! Now we are going to load some real data and train word vectors with everything you just implemented! We are going to use the Stanford Sentiment Treebank (SST) dataset to train word vectors, and later apply them to a simple sentiment analysis task. The script will finish and a visualization for your word vectors will appear.

Written part of CS224n Assignment 2

Here are my solutions to the written part, from the first half of Assignment 2. The goal of the written portion is to ensure that you understand word2vec. Here I do compute partial derivatives of the Naive Softmax loss function as well as the Negative Sampling loss (which is an alternative to the Naive Softmax loss).

Backpropogation & Computation Graphs + Stuff You Should Know: Regularization, Vectorization, Nonlinearities, Learning Rates, Initialization, Optimizers

Backpropagation is recursively applying the chain rule along a computation graph. Be clever to minimize computation by re-using shared stuff.

What’s great is the big O() complexity is the same as Forward Propagation!

Here’s a visual of how backpropagation works in a neural network:

A node receives an upstream gradient.

This is a single node example. The goal is to pass on the correct **downstream gradient**. Each node has a local gradient. **Downstream gradient = (upstream gradient) x (local gradient).**

Downstream gradient = (upstream gradient) x (local gradient)

Algorithm:

Do Forward Propagation: compute results of operations and save intermediate values.
Backprop: Initialize output gradient = 1.
Visit nodes in reverse order: Compute gradient with respect to each node using gradient with respect to successors.

Things You Should Know about Deep Learning — Regularization, Vectorization, Nonlinearities, Learning Rates, Initialization, Optimizers

Regularization

The general idea is if you have a lot of parameters in your model, those parameters can memorize what’s in the data that you trained it with. The model becomes very good at predicting the answers to the data that you trained it on (overfitting) but might not perform well on different examples in the real world.

The training error loss will keep going down until it approaches zero due to so many parameters. To the right of the grey line, the training model is just learning to memorize whatever is in the training data, but not in a way that it will perform well on other data.

This problem is especially bad for deep learning because typically DL has vast vast number of parameters.

Non-linearities

Logistic is no longer the defaults for making deep networks. Logistic has been found to work quite poorly for neural networks (disclaimer: unless in the specific case when you want an output of actually want a value between 0 and 1 as your output.

Tanh is still reasonably widely used in neural networks. But the downside is the expensive calculations of exponentials on your computer tends to slow things down.

Hard tanh was birthed from the desire for cheaper computation. Hard tanh works pretty well in neural networks.

ReLU (rectified linear unit) is hard tanh rect(z) = max(z,0) is essentially the simplest nonlinearity you can have. Turns out that it works brilliantly! It’s by far the default when people are building feed-forward neural networks. They train very quickly. So try ReLU first.

ReLU has slope 0 when you’re in the negatives, while slope = 1 when in the positive regime so it acts like an identity function.

Parameter Initialization

When we have matrices of parameters in our model, it’s very very very important that you initialize the parameter weights with small values. Because we are trying to avoid symmetry where everything moves the same and is calculating the same (prevents learning/specialization). Xavier is a common initialization equation. It aims to tamper down the initialization based on the inputs and outputs so that the numbers don’t get to big (within the middle of tanh for example we want the values to be small within the line).

Optimizers

With SGD you will have to hand tune the learning rate. But there are other sophisticated “adaptive” optimizers which can determine which parameters to move more or less depending on the sensitivities of the parameter. So where things are flat you can move fairly quickly but if things are bouncing around a lot you can move just a little so that you don’t overshoot. Adam is a fairly reliable one that many people use. See below.

Learning Rates

Start around 10^-3 (0.0001). Try powers of 10. In general you want to use the fastest learning rate that isn’t making things unstable.

Dependency Parsing

https://www.youtube.com/watch?v=nC9_RfjYwqA&feature=youtu.be

Why do we need sentence structure? Machines need to know what is connected to each other to understand meaning.

Turns out you can use a simple neural to control the decision of what to do next part which could get greater accuracy and speed than Nivre’s MaltParser. The neural net is a very simple classifier. We extract a set of tokens based on the stack/buffer positions. The neural networks include distributed representations of the word, part of speech tag, and dependency labels. We convert them to vector embeddings and concatenate them!