Social Behavior Analytics — 201

Introduction to natural language processing with word2vec

James Chen
8 min readNov 9, 2016

This series of posts will first focus on introducing natural language processing, with emphasis on the word2vec algorithm.

In this post, we will introduce the methodology behind the algorithm, which may require sufficient amount of knowledge in calculus and linear algrebra. We will then move on to the actual implementation in R with text data from Discovery and National Geographic Facebook fan pages in the second post. Lastly, we will visualize post titles from these two fan pages in the final post.

Background

As mentioned in the previous post, we will introduce the word2vec algorithm this time to uncover insight from Facebook post titles. The algorithm was developed by a team of researchers led by Tomas Mikolov at Google in 2013. The original paper can be found here. The underline theory is that words appear together frequently might have similar meanings. As a result, if we transform words into vectors based on the order and frequency of their occurrences, similar words will be closer in the vector space. This is interesting and somewhat surprising as many of the patterns can be represented as linear translations. An example is that the result of a vector calculation of vec(“King”) minus vec(“Man”) plus vec(“Woman”) is closer to vec(“Queen”) than any other vectors!

This is undoubtedly a promising approach to interpret vast amount of text data created everyday on social platforms.

Methodology

Xin Rong at University of Michigan did a nice piece explaining in details the mechanism behind word2vec here. It is recommended to study and review the important concepts before proceeding further. However, we will still cover the most important terms below (most content and figures also covered in the paper with greater details and intuitive explanations).

Neural Network

Before we start, let us look at the neural networks. Extensive researches have been done on how to mimic neutral network artificially. The simplest artificial neuron has the activity: fire (output “1”) once it’s activation threshold was reached, or it wouldn’t fire (output “0”), as explained in the example below.

A bird will try to eat brown and round objects, such as seeds, but not a key. We can represent this behavior as:

Screenshot of the neuron activity example

Let us model this behavior with a neuron. If we set the activation threshold to 1.5, the neuron would only fire when both brown AND round properties are satisfied. Then the sum of the inputs is 2 (1+1), which is larger than the activation threshold of 1.5.

The figure below shows an artificial neuron, where {x1, · · · , xK} are input values; {w1, · · · , wK} are weights; y is a scalar output; and f is the link function (also called activation/decision/transfer function).

Screenshot of an artificial neuron

Basically the neuron works in this way:

y = f(u), which returns y=1 if u>0, y=0 if other wise

where u is a scalar that represents (x1*w1+x2*w2+…+xK*wK).

A neuron with the linear link function above is also called a perceptron, which also has an update equation:

w(new) = w(old) - n*(y-t)*x

where:
t is the target output, or label
n is the learning rate (always greater than 0)
y is the predicted output
the term (y-t) is called
error

The purpose of an update equation is to adjust the weight w by comparing predicted output with label, or target output. If we run the update equation a number of times, the value of weight can be stabilized. Such process is also called iteration. As we iteratively update the model parameters by going through context-target word pairs generated from a training corpus, or a collection of texts (for example, text words on Wikipedia), the effects on the vectors will accumulate.

Of course, a more complicated neuron will have a more complicated logistic function (as opposed to linear), defined as

σ(u) = 1/(1 + e^(−u))

The function has two properties that can be used in subsequent derivations:

σ(−u) = 1 − σ(u)
dσ(u)/du = σ(u)*σ(−u)

Stochastic gradient descent is used as the learning algorithm of this model, in order to minimize the error function E:

E = (1/2)*(t − y)^2

We know from college calculus that if we want a minimum value of the equation, we need to take the derivative of E with regard to the weight wi

∂E/∂wi = (∂E/∂y)*(∂y/∂u)*(∂u/∂wi)

where:
∂E/∂y = (y-t)
∂y/∂u = y(1-y) because y = σ(u) = 1/(1 + e^(−u))
∂u/∂wi = xi because u=wi*xi

Combining the above, we have minimized error as (y-t)*y*(1-y)

And we can rewrite the update equation as:

w(new) = w(old) - n*[(y-t)*y*(1-y)]*x

Back-propagation

In reality, we might have a more complicated multi-layer neural network.

Screenshot of a multi-layer neural network with one hidden layer

The above neural network has the following characteristics:

Input layer {xk} = {x1,…xK}
Hidden layer {hi} = {h1,…hN}
Output layer {yj} = {y1,…,yM}

The hidden layer has an output (input-hidden) defined as:

hi = σ(ui) = σ( Σ wki*xk, k=1 to K)

The output layer has an output (hidden-output) defined as:

yj = σ(u’j) = σ( Σ w’ij*hi, i=1 to N)

where u’ is just another matrix for simpler notification purpose, instead of the transpose matrix of u

The error function is given as:

E(x,t,W,W’) = (1/2)*(Σ(yj-tj)², j=1 to M)

where:
W = {wki}, a K by N weight matrix (input-hidden)
W’ = {w’ij}, an N by M weight matrix (hidden-output)
t={t1,…,tM}, a M-dimension vector, which represents the target output

In order to define the update equations for W and W’, we need to find the derivative of the error function E, or the minimum error. To make the derivation straightforward, we will start the calculation for the right-most layer (output layer in this case), and then move left. For each layer, we will first compute the minimum error, then net input, and finally the weight. We can repeat the same 3-step process to obtain an update equation for weight of the previous layer, which is essentially the idea of back-propagation.

The complete derivation can be found on pages 18–20 of the paper by Rong. Next we will introduce the two models from the word2vec algorithm.

Continuous Bag-of-Word Model (CBOW)

The CBOW model is equivalent to asking “given this set of context words, what missing word is likely to also appear at the same time?”.

We will start from the simplest version of the CBOW model, as shown below.

Screenshot of a simple CBOW model with only one word in the context

The above model has the following:

V is the vocabulary size
N is the size of the hidden layer
W is the weight between the input layer and the hidden layer

W is also a V by N matrix, where each row of W is the N-dimension vector representation vw of the associated word of the input layer.

In a simple setup where the input is a one-hot encoded vector (only 1 out of V units is 1, rest are 0), we can assume xk = 1 and xk’ = 0 for k’ != k to obtain:

h = σ(u) = (Σ w1*x1+w2*x2+…+wk*xk) = 0+0+…+wk*1 = wk = vw

Basically copying the kth row of W, or vw (vector representation), to h.

From the hidden layer to the output layer, we can compute a score uj for each word in the vocabulary:

uj = h*v’wj

where:
v’wj is the j-th column of the matrix W’

In subsequent analysis, vw is defined as input vector and v’w as output vector of the word w.

In a multi-word setup, when computing the hidden layer output, instead of directly copying the input vector of the input context word, the CBOW model takes the average of the vectors of the input context words, and returns the product of the average vector and the input-hidden weight matrix as output.

Screenshot of the CBOW model in multi-word setting

h = (1/C)*(vw1 + vw2 +…+ vwC)

Skip-Gram Model

It is the opposite of the CBOW model. The target word is now at the input layer, and the context words are on the output layer.

The model is essentially asking “given this single word, what are the other words that are likely to appear near it at the same time?”.

Screenshot of the skip-gram model

Next, we will briefly introduce two methods to improve the computing efficiency if the amount of input text is large.

Hierarchical Softmax

The softmax model uses a binary tree to represent all words in the vocabulary. The V words must be leaf units of the tree, with V-1 inner nodes. As explained by Rong, for each leaf unit, there exists a unique path from the root to the unit; and this path is used to estimate the probability of the word represented by the leaf unit.

Screenshot of the binary tree in the hierarchical softmax model

This method can improve the error in the update equations for both CBOW and skip-gram models. The computational complexity per training instance per context word is reduced from V to log(V), as we are calculating probability between 0 and 1 instead, though we will still have roughly the same number of V-1 words, as compared to the original V words.

Negative Sampling

As explained clearly by Rong, the idea of negative sampling is a more straightforward approach than hierarchical softmax. In order to deal with the difficulty of having too many output vectors that need to be updated per iteration, we only update a sample of them instead.

Apparently the output word (that is, the ground truth, or positive sample) should be kept in our sample and gets updated, and we need to sample a few words as negative samples (hence “negative sampling”).

Hooray! This covers most of the basic terminology and methodology used in the word2vec algorithm. The next post will come shortly with a step-by-step guide on how to implement the algorithm in R.

Questions, comments, or concerns?
jchen6912@gmail.com

--

--

James Chen

Engineer by training. Analytics by passion. R and Python addict who hacks and decodes data for marketers.