Notes for deep learning on NLP

Frank Chung
DeepQ Research Engineering Blog
5 min readDec 30, 2016

Deep learning gradually plays a major role on NLP (Natural Language Processing). Here I note some technical evolution for the NLP problems.

N-gram Model

A continuous text sequence “to be or not to be” can be modelled by:

1-gram(unigram): to,be,or,not,to,be

2-gram(bigram): to be, be or, or not, not to, to be

3-gram(trigram): to be or, be or not, or not to, not to be

N-gram model can solve the problem of next word prediction, e.g., the occurrence of 6-gram model can predict the probability of next word is “be” if the previous words are “to be or not to”:

P(be|to be or not to) = C(to be or not to be) / C(to be or not to)

Term Frequency–Inverse Document Frequency(TF-IDF)

TF-IDF indicates the importance of a word.

Term frequency of a word is the occurrence of the word over all occurrences of words in a document:

TF(“cow” in document) = C(“cow” in document)/C(all words in document)

Document frequency of a word is the number of documents which contains the word over all the documents:

DF(“cow”) = log(C(all documents)/C(documents contain “cow”))

E.g., If “cow” in document 1 occurs 4 times, and document 1 contains 100 words, the term frequency of the word “cow” on document 1 is 0.04. And if “cow” exists in 100 documents and there are overall 10000 documents, the document frequency of “cow” is log(10000/100) = 2. The TF-IDF is therefore 0.04 * 2 = 0.08.

Latent Semantic Analysis (LSA)

LSA applies TF-IDF to calculate the relations between words and documents.

m words and n documents

Let X be a matrix where element (i,j) implies the TF-IDF of word i in document j. Then we can get the correlations of two words by the matrix product 𝑿𝑿ᵀ and the correlations of two documents by the matrix product 𝑿ᵀ𝑿.

Word Vector

With neural network, the idea is proposed to train a shared matrix C which can project each word into a feature vector, and put the vector as the input of a neural network to train the main task.

Suppose the dimension of feature space is M, and vocabulary size is V, the matrix C is |V|*M, and the mapping result of C(Wt) is a 1*M vector

Suppose the dimension of feature space is M, and vocabluary is V, the projection C is a|V|*M matrix. The input layer contains N-1 previous words in a N-gram model, which is encoded by 1-to-|V| representation. The output layer consists of |V| probabilities of the word in vocabulary.

Word Embedding

Word2vec is a two-layer neural network which can extract the feature vector of a word, which is called “word embedding”. Instead of training by the next word, word2vec applies one of the CBOW and Skip-gram network to train the projection matrix.

Continuous bag of words (CBOW) trains the projection by putting surrounding words at input layer to predict the central word. And Skip-gram trains the projection by applying the central word to predict its surrounding words. Both policies can train the projection matrix for word embedding extraction.

Word Similarity

Words related with “Sweden”

To find out the related words of a target word, we can calculate the cosine similarity(distance) of all other words in vocabulary.

Vector Offset

Given China->Beijing, find Japan->?

Given the word vector of words a, band c, we can find d if a:b = c:d by V𝚍 = V𝐜 + (V𝐛 - V𝐚). Then find the word in vocabulary whose word vector has the smallest cosine distance between V𝐝.

Word relationships found by vector offset

Recursive Neural Network

Since no matter CBOW or Skip-gram models take adjacent N words into consideration to train the neural network, it is not scalable if N becomes very large. To overcome the drawback, a recursive neural network (RNN) is proposed:

Traditional neural network

Consider the above traditional neural network for training a NLP problem, each word is fed to projection matrix W for a word vector, and convert to output S by hidden layer R. In this case, the input size of R is fixed to be 5*M(assume there are M features in the word vector). And R can’t be scaled if N is changed to be 6.

Recursive neural network

In a recursive neural network, instead of traing a hidden layer R, we can only train two vectors at a time with hidden layer A, and recursively merge two words. In this case, the input size of A is always 2*M.

Recurrent Neural Network

Since recursive neural network is not easy to implement, recurrent neural network(RNN) is proposed to handle the same problem.

A recurrent network’s input is only one word, and redirect the output of hidden layer to part of the input. The whole input size is |V| + M (M is the number of features of previous words, and V is size of vocabulary).

The above example is an example for next character prediction. “h” is fed into hidden layer to get its word vector, and this vector is feed to part of input with “e”. After all the words are trained, the parameters of hidden layer (the projection matrix) can be trained.

Case Studies

Table1: Syntactic regularity for word vectors trained by different methods, percent correct
Table2: Semantic regularity for word vectors trained by different methods, percent correct

Both Table 1 and Table2 indicates two discoveries: 1. word vectors from RNN is more accurate than LSA no matter in syntactic or semantic experiments. 2. The accuracy grows if the dimension of word vector is increased while applying RNN.

Table3: Microsoft sentence completion challenge

Table 3 finds out the testing accuracy gets better if adapting a skip-gram network to pre-train the projection matrix to an existed RNN.

Table4: comparison of CBOW and Skip-gram

Table 4 mentions that Skip-gram training is better than CBOW.

Benchmarks

SemEval: semantic relations among words.

Microsoft sentence completion challenge: fill in the missed words in a sentence.

--

--