From word2vec to doc2vec: an approach driven by Chinese restaurant process

Yingjie Miao
Keep It Up
Published in
6 min readMar 17, 2014

--

Google’s word2vec project has created lots of interests in the text mining community. It’s a neural network language model that is “both supervised and unsupervised”. Unsupervised in the sense that you only have to provide a big corpus, say English wiki. Supervised in the sense that the model cleverly generates supervised learning tasks from the corpus. How? Two approaches, known as Continuous Bag of Words (CBOW) and Skip-Gram (See Figure 1 in this paper). CBOW forces the neural net to predict current word by surrounding words, and Skip-Gram forces the neural net to predict surrounding words of the current word. Training is essentially a classic back-propagation method with a few optimization and approximation tricks (e.g. hierarchical softmax).

Word vectors generated by the neural net have nice semantic and syntactic behaviors. Semantically, “iOS” is close to “Android”. Syntactically, “boys” minus “boy” is close to “girls” minus “girl”. One can checkout more examples here.

Although this provides high quality word vectors, there is still no clear way to combine them into a high quality document vector. In this article, we discuss one possible heuristic, inspired by a stochastic process called Chinese Restaurant Process (CRP). Basic idea is to use CRP to drive a clustering process and summing word vectors in the right cluster.

Imagine we have an document about chicken recipe. It contains words like “chicken”, “pepper”, “salt”, “cheese”. It also contains words like “use”, “buy”, “definitely”, “my”, “the”. The word2vec model gives us a vector for each word. One could naively sum up every word vector as the doc vector. This clearly introduces lots of noise. A better heuristic is to use a weighted sum, based on other information like idf or Part of Speech (POS) tag.

The question is: could we be more selective when adding terms? If this is a chicken recipe document, I shouldn’t even consider words like “definitely”, “use”, “my” in the summation. One can argue that idf based weights can significantly reduce noise of boring words like “the” and “is”. However, for words like “definitely”, “overwhelming”, the idfs are not necessarily small as you would hope.

It’s natural to think that if we can first group words into clusters, words like “chicken”, “pepper” may stay in one cluster, along with other clusters of “junk” words. If we can identify the “relevant” clusters, and only summing up word vectors from relevant clusters, we should have a good doc vector.

This boils down to clustering the words in the document. One can of course use off-the-shelf algorithms like K-means, but most these algorithms require a distance metric. Word2vec behaves nicely by cosine similarity, this doesn’t necessarily mean it behaves as well under Eucledian distance (even after projection to unit sphere, it’s perhaps best to use geodesic distance.)

It would be nice if we can directly work with cosine similarity. We have done a quick experiment on clustering words driven by CRP-like stochastic process. It worked surprisingly well — so far.

Now let’s explain CRP. Imagine you go to a (Chinese) restaurant. There are already n tables with different number of peoples. There is also an empty table. CRP has a hyperparamter r > 0, which can be regarded as the “imagined” number of people on the empty table. You go to one of the (n+1) tables with probability proportional to existing number of people on the table. (For the empty table, the number is r). If you go to one of the n existing tables, you are done. If you decide to sit down at the empty table, the Chinese restaurant will automatically create a new empty table. In that case, the next customer comes in will choose from (n+2) tables (including the new empty table).

Inspired by CRP, we tried the following variations of CRP to include the similarity factor. Common setup is the following: we are given M vectors to be clustered. We maintain two things: cluster sum (not centroid!), and vectors in clusters. We iterate through vectors. For current vector V, suppose we have n clusters already. Now we find the cluster C whose cluster sum is most similar to current vector. Call this score sim(V, C).

Variant 1: v creates a new cluster with probability 1/(1 + n). Otherwise v goes to cluster C.

Variant 2: If sim(V, C) > 1/(1 + n), goes to cluster C. Otherwise with probability 1/(1+n) it creates a new cluster and with probability n/(1+n) it goes to C.

In any of the two variants, if v goes to a cluster, we update cluster sum and cluster membership.

There is one distinct difference to traditional CRP: if we don’t go to empty table, we deterministically go to the “most similar” table.

In practice, we find these variants create similar results. One difference is that variant 1 tend to have more clusters and smaller clusters, variant 2 tend to have fewer but larger clusters. The examples below are from variant 2.

For example, for a chicken recipe document, the clusters look like this:

  • ‘cayenne’, ‘taste’, ‘rating’, ‘blue’, ‘cheese’, ‘raved’, ‘recipe’, ‘powdered’, ‘recipe’, ‘dressing’, ‘blue’, ‘spicier’, ‘spoon’, ‘cup’, ‘cheese’, ‘cheese’, ‘blue’, ‘blue’, ‘dip’, ‘bake’, ‘cheese’, ‘dip’, ‘cup’, ‘blue’, ‘adding’, ‘mix’, ‘crumbled’, ‘pepper’, ‘oven’, ‘temper’, ‘cream’, ‘bleu’, ……
  • ‘the’, ‘a’, ‘that’, ‘in’, ‘a’, ‘use’, ‘this’, ‘if’, ‘scant’, ‘print’, ‘was’, ‘leftovers’, ‘bring’, ‘a’, ‘next’, ‘leftovers’, ‘with’, ‘people’, ‘the’, ‘made’, ‘to’, ‘the’, ‘by’, ‘because’, ‘before’, ‘the’, ‘has’, ‘as’, ‘amount’, ‘is’, ……
  • ‘stars’, ‘big’, ‘super’, ‘page’, ‘oct’, ‘see’, ‘jack’, ‘photos’, ‘extras’, ‘see’, ‘video’, ‘one’, ‘page’, ‘f’, ‘jun’, ‘stars’, ‘night’, ‘jul’, ……

Apparently, the first cluster is most relevant. Now let’s take the cluster sum vector (which is the sum of all vectors from this cluster), and test if it really preserves semantic. Below is a snippet of python console. We trained word vector using the c implementation on a fraction of English Wiki, and read the model file using python library gensim.model.word2vec. c[0] below denotes the cluster 0.

>>> similar(c[0], model[“chicken”])0.95703287846549179>>> similar(c[0], model[“recipe”] + model[“chicken”])0.95602993446153006>>> similar(c[0], model[“recipe”] + model[“fish”])0.7678791380788017>>> similar(c[0], model[“computer”])0.0069432409372725294>>> similar(c[0], model[“scala”])0.061027248018988116

Looks like the semantic is preserved well. It’s convincing that we can use this as the doc vector.

The recipe document seems easy. Now let’s try something more challenging, like a news article. News articles tend to tell stories, and thus has less concentrated “topic words”. We tried the clustering on this article, titled “Signals on Radar Puzzle Officials in Hunt for Malaysian Jet”. We got 4 clusters:

  • ‘have’, ‘when’, ‘time’, ‘at’, ‘when’, ‘part’, ‘from’, ‘from’, ‘in’, ‘show’, ‘may’, ‘or’, ‘now’, ‘on’, ‘in’, ‘back’, ‘be’, ‘turned’, ‘for’, ‘on’, ‘location’, ‘mainly’, ‘which’, ‘to’,, ‘also’, ‘from’, ‘including’, ‘as’, ‘to’, ‘had’, ‘was’ ……
  • ‘radar’, ‘northwest’, ‘radar’, ‘sends’, ‘signal’, ‘signals’, ‘aircraft’, ‘data’, ‘plane’, ‘search’, ‘radar’, ‘saturated’, ‘handles’, ‘search’, ‘controlled’, ‘detection’, ‘data’, ‘nautical’, ‘patrol’, ‘detection’, ‘detected’, ‘floating’, ‘blips’, ‘plane’, ‘objects’, ‘jets’, ‘kinds’, ‘signals’, ‘air’, ‘plane’, ‘aircraft’, ‘radar’, ‘passengers’, ‘signal’, ‘plane’, ‘unidentified’, ‘aviation’, ‘pilots’, ‘ships’, ‘signals’, ‘satellite’, ‘radar’, ‘blip’, ‘signals’, ‘radar’, ‘signals’ ……
  • ‘of’, ‘the’, ‘of’, ‘of’, ‘of’, ‘the’, ‘a’, ‘the’, ‘senior’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘a’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘of’, ‘the’, ‘of’, ‘a’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘its’, ……
  • ‘we’, ‘authorities’, ‘prompted’, ‘reason’, ‘local’, ‘local’, ‘increasing’, ‘military’, ‘inaccurate’, ‘military’, ‘identifying’, ‘force’, ‘mistaken’, ‘expanded’, ‘significance’, ‘military’, ‘vastly’, ‘significance’, ‘force’, ‘surfaced’, ‘military’, ‘quoted’, ‘showed’, ‘military’, ‘fueled’, ‘repeatedly’, ‘acknowledged’, ‘declined’, ‘authorities’, ‘emerged’, ‘heavily’, ‘statements’, ‘announced’, ‘authorities’, ‘chief’, ‘stopped’, ‘expanding’, ‘failing’, ‘expanded’, ‘progress’, ‘recent’, ……

Again, looks decent. Note that this is a simple 1-pass clustering process and we don’t have to specify number of clusters! Could be very helpful for latency sensitive services.

There is still a missing step: how to find out the relevant cluster(s)? We haven’t yet done extensive experiments on this part. A few heuristics to consider:

  • idf weight
  • POS tags. We don’t have to tag every single word in the document. Empirically, word2vecs tend to group syntactically as well. So We only have to sample a few tags from each cluster.
  • compare cluster sum vectors with title vector (although this depends on the quality of title)

There are other problems to think about: 1) how do we merge clusters? Based on similarity among cluster sum vectors? Or averaging similarity between cluster members? 2) what is the minimal set of words that can reconstruct cluster sum vector (in the sense of cosine similarity)? This could be used as a semantic keyword extraction method.

Conclusion: Google’s word2vec provides powerful word vectors. We are interested in using these vectors to generate high quality document vectors in an efficient way. We tried a strategy based on a variant of Chinese Restaurant Process and obtained interesting results. There are some open problems to explore, and we would like to hear what you think.

Appendix: python style pseudo-code for similarity driven CRP

We wrote this post while working on Kifi — Connecting people with knowledge. Learn more.

Originally published at eng.kifi.com on March 17, 2014.

--

--