In math lecture a few weeks ago, I learned about how words could be converted into high-dimensional vectors to solve analogy questions. An analogy question is a question that asks you to find the relationship between words. For instance, the answer to the question: man is to woman, what king is to ____ is “queen” since you have to use the relationship in the first part of the question to fill in the blank. This seemed really interesting to me since relationships between words seem to be quite complex to encode. I was curious to see whether high dimensional vectors could truly capture the meaning of the word, and its relationship to another word.
The scheme described in class was to first convert each word into a high-dimensional vector. Then simply subtract the first vector from the second in the first-word pair, and add that to the first word in the second word-pair. The word whose vector was closest to the resultant answer would be the solution. For the example above, the solution would be the vector closest to king + woman — man which would hopefully be queen.
The first step in solving this problem was finding a good dataset that would train the Word2Vec model (a model that converts words into high-dimensional vectors). I decided to use a compilation of all the hotel reviews in the OpinRank dataset. I found a concatenated list of all the hotel reviews in one file: https://github.com/kavgan/nlp-text-mining-working-examples/tree/master/word2vec the file is (reviews_data.txt.gz). There were a total of 255,403 reviews in the dataset.
I used Gensim, a python toolkit which helps with vector-space and topic modeling to create a model that solved analogies. I trained the Word2Vec model using Gensim’s implementation of Word2Vec. The model was trained by passing in a list of reviews. While, I don’t know the technicalities of the Word2Vec implementation, the axiom “the meaning of a word can be found from the company it keeps” can be used to understand at a high-level how Word2Vec is implemented. For each word, we examine its neighboring words and use that to infer its meaning. Each vector, corresponding to each word ends up being 150 dimensional in Gensim. I used the following link for to ensure I had the correct syntax: http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W7HQUhNKjBK
Once the model was trained, I implemented exactly the scheme proposed in lecture. I manually chose three words, the first two being the word pair, and converted each of them to vectors. I then applied the formula 4th word = 3rd word + (2nd word — 1st word), and found the word closest to the resultant vector. I referred to the Gensim documentation for syntax related questions. (https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.wmdistance)
I found that the analogy solver was able to solve simple analogies comfortably but struggled with more complex ones. Here is a list of analogies tried, and the top words most likely to be the solution. The number corresponding to each word is the similarity between that word’s vector, and the vector of the solution computed by the formula in the hypothesis.
Note: I had to manually filter out some of the results that gave an answer that repeated a word in the question.
- Man is to Woman what King is to ___?
2. Boy is to Girl what Man is to ___?
2. Water is to ice what liquid is to ___?
3. Bad is to good what sad is to ___?
4. Doctor is to hospital what teacher is to ___?
5. USA is to pizza what Japan is to ___?
6. Human is to house what bird is to ___?
7. Grass is to Green what Sky is to ___?
It seems as though the scheme of converting words to high-dimensional vectors is correct, and gives reasonably good results. There is room for improvement, and perhaps that can be achieved with a bigger dataset. I’m going to try to experiment with different datasets and I’ll also try to use different implementations of Word2Vec.
I’m curious to see how this model would perform against old SAT multiple choice questions (the test formerly tested analogies). Since the questions are multiple choice, the model would only have to choose which of the 4 word’s vectors is closest to the predicted vector. I’m going to try seeing the accuracy of this model in comparison to humans taking the test!
I‘m very impressed on the whole with how accurate the model was, and that for the most part, the results made sense. This means that the conversion of a word into a 150-dimensional vector enables the meaning of the word to be encapsulated. It also means that the relationships between two words can also be represented as a 150-dimensional vector, and these two facts in combination led to the accuracy of the model.