Evaluating Pretrained Word2Vec Vectors

Alperen Çakın
7 min readMay 19, 2019

--

Introduction

Word embeddings are a newly discovered way of representing a word in a low-dimensional space. They provide a vectorial representation of words that bears any semantics or syntax.

In this story, efficient evaluation of word2vec vectors trained by Google is discussed. This story is derived from an NLP assignment report given by Necva Bölücü.

Loading Vectors

Pretrained word vectors obtained from Google in a binary file format, is loaded using .load() method of the KeyedVectors class implemented in the gensim module. A limit parameter is passed to the .load() method causing it to ignore embeddings of less frequent words. This operation is performed because trying to store both normalized (explained in the next section) and raw vectors to the memory yields a MemoryError on a device with insufficient physical memory for this operation.

L2 Normalization

L2 normalization can be defined as normalizing vectors by their lengths. This operation is shown in the equation below, for a vector v:

This operation is performed for each vector in the given dataset by using .init_sims() method of the Word2VecKeyedVectors class. The purpose is to calculate cosine similarities faster.

Calculating the Target Vector

Before calculating cosine similarities, a target vector needs to be calculated in order to execute the given analogy tests.

For given words a, b, c and word vectors obtained from the given dataset va, vb, vc the target vector is calculated using the following formula:

This vector is also L2 normalized, allowing the software to calculate cosine similarities between this vector and all vectors in the dataset just by using dot product. This is illustrated in the equation below:

After calculating the target vector, the most similar vector in the dataset to this vector should give the expected word analogy. This can be demonstrated using the following example:

Calculating Cosine Similarities

Cosine similarity for two given vectors v1 and v2 can be calculated using the following formula:

The idea is, if we calculate L2 normalizations of all vectors in the dataset, calculating cosine similarity is just a matter of calculating dot product of any given two vectors. This is illustrated in the below figure:

Getting a List of All Similarities

Word2VecKeyedVectors class stores loaded word vectors in the form of a matrix (or a list of vectors): syn0 attribute of the class holds the raw vectors while syn0norm attribute holds L2 normalized vectors after calling the .init_sims() method.

Considering this information, cosine similarities between the target vector and the all the other word vectors in dataset can be calculated in the form of a list by performing a dot product operation between the syn0norm and target vector.

This algorithm is inspired by inspecting the .most_similar() method of the Word2VecKeyedVectors class’ source code.

There are some differences between this algorithm and this class’ way of calculating cosine similarities. However, they are similar as they use the same principle. (Getting cosine similarities using precalculated L2 normalized vectors).

Picking the Most Similar Word Vector

After creating a list of cosine similarities, a problem of picking the most similar vector arises. In gensim’s code, this problem is solved by sorting the whole list. However, sorting operation is expensive.

As an alternative to sorting, this problem can be converted into picking the k largest elements of this list. When word indices of the k = 4 largest elements of the similarity list are picked, it becomes possible to choose the most similar word vector. Word indices of the selected similarities are required, as those word indices will be used in Word2VecKeyedVectors’s .index2word mapping to retrieve words, both for filtering and output purposes.

k must be at least 4 because 3 input words are given as a parameter to each analogy test. Therefore we need at least 4 of the largest elements, as at most 3 of the input vectors may appear in the k largest list. If a value smaller than 4 were chosen, a scenerio explained in the below figure might occur for input words a, b, c:

In this scenerio, the method would fail to find the index of the most similar word vector as indices of the words a, b, c are filtered from the k largest elements. Thus, the method would end up with an empty list.

Picking the k largest elements problem is solved by using numpy’s .argpartition() method which works in linear time, clearly more efficient than the numpy’s .argsort() method used in gensim’s code.

After picking the 4 largest elements, the most similar word is chosen according to the pseudocode below:

Hardware Optimizations

Along with the software optimizations mentioned in previous sections, some optimizations to use the hardware more efficiently are possible.

In this implementation, threads are used as a method of optimization. Using threads allows the software to use more than one CPU core at a time thus lowering the execution time. Built-in module threading is used in order to employ threads. Analogy tests are divided between threads using the numpy’s .array_split() method.

Another way to decrease the execution time would be using a GPU for matrix multiplication purposes while calculating cosine similarity. However, this method is not implemented in this assignment.

Calculating the Accuracy

After executing the given analogy tests, accuracy of the pretrained model is calculated according to the below formula:

Results

Running Time of the Most Similar Method

The line “Abuja Nigeria Accra Ghana” from the analogy tests has been executed using both my implementation and the default .most_similar() method of the Word2VecKeyedVectors class for 1000 times. Mean of the execution times of ranges are shown in the above plot.

As can be observed from the plot above, my implementation performed slightly better than the default .most_similar() implementation in gensim’s code. This test is conducted using a vector limit of 500000 and no threads. Total mean of the execution times without ranges are shown in the figure below:

Accuracy and Total Running Time

All of the analogy tests are executed using a vector limit of 2000000 (because of the physical memory constraints of my device) and 4 threads.

Execution times include loading time for the pretrained vectors. Tests are executed on a system with the following specifications:

  • CPU: Intel(R) Core(TM) i5–6300HQ CPU @ 2.30GHz
  • Memory: 8192 MB, DDR4–2133, Samsung M471A1G43DB0-CPB
  • Drive: HGST HTS721010A9E630 (7200 rpm)
  • OS: Ubuntu 17.10, 64 bit

The following categories from the analogy tests are included in evaluation:

  • capital-world
  • currency
  • city-in-state
  • family
  • gram1-adjective-to-adverb
  • gram2-opposite
  • gram3-comparative
  • gram6-nationality-adjective

The following results are obtained without any exceptions indicating a test word is not in the loaded vectors:

As you can see in the table above, my method performed slightly better in execution times. However, using my method, accuracy value is lower than the gensim’s default implementation. This is not caused by a bug in the software, this phenomenon occurs because gensim calculates the target vector in an unusual way.

The below figure represents the algorithm applied by the gensim to calculate the target vector:

Calculation of the target vector for my implementation, which is different than the algorithm above, is already explained in the previous sections, so it is unnecessary to explain it again in this section. However, by looking at the above algorithm, it should be clear why accuracy results differ.

Source Code

Source code for the whole assignment including this story and the other one can be found on Github:

References

Pretrained vectors trained on part of Google News dataset:

Analogy tests:

http://www.fit.vutbr.cz/~imikolov/rnnlm/word-test.v1.txt

Solution to memory error when using gensim module:

L2 normalization:

Gensim’s source code for the .most_similar() method:

Argpartition method:

--

--