Is the Stanford Rare Word Similarity dataset a reliable evaluation benchmark?

Rare word representation is one of the active areas in lexical semantics which deals with inducing embeddings for rare and unseen words (for which no or very few occurrences have been observed in the training corpus).

Since its creation, the Stanford Rare Word (RW) Similarity dataset has been regarded as a standard evaluation benchmark for rare word representation techniques. The dataset has 2034 word pairs which are selected in a way to reflect words with low occurrence frequency in Wikipedia, rated with a similarity scale [0,10].

Created by Minh-Thang Luong, Richard Socher, and Christopher D. Manning (2013), the RW dataset is one of the many recent word similarity datasets which acquire their similarity judgements from crowdsourcing. In this case, (Amazon Mechanical) Turkers have provided up to ten scores for each word pair. The raters were restricted to only US-based workers in order to increase their chances of being fluent English speakers. Additionally, the authors asked the raters to self-certify themselves: indicate if they “knew” the word. These were used to discard unreliable pairs.

However, a quick analysis of the RW dataset clearly indicates that the above measures have not been able to guarantee quality annotations, casting serious doubt over its reliability as an evaluation benchmark for rare word representation and similarity.

I’ll jump directly into the discussion with the help of an example:

Example 1: the word bluejacket is paired with submariner in the dataset. According to WordNet, a submariner (a member of the crew of a submarine) is a bluejacket (a serviceman in the navy; a navy_man, sailor), hence a hypernymy relationship (also known as is_a relationship, e.g., cat is_a feline or car is_a vehicle). One would expect a word to have high semantic similarity with its hypernym. However, the gold score for this pair is just 0.43 in the scale [0,10] of the dataset (a score close to the completely dissimilar side of the scale).

There are numerous such examples in the dataset:

Example 2: tricolor (a flag having three colored stripes — especially the French flag) is paired with flag with a similarity score of 0.71.

Example 3: untruth (a false statement) and statement with a similarity of 1.22.

This raises the question if Turkers (even if they are restricted to US only) are sufficiently knowledgeable candidates for such semantic annotations? Also, the self-certification measure, apart from not being a rigorous evaluation, does not verify if the annotator had knowledge of all possible meanings of a word.

Example 4: decompositions could refer to the analysis of vectors in algebra. When paired with algebra, the assigned score is 0.75 only!

Example 5: depressor has three senses in WordNet; the first one is defined as “any skeletal muscle that draws a body part down’’. When paired with muscle, the gold score is 1.86 only.

Such examples clearly indicate that the annotators were not aware of specific meanings of a word (in this case, the algebraic meaning of decomposition and the anatomical meaning of depressor), despite “knowing” the words.


Distribution of the data:

To construct the dataset, the authors first gathered a set of low-frequency words (word1) from Wikipedia, making sure that they appear in WordNet’s vocabulary (to discard misspellings or non-words). The pairing word (word2) for each of these was selected as follows:

First, a WordNet synset of word1 is randomly selected, and we construct a set of candidates which connect to that synset through various relations, e.g., hypernyms, hyponyms, holonyms, meronyms, and attributes.

Due to this questionable selection strategy (which I will not discuss here), the distribution of pairs over the similarity scale is biased towards the upper bound (there are more semantically similar words than dissimilar ones).

According to our estimate, 78% of the 2034 word pairs in the dataset are in a hypernymy or similar_to relationship. One would expect most of these (semantically similar) pairs to have been assigned high similarity scores, closer to the upper bound of the similarity scale [0,10]. However, surprisingly, these pairs are spread across the similarity scale, spanning from complete unrelatedness (lower bound) to identical semantics: respectively, 60%, 68%, 65%, and 56% of the pairs in the first to fourth quartiles have hypernymy relationship.

Left: the distribution of pairs with hypernymy relation is almost uniform across the four quartiles of the dataset, whereas one would expect many more pairs in the top quartiles (4 and 3), given the high semantic similarity of hypernym-hyponyms. Right: the dataset is biased towards the upper bound, i.e., high similarity judgements.

Additionally, the dataset suffers from inconsistent annotations. For instance, consider the two almost identical pairs in the dataset:

tricolour : flag
tricolor : flag

The assigned scores for the pairs are 5.80 and 0.71, respectively. This inconsistency is also reflected by high variances across annotators’ scores.

Inter-annotator agreement (IAA):

Up to 10 annotations have been provided for the 2034 word pairs in the dataset, each with a similarity score in [0,10] range. More precisely, 214 of the pairs are not provided with 10 scores, with the minimum number of scores for a pair being 7.

Given that there are not equal number of annotations for each pair, it is not straightforward to apply standard correlation-based measures for computing IAA (let aside if having an IAA on different sets of annotators is meaningful).

Given the high variance across annotators’ scores, the authors pruned the scores to only those that were within one standard deviation of the mean. This results in a further imbalanced set of scores, making the computation of IAA more challenging.

However, according to a rough calculation, the averaged pairwise Spearman correlation on seven overlapping sets of the original annotations is 0.41, which is a significantly low figure compared to other existing word similarity datasets (e.g., SimVerb-3500 reports 0.84).

All in all, I believe that the RW dataset is not a reliable benchmark for the evaluation of rare word representation (or any other task). As a matter of fact, the state-of-the-art on this dataset has been unable to surpass the 0.45 ceiling, a performance level that can already be achieved using standard pre-trained embeddings (such as the Google News Word2vec) without employing any sophisticated rare word representation procedure.

Suggestions:

As mentioned earlier, the dataset is the only existing dataset dedicated to rare word similarity (to my knowledge). Given the importance of this and the need for such an evaluation benchmark, it would be reasonable to attempt at creating improved benchmarks, either by creating new similarity datasets (or other benchmarks) or by improving the Stanford RW dataset. There can be several solutions, including:

  1. Re-score the pairs in the RW dataset, either with the help of expert annotators or by providing the Turkers with dictionary definitions of the words and employing more rigorous evaluation measures.
  2. Create a new dataset. The RW dataset was created with focus on morphological variations. As a result, there are many inflected word and plurals in the dataset (approximately, a third of the rare words are either plural or -ed form). Hence, it does not include many single-morpheme or domain-specific rare words which might be more challenging to handle — — the embedding for a plural word (e.g., consequences) or an inflected one (e.g. untracked) can be more easily induced based on its singular (consequence) or other morphologically related forms (untrack or track) than for single-morpheme words (e.g., the medical term afebril) or exocentric compounds (whose meanings cannot be identified from their sub-word components, e.g., honeymoon does not have much to do with honey and moon) for which no variation has been observed in the training.