Toward a Deeper Understanding of Word Embeddings

A walk through two 2020 NLP conference papers.

Published in

BMO AI

11 min readDec 17, 2020

This blog post will walk you through these two conference papers from EMNLP 2020 *and* COLING 2020.

Word embedding algorithms such as Word2vec and GloVe were foundational for Natural Language Processing (NLP). At the time of writing, the 3 papers proposing Word2vec and GloVe have accumulated over 60,000 citations combined in the past 7 years.

Their popularity emerged due to their general-purpose usefulness to a plethora of NLP tasks, from sentiment analysis to automatic translation, POS-tagging to information extraction, etc. This is because word embeddings solve one of the most challenging and ubiquitous problems in NLP — feature extraction.

Indeed, in NLP we often face arbitrarily sized sequences of words. Word embeddings turn these frustratingly discrete words into meaningful fixed-length vectors of numbers. In turn, this sequence of vectors can be easily encoded into a document representation via established techniques from simple averaging to deep long- short-term memory neural networks (LSTMs), deep transformers, and more.

Despite the current dominance of new contextualized embedding models in NLP, it is valuable to understand what made word embeddings so useful for so many tasks. I posit that a deeper theoretical understanding of what they actually learn (i.e., what causes them to possess a certain notion of meaning) may provide insights into new structures, inputs, and priors that deep contextualized models are lacking.

In this blog post I will attempt to shed light on the meaning latent in word embeddings and the value they can deliver by summarizing my two latest accepted conference papers: Deconstructing Word Embedding Algorithms (EMNLP 2020) and Learning Task-Efficient Meta-Embeddings with Word Prisms (COLING 2020). Note, of course, these works would not have been possible without the patience, dedication, and hard work of all coauthors involved.

Deconstructing Word Embedding Algorithms

Many NLP practitioners have experimented with using word embeddings in their models. Yet, most are not aware that (1) half of their embeddings are missing; and, (2) their embeddings were trained to approximate pointwise mutual information (PMI) statistics in the corpus they were trained upon.

PMI is a measure of how much two events impact each other. For example, the PMI between the event of a coin landing on heads and a die rolling a six is 0 because these two events are completely independent of each other. In NLP, we find that the PMI between the event of seeing the word “costa” and seeing another word “rica” is much greater than 0, because the presence of one of those words strongly increases the probability of seeing the other.

The table below from my EMNLP paper explains these above points in reference to 6 algorithms (FastText is not shown as it is approximately equivalent to SGNS, aka Word2vec, see the paper). While the theoretical proof for each of these algorithms necessitated a fair amount of notation, algebra, and matrix calculus, the fundamental finding can be summarized as follows:

Every popular word embedding algorithm learns two sets of embeddings (vectors and covectors, or, input and output vectors) such that the dot product between a covector i and a vector j aims to approximate the PMI between word i and word j from the original training corpus.

This is a surprising result — despite all of the different ways each of the algorithms goes about sampling along the corpus, despite some being shallow neural networks and others using least squares optimization — they all converge to one central learning objective: approximating PMI.

Another way to say explain this finding is to say that each algorithm is implicitly factorizing a matrix of PMI statistics, as Levy & Goldberg proved for Word2vec (leveraging assumptions we were able to relax). You can find elsewhere my other blog posts describing the relationship between Word2vec and matrix factorization.

Ways to improve & augment word embeddings

Hundreds of works have developed upon the baseline word embeddings of Word2vec and GloVe. FastText employed subword-based information to train new subword embeddings (with word embeddings as the covectors) in order to resolve the out-of-vocabulary problem. Other techniques learn subword embeddings post-training — subword decomposition. I employed a variant of this technique in order to produce useful subword based embeddings in the Belgian NLP context.

Other techniques augment word embeddings to possess certain semantic properties. Retrofitting introduced a simple, graph-based approach to augment pre-trained word embeddings to capture semantic relationships such as synonymy. This opened up an interesting question for the community: what kind of semantic relationships should my word embeddings possess?

For example, should the embeddings for “east” and “west” be highly similar, or highly dissimilar; if dissimilar, should they be orthogonal, or 180 degrees apart (and thus co-linear)? On the one hand, they are direct opposites; on the other hand, they are both cardinal directions. I would argue one cannot designate whether certain semantic properties are desirable a priori, that we should allow the downstream task to make this determination.

While we understand that different algorithms (Word2vec, Glove, SVD, etc.) will all produce embeddings that model PMI, there are still many degrees of freedom in the training process including context window size and the training corpus. Given these degrees of freedom for training embeddings, their different possible subword-based modifications, and all of the ways to augment their semantic properties, we see that the state space of possible word embeddings one can employ is astronomical. Moreover, there is no evidence suggesting that there exists one single set of word embeddings that will be the most performant for every NLP task. So, what is to be done?

Meta-embeddings and Word Prisms

A somewhat inconspicuous body of NLP literature has been developing since at least 2016 on the subject of meta-embeddings. The aim is to solve the problem posed above by answering as follows:

Indeed, there are a plethora of possible embeddings to choose from and we do not have knowledge a priori of which will be best at the downstream task. In fact, it is likely that different word embeddings could very well complement each other. Thus, let us combine multiple sets of word embeddings together as an ensemble — we need not limit ourselves to one algorithm and one notion of meaning.

Perhaps an NLP practitioner seeks to build a specialized sentiment analysis system. They are confronted with a plethora of different embeddings to choose from — GloVe CommonCrawl embeddings, FastText for subword information, the ConceptNet pre-retrofitted numberbatch embeddings, etc. The practitioner need not experiment with each set of embeddings separately, rather, they can employ a meta-embedding function in order to leverage the useful linguistic and semantic properties possessed in each set of embeddings.

One can define many functions that can perform this meta-embedding operation, as visualized below.

Averaging. The simplest function is just averaging them all together — this is very cheap because you don’t need to train any new model, and it also geometrically well-justified for the case where you’re using just a few different sets of embeddings. Moreover, the final dimensionality of the meta-embedding will only be the size of the largest embedding set in your group of embeddings (you will pad the rest with zeros to match).

Therefore, averaging can be implemented very efficiently for inference time computations since you can precompute your set of meta-embeddings in advance to be a reasonable dimensionality. However, we will see that averaging causes considerable information loss and mixing of the vector space, causing large deterioration in performance when you have more than 2 or 3 sets of embeddings.

Concatenation. The diametrical counterpart to averaging is concatenation. This is, again, a very simple meta-embedding function. For example, by concatenating four 300-dimensional embeddings for the word “apple” you will have a resulting dimensionality of 1200 for your new “apple” meta-embedding.

The advantage here is that no information is lost in this meta-embedding process. However, from an efficiency perspective, this technique will be completely infeasible when there are more than a few sets of embeddings, as the dimensionality of the resulting vector will be too memory-intensive for downstream models to ingest (also, this high dimensionality may make the downstream model more prone to overfitting).

While one can use techniques such as SVD to compress such large concatenated meta-embeddings, the problem is that such an approach ignores the downstream task at hand. It is more desirable to allow the downstream task to select which features are most useful for solving the problem, rather than doing so in an unsupervised manner which will cause information loss.

Word Prisms

My 2020 COLING paper proposes a new technique for producing meta-embeddings that are dynamically adapted to the downstream task at hand: word prisms.

Word prisms address two issues with meta-embeddings: (1) the parameters of the word prism model are learned during training for the downstream task, allowing the model to dynamically determine the relative importance of each set of embeddings; (2) the final resulting meta-embeddings are as large as those produced by averaging making them very efficient at inference time, and the information within in this resultant meta-embedding is as useful as that which would have been produced by concatenation. While other works may address problem (1) or (2) separately, ours is the first to address both simultaneously.

The figure below illustrates the process of learning word prisms during training for a task called supersense tagging, which is like POS-tagging but for semantics. On the left, we see how the word prisms enter the sequence labelling model as the first layer of feature representation. On the right, we observe the internal workings of the word prism. In this example, there are 5 sets of input embeddings (facets) being displayed — we can see that each embedding set provides a different relative form of meaning to the word “apple”.

The top embedding sets understand the word “apple” as being a type of fruit, while the latter embedding sets understand “apple” as relating to technology and then the corporation. Indeed, the meaning of “apple” depends on its context, so the word prism facilitates the downstream model to refer to whatever notion of meaning is relevant to the task at hand. Whereas in certain situations “apple” is just a fruit, in this case, the model seeks to predict that “apple” is in fact being referred to as the corporation “apple inc”.

Yet, the word prism is just a single vector, as we see in the equation prism(“apple”). The structure of this equation and the learned model parameters (α_f, P_f, and b_f) facilitate the downstream model to probe into this prismatic structure. Recalling that by “facet” we mean an individual set of embeddings (e.g., the GloVe embeddings), the meaning of this equation is as follows:

The word prism for a word w is a linear combination of the various embeddings for w along each of its composite facets. During training for the downstream task, the word prism model learns 3 parameters for each facet via backpropagation: α_f, a scalar importance factor; P_f a square orthogonal matrix that transforms the facet’s embedding of w (w_f) to a new region of the vector space (with the help of b_f, a bias vector). The resulting sum of each orthogonally-transformed and scaled facet embedding is a new vector, the word prism for w.

By using linear orthogonal transformations for each component facet, the model is able to disentangle the various sets of embeddings and place them into different regions of the vector space, therefore allowing the final summation to occur without any information loss. Moreover, because the downstream model parameters are learned simultaneously with these linear transformations, it will be aware of each of these separate regions in the space. We experimentally verified that this is indeed true by finding that the orthogonally transformed facets are dramatically more naturally clusterable (a desirable quality in representation learning) than they would be otherwise, as visualized below.

High-quality results for Word Prisms

We evaluated word prisms on 6 different downstream NLP tasks (supersense tagging, POS-tagging, named entity recognition, sentiment analysis, and natural language inference). We experimented with combining different sets of embeddings — either having just two sets (FastText and GloVe), or 13 sets of embeddings (just read the paper for those).

We compared the technique to averaging, concatenation, and another technique called dynamic meta-embeddings (DMEs) which also allows the meta embedding function to be fine-tuned according to the downstream task at hand (but is far less efficient at inference time).

Our major findings were:

Word prisms perform better or equivalently to the concatenation baseline, unlike averaging and DMEs, which are mostly always worse than concatenation. This is remarkable because the resultant word prisms were only 300-dimensional, while the concatenated vectors had 3900 dimensions! This is equivalent to a 92% compression ratio with little-to-no information loss (and oftentimes, information gain).
Word prisms perform very well regardless of how many input facets there are; meanwhile, averaging does OK with 2 facets, but dramatically deteriorates with 13.
Using word prisms to combine various sets of embeddings always performs better than just using one set of embeddings (see Table 3 in the paper).
Word prisms are highly efficient because, after training the downstream model, the meta-embeddings can be precomputed for each word in your vocabulary. Thus, the first layer of orthogonal transformations can be removed from the model during inference, meaning that you can employ word prisms with the same efficiency as using just a single set of embeddings during inference time.

Concluding remarks

Word prisms offer a marked improvement over existing meta-embedding algorithms both in terms of accuracy and efficiency. It would not have been possible to develop this technique without the theoretical groundwork provided by the deconstruction of major word embedding algorithms. Indeed, the fundamental intuition behind using orthogonal projections in word prisms geometrically derives from the dot product learning objective used to create word embeddings.

Some may argue “this is all well and good, but we have contextualized models now, and your word embeddings will never be able to outperform them, so why bother with this analysis and meta-embeddings?”

I answer:

Word embeddings formed the foundation of contextualized embedding models (note that ELMO was initialized with GloVe embeddings in the first layer).
A deep theoretical analysis of word embeddings can provide the theoretical basis for further investigation into the properties of these much more complex contextualized models, which will in turn lead to new practical architectures.
Meta-embedding techniques are not limited to words, in fact recent work has shown that using meta-embedding to combine an ensemble of contextualized models obtains better results than using just one model on its own.
Word embeddings (and word prisms) are much more efficient than large contextualized deep neural networks and can be effectively used on CPUs, so they are a viable low-resource alternative to expensive, resource-intensive contextualized models.

-Kian Kenyon-Dean, AI Developer for the BMO AI Capabilities Team. Reach out at kian.kenyon-dean@bmo.com