# Embeddings: A Matrix of Meaning

How can we model the meaning of a word? When we think of the verb “run”, it evokes a feeling of exertion or maybe a picture of a marathon. Can we represent a word as a node in a graph or model it as an image? For centuries, the question of how to model meaning was left to philosophers, then, in the last hundred years, to psychologists and linguists. In the last twenty years, that question has become crucial to the world of AI and computer science.

In this context we get to explore the meanings of words empirically. The concept of the embedding matrix has proven to be useful both for understanding linguistics and for solving real-world problems in the realm of natural language processing.

### What is an Embedding Matrix?

Take three similar phrases:

• … when the worker left …
• … when the fisherman left …
• … when the dog left …

Now, imagine we don’t know what “worker,” “fisherman,” and “dog” mean. In fact, we don’t know any of the words, but can easily tell that the phrases are identical except for one word. Since the contexts are identical, we could extrapolate that some similarity exists between “worker,” “fisherman,” and “dog”. By applying this idea to an entire corpus, we could define the relationships between words. Then we come to the question of how best to represent those relationships in a general way.

The concept of an embedding matrix is an attempt to solve this relationship representation problem. To begin with, we pick a dimensionality of meaning — this can be somewhat arbitrary. Say we decide all meaning can map to some abstract space of three dimensions. Conceptually, that would mean that all words would exist as singular points in a 3D space and any word could be uniquely defined by their position within that space described by three numbers (x, y, z). But, in reality, meaning is too complicated to fit well into three dimensions. Typically, we use something like 300 dimensions, and all words map to some point in this 300D hyperspace and are defined by 300 numbers. We call the 300 numbers that identify a given word an embedding for that word.

Now relationships between words can be represented by comparing the words’ embeddings. For instance, “worker” maps to W, which is a 300-length vector, and “fisherman” maps to F, which is a different 300-length vector. The relationship between the two words can be described as their difference: W–F. Note that this is not commutative (x–y != y–x), but words’ relationships to each other aren’t commutative either. The relationship between “person” and “human” is not the same as the other way around. The direction of the relationship actually contains meaning in its own right — what can be done to “person” to transform it into “human”?

An embedding matrix is a list of all words and their corresponding embeddings.

A few things to keep in mind:

• Thinking in higher dimensions is hard. Don’t get caught up in the dimensions. The same concept works (albeit not nearly as well) in three dimensions.
• In the real world, each dimension has an obvious physical meaning — like the XYZ axes of our 3D world. In an embedding space, we typically don’t define the axes — most of the time we don’t even know what a particular axis represents in our embeddings hyperspace. Instead, we use other tools to decipher utility from the matrix.
• Note that in the embedding matrix above, each row corresponds to a word and each column corresponds to a dimension (axis). Typically, we store this in a dense fashion, where we have a list of words and row ID’s which map to the corresponding row of the matrix. For the above example, we’d have the following list in addition to the matrix:
{ hello: 0, there: 1, texas: 2, world: 3, … }
• We obviously can’t have infinite rows in this matrix, so they’re typically cut off at, say, 50,000 — these should be the 50,000 most common words. If we only ever see a word in one context, it’s hard to determine its meaning (e.g., it’s difficult to learn some words that we have no reason to use frequently). Any word in a dataset that isn’t in our list gets tagged “UNK,” which means that we won’t see it enough times to be useful. UNK words are so rare that we can usually ignore them.
• Embedding matrices are extremely large! If we have 50,000 words and 300 dimensions, that means we have 50,000 x 300 individual numbers. If these numbers are floats (4 bytes), we would need 50,000 x 300 x 4 bytes — 60MB for one matrix!

If we could generate these 50,000 x 300 numbers, then, theoretically, we would have a “dictionary” of all 50,000 words encoded in numbers. This would be very useful for designing machine learning models, as they wouldn’t have to deal with messy strings anymore.

### Training an Embedding Matrix

Like any other training task, you need some sort of dataset to train embeddings. The difficulty is that, unlike most deep learning cost functions, we don’t have any “real” embeddings to calculate a useful loss (error). Thus, we have no better recourse than to train for some other goal that we can test and, hopefully, create the embeddings as byproducts — parameters — along the way.

One methodology is to design the model to guess the context of a word from the word itself. This is useful because it allows us to always verify if the model guessed correctly, whereas it’s impossible to verify that an embedding is correct.

The embedding matrix is randomly initialized and set as parameters to this context-guessing model. A cost can be calculated by seeing how closely the model guessed the context embedding, then the whole model can be trained using gradient descent. This is the ‘continuous-bag-of-words’ model proposed by Mikolov et al in 2013.

### Evaluation

Once an embedding matrix has been trained from a dataset, it would be nice to examine the embeddings to see if they make sense. There are really no guarantees of this because any number of things could go wrong during the training process:

• The dataset might not be general enough (maybe a few words’ embeddings are good but not all 50,000).
• Maybe the dataset isn’t large enough to train the embeddings at all. In this case, any training method would memorize the dataset, producing garbage embeddings.
• We could have chosen the wrong dimensionality for our embeddings.
• Perhaps the training model we chose has a flaw such that it doesn’t converge properly.
• One of our hyperparameters (parameters that are part of the model but not trained by the model, which we set before the training process) could lead to it not converging properly.

So it’s definitely necessary to evaluate the model, but it isn’t quite as simple to evaluate embeddings as it is to evaluate other machine learning tasks, because the model is only learning the embeddings as a byproduct of guessing some other task (such as a word’s meaning based on its context). We could check to see how well the model does at guessing context words, but this isn’t all that useful in and of itself.

A few other techniques can be used:

• For a given word embedding, we can examine the words closest to that word. If we think of words as just points in space, then for any two words there is a distance between them. If our embedding matrix was successfully trained, synonyms should show up closer together.
• Word relationships:
King” is to “Queen” as “Man” is to X
When we have embeddings, we can translate the above relationship into an equation:
King–Queen~Man–X
• In a semantic dataset that has many relationships like these, if the embeddings are trained correctly, X above should be what we’d expect: “Woman.”
• Finally, there are other machine learning models that are designed to take word embeddings as inputs. If our embeddings seem correct, we could use them as inputs to another model that is training something else, like sentiment analysis, for example.

### Distributing Embedding Matrices

The ability to correctly train embedding matrices is a prerequisite for most NLP machine learning models, but working with embedding matrices brings up some engineering challenges. The matrices are quite large and they don’t follow the same assumptions that tensor-based frameworks were designed for. In image processing models, parameters are typically dense (the entire matrix is trained and updated at every step), however, when training with embedding matrices, we typically only train the rows (words) specific to the current context window in the training dataset. The vast majority of the matrix isn’t altered during a given training step.

In a single node case, since we still need to store the whole matrix, this is of little consequence, but when distributing a training model, it has significant implications. Generally, in deep learning, we distribute models across a cluster by using data parallelism, a technique where each worker node in the cluster runs through the same model, but with different data. They then coordinate the updates they will make to parameters — the minimal information required to work together. When working with dense parameters, this coordination simply means that each worker copies its parameters to each other worker. However, sparse matrices like the embedding matrix can be made much more efficient if they only transfer the rows that changed. Most of the parameters wouldn’t need to be copied.

So, how can we transfer certain rows of a matrix over a network efficiently? The following rules of thumb are based on our research:

• When copying data, it’s always more efficient to do it in bulk instead of one-at-a-time. For instance, if we have 100 rows to transfer, it is better to do one transfer with them all coalesced than to do 100 individual transfers. This also holds true for memory and GPU transfers.
• When coalescing, you should do it in-place (without using a new buffer) to save RAM overhead.
• After coalescing rows, you need to communicate where the original rows came from. This can be done in two ways: with an array of indices or a bitmap. The optimal way depends on the percentage of rows that need to be copied and both ways can be compressed.
• In actuality, some rows show up more frequently than others: regardless of language, words follow a zipfian distribution, not a uniform distribution. This has implications if you intend to shard your communication (split your embedding matrix into multiple sections). For instance, if the embedding matrix is sorted in order of most frequent words to least frequent and we want to shard this matrix into four sections, the first section will be updated far more frequently than the last. Ideally, we would randomize the rows of the matrix so that splitting the matrix into sections doesn’t lead to certain sections being updated more often than others.

Designing systems for machine learning always involves a tradeoff. The more powerful the framework, the harder it is to use. An embedding matrix could fit into the standard tensor mould like any other deep learning parameter, but if we treated it like any other parameter, we would lose the ability to take advantage of its sparsity (the fact that it doesn’t need to be updated all at once).

If we only transfer the minimum modified information — the rows of the embedding matrix that were updated for a given step — we can distribute the entire job while communicating less information. This effectively increases the efficiency of parallelizing the task as a whole.

Embedding matrices are not the only special-case parameters in machine learning — sparsity can come in many forms. In general, you can think of parameters as falling into three categories: stored and communicated sparsely (akin to hashmaps), stored densely and communicated sparsely (embedding matrices), or stored and communicated densely (traditional tensors). When designing systems to solve large families of machine learning algorithms, it is necessary to support all three families.

#### References

A few models that are very similar to the “Continuous Bag of Words”:

Once word embeddings are trained they can be used for higher level tasks:

Embedding matrices can also be used to represent phrases instead of words: https://arxiv.org/pdf/1406.1078.pdf