Consolidating these models both simplifies our deployment process for new models, while providing a potentially more scalable solution for adding new language support in the future. We considered several possible research avenues and ultimately chose to investigate ways to merge pre-trained embedding lookup tables of each language. In this post, we’ll talk about what happens to the information contained in the word embedding if they are naïvely merged.
For the interested reader who does not come from a technical background, the remainder of the series is somewhat technical. However, you don’t need to understand code in order to follow the posts and you are cordially invited to continue reading. For those with technical prowess, all code you see is written in Python.
To begin our discussion, we’ll do a brief background on word embeddings. If you’re already familiar with word embeddings, feel free to skip to the following two sections.
A brief review of word embeddings
Word embeddings are vectors (i.e. lists) of real numbers that represent a word. For a quality explanation of what a word embedding is and how they are produced, I recommend reading the tensorflow tutorial on word embeddings.
In practice, we want to produce embeddings vectors such that their organization in n-dimensional space encodes semantic or linguistic relationships between the words they represent. One of the most interesting consequences of learning these vectors is that the vectors themselves tend to exhibit linear relationships that correspond to semantic relationships despite not having been trained to do so³. For example, Word2Vec, which is one of the most well-known techniques for creating word embeddings, isn’t performed with the specific task of producing linear relationships between the embedding vectors. Linear relationships can nevertheless be found after training.
The following is a common example of what these relationships might look like.
*Word math* : (king - male) + female = queen
Each of the words in this bit of word math can be replaced with a vector from the embedding lookup table. If the linear relationship between the vectors for king, queen, and female closely approximates their semantic relationship, then the formula above should hold true when substituting the vectors for the words.
*Vector math* : (V_king - V_male) + V_female = V_queen
This relationship isn’t always strictly true, and it depends on how the word embeddings were created. For a great environment to create your own word embeddings and explore the relationships between words, check out this Colab Notebook made available by the TensorFlow Team.
One key thing to understand about a word embedding is that it is a vector, and a vector is a list of numbers that describes both the orientation and a length (i.e. magnitude) of a line.
A vector can be plotted on a graph (if there are 3 or fewer numbers in the list). The following is a vector with two numbers (for the x and y coordinates).
A collection of vectors that all start from the same position will define a shape.
The techniques that produce these embedding vectors do so by incrementally altering the direction and length of the vector until all of the vectors have an optimal orientation and length, which is determined by whatever learning task is used to train the vectors. The end result is that vectors that are semantically similar are either positioned with small angles between them or are related through linear combinations with other vectors. Together, the collection of vectors defines a space with a shape and orientation.
Note: Since the surface is not defined by an infinite number of vectors (because the number of words is not infinite), the surface should not be thought of as a solid continuous surface, but instead as a collection of points.
The shape defined by this collection of vectors is important since models that consume these vectors typically either traverse the surface of this shape (as is the case with recursive neural networks such as the LSTM) or perform correlation analysis between components that define the shape (as is the case with convolutional neural networks). In other words, the information exploited by these models is embedded in the shape and structure of the space. If we change the shape of the space, then we alter the information embedded in that space.
Returning now to the problem at hand, merging embedding tables for different languages introduces the problem we call word collisions. Any two languages (so long as they use a similar or the same alphabet) will very likely contain at least some overlapping words, and these words may or may not have the same semantic meaning.
We’ve taken publicly available MUSE word embeddings for eight languages and computed the frequencies of word collisions across different combinations and found that as much as 22.5% of the total combined lexicon may collide when merging embeddings.
The high-frequency occurrence of word collisions is problematic, and begs the question — why are the rates so high? The following sections identify three contributing reasons.
The first is called interlingual homographs, or words that are spelled the same but have different meanings (and occur in different languages). Take for example the words son, champ, and vent. In English, these words are readily identified along with their meanings. son is the male offspring of a parent, a champ is a victorious individual, and vent is a passage through which air may flow. In French, however, these words take on very different meanings.
The important characteristic we are concerned with here is that the usage of the words between the languages will be different. This is important because the strategies that are employed to compute their word embeddings largely depend on how these words are used. Another way to think about this is that even though they are spelled the same, these words will regularly be found in completely different types of sentences between English and French.
For example, in English, you might read something like,
“The champ took down all of their opponents with ease.”
But in French, we might read:
“Le champ était rempli de fleurs” (The field was filled with flowers)
The words surrounding the keyword champ are very different, and these words are generally those used to determine what the embedding vector should be for champ.
This phenomenon can also occur within the same language to produce homographs. However, this problem is dealt with during the training process for the language-specific embeddings.
The second is loanwords. A loanword is a word that is directly borrowed from another language. English is commonly borrowed by many languages, and loanwords appear frequently in non-English both internally at Zendesk as well as in public datasets such as the MUSE word embeddings.
When building language-specific datasets at scale, unwanted data will nearly always find its way in. Unwanted data may contain words and phrases that are not actually part of a specified language. For example, things like brand names or highly recognizable phrases. At Zendesk, we might find things like English text wrapping non-English text in an email as part of a standard email signature or parts of non-English emails that use small amounts of English for one reason or another. For example:
Bonjour,Nous vous remercions de votre confiance et espérons que vous reviendrez.Merci,
Le FrançaisStandard Confusing Disclaimer: Le Français does not condone jiggery-pokery nor acquiesce the ukase of illegitimate or unlawful businesses. The views and opinions expressed in these musings are those of the author and do not necessarily reflect the official policy and positions of Le Français.
Coping with information loss
In the final section of this post, we’ll discuss the problem of information loss when merging the embedding tables. In the next post, we’ll discuss how we approached preserving information.
Word collisions cannot be ignored if we intend to merge embedding tables since they potentially encode a great deal of important information. There are a number of ways to deal with this problem, but they all boil down to deciding how to resolve the word-to-vector mapping back to a 1:1 relationship. A most naïve approach and the one we chose to perform experiments with is to simply average the colliding vectors together.
Given this approach, there are two important questions:
1. What happens to the shape of the space when we combine collections of embedding vectors from different languages?
2. What happens to the information when we average the vector collisions together?
Word embeddings for different languages are often pre-trained separately from the models they are used with. At Zendesk, the six languages that are supported by Answer Bot are supported through the serving of six individual models, each shipped with its own pre-trained word embeddings. Thus, for each individual model, the shape and orientation of the embeddings are unique to that model.
Ignoring the collisions for a moment, we can first consider what happens when we combine the embeddings into a single table. The models that consume this combined embedding space will need to deal with two sets of points that each have their own relationships encoded in their shape. In practice, this is ok, since only certain subsets of words will be consumed at a time. For example, consider the following two sentences:
The fox jumped over the fence. (English)El zorro saltó la cerca. (Spanish translation)
If the English and Spanish word embedding tables are combined for this set of words, there are no collisions, so the model may as well consume their vectors from two independent word embedding tables. Furthermore, the words from these sentences don’t interact with one another, so the vectors for one language are only ever combined with other vectors from the same language.
The words The and fox will only ever be consumed together, as is the case for El and zorro. Typically (as is usually the case at Zendesk), we would not observe sequences like The zorro saltó the fence. In this way, the underlying model needs only adapt to consuming vectors from each independent space. In this case, there is no information loss; just a hurdle for the model to overcome.
If we add the collisions back, we encounter a different problem. When averaging two or more points used to define the shape of the embedding space, we modify the very thing that contains the information the model needs.
Visualizing the information lost when averaging collisions
We can visualize the consequence of merging independently learned embeddings by creating simulated embeddings and merging together randomly selected common indices. In the following code, we’ve used Numpy’s random normal multivariate functions to sample points to create blobs that mimic embedding manifolds in 2-dimensions.
Note: These are not real embeddings, and the space between them has been exaggerated.
import numpy as npdef make_blob(mean, cov, size):
x, y = np.random.multivariate_normal(mean, cov, size).T
return np.hstack((x.reshape(-1, 1), y.reshape(-1, 1)))means1 = [[-10, 10], [-15, 5], [-9, 3]]
means2 = [[10, 10], [15, 5], [9, 3]]covs1 = [
[[4, 0], [0, 3]],
[[2, 0], [0, 3]],
[[2, 0], [0, 3]]]
covs2 = [
[[4, 0], [0, 3]],
[[3, 0], [0, 3]],
[[2, 0], [0, 5]]]blob_1 = np.vstack(
[make_blob(m, c, 500) for m, c in zip(means1, covs1)])
blob_2 = np.vstack(
[make_blob(m, c, 500) for m, c in zip(means2, covs2)])
This snippet of code takes three independent samples, each with a slightly different mean and covariance matrix, and combines them to create a blob of points that somewhat resembles typical word embeddings flattened to two dimensions.
The green and blue blobs are hypothetical embeddings, whereas the positions of all the individual points define their shape and structure, and thus information. Each white point represents a word collision, where a white point amongst the green blob has a sister white point somewhere amongst the blue blob. Notice that in this case, the blobs don’t overlap, and their orientations are not the same. These are independent hypothetical embeddings.
If we wish to merge these two embedding spaces, we can see the effect by observing the movement of the white points after merging.
In the case of merging these hypothetical embeddings by averaging, the common white points are pulled from their respective green and purple blobs and placed in some arbitrary space between. When these points move out of their original positions, the information encoded in that positioning is lost. With word embeddings, that information corresponds to critical semantic information necessary for natural language understanding.
This naïve approach, while mathematically convenient, is destructive to the semantic organization of the embedding vectors. In the following post, we’ll introduce methods we employed to mitigate the loss of information when averaging points together.