The science behind consolidating Answer Bot production Models: Part 2

Paul Gradie
Oct 4, 2019 · 10 min read

In the previous post (Part 1), we introduced the issue of consolidating the six production models that are used to provide multilingual support for Answer Bot at Zendesk into a single model.

Consolidating these models both simplifies our deployment process for new models, while providing a potentially more scalable solution for adding new language support in the future. We considered several possible research avenues and ultimately chose to investigate ways to merge pre-trained embedding lookup tables of each language. In this post, we’ll talk about what happens to the information contained in the word embedding if they are naïvely merged.

For the interested reader who does not come from a technical background, the remainder of the series is somewhat technical. However, you don’t need to understand code in order to follow the posts and you are cordially invited to continue reading. For those with technical prowess, all code you see is written in Python.

To begin our discussion, we’ll do a brief background on word embeddings. If you’re already familiar with word embeddings, feel free to skip to the following two sections.

A brief review of word embeddings

In practice, we want to produce embeddings vectors such that their organization in n-dimensional space encodes semantic or linguistic relationships between the words they represent. One of the most interesting consequences of learning these vectors is that the vectors themselves tend to exhibit linear relationships that correspond to semantic relationships despite not having been trained to do so³. For example, Word2Vec, which is one of the most well-known techniques for creating word embeddings, isn’t performed with the specific task of producing linear relationships between the embedding vectors. Linear relationships can nevertheless be found after training.

The following is a common example of what these relationships might look like.

*Word math* : (king - male) + female = queen

Each of the words in this bit of word math can be replaced with a vector from the embedding lookup table. If the linear relationship between the vectors for king, queen, and female closely approximates their semantic relationship, then the formula above should hold true when substituting the vectors for the words.

*Vector math* : (V_king - V_male) + V_female = V_queen

This relationship isn’t always strictly true, and it depends on how the word embeddings were created. For a great environment to create your own word embeddings and explore the relationships between words, check out this Colab Notebook made available by the TensorFlow Team.

Embedded Information

A vector can be plotted on a graph (if there are 3 or fewer numbers in the list). The following is a vector with two numbers (for the x and y coordinates).

A single vector plotted in 2D

A collection of vectors that all start from the same position will define a shape.

A collection of vectors define a shape

The techniques that produce these embedding vectors do so by incrementally altering the direction and length of the vector until all of the vectors have an optimal orientation and length, which is determined by whatever learning task is used to train the vectors. The end result is that vectors that are semantically similar are either positioned with small angles between them or are related through linear combinations with other vectors. Together, the collection of vectors defines a space with a shape and orientation.

Note: Since the surface is not defined by an infinite number of vectors (because the number of words is not infinite), the surface should not be thought of as a solid continuous surface, but instead as a collection of points.

Vectors can represent words

The shape defined by this collection of vectors is important since models that consume these vectors typically either traverse the surface of this shape (as is the case with recursive neural networks such as the LSTM) or perform correlation analysis between components that define the shape (as is the case with convolutional neural networks). In other words, the information exploited by these models is embedded in the shape and structure of the space. If we change the shape of the space, then we alter the information embedded in that space.

Word collisions

We’ve taken publicly available MUSE word embeddings for eight languages and computed the frequencies of word collisions across different combinations and found that as much as 22.5% of the total combined lexicon may collide when merging embeddings.

Collision counts resulting form embedding table merging

The high-frequency occurrence of word collisions is problematic, and begs the question — why are the rates so high? The following sections identify three contributing reasons.

Interlingual Homographs

Interlingual homographs

The important characteristic we are concerned with here is that the usage of the words between the languages will be different. This is important because the strategies that are employed to compute their word embeddings largely depend on how these words are used. Another way to think about this is that even though they are spelled the same, these words will regularly be found in completely different types of sentences between English and French.

For example, in English, you might read something like,

“The champ took down all of their opponents with ease.”

But in French, we might read:

“Le champ était rempli de fleurs” (The field was filled with flowers)

The words surrounding the keyword champ are very different, and these words are generally those used to determine what the embedding vector should be for champ.

This phenomenon can also occur within the same language to produce homographs. However, this problem is dealt with during the training process for the language-specific embeddings.

Loanwords

Contamination

Bonjour,Nous vous remercions de votre confiance et espérons que vous reviendrez.Merci,
Le Français
Standard Confusing Disclaimer: Le Français does not condone jiggery-pokery nor acquiesce the ukase of illegitimate or unlawful businesses. The views and opinions expressed in these musings are those of the author and do not necessarily reflect the official policy and positions of Le Français.

Coping with information loss

Word collisions cannot be ignored if we intend to merge embedding tables since they potentially encode a great deal of important information. There are a number of ways to deal with this problem, but they all boil down to deciding how to resolve the word-to-vector mapping back to a 1:1 relationship. A most naïve approach and the one we chose to perform experiments with is to simply average the colliding vectors together.

Given this approach, there are two important questions:

1. What happens to the shape of the space when we combine collections of embedding vectors from different languages?

2. What happens to the information when we average the vector collisions together?

Word embeddings for different languages are often pre-trained separately from the models they are used with. At Zendesk, the six languages that are supported by Answer Bot are supported through the serving of six individual models, each shipped with its own pre-trained word embeddings. Thus, for each individual model, the shape and orientation of the embeddings are unique to that model.

Ignoring the collisions for a moment, we can first consider what happens when we combine the embeddings into a single table. The models that consume this combined embedding space will need to deal with two sets of points that each have their own relationships encoded in their shape. In practice, this is ok, since only certain subsets of words will be consumed at a time. For example, consider the following two sentences:

The fox jumped over the fence. (English)El zorro saltó la cerca. (Spanish translation)

If the English and Spanish word embedding tables are combined for this set of words, there are no collisions, so the model may as well consume their vectors from two independent word embedding tables. Furthermore, the words from these sentences don’t interact with one another, so the vectors for one language are only ever combined with other vectors from the same language.

The words The and fox will only ever be consumed together, as is the case for El and zorro. Typically (as is usually the case at Zendesk), we would not observe sequences like The zorro saltó the fence. In this way, the underlying model needs only adapt to consuming vectors from each independent space. In this case, there is no information loss; just a hurdle for the model to overcome.

If we add the collisions back, we encounter a different problem. When averaging two or more points used to define the shape of the embedding space, we modify the very thing that contains the information the model needs.

Visualizing the information lost when averaging collisions

Note: These are not real embeddings, and the space between them has been exaggerated.

import numpy as npdef make_blob(mean, cov, size):
x, y = np.random.multivariate_normal(mean, cov, size).T
return np.hstack((x.reshape(-1, 1), y.reshape(-1, 1)))
means1 = [[-10, 10], [-15, 5], [-9, 3]]
means2 = [[10, 10], [15, 5], [9, 3]]
covs1 = [
[[4, 0], [0, 3]],
[[2, 0], [0, 3]],
[[2, 0], [0, 3]]]
covs2 = [
[[4, 0], [0, 3]],
[[3, 0], [0, 3]],
[[2, 0], [0, 5]]]
blob_1 = np.vstack(
[make_blob(m, c, 500) for m, c in zip(means1, covs1)])
blob_2 = np.vstack(
[make_blob(m, c, 500) for m, c in zip(means2, covs2)])

This snippet of code takes three independent samples, each with a slightly different mean and covariance matrix, and combines them to create a blob of points that somewhat resembles typical word embeddings flattened to two dimensions.

Two hypothetical embedding tables, white dots are simulated collisions

The green and blue blobs are hypothetical embeddings, whereas the positions of all the individual points define their shape and structure, and thus information. Each white point represents a word collision, where a white point amongst the green blob has a sister white point somewhere amongst the blue blob. Notice that in this case, the blobs don’t overlap, and their orientations are not the same. These are independent hypothetical embeddings.

If we wish to merge these two embedding spaces, we can see the effect by observing the movement of the white points after merging.

The result of averaging vectors for simulated collisions for two 2D hypothetical embedding tables.

In the case of merging these hypothetical embeddings by averaging, the common white points are pulled from their respective green and purple blobs and placed in some arbitrary space between. When these points move out of their original positions, the information encoded in that positioning is lost. With word embeddings, that information corresponds to critical semantic information necessary for natural language understanding.

This naïve approach, while mathematically convenient, is destructive to the semantic organization of the embedding vectors. In the following post, we’ll introduce methods we employed to mitigate the loss of information when averaging points together.

References

  1. https://en.wikipedia.org/wiki/Word2vec
  2. https://arxiv.org/pdf/1901.09813.pdf
  3. https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word_embeddings.ipynb
  4. https://github.com/facebookresearch/MUSE
  5. https://en.wikipedia.org/wiki/Interlingual_homograph
  6. https://en.wikipedia.org/wiki/Loanword

Zendesk Engineering

Engineering @ Zendesk