A wordvector is a vector representing the semantic meaning of a word. A relationship between words (such as similarity) can be defined numerically using wordvectors. Your average wordvector these days is trained by looking at the context of the word you are trying to represent. In other words, the vector of a target word should be close to the vectors of the words commonly surrounding the target word. However, wordvectors generated by these means do not take into account latent hierarchical relationships between words. In this post, I’ll explain how wordvectors can encode hierarchical information through Poincaré embeddings.
(This post is based on a research paper from Facebook AI Research: https://arxiv.org/pdf/1705.08039.pdf)
Representing Hierarchical Information in Hyperbolic Space
Let’s look at a small domain of words: crimson, aquamarine, color, blue, scarlet, cerulean, red. There’s a noticeable hierarchy to these words. Let’s visualize this hierarchy as a tree:
We want to associate a vector to each of these words and a distance metric between these vectors. Intuitively, a word in a category is similar to the category it’s a part of, and words in different categories are not similar to each other. Therefore, the distance metric should have 1) the child’s vector close to the parent’s vector and 2) vectors of different parents far away from each other. Following these rules, we can see a geometric representation on a 2D Euclidean surface for the example’s wordvectors:
If we place ‘Color’ at the origin, here are the vectors we could use to represent the example domain: Color: (0, 0), Red: (-1, 0), Blue: (1, 0), Crimson: (-2, 1), Scarlet: (-2, -1), Cerulean: (2, 1), Aquamarine: (2, -1).
But, how can we construct these vectors generally? Let’s try representing a hierarchy with k levels where each non-leaf node has b branches on a 2D Euclidean surface.
For just a graph with a branching factor of 4 and 2 levels, we can see that we are running out of space at the leaf nodes. Leaves of different branches are getting close to each other, which we don’t want! Even if we increase the dimension of the Euclidean surface from 2D to 3D, it won’t take much for us to run out of space again. The amount of space we have at each level in 2D is analogous to the circumference of a circle. It increases proportionally by the level (circumference = 2πr). In 3D, the amount of space increases proportionally to the square of the level (area of a sphere = 4πr²). In N dimensions of a Euclidean surface, the amount of the space you have increases proportionally to Lᴺ where L is the number of levels. However, the amount of nodes at each level is not polynomial, it’s exponential. The amount of nodes at each level is bᴸ where b is the branching factor. So, in a Euclidean surface, the nodes at each level outrun the amount of space you have at that level.
So let’s throw Euclidean geometry away. We want a surface such that the farther we get from the center, the distances grow exponentially. This looks like the Poincaré ball from hyperbolic geometry. The figure below shows a hierarchy embedded in a 2D Poincaré ball.
In the disk above, each of the lines are the same Poincaré distance. The space from the center increases exponentially such that no line is able to reach the circumference of the circle. Therefore, we can fit an arbitrary number of levels in hyperbolic space with significantly fewer dimensions than in Euclidean space.
Optimizing Embeddings in Hyperbolic Space
Now that we know the surface on which we are defining our wordvectors, let’s define how to evaluate the quality of these vectors. We use a variant on the log softmax function to define the loss function:
Θ is our set of wordvectors. D is the set of observed hierarchical relationships (i.e. (Color, Blue) or (Red, Crimson)). N(u) is the set of k random words (k is a constant parameter) that don’t have a relationship with u (i.e. (Blue, Crimson) and (Blue, Red)). A good set of wordvectors minimizes this loss function because words that have hierarchical relationships should be close to each other (loss function’s numerator) and words that don’t should be farther away from each other (loss function’s denominator). This type of of technique that adds an inverse factor of random nonexistent pairs to observed pairs is called negative sampling.
To minimize a loss function, we usually use gradient descent. In essence, gradient descent is a method where you initialize a random input (x) for the function F(x) and update x in the direction of the negative gradient. The gradient of a function is the partial derivative with respect to each dimension. After updating x over many iterations, we will find the optimal x to minimize F(x). However, we can’t just use the Euclidean gradient on hyperbolic space. Let’s think of a derivative as a tangent surface. Since the tangent surface is different on every point of a hyperbolic surface, the Riemann metric tensor allows us to map a point on a tangent surface to a point on the hyperbolic surface. Using this metric tensor, we can rescale the Euclidean gradient to get the Riemann gradient. The updates on the initial vectors will be in the following form.
θ is a wordvector, η is our step size, and L is our loss function. We’ ll start by associating every word in our domain with an initial vector where each element is uniformly random between (-.001, .001). From there we will be updating the vectors with the negative Riemann gradient of the loss function. After substituting the Riemann gradient with the rescaling of the Euclidean gradient, the full formula looks like:
A bunch of iterations of updates later, ta-daa we have our Poincaré embeddings!
I didn’t fully derive the math in the end, but I hope you all understood the intuition behind Poincaré embeddings. Words in a domain often have an underlying hierarchy to them and we can add this hierarchical information to the numerical representation of a word. Hierarchies are often trees where the number of nodes increases exponentially by the level, so we want a space that increases exponentially as we deviate from the “origin” of the space. The Poincaré ball follows this property, so we can define our vectors on this space. We define a loss function that gives a good score for clustering vectors in the same hierarchy and spreading out vectors in different hierarchies. Finally, we can find vectors to minimize our loss function using Riemann gradient descent and get a good numerical representation of our words.