Seeing Meaning

Published in

Cisco Emerge

5 min readAug 14, 2017

How we encoded semantic representations of text as images and what that means for understanding language

The text of this blog posted encoded as an image

Background
Embeddings encode the meaning of a word into a list of numbers. Those numbers represent a point in higher dimensional space. Similar words are near each other in that space. We’ve had some success using these to train our NLP machine learning tasks.

Objective
We wanted to see if we could get a deeper insight into why this works by encoding these embeddings as images. So, given a block of text, generate an image that encodes the meaning of that text. In principle, this would turn a semantic understanding problem into an object recognition problem. We considered that this might not be much different than simply training on a sequence of embeddings. It could turn out there is no advantage to transforming floats in an embedding vector into pixels in an image — but we wanted to find out. Sometimes looking at a problem from a different angle can lead to important insights.

Experiment
So we ran a quick experiment. The first constraint is that any length of text should be encodable as a fixed size image. This is something like adding zero padding to a sequence when training a neural network. The second requirement is that we should be able to decode the text from the image at a later time.

We took a straight-forward approach and pictured the sentence as an M x N matrix, where M is the number of words and N is the number of dimensions in the embedding vector. In a future experiment, we might try generating the sentence vectors using an Autoencoder, but we wanted to try something simple first. So given a sentence like the fox jumped over the hedge and a 50 dimensional word vector, we end up with a 6 x 50 matrix.

If we stuck with that method, we’d end up with as many matrix shapes as we had sentence lengths in our training set — which fails to satisfy our first requirement. To solve this, we interpolated vectors between the words so that we would always end up with a fixed sized matrix. This creates a kind of semantic topology, where the height of the terrain corresponds to the value of a particular embedding dimension.

We can represent this topology as a 2D image. Here is what that looks like for “the fox jumped over the hedge” (using the pre-trained 50 dimensional Glove embedding set). The colors represent the intensity of the weight at each dimension in the embedding vector.

Questions
This is sort of interesting on its own, but it made us ask the question — “What words, if any, do those interpolated vectors correspond to?” Another way to think of it is to picture the higher dimensional space as stars embedded in a galaxy. Interpolating between two words is like traveling from one star to another and creating new stars at various points along the path. You can then ask, “What kind of stars are these?” — or, in our case, “What words do these points correspond to?”

We used Gensim to return the nearest word to the interpolated vector. Our hypothesis was that it would return a word that was semantically somewhere between the original points. So, if the word pair was king → duke, we might encounter a word like prince. At 300 dimensions, we actually found that the interpolated vectors would always either be nearest king or duke. This wasn’t what we expected. Our best guess was that the hyperspace was too sparse to see this effect. Going back to the star analogy, in a real galaxy it is highly unlikely that you would pass through an intervening star on your way from star A to B. The average density of a typical galaxy is just too low.

First, we tried to fix this by linearly scaling down the embedding coordinates. This didn’t help. No matter how far we condensed the embedding hyperspace, the points were still at the same relative distances. What did work was to lower the dimensionality of the embedding space. Instead of 300 dimensions, we tried 50. Sure enough, we started seeing the effect we were expecting — at least between very different words. For example, you pass through timid on your way from dog to reticent. We decomposed the 50 dimensional vectors further to 10 dimensions using t-SNE. The effect became more apparent. Between dog and prince, you pass through idol, legendary, famous and king.

Interpolated points between word embeddings

We see the opposite effect when we give the algorithm a long block of text. If we pass the Wikipedia article on Cisco to the algorithm, the interpolation performs a kind of compression.

Excerpt from the Wikipedia article on Cisco Systems

The real test is to see if these images are actually encoding the meaning of the words in a way that can be decoded later. If it’s working, we should be able to train a model that can decode a new sentence from an image, even if that sentence was never encountered in the training set. The model should even be able to see word similarities. It should learn that prince looks a lot like king.

Results
We approached this as an object recognition problem. As a proof of concept, we trained a Convolutional Neural Net (CNN) to recognize the word emerge in a corpus.

Object recognition of “emerge” in the encoded image

The results are very preliminary, but the model was able to find the word in images in the test set. This is just an experiment. We would need more time and data to explore the idea fully, but the fact that it works at all is pretty remarkable.

About Cisco Emerge

At Cisco Emerge, we are using the latest machine learning technologies to advance the future of work.
Find out more on our website.

Seeing Meaning

About Cisco Emerge

Written by Will Reed