Paper Summary: DeViSE: A Deep Visual-Semantic Embedding Model

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/21, with better formatting.

DeViSE: A Deep Visual-Semantic Embedding Model (2013) Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, Tomas Mikolov

When I first heard about this paper in Jeremy Howard’s course it kind of blew my mind because of the freewheeling way it mixes words and images. One of the most inexplicable and definitively human experiences — yet so common as to be part of the background, like water for fish — is the integration and solidity of experiencing a thing through multiple senses. We see a ripe strawberry and can imagine the sweetness and tanginess, the floral aroma, the feeling of the seeds on the tongue. Then when we bite in, we actually taste it — sweeter than expected, more earthy — and the strawberry image is somehow reinforced. These sensory streams are consolidated into a single experience of a single object and simultaneously an instance of a class of objects. This is not what this paper is about. But there is a little hint of this multi-sensory magic that suggests more magic to come. An existence proof of a kind of artificial synesthesia.

The problem the paper addresses is really quite straightforward. Existing image classifiers are restricted to a fixed set of output categories. And these categories are either entirely discrete, or if interrelated are interrelated in rigidly defined ways. Classification is brittle and the errors are often nonsensical and arbitrary. The idea in this paper is to find a way to embed the ImageNet labels into a much larger, semantically and syntactically structured space. Fortunately we now have such spaces at our disposal thanks to word2vec and friends.

The deep visual-semantic embedding (DeViSE) model is trained in three phases: first a standard skip-gram word2vec model is trained on wikipedia (5.4B words). Concurrently and separately a standard softmax ImageNet classifier (the “visual model”) following Krizhevsky 2012. Finally the two networks are combined into the DeViSE model (see diagram):

I’ve talked about the first two phases elsewhere and they’re pretty standard, so I’ll focus on the third. Take the visual model layers, remove the softmax, and add a projection layer that maps the 4096-d last-layer activations down to the 500-d or 1000-d word vector dimensions. Also bring in the embedding layer from the language model. Training data consists of image-label pairs from ImageNet 1K. Each image is passed into the layers from the visual model and the label is passed into the embedding layer. The label embedding and the transformed visual activations now have the same dimensions and can be compared directly with a similarity metric used as a loss function.

For the loss function the authors use cosine similarity (detail: they actually unit-norm the embedding layer and use dot-product similarity, which is equivalent) combined with a hinge rank loss. This is like an SVM — the idea is that we want the similarity to the correct label to exceed the similarity of the next closest label by a tunable margin (they used 0.1).

(M is the projection matrix, so Mv(image) is the output of the visual side of the network. The t are label embeddings.) If you stare at the equation for a bit, you’ll notice that the gradients are only non-zero when a false label violates the margin — that is, is too similar to the transformed image compared to the true label. This has the effect of pushing the false label similarities down and the true label similarity up. (Another detail: they found it was sufficient to stop considering subsequent false labels once any false label was found to violate the margin.) They also tried L2 loss, but this consistently didn’t work as well.

Initially during training only the projection weights are updated. After some unspecified number of epochs the layers from the visual model are unfrozen. The authors note that the embedding layer could also be unfrozen, but in this case you’d need to keep the language model around as an auxiliary objective so that the embedding layer maintains semantic coherence.

At test time we have images without labels. So the image goes into the visual model → projection, producing an output vector in latent word representation space. Then do a nearest neighbor search (which can be done sublinearly with trees or hashing).

For evaluation the plain visual model + softmax was used as a baseline. Additionally a DeViSE model with randomized word embeddings was trained as a point of comparison, to validate that any benefits in the full model were in fact coming from information in the embedding layer (this was indeed the case). DeViSE works as intended, returning more sensible guesses compared to baseline, when considered qualitatively (see figure above). Interestingly DeViSE did a little bit worse than baseline on flat “hit@k” metrics (the probability of returning the true label in the top k predictions). To see the qualitative benefits empirically, the authors used a hierarchical “precision@k” metric that accepted results from an expanded list of valid labels derived from ImageNet’s label hierarchy. On this metric DeViSE did up to 7% better than baseline.

The zero-shot results are even more impressive. They tested this by giving DeViSE images from ImageNet 21K, which had labels the model hadn’t trained on. This isn’t strictly zero-shot, since the 21K labels did show up in the word embeddings training, but the model had never seen image-label pairs for these labels. The results are far from perfect, but the model picks out the correct label far better than chance. For the 1589 easiest labels, the model’s first guess (hit@1) is right 6% of the time and its hit@20 is 36.4%. On precision@k for high k on the full ImageNet 21K dataset, DeViSE is as much as 55% better than baseline.

This tells us something interesting about image-label space: that images with semantically similar labels are themselves similar (in some sense). Which is probably necessary for us, as humans, to enjoy the kinds of magic multi-sensory gestalt experiences we do. I reckon these ML models have some catching up to do.