Imagenet + Wordnet = magic?

Motivation: I was looking to understand how the deVISE model worked.

Outlined in this paper

I think why it was unique was that it was a multi-model approach. It took a model trained in image recognition. It then trained a model on words (finding semantic similarities between words). Then it blended these two models together.

Why i think this is powerful is i think it becomes more generalizable.


imagine searching for a pair of shoes on a fashion site ( or text to image search)

for me, i would probably search for running shoes but imagine how many diff types of running shoes there are. It could be cross trainers, could be shoes meant more for sprinting, more for long distance.

where it becomes powerful is taking my search query of running shoes and finding semantically close words associated with it.

running shoes >>> cross trainers

running shoes >>> long distance running shoes

Then taking cross trainers and long distance running shoes and find appropiate images.

It could then do it the opposite way to do. Given an image of a shoes find similar images too.


Here’s how I think the architecture works

Words and Images are totally different. it’s like blending meat and fruits together their taste/texture is different.

How can we get them to a common currency.

That’s where word vectors come in handy.

Imagenet has labels which we can convert to word vectors.

Wordnet has words which we can convert to word vectors. ** one thing im wondering here is wordnet would probably have coccurrence of words, why not use their cooccurence score here.

Here’s how i believe the architecture works

(1) Image recognition model : We train a resnet model to take an image and predict the word vectors

(2) Word model: We using kmeans clustering to cluster similar nouns together. similar in the sense that they may co-occur.

(3) Prediction: Pass in an image and the image model gives you a word vector. Use the word vector to predict which cluster it would belong to. Iterate through the cluster to find similar images.