Visual Question Answering with Various Feature Combinations

Extensions of Visual Question Answering

Published in

Analytics Vidhya

7 min readSep 6, 2019

My work at present involves language modelling, segmentation and decomposition using NLP. Considering the significant role semantics and meaning play in understanding language, I wanted to revisit some previous work in this space, especially one as interesting as Visual Question Answering (VQA), which combines computer vision, natural language understanding, and deep learning. This article provides an overview of the VQA model along with its extensions. For a more detailed read on the implementation and results, please refer to the paper here and code here. The VQA team also maintains a comprehensive source of information, resources and software, including the latest talks and papers.

Among the various problems of Artificial Intelligence (AI), image/video captioning which is a combination of Computer Vision, Natural Language Processing and Knowledge Representation and Reasoning has been tackled by plenty of research groups. However, a significant gap still exists in the quality of assessment of images as compared to humans. VQA has emerged as an interesting field that sits at the intersection of some of these problem areas. Given an image, a Visual Question Answering algorithm helps a machine answer free-form, open-ended, natural-language questions about the image. This is accomplished by measuring similarity in a semantic space between the two modalities (text and image) based on Microsoft’s Deep Multimodal Similarity Model (DMSM) [1]. While the basic VQA model itself has a number of potential real world use-cases such as automatic tagging in large image sets, image retrieval systems, integration into vast social media and e-commerce databases, as graduate students in the vision lab at Virginia Tech, Jinwoo Choi and I also wanted to experiment with extending the basic model of VQA [2] to other combinations of inputs.

Figure 1: The four models are illustrated here. The first model is a basic VQA model. Given an Image, ask a question about it, and find the correct answer to the question about the image. In the second model, given an image, we retrieve the corresponding question and answer pair. Third model is an image retrieval model with an input question and answer pair. Given a question and answer pair, we retrieve the most relevant images to the query question and answer pair. The last model is jeopardy model. Given an image and an answer, this model tries to find the question about them.

To perform any VQA tasks, we measure the similarity between the input modalities (image & sentence) or (image + sentence & sentence). The DMSM helps us map the input vectors to a common semantic space and measure cosine similarity between the embedding vectors. The DMSM is a multimodal extension of the unimodal Deep Structured Semantic Model (DSSM), which measures the similarity between text queries and documents, and also uses a pair of neural networks similar to DSSM.

The basic VQA model uses features that are extracted from images and concatenated with question features to generate (Image+question) features. These are then fed through one network in the DMSM and the answer features are fed in through the other network to train the model.

Figure 2: For the QA retrieval model, extracted image features are fed into one network and the concatenated (question+answer) features are fed into the other. For the image retrieval model, the concatenated (question+answer) features are fed into one network and extracted image features are fed into the other network of DMSM. For the jeopardy model, we concatenate image features with answer features. The concatenated (image+answer) features are fed into one network and question features are fed into the other network.

The prediction in DMSM returns cosine similarity scores between input 1 (an image + question feature) and input 2 (candidate answer features) ranging from -1 to 1. Our data-set had 18 multiple choices per each question from which maximum scored answer was selected. To do this, we rank the scores in descending order and select the top-K most similar results.

The data-set used in this work, was the original VQA dataset [2]. During the development of our model, this data-set contained 82,783 training images and 40,504 validation images from Microsoft COCO data-set [3]. Furthermore, the VQA data-set contains approximately 3 ground truth question-answer pairs per each training/validation image. The VQA data-set provides two modalities for answering the questions: (1) open-answer and (2) multiple-choice. In this work, only multiple choice answers were used for the VQA model. Kindly note that new data-sets have been released by the VQA team and so be sure to check them out before you begin developing your own models.

Implementation Summary

We used training images and the corresponding training questions and answers to train our models. We used validation images and corresponding validation questions and answers to get predictions and to calculate the accuracy.

We first generate feature sets of the required combination, align them and concatenate them to get a sparse vector representation.
Caffe was used to extract the image features which are activations from VGGNET and for question and answer features, the bag-of-word representations of letter trigram count vectors are used.
Feed both features to train DMSM and update weight matrices.

Using Torch was slow, taking ~2–3 hours per each epoch, therefore our implementation used the DMSM C# reference code which took ~70–100 minutes per each epoch. Therefore, for 100 epochs, we needed approximately 5–7 days to train a model.

To test, we measured the similarity between the two inputs by a cosine similarity between their embedding vectors. For example, we computed the embeddings for a given image and used the multi-modal cosine similarity score to find the nearest question+answer embedding for the image. This demonstrates the ability to inquire what questions can be asked about the image and the ability to get the answers for the automatically generated questions.

Results

The experiments showed promising results for the VQA extension models. Some example results of the original VQA model are shown in Figure 3.

QA pair retrieval example results are depicted in Figure 4. Given a query image, the machine retrieves the corresponding QA pairs and re-ranks the pairs according to the similarity scores in descending order. Even though the correct answer rank is not very high, retrieved answers are relevant to the query image. In the first example, all top-3 ranked QA pairs contain ”table” and ”fruit” which are contained in the query image. In the third example, two out of 3-top ranked QA pairs contain ”cat” while the other misclassified a cat in the image as a dog.

Figure 4: QA pair retrieval model example results.

Examples of the image retrieval results are shown in Figure 5. Given a query QA pair, the machine retrieves the corresponding images and re-ranks the images according to the similarity scores. Similar to the QA retrieval results, top-3 retrieved images are reasonably relevant to the query QA pair. For example, the second row in Figure 5 has ”Are the elephants swimming” and ”no” as a QA pair. All top-3 retrieved images contain elephants and they are not swimming. The third row in Figure 5 has ”Where is this person cooking this meal” and ”oven” as a QA pair. All top-3 retrieved images contain cooking scene. And two of them contain the oven.

Figure 5: Image retrieval model example results.

Jeopardy model example results are depicted in Figure 6. Given an image and an answer, the machine tries to find what is the question. This model also retrieves the relevant questions reasonably. In the first example, the query image is a restroom image and a query answer is ”tile”. All top-3 retrieved questions contain ”floor” or ”ceiling”. In the third example, the query image contains a scene about a man is skateboarding and the answer is ”skateboarding”. All top-3 retrieved questions have a form of ”What is someone doing”. But the top rank question is ”what is the fireman doing” which is definitely not correct. This is because of the blurry part of the image. Blur of the lights might have yielded the confusions to image feature extraction module.

Figure 6: Jeopardy model example results.

Conclusion

The VQA extensions help explore new potential applications such as generic object recognition, holistic scene understanding, narrating information and stories from images, or developing interactive educational applications that ask questions about images. And although current predictions under-perform compared to human decisions, newer larger data-sets and adoption by more platforms & devices would enable computers to understand data much more intuitively and change the way we search and interact with data. Within the scope of this research, an interesting next step would be to apply transfer learning to the three extension models using the weights of the original VQA model. Transfer learning may reduce the training time of the three extension models and increase model accuracy.

References

[1] Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh K., Deng, Li, Dollar, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John C., Lawrence Zitnick, C., and Zweig, Geoffrey. From captions to visual concepts and back. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[2] Antol, Stanislaw, Agrawal, Aishwarya, Lu, Jiasen, Mitchell, Margaret, Batra, Dhruv, Zitnick, C. Lawrence, and Parikh, Devi. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015

[3] Chen, Xinlei, Fang, Hao, Lin, Tsung-Yi, Vedantam, Ramakrishna, Gupta, Saurabh, Dollar, Piotr, and Zitnick, C Lawrence. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.

[4] Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.