A “Visual Turing Test” for modern AI systems

6 min readAug 13, 2020

Visual Question Answering (VQA) is a fascinating research field at the intersection of computer vision and language understanding.

In this post we will elaborate both on existing data sets and examine potential approaches and applications and present a prototype in which the user can choose from images the algorithm has not seen before and asks question accordingly.

What is VQA ?

Visual Question Answering approaches are designed to handle the following tasks: Given an image and a natural language question about the image, the VQA model needs to provide an accurate natural language answer.

This is by nature a multi-discipline research problem. It consists of the following sub-tasks:
· Computer Vision (CV)
· Natural Language Processing (NLP)
· Knowledge Representation & reasoning

That’s why some authors refer to Visual Question Answering as “Visual Turing Test” for modern AI systems.

This screenshot from my prototype illustrates how a VQA system works. Be aware of the fact that the user has chosen an image the algorithm has not seen during training and asks question accordingly.

Datasets

Most of the existing datasets contain triples made of an image, a question and its correct answer. Some publicly available datasets, on the other hand, provides extra information like image captions, image regions represented as bounding boxes or multiple-choice candidate answers.

The available VQA datasets can be categorized based on three factors:
· type of images (natural, clip-art, synthetic)
· question–answer format (open-ended, multiple-choice)
· use of external knowledge

The following table shows an overview of the available datasets:

Source: Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor, published in Artificial Intelligence Review (2020)

For our prototype we make use of the VQA dataset with natural images and open-ended questions. It is one of the most popular ones and also used for the annual VQA competition. The dataset we use consists of 443,757 image-questions pairs for training and 214,354 sets for validation. It can be downloaded here.

Example of an annotated image-question-pair

One special characteristic of VQA dataset is that the annotations, i.e. the answers provided to a specific image-question pair are not unique. The answers have been collected via Amazon Mechanical Turk and for each image-question pair ten answers are supplied, that could be all equal but also different. The screenshot on the left shows an example.

Approaches & Architectures

The basic architecture as shown below consists of three main elements:
· Image feature extraction
· Question Feature extraction
· Fusion model + classifier to merge the features

Source: Visual Question Answering: Datasets, Algorithms, and Future Challenges https://arxiv.org/abs/1610.01465

Image feature extraction

Image feature extraction describes the method to transform an image to a numerical vector to enable further computational processing.

Convolutional neural network (CNN) has established themselves as the state-of-the-art approach. VQA architectures generally use already pre-trained CNN models by applying transfer learning. The chart shows an evaluation of the utilization rates of different architectures in several VQA research papers.

In the prototype we use the VGG16 architecture that uses 224 × 224 pixel images as input and outputs a 4096-dimensional vector.

Question feature extraction

To extract question features multiple approaches have been developed ranging from count-based methods like One-hot-encoding, Bag-of-words to text embedding methods like Long-short-term-memory (LSTM) or gated recurrent unit (GRU). The diagram below illustrates the utilization rate of these approaches in the research.

For our prototype we use the most popular LSTM approach with Word2Vec representations of the single words fed into it. The LSTM-model outputs a 512-dimensional vector.

Fusion model + classifier

To fusion the two feature vectors several basic approaches exist including point-wise multiplication or addition and concatenation. More advanced architectures use Canonical Correlation Analysis (CCA) or end-to-end models with a Multimodal Compact Bilinear Pooling (MCB) layer.

Coverage of questions by most frequent answers

In our prototype we use simple concatenation followed by a softmax classifier to the 1,000 most common answers. This approach is suitable as more than 95% of the question contain at least one annotation which is covered by the 1,000 most common answers (see graph on the left).

More advanced approaches

In the recent past more sophisticated architectures have been developed with attention-based approaches being the most popular. Here, the idea is to set the focus of the algorithm on the most relevant parts of the input. For example, if the question is “What is the color of the ball?”, the region of the image containing the ball is more relevant than the others. Concerning the question, “color” and “ball” are more informative than the rest of the words.

The most common choice in VQA is to use spatial attention to generate region specific features to train the Convolutional Neural Network.

Two common methods to obtain spatial attention are to either project a grid over the image and determine the relevance of each region by the specific question or to automatically generate bounding boxes in the image and utilize the question to determine the relevance of the features for each box.

The use of an attention-based approach goes beyond the scope of our prototype.

Evaluation

Due to the variety of datasets it is not surprising that multiple approaches to evaluate the performance of the algorithms exist. In a multiple-choice setting, there is just a single right answer for every question, so the assessment can be easily quantified by the mean accuracy over test questions. In open-ended setting though, several answers for a particular question could be correct due to synonyms and paraphrasing.

In such cases metrics that measure how much a predicted answer differs from the ground truth based on the difference in their semantic meaning could be used. The Wu-Palmer Similarity (WUPS) is such an example.

As the VQA datasets work with very short answers a consensus metric defined as Accuracy_VQA = min(n/3, 1) is used, i.e. a 100% accuracy is achieved when the predicted answer matches at least 3 out of the 10 annotated answers.

The diagram show the accuracy as defined above for the different question types:

Potential applications of VQA

VQA systems offer a vast number of potential applications. One of the most socially relevant and direct application is to help blind and visually-impaired users to communicate with pictures. Furthermore, it can be integrated in image retrieval system, which can be commercially used on e-commerce sites to attract customers by giving more exact results to their search queries. Incorporation of VQA may also increase the popularity of online educational services by allowing learners to interact with images. Another application of VQA is in the field of data analysis where VQA can help the analyst to summarize the available visual data.

Closing thoughts

VQA is a research field that requires the understanding of both text and vision. The current performance of the systems is still lagging behind human decisions, but since deep learning techniques are significantly improving both in Natural Language Processing and Computer Vision, we can reasonably expect VQA to achieve higher and higher accuracy. Progress will be further driven by contests like the VQA challenge hosted on visualqa.org.

If you would like to dive deeper into this topic you can find the code of the prototype on my github repo here. Any feedback to the approach or the code is highy appreciated.

Further recommend readings include:
· Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor Artificial Intelligence Review (2020)
· VQA: Visual Question Answering: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh