Visual Question Answering With Hierarchical Question-Image Co-Attention
Table of Contents:
- Introduction
- VQA Dataset
- Exploratory Data Analysis
- Mapping the problem into a classification problem
- Creating Image Features using VGG19
- Creating Question Vectors
- Baseline Model
- Modeling with Hierarchical Question-Image Co-Attention
- Error Analysis
1. Introduction:
Visual Question Answering (VQA) is a computer vision task where a system is given a text-based question about an image, and it must infer the answer. Questions can be arbitrary and they encompass many sub-problems in computer vision, e.g.,
- Object recognition — What is in the image?
- Object detection — Are there any cats in the image?
- Attribute classification — What color is the cat?
- Scene classification — Is it sunny?
- Counting — How many cats are in the image?
Beyond these, many more complex questions can be asked, such as questions about the spatial relationships among objects (What is between the cat and the sofa?) and common sense reasoning questions (Why is the girl crying?).
While seemingly an easy task for humans, VQA affords several challenges to AI systems spanning the fields of natural language processing, computer vision, audio processing, knowledge representation, and reasoning.
2. VQA Dataset:
The VQA dataset contains open-ended as well as multiple-choice questions. This dataset is very huge and contains 12.6 GB of training images only.
VQA v2 dataset contains:
- 82,783 training images from COCO (common objects in context) dataset.
- 40, 504 validation images, and 81,434 test images.
- 443,757 question-answer pairs for training images.
- 214,354 question-answer pairs for validation images.
This dataset also contains abstract cartoon images. Each image has at least 3 questions (5.4 questions on average) and each question has 10 answers from unique persons.
3. Exploratory Data Analysis:
To gain an understanding of the types of questions asked and answers provided, we need to analyze the questions and answers in the VQA train dataset.
3.1 Questions
The below image is the distribution of question lengths. We can see that most questions range from four to ten words.
Types of Question: Given the structure of questions generated in the English language, questions categorized into different types based on the words that start the question. There exists a variety of question types, including “What is. . .”, “Is there. . .”, “How many. . .”, and “Does the. . .”. A particularly interesting type of question is the “What is. . .” questions, since they have a diverse set of possible answers.
As we can see from the above image “how many”, “is the”, “what”, “what is color the”, “what is the” are the most frequent question types. Among 65 different question types “how many” type of questions covers 9.5% of the entire train data.
3.2 Answer
The below image shows the distribution of answers for several question types. We can see that a number of question types, such as “Is the. . . ”, “Are. . . ”, and “Does. . . ” are typically answered using “yes” and “no” as answers. Other questions such as “What is. . . ” and “What type. . . ” have a rich diversity of responses. Other question types such as “What color. . . ” or “Which. . . ” have more specialized responses, such as colors, or “left” and “right”.
Lengths: Most answers consist of a single word, with the distribution of answers containing one, two, or three words, respectively being 89.32%, 6.91%, and 2.74%.
‘yes/no’ type of answers covers 38% of the entire data. Among these ‘yes/no’ questions, there is a bias towards “yes” — 59%.
4. Mapping the problem into a classification problem:
- We pose the problem as a K-class classification problem. We choose the top K = 1000 most frequent answers as possible outputs. This set of answers covers 87.51% of the train answers.
- Performance Metric:
accuracy=min(# annotator answers matching the generated answer/3, 1)
The intuition behind this metric is as follows,
If a system-generated answer matches one produced by at least 3 unique annotators, it gets a maximum score of 1 on account of producing a popular answer. If it generates an answer that isn’t present amongst the 10 candidates, it gets a score of 0, and it is assigned a fractional score if it produces an answer that is deemed rare. If the denominator 3 is lowered, wrong and noisy answers in the dataset (often present due to annotation noise) will receive a high credit. Conversely, if it is raised towards 10, a system producing the right answer may only receive partial credit, if the answer choices consist of synonyms or happen to contain a few noisy answers.
5. Creating Image Features using VGG19
Images are one of the inputs to our model. So before feeding images to the model we need to convert into the fixed-size vector.
So we need to convert every image into a fixed-size vector then it can be fed to the neural network. For this, we will use the VGG-19 pre-trained model. VGG-19 model architecture is trained on the Imagenet dataset to classify the image into one of 1000 classes. Here our task is not to classify the image but to get the bottleneck features from the last layer of convolutions. Each image is scaled to 224× 224. Hence will get 7 x 7 x 512 dimensional vector representation (bottleneck features) for each image.
6. Creating Question Vectors
As we have converted images into a fixed vector, we also need to convert questions into a fixed-size vector representation.
Question Vectors extracted by the following steps:
- Tokenizing the text
- Sequence Padding
The question features shape is [24, ], where 24 is the sequence length of the questions after pre-processing.
7. Baseline Model
The image below shows the high-level architecture of a Baseline for the VQA system that we implemented. We won’t directly use the image as input into the model. The image is scaled to 224× 224. The scaled image is fed into a convolutional neural network (CNN) such as VGG-19 which outputs a feature vector encoding the contents of the image and is referred to as an image embedding. The question is fed into an embedding layer, resulting in a question embedding. These embedding vectors, which compactly represent the image and question contents have different dimensions. Hence they are first projected into the same number of dimensions using corresponding fully connected layers (a linear transformation) and then combined using pointwise multiplication (multiplying values at corresponding dimensions). The final stage of the VQA model is a multilayer perceptron with a softmax non-linearity at the end that outputs a score distribution over each of the top K(1000) answers. Converting the answers to a K-way classification task allows us to train the VQA model using a cross-entropy loss between the generated answer distribution and the ground truth.
The image backbone is initialized with weights obtained from a network such as VGG-19, trained on the ImageNet classification dataset. The image representations for the entire training dataset can be pre-computed and stored on disk, which results in less memory consumption while training the VQA model.
The below images help us to visualize the neural network architecture of the baseline model.
8. Modeling with Hierarchical Question-Image Co-Attention
In addition to modeling ‘where to look’ or ‘visual attention’, it is equally important to model ‘what words to listen to’ or ‘question attention’.
Co-Attention: This paper proposes a novel mechanism that jointly reasons for visual attention and question attention, which is referred to as co-attention. More specifically, the image representation is used to guide the question attention and the question representation(s) are used to guide image attention.
Question Hierarchy: This model has a hierarchical architecture that co-attends to the image and question at three levels: (1) word-level, (2) phrase level, and (3) question level. At the word-level, we embed the words to a vector space through an embedding matrix. At the phrase level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams, and trigrams. Specifically, we convolve word representations with temporal filters of varying support and then combine the various n-gram responses by pooling them into a single phrase-level representation. At the question level, we use recurrent neural networks to encode the entire question. For each level of the question representation in this hierarchy, we construct joint question and image co-attention maps, which are then combined recursively to ultimately predict a distribution over the answers.
The paper proposes two co-attention mechanisms parallel co-attention, alternating co-attention. Here, we will implement the Parallel Co-Attention model.
Parallel co-attention attends to the image and question simultaneously. we connect the image and question by calculating the similarity between image and question features at all pairs of image-locations and question-locations. The image and question attention vectors are calculated as follows,
The parallel co-attention is done at each level in the hierarchy, the co-attended image, and question features from all three levels are then combined recursively to ultimately predict the answer.
The image is scaled to 448× 448 and then image features are calculated again. While training we used only 48k images due to time and resource constraints. At the end of 30 epochs, we have achieved an accuracy of 49.49% on the test data. Below are the few predictions of the Co-attention model.
9. Error Analysis
The image below shows the Error Analysis of the model. The Co-attention model is performing very well (>80% accuracy) for question type like ‘What room’, ‘What sport’, and ‘What animal’. Most of the Question type ‘Could’, ‘are’, ‘is’, ‘do you’…. will have only two possible answers yes/no. For this kind of question type accuracy nearly 60%. The model is not performing very well (less than 20% accuracy) for question type like ‘What number’, ‘why’, ’ how’, and ‘where are the’.
As we can see in error analysis, question types that have a few possible answers working very well (eg ‘what room’, ‘what sport’). And question type like ‘why’, ‘how’ will have vast possible answers so the model is not working so well. Here we are using VGG which gives 512 filters at the end. If we increase the filters, the model can identify more patterns so that we can increase model performance (eg ResNet final Conv layer has 2048 filters). And also instead of using pre-trained models if we train the whole architecture on COCO images and the VQA questions model can learn the complex patterns very well.
Reference
- VQA: Visual Question Answering — https://arxiv.org/pdf/1505.00468.pdf
- Hierarchical Question-Image Co-Attention for Visual Question Answering — https://arxiv.org/pdf/1606.00061v5.pdf
- https://www.appliedaicourse.com/
To view the entire work, you can visit my git hub link: https://github.com/harsha977/Visual-Question-Answering-With-Hierarchical-Question-Image-Co-Attention
LinkedIn profile:
https://www.linkedin.com/in/harshavardhan-reddy-27380352/