Visual question answering with multimodal transformers

PyTorch implementation of VQA models using text and image transformers from Hugging Face

What is VQA?

VQA is a multimodal task wherein, given an image and a natural language question related to the image, the objective is to produce a natural language answer correctly as output.

Source: OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge on arxiv.org.

VQA with multimodal fusion models

Multimodal models can be of various forms to capture information from the text and image modalities, along with some cross-modal interaction as well. In fusion models, the information from the text and image encoders are fused into a combined representation to perform the downstream task.

  • Featurization of image and question: We need to extract features from the image and obtain the embeddings of the question after tokenization. The question can be featurized using simple embeddings (like GLoVe), Seq2Seq models (like LSTMs), or transformers. Similarly, the image features can be extracted using simple CNNs (convolutional neural networks), early layers of object detection or image classification models, or image transformers.
  • Feature fusion: Since VQA involves a comparison of the semantic information present in the image and the question, there is a need to jointly represent the features from both modalities. This is usually accomplished through a fusion layer that allows cross-modal interaction between image and text features to generate a fused multimodal representation.
  • Answer generation: Depending on the modelling of the VQA task, the correct answers could either be generated purely using natural language generation (for longish or descriptive answers) or using a simple classifier model (for one-word/phrase answers present in a fixed answer space).
Types of multimodal data fusion. Image created by the author.

Preliminaries

Installing required packages

We need to create a virtual environment and install the required packages:

datasets==1.17.0
nltk==3.5
pandas==1.3.5
Pillow==9.0.0
scikit-learn==0.23.2
torch==1.8.2+cu111
transformers==4.14.0

Setting up the environment

To set up the environment for training our multimodal VQA model, we need to import the required modules and set the appropriate device for PyTorch.

Data preparation

For the VQA model training, we use the full DAtaset for QUestion Answering on Real-world images (DAQUAR) dataset, which contains approximately 12,500 question-answer pairs based on images from the NYU-Depth V2 Dataset.

Sample images, questions, and answers from the DAQUAR Dataset. Source: Ask Your Neurons: A Neural-based Approach to Answering Questions about Images. ICCV’15 (Poster).

Preprocessing the dataset

The raw dataset contains the actual images separately in the images/ directory. All the question-answer pairs are present on consecutive lines in a .txtfile as shown below:

what is on the desk and behind the black cup in the image4 ?
bottle
what is in front of the monitor in the image6 ?
keyboard
...

Loading the data

Now we are set to load this data using this processed dataset. For this, we use the datasets library from Hugging Face. Since we model this task as a multiclass classification task, we should assign labels to every answer. These labels are derived from the indices of the answers in the answer space.

A random entry from the training dataset after loading and creating labels from the answer-space.

Defining a multimodal collator for data

Up to this point, we have just loaded the questions, answers, and corresponding image IDs, along with the labels. To feed the information about the question and actual images batchwise into our multimodal model, we need to define a data collator.

Defining the multimodal VQA model architecture

As mentioned previously, we use the idea of late fusion to define our multimodal model comprising:

  • A text transformer to encode the question and generate embeddings
  • An image transformer to encode the image and generate features
  • A reasonably simple fusion layer that concatenates the textual and image features and passes them through a linear layer to generate an intermediate output
  • A classifier, which is a fully connected network with output having the dimensions equal to that of the answer-space

Pretrained models for textual encoding

Pretrained text transformers for experimentation to provide textual features.

Pretrained models for image encoding

Pretrained image transformers for experimentation to provide visual features.

Creating the collator and multimodal model

Because we aim to experiment with multiple combinations of text and image transformers, it is reasonable to implement a function for creating the corresponding collators with the respective models.

Evaluation metrics

We approach the VQA task as a multiclass classification problem in this article. Hence, accuracy and macro F1 score are straightforward choices as metrics for evaluating the performance of our model. However, because these metrics may often be too restrictive, penalizing almost correct answers (‘tree’ versus ‘plant’) as heavily as incorrect answers (‘tree’ versus ‘table’), we select a metric like WUPS as our primary evaluation metric. Such a metric considers the semantic similarity between the predicted answer and the ground truth.

Wu and Palmer Similarity (WUPS) Score

One option to evaluate open-ended natural language answers is to perform exact string matching. However, it is too stringent and cannot capture the semantic relatedness between the predicted answer and the ground truth. This prompts the use of other metrics that capture the semantic similarity of strings effectively. One such commonly used metric is the Wu and Palmer Similarity (WUPS) Score.

Training the multimodal VQA model

We finally come to the part where we use the previously defined functions to initialize our multimodal model and train it using the Trainer from Hugging Face to abstract away most of the code required for setting up a PyTorch training loop. The hyperparameters such as training epochs, batch size, and so on, are passed to the Trainer by setting the corresponding values in the TrainingArguments.

Hyperparameters used for training the multimodal model.

Making inferences using trained model

To use any of the saved model checkpoints for inferencing, the question must be tokenized, and image features must be extracted appropriately (as done in the collator). These would serve as input to the model, with weights loaded from the trained checkpoint. The label predicted by the model is then mapped to the index of the actual answer in the answer space.

Examples of answers predicted by our multimodal VQA model on certain evaluation instances.

Comparing the performance of various models

A similar approach is followed to train VQA models with various combinations of text and image transformers by changing the text and image arguments while calling the createMultimodalVQACollatorAndModel(...) function.

  • RoBERTa + BEiT performs the best in terms of both WUPS and accuracy.
  • RoBERTa-based models generally perform better than the rest. This can be attributed to the larger number of trainable parameters and the embeddings generated through more robust pre-training.
  • ALBERT-based models are expected to have lower performance because ALBERT is much smaller compared to BERT and RoBERTa. Yet, the ALBERT + ViT model can achieve scores comparable to the BERT + ViT model, despite having only around half the number of parameters.
  • For BERT and RoBERTa-based text transformers, the best results are achieved using BEiT as the image transformer. However, it does not perform up to the mark with ALBERT. This could indicate that higher quality textual embeddings are required to complement the image embeddings generated by BEiT.

Concluding remarks

In summary, we successfully implemented, trained, and evaluated a late fusion type of multimodal transformer model in PyTorch for visual question answering using the DAQUAR dataset. We also learned how to use the model weights from a trained checkpoint to answer questions related to an image. Last, we compared the performance of several models using different text and image transformers to featurize the question and image before performing fusion.

--

--

Lessons learned in the practice of data science at Microsoft.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tezan Sahu

Data & Applied Scientist at Microsoft | B. Tech from IIT Bombay | GSoC’20 with PEcAn Project