Self-Attention Based Visual Dialogue

Divyansh jha
Intel Student Ambassadors
5 min readFeb 1, 2019

The quest to achieve artificial general intelligence is very difficult and long. However, small steps in the direction are being taken day to day by great researchers. One such step is the task of Visual Dialogue which lies at the intersection of Computer Vision and Natural Language Processing (NLP). It is a task where an AI agent can hold meaningful dialogue with humans in natural language about a visual content, specifically images. Given an Image (I), current question (Q) and a history of Question and answers (H), the agent should be able to generate the answer of the current question. The image is not known to the questioner and he/she can ask any open ended question about the given image. Also, he/she can have multiple rounds of the questions and can have pronouns included in the questions. For example: Questioner: “What is he doing?” , AI Bot: “He is playing tennis.”. This kind of scenario occurs frequently during multiple rounds of dialogues.

Dataset

The dataset used is VisDial v0.9. It contains over a million Question and Answers for over 100k COCO images. This dataset is collected by MLP lab of Georgia Tech University using a chat interface on Amazon Mechanical Turk. They created a chat interface where the answerer gets to see the image and the questioner asks open ended questions. This technique of data collection made made the questions open ended and varied.

VisDial and VQA dataset comparison (Source: Visual Dialogue Paper)

Data Handling

The data handling of this type of dataset is quite complex and cumbersome. As for each image there are ten rounds of question and answers. To meet the deadlines I set for myself, I extracted VGG16-relu7 features as a input to language processing models. The dataset was handled in such a way that for each coco image id we get its image features and all the questions to iterate on. This strategy greatly helped in handling such large and complex dataset.

Approach

For this problem, we have handled the image feature extraction part. To process language a sequence to sequence model had to be used as questions and answers can be of varied lengths. A sequence to sequence model contains an Encoder and a Decoder both a kind of RNN called Long Short Term Memory (LSTM). They process each question and answer sequentially. Unlike traditional Encoder, our encoder has to include the features from image and the history of question and answers. There are several proposed methods and each have their pros and cons. Considering the implementation difficulty and accuracy of the results I decided to go for Late Fusion Encoder discussed in this paper. In this encoder, we treat history(H) as a long string with the entire history concatenated. The current question(Q) and H are separately encoded using two different LSTMs. The Image(I), H and Q are are concatenated and linearly transformed to a desired size of joint representation.

Late Fusion Encoder (Source: Visual Dialogue Paper)

There can be different types of decoder as well namely Generative and Discriminative. According to the results reported from the paper the Discriminative Decoder’s performance was better. The discriminative decoder outputs a probability of the answer options. The discriminative decoder is better on metric but generative decoder is more realistic.

While doing error analysis, I found that the model used very little information from the image, as the magnitude of the gradient during backpropagation on the image was very low. This would primarily have been because, we were not fine tuning the VGG16 for the task. Fine Tuning would have made the complete model understand what parts are important in the image. I then integrated a Self-Attention module inspired from the SAGAN paper. On training the complete Encoder-Decoder Network with Self-Attention stabilized the training a lot and decreased the loss further and improved the performance on the used metric on the validation set. The gradients were now travelling all the way back to the image. Hence, proving that the information from the image was also utilized.

Self-Attention Module (Source: SAGAN Paper)

Evaluation Metric

One fundamental challenge in dialog systems is evaluation. It is an open problem in NLP to evaluate the answers. Metrics like BLEU, ROGUE are there but they are known to correlate poorly with human judgement. To challenge this issue, I used the metric used in the paper, i.e evaluating individual responses. At test time, the model is given an Image(I), History (H) and a set of 100 candidate answers. The model ranks the answers and it is then evaluated on Mean Reciprocal Rank of the human response (higher the better) and Recall@k i.e. existence of the human response in top-k ranked responses (lower the better).

Intel Development Tools Used

The project was done as a part of Intel Early Innovation project for which I was funded by Intel to buy research equipment. The project made use of Intel® Math Kernel Library which accelerated data preprocessing to a great extent. Intel® Xeon® Scalable processors was used to train the Visual Dialogue model on 12 nodes on the Intel® AI DevCloud. The open-source pytorch library was used to build various nuts and bolts of the model. The Intel AI DevCloud and the local hardware provided by Intel helped me to quickly debug and iterate over the development process. It handled one of the most compute-intensive training of a Visual Dialogue model with ease and with minimal maintenance. The support on Intel AI forums and blogs were pivotal for the success of this project.

The App

--

--