Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

3 min readMay 14, 2018

Visual attention mechanisms are known to be important components of modern computer vision systems, and are an inherent part of State-of-the-art achievements in almost all fields: object detection, image-captioning, and more. This recent paper takes the concept of visual attention one step further and reaches SOTA performance on both the COCO image captioning 2017 and the 2017 VQA challenge (Visual question answering).

Until now attention mechanisms could be generally divided into two types:

1) Detection proposals, such as the Faster R-CNN (RPN) proposals. The ROI-Pooling operation is an attention mechanism that enables the second stage of the detector to attend only to the relevant features. The disadvantage of this approach is that it doesn’t use information outside of that proposal that can be very important for classifying it correctly in the second stage.

2) Global attention mechanisms, which re-weight the entire feature map according to a learned attention “heat map”. The disadvantage of this approach is that it doesn’t use the information about the objects in the image to generate the attention map.

This paper combines the two approaches into one, and thus mitigating their disadvantages. This is done by generating the attention map over the proposals generated by the RPN, instead of an attention map over the global feature map. This is a very strong mechanism and you can get an impression for that in the images below.

To implement this approach, they use Faster R-CNN to generate the 36 top proposals, and ROI-Pool each proposal to a 2048-d feature map (with average pooling).

These pooled feature maps are averaged into a single feature map and fed into the attention LSTM. The output of the attention LSTM is a weight vector of size 36 (one weight for each proposal).

The next stage of the process is to calculate the attended feature map, by summing all of the pooled feature maps according to their predicted weights. These attended feature maps can be used as an input for a second network that performs the actual task. In the paper it was served as an input to another LSTM which generated a single word for the image captioning task at each timestep.

This attention mechanism can be very valuable in many tough domains. For example, in the world of medical imaging there are many possible use cases for this mechanism. In brain CT-scans analysis, if there is a proposal for a brain hemorrhage in the right hemisphere of the brain, and in the other side there is no such proposal — it significantly increases the probability that this proposal is an important abnormality. If a proposal exists on both hemispheres — it significantly decreases the probability of it being a hemorrhage.

Since the brain is far from being a perfectly symmetric structure — it is not possible to solve this kind of problems using actual mirroring of the image. This kind of mechanism can attend to each proposal and use information about the objects that are relevant as context for making this kind of decisions.

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Written by Ariel Persiko