Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Jason Benn
Paper Club
Published in
10 min readJul 9, 2017

First of all, here’s a link to the paper. I chose to review this one because it has achieved state-of-the-art results on visual question answering and visual grounding tasks, which are just plain cool as far as AI datasets go.

Big question

How do we best identify salient areas of and answer questions about images?

Visual question answering: “What is the woman feeding the giraffe?” Correct answer: “Carrot”
Visual grounding: “A tattooed woman (red) with a green dress (blue) and yellow backpack (green) holding a water bottle (pink) is walking across the street.” The bounding boxes drawn would be excellent answers to this task.

Background summary

We can create good representations of images using convolutional nets and good representations of phrases and sentences using RNNs, so it makes sense that a “visual question answering” (VQA) task or a “visual grounding” task should involve combining these representations somehow. Typical approaches use vector concatenation, element-wise vector summing, or element-wise vector multiplication to combine these representations. The authors suspect these approaches are too simplistic to fully capture the relationships between images and text (makes sense), but are reluctant to use bilinear pooling (AKA a full outer product of vectors — see definitions of terms at the bottom of this article), because the number of resulting parameters is too high to be practical (in this case, our image and text vectors are of length 2048, so the resulting matrix would have 2048² elements, and we’d need to fully connect that matrix to 3000 classes, resulting in ~12.5 billion learnable parameters). The authors here apply an existing technique for capturing the discriminating abilities of bilinear pooling with only a a few thousand parameters (16k) while preserving backpropability using a recently invented technique called multimodal compact bilinear pooling (MCB).

I have a theory that the term “multimodal compact bilinear pooling” was chosen specifically to maximize the impressiveness of this paper. It’s a mouthful but isn’t such a complicated concept: “bilinear pooling” simply means the outer product of the two input vectors; this is what happens when you multiply every element of one vector of length N by every element of the other vector of length M, resulting in a matrix of size NxM. This yields a TON of parameters, making it powerful but very expensive. “Compact” bilinear pooling means the authors applied dimensionality reduction techniques to get almost the same level of power with way fewer parameters. I don’t know exactly how it works — I didn’t read the paper where compact bilinear pooling was invented — but I have no idea how this could work as parameters change during training. In my head I’m imagining something similar to principal components analysis. “Multimodal”, as far as I can tell, is a term that describes this specific use case but doesn’t really mean anything different — compact bilinear pooling could be done on any two vectors, practically speaking it doesn’t matter at all that one vector happens to represent one mode (images) while the other represents another (text).

Specific question(s)

Can we improve the state of the art in visual question answering and visual grounding tasks with this clever approach to bilinear pooling?

Approach

The authors developed a model that uses MCB at three different points where visual and textual representations need to be combined:
1. to predict spatial attention in the image
2. to predict an answer to the question
3. to relate the encoded multiple-choice answers to the combined question+image space (this only applies to the multiple choice problem).

They also use attention maps and additional training data, which feels a bit like sacrificing experimental purity to achieve state-of-the-art results.

Methods

> Like, is it a vector of dimensionality 2048, or a vector with 2048 elements, where each element has 3000 dimensions? Is that a 2048-dimensional vector, a 3000-dimensional vector, or a 6,144,000-dimensional vector? Why don’t we call this thing a matrix instead of a vector??Here’s the architecture breakdown:
Visual question answering: Inputs are an image and a question, and the output is one of 3k classes.

Wait, the “open-ended” VQA is just a 3,000-way classification problem? Are these answers all one word long? How was this answer set generated — were the answers supplied first, then a set of questions and images created that were answerable by one of the 3,000?

The image representations are trained using the 152-layer ResNet model, pretrained on ImageNet. They threw out the final 1,000-way fully connected layer, used L2 norm on the second-to-last layer’s output (the “pool5” layer); this results in a 2048-dimensional vector.
The questions are tokenized into words, one-hot encoded, and passed through a learned embedding layer followed by a 2-layer LSTM, each outputting a 1024-dimensional vector, which are then concatenated to form a 2048-dimensional vector.
The two pieces are passed through a 16,000-dimensional MCB and fully connected to the 3,000 top answers.

Open-ended VQA architecture.

Attention and visual grounding: inputs are a phrase and an image, and the output is a set of bounding boxes. I need to read one or two of these papers to understand this bit:
1. Xu et al., 2015 “Show, attend and tell: Neural image caption generation with visual attention”
2. Xu and Saenko, 2016: “Ask, attend and answer: Exploring question-guided spa- tial attention for visual question answering”
3. Yang et al., 2015: “Stacked attention networks for image question answering”
But I’ll take a stab at explaining the result anyway: it appears that there are “spatial grid locations” in certain layers of the conv net; the technique relies on merging these visual representations with the language representation and predicting “attention weight”. From the paper: “Predicting attention maps… allows the model to effectively learn how to attend to salient locations based on both the visual and language representations.”

Magic, man.

Architecture for visual grounding task.

Multiple choice answering: answers are encoded with a word embedding and LSTM layers, with the LSTM weights shared across all the answers. These are then merged with MCB and fully connected to N classes, where N is the number of multiple choice answers.

Architecture for multiple-choice VQA.

Results

Visual question answering: significant improvement on the previous year’s state-of-the-art (7.9%), moderate improvement over this year’s 2nd place competing model (<1%), but most of this improvement is due to incorporating attentional maps (and presumably some is due to the additional training data). Thankfully, they omitted parts of the model so we can see that MCB accounts for about 3% improvement in the results.
Visual grounding: moderate improvement in the state of the art: slightly under 1% improvement on each of the two VG datasets.

The authors acknowledge that their approach might be better than the state of the art simply because their model involves more parameters. They compensate for this confounding variable by tacking on fully connected layers to the end of the non-bilinear pooled models until the total number of parameters is comparable, then compare the performance of these models. But I don’t know if I buy that this proves that MCB is the winner, or if it just proves that more parameters at the point where vector representations are combined is the difference maker. Maybe a better control would have been a random selection of dimensions from an outer product of the vectors such that the total dimensionality was the same as MCB (16,000)? This would show that MCB is providing superior results to randomly selected parameters at the point of the vectors meeting. Further, how do they know that 16,000 is the optimal number of parameters at the point where vectors meet? If that is the information bottleneck, where a representation needs to be highly complex to capture the relationship between image and language, then why not experiment with more or fewer parameters at that location?

Conclusion

The authors were trying to win the competition; thankfully, they also broke down their results by technique. Seems like MCB is worth considering in pretty much any model that involves combining similarly dimensional vectors — it’s an improvement over the standard techniques of concatenation, addition, or multiplication because it captures more information about the relationship between the multimodal representations. As an added benefit, because it’s based on element-wise multiplication, any number of similarly-sized vectors can be added efficiently.

Viability as a Project

Is the data available? Yes.
How much computation? Not so much — they rely on pre-trained ImageNet weights for their convnet, and presumably used pretrained word embeddings as well.
Can the problem be scaled down? Yes, you could test the viability of alternate approaches using subsets of the Visual7W/Visual Genome/VQA datasets.
How much code development? Their code is available on Github! Nice!
How much work to turn this paper into a concrete and useful application? Well, if we’re specifically talking about improving the state of the art for the tasks discussed in the paper:

  • Visual question answering: How far can you get with the ability to answer questions about an image, provided your answers are already within a set of 3,000 single-word answers? What subset of useful common questions and answers could this represent? I’m not sure. However, it’s not hard to see how this is a step towards speaking with AI assistants — how long until we have the computational power to encode not just 3,000 short answers, but a much larger set of possible answers, perhaps one that encompasses the most common 80% of answers that people want? Alexa or Google Home probably rely on similar technology.
  • Visual grounding, on the other hand, could be used to automate cleaning up images, or for video-related tasks such as identifying pedestrians or another driver signaling to you in self-driving car applications. Improving the state of the art here is obviously useful.

But the larger principle — that MCB is a superior approach to combining different types of information if you buy their methodology) is useful for any mixed-modality task.

The trickiest part about all this is that MCB requires that vectors are the same length. But that might be impractical — imagine a self-driving plane, which combines visual data from a video camera on the wings with less complex instrument — perhaps a wind sensor on the nose. Preferably you would be able to combine these representations efficiently (i.e., without using an outer product) but without requiring that your wind sensor output a representation as highly dimensional as your video camera. I understand why the authors chose to test combining vectors with MCB on a VQA task — language and imagery are similarly complex and so warrant output vectors of similar dimensionality — it’s a perfect application. A technique for efficiently combining vectors of uneven dimensionality would be a good point of research.

What do other researchers say?

I didn’t find many reviews of this paper, but I did find reviews of a later one criticized for being highly similar to this one, and the reviewer’s comments are relevant enough to be worth mentioning here:

  • “Impressive results on standard benchmarks show progress. While the novelty is limited, accepting the paper will help others build on the state-of-the-art results.”
  • “However, this paper used many pretrained models and embeddings, so it would make the paper better if all these effects are better analyzed.”

Words I didn’t know

  • outer product of… vectors: fancy name for multiplying two vectors together into a matrix (multiplying each element of vector A by each element of vector B). [1, 2] ⊗ [3, 4] = [[3, 4], [6, 8]]
  • vs cartesian product: this is technically set theory, but it’s a similar concept to the outer product, but you’re creating sets, not products. Similar to the nested loops join algorithm in databases. [1, 2] × [3, 4] = [[[1, 3], [1, 4]], [[2, 3], [2, 4]]]. (The symbol shows the contrast to outer product! ⊗ looks like the combination of × and ○, which is kind of like the multiplication symbol)
  • vs element-wise multiplication: two vectors of the same length making a third vector of the same length. [1, 2] x [3, 4] = [3, 8].
  • vs Hadamard product: same as above!
  • vs dot product: element-wise multiplying two vectors of the same length, then summing the result. [1, 2] ⋅ [3, 4] = 11.
  • vs inner product: this is a more general form of the dot product; an inner product can be computed on any two vectors, including infinite or complex (i.e., including imaginary numbers) vectors. Dot products can only be computed on vectors of real numbers.
  • vs tensor product: is just another name for the outer product.
  • vs cross product: finds a vector perpindicular to both of the given vectors. Useful for physics and engineering and lots of other things… not so much neural nets :)
  • Bilinear pooling: fancier name for the outer product of two vectors.
  • “Ablations without MCB”: by ablations, they mean that they tried omitting MCB from each location in turn, and replacing it with something comparable (in this case, vector concatenation followed by multiple fully connected layers of similar total dimensionality), to test the added benefit of MCB.
  • Visual grounding: finding the bounding box of a phrase’s location in an image.
  • L2 normalization: penalizes large outputs; helps avoid exploding gradients.
  • Attention maps, soft attention: will have to read some more papers :)

--

--