The lure of the outer product

Bilinear pooling and its approximations for Visual Question Answering

Aniruddha Kembhavi
Jun 16, 2019 · 12 min read

Vanilla architectures for Visual Question Answering (VQA) represent the image by a feature vector known as an image embedding, embed the question into a question embedding and then combine the two modalities using simple element-wise operations. This aggregated vector is then provided to a multilayer perceptron to obtain a distribution over the set of possible answers.

A common extension to these vanilla models is to add an attention module that computes an affinity referred to as an attention score between the question and each region in the image. The resulting attention vector dictates which regions of the image the model should focus on, in order to answer the given question. This affinity computation is usually done using element-wise operations between the two modalities. Learn more about attention modules for VQA in a previous post.

The key element in both the above cases is the nature of interaction between a question embedding and an image embedding. Modeling these interactions as element-wise operations works well but only models a subset of the interactions that are possible between two vectors. A far more powerful interaction is the outer product of the two vectors. In this post, I will discuss the outer product interaction applied to VQA, the practical difficulties in modeling this and the approximations used instead.

The outer product

Given an image embedding v_I ∈ ℝ^(dx1) and a question embedding v_Q ∈ ℝ^(dx1), a simple way to combine them is using a Hadamard product, i.e. an element-wise multiplication. This allow elements in the same position within each embedding to interact in a multiplicative fashion, resulting in the aggregated vector z ∈ ℝ^(dx1).

z =v_I ⊙ v_Q

Another method is to learn this affinity via a parametric neural network. This network typically projects the embeddings into a lower dimensional space and then combines them using an element-wise multiplication operation.

𝕫 = ( W_I × v_I+b_I ) ⊙ ( W_Q × v_Q + b_Q )

z = W_z × 𝕫 + b_z

where W_I ∈ ℝ^(k x d) and W_Q ∈ ℝ^(k x d) project the embeddings into a k dimensional space before the element-wise multiplicative interaction and then W_z ∈ ℝ^(k x θ) is used to project the resulting vector into the desired dimension.

The above methods of interaction only allow elements in the same position within the two input embeddings to interact in a multiplicative fashion. A far more powerful mechanism results when each element in the first embedding interacts multiplicatively with every element in the second embedding. The outer product of the two embeddings, represented as ⊗ does precisely this; and results in a matrix P ∈ ℝ^(d x d).

P = v_I ⊗ v_Q = v_I × v^T_Q

Note: In the above equation, ^T refers to the matrix transpose and × refers to the regular matrix multiplication.

A common mechanism to incorporate the outer product into an interaction between the two embeddings is via the mechanism of bilinear pooling. This is essentially an outer product

P = v_I ⊗ v_Q

followed by a reshaping of the matrix P into ∈ ℝ^(d² x 1)

= reshape(P)

followed by a linear projection of the result to obtain the output of the interaction z ∈ ℝ^x 1)

z = W_P × ℙ

Bilinear pooling using the outer product is an incredibly powerful way to model the interaction between two input vectors but it requires learning the large parameter matrix W_P. For instance, if the dimensionality d of the input embeddings is 1024 and the dimensionality θ of the output embedding is 2000, W_P has 1024 * 1024 * 2000 = 2 billion parameters! Representing this matrix would require a large amount of memory and learning these parameters would require a lot of training data and would be very slow. This is clearly infeasible and impractical.

Below, I will outline a few methods that have been proposed over the last couple of years to approximate this outer product for VQA, so as to harness most of its power while still keeping the memory footprint manageable.

Multimodal Compact Bilinear Pooling

One of the first successful uses of the outer product (or rather, an approximation of it) in VQA was Multimodal Compact Bilinear Pooling (MCB). At the heart of the MCB technique is a probabilistic data structure called the Count Sketch. To understand MCB, we need to digress into the world of Sketching.

Sketches are data structures that allow you to represent a very large stream of numbers (or a vector) in a highly compact fashion. These data structures are lossy and not reversible, in that the original data cannot be exactly recovered from them. However, they are capable of answering a class of queries about the stream of data (for example: Counts of unique objects present in the input stream) while providing accuracy guarantees.

Let’s say we wanted to maintain the counts of numbers seen in the large array named input_stream. A trivial and lossless method to do this would be using a dictionary.

// Using a dictionary to count itemscount_store = dict()def process_stream(input_stream):
for element in input_stream:
if element not in count_store:
count_store[element] = 1
count_store[element] += 1
def query_element_count(element):
if element in count_store:
return count_store[element]
return 0

If the number of unique elements η in the stream is very large, this dictionary (which stores the hash of each item and its count) becomes very large and may become infeasible in many applications.

Instead, let’s initialize an array (counter_array)with κ counters, each one initially set to 0. As we read each element in input_stream, we map it to a position in counter_arrayvia a hash function, and then increment the corresponding counter by 1. Now, when we want to query the count of an entity, we simply map it to a location using the same hash function and read out the value stored in the counter at that location.

If κ = η, this is equivalent to the dictionary method of storing counts and is infeasible. However, if κ << η, the memory footprint of this storage system gets much smaller and becomes manageable. The problem with this scenario is that multiple elements in input_streamwill begin mapping to the same position within counter_array (also known as hash collisions). This will lead to over estimating the counts of elements, since a given counter will actually return a sum of the counts for a few different elements. This is clearly problematic and won’t work.

To alleviate this, one uses N counter arrays, each with its corresponding hash function mapping elements to positions. When an element is encountered within input_stream, it gets mapped to a unique location within each counter array and the counters at those location gets incremented by 1. Now, when we want to query the count of an entity we use the same hash functions to retrieve a set of values, one from each counter array; and return the minimum value. While there will still be errors in the count, using multiple arrays provides far better results with theoretical guarantees. This simple algorithm is known as the Count-Min Sketch.

// Count_Min algorithm# N = number of counter arrays
# kappa = number of counters in each array
counter_matrix = np.zeros([N, kappa])
def process_stream(input_stream):
for element in input_stream:
for counter_id in range(N):
position = hash_function(counter_id, element)
counter_matrix[counter_id, position] += 1
def query_element_count(element):
counts = np.zeros(N)
for counter_id in range(N):
position = hash_function(counter_id, element)
counts[counter_id] = counter_matrix[counter_id, position]
return np.min(counts)

The Count Sketch is very similar to Count-Min Sketch with 2 differences: (a) Each time an element is processed, two hash functions are invoked: One determines the position within the counter array. The other chooses between incrementing the counter by 1 vs decrementing the counter by 1. (b) At query time, given an element, we first retrieve the set of counts as before, but we now use the median instead of the min.

The counter arrays produced by these algorithms are very compact and yet quite informative. They can be thought of as a compact lossy representation of the input stream that can be used in other downstream tasks.

MCB employs Count Sketch to represent its question and image embeddings and then processes these representations to obtain the answer. The input stream described above is a series of integers with repetitions where as the question and image embeddings are vectors with floating point entires. This requires modifying the Count Sketch algorithm slightly, resulting in Tensor Sketch. The main difference is that counters are now updated by the magnitude of the floating point element instead of by 1.

However, one key question remains unanswered: Why does MCB use Count Sketch as a representation and not some other ?

An interesting property of the Count Sketch representation, proven by Pham et al. is that the count sketch (represented by ℂ) of the outer product of two vectors is equivalent to the convolution (denoted as *) of their count sketches.

ℂ(v_I ⊗ v_Q) = ℂ(v_I) * ℂ(v_Q)

Furthermore, convolution in the time domain is equivalent to element-wise multiplication in the frequency domain. This results in:

ℂ(v_I ⊗ v_Q) = IFFT( FFT(ℂ(v_I)) ⊙ FFT(ℂ(v_Q)) )

As a result of this, one can now allow v_I and v_Q to interact via an outer product by simply computing their Count Sketch vectors which is straightforward, followed by 2 FFT transforms, 1 IFFT transform and a point-wise multiplication.

The entire VQA pipeline with the MCB module enabling a rich interaction between the input embeddings is shown below. This pipeline is quite powerful and it outperforms models that only employ point-wise multiplication as an interaction mechanism between the two modalities.

However, MCB does have a critical shortcoming: As described above, the errors introduced by the Count Sketch operation are a result of hash collisions. Larger the counter array, smaller the probability of a collision. It turns out that for MCB to be effective, the length of the Count Sketch vector needs to be very large. In the MCB paper, they employed a dimensionality of 16000, resulting in a very large memory footprint. While still smaller than the memory required to perform a full outer product based interaction, this large footprint is cumbersome, especially when multiple MCB modules might need to be inserted into a single neural pipeline.

Learn more about Tensor Sketch here:

Fast and Scalable Polynomial Kernels via Explicit Feature Maps

Ninh Pham, Rasmus Pagh

KDD 2013

Learn more about MCB here:

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach

EMNLP 2016

Multimodal Factorized Bilinear Pooling

Another mechanism to exploit the power of bilinear pooling without a surge in the memory footprint of the model is to employ some matrix factorization tricks, as was done in work named Multimodal Factorized Bilinear Pooling (MFB).

Given an image embedding v_I ∈ ℝ^(dx1) and a question embedding v_Q ∈ ℝ^(dx1), the outer product described above can be concisely written as:

z_i = v^T_I × W_i × v_Q

where z_i is the i-th element of z ∈ ℝ^x 1), × denotes regular matrix multiplication and ^T refers to a matrix transpose. W_i ∈ ℝ^(d x d) and we need to learn θ of them which results in learning d x d x θ parameters, clearly a daunting task.

This can be simplified by factorizing W_i into two low rank matrices L and M.

z_i = v^T_I × (L_i × M^T_i) × v_Q

where L_i ∈ ℝ^(d x k) and M_i ∈ ℝ^(d x k) with k representing the latent dimensionality of these matrices. A lower value of k leads to fewer parameters but adds more constraints on the resulting W_i and thus may lead to a less powerful model.

This can further be rewritten as

z_i = 1^T ( (L^T_i × v_I) ⊙ (M^T_i × v_Q) )

The number of parameters now reduces from the enormous d x d x θ to a more manageable d x k x θ for L^T and d x k x θ for M^T.

In fact MFB with a fewer number of parameters than MCB, performs better than MCB on standard benchmarks.

Another take: Looking at the above equation, it’s interesting to note that MFB starts to look quite different from an outer product, and in fact starts resembling a point-wise operation between the two modalities. The L_i and M_i can be thought of as simple projection matrices to project the input embeddings into a k dimensional space. This is followed by a point-wise operation and an element summation operation. But this only results in a single element in the output vector z. This is repeated θ number of times. Hence one can think of this technique as a multi-head point-wise operation.

Learn more about Multimodal Factorized Bilinear Pooling here:

Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Zhou Yu, Jun Yu, Jianping Fan, Dacheng Tao

ICCV 2017

Multimodal Tucker Fusion

Another method to incorporate the outer product operation without leading to a huge increase in parameters is to use the Tucker decomposition of the parameter matrix, as was done in the work Multimodal Tucker Fusion (MUTAN).

The full bilinear pooling operation described above between the image and question embedding can be written as

z = ( W ×:1 v_Q ) ×:2 v_I

where ×: i refers to the i-mode product between a tensor and a matrix. W ∈ ℝ^(d x d x θ) is the large parameter matrix that needs to be learnt.

W can be decomposed into a tensor ψ ∈ ℝ^(k x k x o) and 3 factor matrices W_q ∈ ℝ^(d x k), W_i ∈ ℝ^(d x k) and W_z ∈ ℝ^x o) using a Tucker decomposition.

W = ((ψ ×:1 W_q) ×:2 W_i) ×:3 W_z

The bilinear pooling operation can then be rewritten as:

z = ( (ψ ×:1 (𝕧_Q)) ×:2 𝕧_I )^T W_z

where 𝕧_Q = v^T_Q × W_q is the projection of v_Q using W_q and 𝕧_I = v^T_I × W_i is the projection of v_I using W_i. The above equation represents a full bilinear pooling operation between 𝕧_Q and 𝕧_I followed by a projection using W_z. The number of parameters reduces from the enormous d x d x θ to a more manageable d x k for W_q , d x k for W_i, o x θ for W_z and k x k x o for ψ.

To summarize the MUTAN method, the question and image embeddings are projected onto lower dimensions followed by a full bilinear pooling operation and then projected into the desired output dimension.

MUTAN also outperforms the expensive MCB method but results on standard benchmarks show that it does not outperform MFB. It is still an interesting method as it incorporates a full bilinear pooling, albeit at lower dimensions.

Learn more about MUTAN here:

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

Hedi Ben-younes, Rémi Cadene, Matthieu Cord, Nicolas Thome

ICCV 2017

Incorporating bilinear pooling into attention mechanisms

The bilinear pooling operations described above enable rich interactions between two input embeddings and generate a resultant output embedding. Hence they can be applied in different parts of a VQA architecture, wherever a multimodal interaction takes place. In particular, there are two points in an attention based VQA architecture where bilinear pooling is usually employed.

  1. The core part of VQA attention is interaction between a question embedding and different image region embeddings. This is a perfect match for bilinear pooling. The result of the pooling operation is a tensor with an embedding for each attention. This tensor is then passed through convolution layers to finally result in the attention weight vector.
  2. After the attention weights are used to linearly combine image embeddings into a final image embedding, this embedding undergoes an interaction with the question embedding to form the aggregated information vector. This interaction is also modeled by bilinear pooling.

These applications of bilinear pooling to the attention based VQA architecture are shown below.


To summarize, an outer product between two vectors is an extremely rich and powerful mechanism to combine them, in contrast to simple element wise operations. However, the raw outer product is prohibitively expensive and impractical to incorporate into VQA models.

To overcome this deficiency and yet extract some power from it, several versions of bilinear pooling have been proposed over the past few years. These have led to accuracy improvements over simpler interaction methods. I discussed some of the more prominent ones in this post.

Follow AI2 on Twitter @allen_ai | Follow Ani Kembhavi @anikembhavi

AI2 Blog

AI for the common good.

Aniruddha Kembhavi

Written by

Computer vision researcher at the Allen Institute for AI. Twitter: @anikembhavi

AI2 Blog

AI2 Blog

AI for the common good.

More From Medium

More from AI2 Blog

More from AI2 Blog

Green AI

Roy Schwartz
Feb 14 · 9 min read


More from AI2 Blog

More from AI2 Blog

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade