The lure of the outer product
Bilinear pooling and its approximations for Visual Question Answering
Vanilla architectures for Visual Question Answering (VQA) represent the image by a feature vector known as an image embedding, embed the question into a question embedding and then combine the two modalities using simple element-wise operations. This aggregated vector is then provided to a multilayer perceptron to obtain a distribution over the set of possible answers.
A common extension to these vanilla models is to add an attention module that computes an affinity referred to as an attention score between the question and each region in the image. The resulting attention vector dictates which regions of the image the model should focus on, in order to answer the given question. This affinity computation is usually done using element-wise operations between the two modalities. Learn more about attention modules for VQA in a previous post.
The key element in both the above cases is the nature of interaction between a question embedding and an image embedding. Modeling these interactions as element-wise operations works well but only models a subset of the interactions that are possible between two vectors. A far more powerful interaction is the outer product of the two vectors. In this post, I will discuss the outer product interaction applied to VQA, the practical difficulties in modeling this and the approximations used instead.
The outer product
Given an image embedding v_I ∈ ℝ^(dx1) and a question embedding v_Q ∈ ℝ^(dx1), a simple way to combine them is using a Hadamard product, i.e. an element-wise multiplication. This allow elements in the same position within each embedding to interact in a multiplicative fashion, resulting in the aggregated vector z ∈ ℝ^(dx1).
z =v_I ⊙ v_Q
Another method is to learn this affinity via a parametric neural network. This network typically projects the embeddings into a lower dimensional space and then combines them using an element-wise multiplication operation.
𝕫 = ( W_I × v_I+b_I ) ⊙ ( W_Q × v_Q + b_Q )
z = W_z × 𝕫 + b_z
where W_I ∈ ℝ^(k x d) and W_Q ∈ ℝ^(k x d) project the embeddings into a k dimensional space before the element-wise multiplicative interaction and then W_z ∈ ℝ^(k x θ) is used to project the resulting vector into the desired dimension.
The above methods of interaction only allow elements in the same position within the two input embeddings to interact in a multiplicative fashion. A far more powerful mechanism results when each element in the first embedding interacts multiplicatively with every element in the second embedding. The outer product of the two embeddings, represented as ⊗ does precisely this; and results in a matrix P ∈ ℝ^(d x d).
P = v_I ⊗ v_Q = v_I × v^T_Q
Note: In the above equation, ^T refers to the matrix transpose and × refers to the regular matrix multiplication.
A common mechanism to incorporate the outer product into an interaction between the two embeddings is via the mechanism of bilinear pooling. This is essentially an outer product
P = v_I ⊗ v_Q
followed by a reshaping of the matrix P into ℙ ∈ ℝ^(d² x 1)
ℙ = reshape(P)
followed by a linear projection of the result to obtain the output of the interaction z ∈ ℝ^(θ x 1)
z = W_P × ℙ
Bilinear pooling using the outer product is an incredibly powerful way to model the interaction between two input vectors but it requires learning the large parameter matrix W_P. For instance, if the dimensionality d of the input embeddings is 1024 and the dimensionality θ of the output embedding is 2000, W_P has 1024 * 1024 * 2000 = 2 billion parameters! Representing this matrix would require a large amount of memory and learning these parameters would require a lot of training data and would be very slow. This is clearly infeasible and impractical.
Below, I will outline a few methods that have been proposed over the last couple of years to approximate this outer product for VQA, so as to harness most of its power while still keeping the memory footprint manageable.
Multimodal Compact Bilinear Pooling
One of the first successful uses of the outer product (or rather, an approximation of it) in VQA was Multimodal Compact Bilinear Pooling (MCB). At the heart of the MCB technique is a probabilistic data structure called the Count Sketch. To understand MCB, we need to digress into the world of Sketching.
Sketches are data structures that allow you to represent a very large stream of numbers (or a vector) in a highly compact fashion. These data structures are lossy and not reversible, in that the original data cannot be exactly recovered from them. However, they are capable of answering a class of queries about the stream of data (for example: Counts of unique objects present in the input stream) while providing accuracy guarantees.
Let’s say we wanted to maintain the counts of numbers seen in the large array named
input_stream. A trivial and lossless method to do this would be using a dictionary.
// Using a dictionary to count itemscount_store = dict()def process_stream(input_stream):
for element in input_stream:
if element not in count_store:
count_store[element] = 1
count_store[element] += 1def query_element_count(element):
if element in count_store:
If the number of unique elements η in the stream is very large, this dictionary (which stores the hash of each item and its count) becomes very large and may become infeasible in many applications.
Instead, let’s initialize an array (
counter_array)with κ counters, each one initially set to 0. As we read each element in
input_stream, we map it to a position in
counter_arrayvia a hash function, and then increment the corresponding counter by 1. Now, when we want to query the count of an entity, we simply map it to a location using the same hash function and read out the value stored in the counter at that location.
If κ = η, this is equivalent to the dictionary method of storing counts and is infeasible. However, if κ << η, the memory footprint of this storage system gets much smaller and becomes manageable. The problem with this scenario is that multiple elements in
input_streamwill begin mapping to the same position within
counter_array (also known as hash collisions). This will lead to over estimating the counts of elements, since a given counter will actually return a sum of the counts for a few different elements. This is clearly problematic and won’t work.
To alleviate this, one uses N counter arrays, each with its corresponding hash function mapping elements to positions. When an element is encountered within
input_stream, it gets mapped to a unique location within each counter array and the counters at those location gets incremented by 1. Now, when we want to query the count of an entity we use the same hash functions to retrieve a set of values, one from each counter array; and return the minimum value. While there will still be errors in the count, using multiple arrays provides far better results with theoretical guarantees. This simple algorithm is known as the Count-Min Sketch.
// Count_Min algorithm# N = number of counter arrays
# kappa = number of counters in each array
counter_matrix = np.zeros([N, kappa])def process_stream(input_stream):
for element in input_stream:
for counter_id in range(N):
position = hash_function(counter_id, element)
counter_matrix[counter_id, position] += 1def query_element_count(element):
counts = np.zeros(N)
for counter_id in range(N):
position = hash_function(counter_id, element)
counts[counter_id] = counter_matrix[counter_id, position]
The Count Sketch is very similar to Count-Min Sketch with 2 differences: (a) Each time an element is processed, two hash functions are invoked: One determines the position within the counter array. The other chooses between incrementing the counter by 1 vs decrementing the counter by 1. (b) At query time, given an element, we first retrieve the set of counts as before, but we now use the median instead of the min.
The counter arrays produced by these algorithms are very compact and yet quite informative. They can be thought of as a compact lossy representation of the input stream that can be used in other downstream tasks.
Back to the VQA world
MCB employs Count Sketch to represent its question and image embeddings and then processes these representations to obtain the answer. The input stream described above is a series of integers with repetitions where as the question and image embeddings are vectors with floating point entires. This requires modifying the Count Sketch algorithm slightly, resulting in Tensor Sketch. The main difference is that counters are now updated by the magnitude of the floating point element instead of by 1.
However, one key question remains unanswered: Why does MCB use Count Sketch as a representation and not some other ?
An interesting property of the Count Sketch representation, proven by Pham et al. is that the count sketch (represented by ℂ) of the outer product of two vectors is equivalent to the convolution (denoted as *) of their count sketches.
ℂ(v_I ⊗ v_Q) = ℂ(v_I) * ℂ(v_Q)
Furthermore, convolution in the time domain is equivalent to element-wise multiplication in the frequency domain. This results in:
ℂ(v_I ⊗ v_Q) = IFFT( FFT(ℂ(v_I)) ⊙ FFT(ℂ(v_Q)) )
As a result of this, one can now allow v_I and v_Q to interact via an outer product by simply computing their Count Sketch vectors which is straightforward, followed by 2 FFT transforms, 1 IFFT transform and a point-wise multiplication.
The entire VQA pipeline with the MCB module enabling a rich interaction between the input embeddings is shown below. This pipeline is quite powerful and it outperforms models that only employ point-wise multiplication as an interaction mechanism between the two modalities.
However, MCB does have a critical shortcoming: As described above, the errors introduced by the Count Sketch operation are a result of hash collisions. Larger the counter array, smaller the probability of a collision. It turns out that for MCB to be effective, the length of the Count Sketch vector needs to be very large. In the MCB paper, they employed a dimensionality of 16000, resulting in a very large memory footprint. While still smaller than the memory required to perform a full outer product based interaction, this large footprint is cumbersome, especially when multiple MCB modules might need to be inserted into a single neural pipeline.
Learn more about Tensor Sketch here:
Fast and Scalable Polynomial Kernels via Explicit Feature Maps
Ninh Pham, Rasmus Pagh
Learn more about MCB here:
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach
Multimodal Factorized Bilinear Pooling
Another mechanism to exploit the power of bilinear pooling without a surge in the memory footprint of the model is to employ some matrix factorization tricks, as was done in work named Multimodal Factorized Bilinear Pooling (MFB).
Given an image embedding v_I ∈ ℝ^(dx1) and a question embedding v_Q ∈ ℝ^(dx1), the outer product described above can be concisely written as:
z_i = v^T_I × W_i × v_Q
where z_i is the i-th element of z ∈ ℝ^(θ x 1), × denotes regular matrix multiplication and ^T refers to a matrix transpose. W_i ∈ ℝ^(d x d) and we need to learn θ of them which results in learning d x d x θ parameters, clearly a daunting task.
This can be simplified by factorizing W_i into two low rank matrices L and M.
z_i = v^T_I × (L_i × M^T_i) × v_Q
where L_i ∈ ℝ^(d x k) and M_i ∈ ℝ^(d x k) with k representing the latent dimensionality of these matrices. A lower value of k leads to fewer parameters but adds more constraints on the resulting W_i and thus may lead to a less powerful model.
This can further be rewritten as
z_i = 1^T ( (L^T_i × v_I) ⊙ (M^T_i × v_Q) )
The number of parameters now reduces from the enormous d x d x θ to a more manageable d x k x θ for L^T and d x k x θ for M^T.
In fact MFB with a fewer number of parameters than MCB, performs better than MCB on standard benchmarks.
Another take: Looking at the above equation, it’s interesting to note that MFB starts to look quite different from an outer product, and in fact starts resembling a point-wise operation between the two modalities. The L_i and M_i can be thought of as simple projection matrices to project the input embeddings into a k dimensional space. This is followed by a point-wise operation and an element summation operation. But this only results in a single element in the output vector z. This is repeated θ number of times. Hence one can think of this technique as a multi-head point-wise operation.
Learn more about Multimodal Factorized Bilinear Pooling here:
Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Zhou Yu, Jun Yu, Jianping Fan, Dacheng Tao
Multimodal Tucker Fusion
Another method to incorporate the outer product operation without leading to a huge increase in parameters is to use the Tucker decomposition of the parameter matrix, as was done in the work Multimodal Tucker Fusion (MUTAN).
The full bilinear pooling operation described above between the image and question embedding can be written as
z = ( W ×:1 v_Q ) ×:2 v_I
where ×: i refers to the i-mode product between a tensor and a matrix. W ∈ ℝ^(d x d x θ) is the large parameter matrix that needs to be learnt.
W can be decomposed into a tensor ψ ∈ ℝ^(k x k x o) and 3 factor matrices W_q ∈ ℝ^(d x k), W_i ∈ ℝ^(d x k) and W_z ∈ ℝ^(θ x o) using a Tucker decomposition.
W = ((ψ ×:1 W_q) ×:2 W_i) ×:3 W_z
The bilinear pooling operation can then be rewritten as:
z = ( (ψ ×:1 (𝕧_Q)) ×:2 𝕧_I )^T W_z
where 𝕧_Q = v^T_Q × W_q is the projection of v_Q using W_q and 𝕧_I = v^T_I × W_i is the projection of v_I using W_i. The above equation represents a full bilinear pooling operation between 𝕧_Q and 𝕧_I followed by a projection using W_z. The number of parameters reduces from the enormous d x d x θ to a more manageable d x k for W_q , d x k for W_i, o x θ for W_z and k x k x o for ψ.
To summarize the MUTAN method, the question and image embeddings are projected onto lower dimensions followed by a full bilinear pooling operation and then projected into the desired output dimension.
MUTAN also outperforms the expensive MCB method but results on standard benchmarks show that it does not outperform MFB. It is still an interesting method as it incorporates a full bilinear pooling, albeit at lower dimensions.
Learn more about MUTAN here:
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
Hedi Ben-younes, Rémi Cadene, Matthieu Cord, Nicolas Thome
Incorporating bilinear pooling into attention mechanisms
The bilinear pooling operations described above enable rich interactions between two input embeddings and generate a resultant output embedding. Hence they can be applied in different parts of a VQA architecture, wherever a multimodal interaction takes place. In particular, there are two points in an attention based VQA architecture where bilinear pooling is usually employed.
- The core part of VQA attention is interaction between a question embedding and different image region embeddings. This is a perfect match for bilinear pooling. The result of the pooling operation is a tensor with an embedding for each attention. This tensor is then passed through convolution layers to finally result in the attention weight vector.
- After the attention weights are used to linearly combine image embeddings into a final image embedding, this embedding undergoes an interaction with the question embedding to form the aggregated information vector. This interaction is also modeled by bilinear pooling.
These applications of bilinear pooling to the attention based VQA architecture are shown below.
To summarize, an outer product between two vectors is an extremely rich and powerful mechanism to combine them, in contrast to simple element wise operations. However, the raw outer product is prohibitively expensive and impractical to incorporate into VQA models.
To overcome this deficiency and yet extract some power from it, several versions of bilinear pooling have been proposed over the past few years. These have led to accuracy improvements over simpler interaction methods. I discussed some of the more prominent ones in this post.