Foreword: The author is Hedi Ben Younes, former PhD student at LIP6 / Heuritech. Multimodal fusion of text and image information is an important topic at Heuritech, as most of the media on the internet is composed of images, videos, and text. The challenging Visual Question Answering task is an excellent benchmark for the fusion of text and image.
This blog post presents a work done by Hédi Ben-Younes*, Rémi Cadène*, Matthieu Cord and Nicolas Thome. The paper was accepted at the International Conference on Computer Vision (ICCV) 2017 and has been presented at the poster session. To delve deeper into our work, you can:
Visual Question Answering
The goal of Visual Question Answering (VQA) is to build a system that can answer questions about images.
Solving the task of VQA would significantly improve the possibilities in human-machine interfaces, allowing to dynamically extract the needed information from a picture.
In a shorter term, and as pointed out in the foreword, VQA provides a benchmark for multimodal representation methods. We can use this task to develop methods that could be used for problems where inputs are intrinsically multimodal, and where the output highly depends on the combination of modalities.
To solve this problem, precise image and text models (monomodal representations) are required. But more importantly, high-level interactions between these two modalities have to be carefully settled into the model in order to provide the correct answer.
This projection from the monomodal spaces to a multimodal space is supposed to model the relevant correlations between the two spaces. Besides, the model must have the ability to understand the full scene, focus its attention on the relevant visual regions and discard the useless information regarding the question.
We cast the visual question answering task into the problem of classification. Given a question q about an image v, we want the predicted answer â to match the correct one a*:
To this end, we first represent the image and the question using powerful monomodal embeddings. We use ResNet-152 to produce
and a GRU to yield
The supervision is given by a vector
We explore the problem of multimodal embedding: how do we learn a multimodal embedding with a composition of monomodal ones? In other words, if the model is:
What should we put in f?
Bilinear models are an appealing solution to the fusion problem since they encode a fully-parametrized interaction between the two embedding spaces. The general form of a bilinear projection between q and v is
We note that this full bilinear model introduces a tensor
If we use an answer vocabulary of size
T becomes way too large for storing as well as for learning. To reduce the number of parameters, we add some structure in the tensor T using the Tucker Decomposition . It consists of writing the very large bilinear interaction T as a smaller bilinear interaction Tc between projections of the input representations.
When we constrain T to have low Tucker ranks, we can re-write the model as:
However, the construction of this decomposition restricts the dimensions Tq, Tv and To to be relatively small (≃ 200), which might cause a bottleneck in the modeling. To reach higher dimensions, and thus reduce the bottleneck, we explore adding more structure into the tensor T. More precisely, we force the third-order slices of the core tensor Tc to have a fixed rank.
Applying this structural constraint on Tc simplifies the bilinear interaction in the Tucker Decomposition. It changes the expression of Z̃, which becomes
Where Mr and Nr are matrices of size Tq x To and Tv x To.
Unifying state-of-the-art VQA models
We show that the framework of Tucker Decompositions can be used to express some of the state-of-the-art fusion strategies for VQA (namely MLB  and MCB ). We invite the interested reader to read the article for details on this point.
Adding visual attention
As it has been done in previous articles, we integrate our fusion strategy into a multi-glimpse attention mechanism.
Basically, we represent the image as a set of region vectors. Then, we use a MUTAN bloc to merge each region vector with a question, and thus yield a score for each region. These scores are used to weight-sum pool the region vectors and provide an attended visual embedding. This vector is then fused with the question embedding with another MUTAN bloc to produce the answer
In a few words, an ensemble of 3 MUTAN models reaches the performance of an ensemble of 7 MLB models, which was the previously published state-of-the-art. We further improve on this result with an ensemble of 5 models. Please read the article for more details on the comparison with other methods.
Impact of rank sparsity
Besides comparing our model to the state-of-the-art, we are interested in understanding how it behaves. More precisely, we focus our study on understanding what the rank constraint can bring.
We carry these experiments on a MUTAN model without attention. As we reduce the rank, we can increase the output dimension and limit the bottleneck effect. We see on this plot that for a fixed number of parameters in the fusion, performance is better when we reduce the rank and increase the output dimension.
Introducing the rank constraint implies that we write Z̃ as a sum of R projections
We want to assess what kind of information these different projections have learned. Are these projections complementary, redundant, specialized,…?
We train a MUTAN without attention, with R = 20 and measure its performance on the validation set. We then set to 0 all the Zr except one and measure the performance of this ablated system.
We do so for each one of the 20 projections. We compare the full system to the R ablated systems on different question types. In the plots below, the dotted line represents the performance of the full system, and each bar is the performance obtained when we keep only the corresponding projection.
Depending on the question type, we observe 3 different behaviors of the ranks.
For questions starting by “Is there”, whose answer is almost always “yes” or “no”, we observe that each rank has learned enough to reach almost the same accuracy as the global system.
Other question types require information from all the latent projections, as in the case of “What is the man”. This leads to cases where all projections perform equally and significantly worst when taken individually than when combined to get the full model.
At last, we observe that specific projections contribute more than others depending on the question type. For example, latent variable 16 performs well on “what room is”, and is less informative to answer questions starting by “what sport is”. The opposite behavior is observed for latent variable 17.
The framework Tucker decomposition helps us understand which kind of structure is imposed on a bilinear model.
We have successfully applied it in the context of VQA, combining it with a low-rank constraint on the core tensor of the decomposition. It could be interesting to explore other kinds of structures in the elements of decomposition while delving deeper into the tensor analysis to get a more precise understanding of the expressivity involved by a decomposition.
We would also like to apply the methods developed for VQA to other tasks requiring multimodal representations.
 Hedi Ben-younes, Rémi Cadène, Matthieu Cord and Nicolas Thome. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. arXiv preprint arXiv:1705.06676. Accepted in ICCV 2017.
 S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. ICCV 2015
 T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev. 51(3):455–500, Aug. 2009
 J.-H. Kim, K.-W. On, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard Product for Low-rank Bilinear Pooling. ICLR 2017.
 A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP 2016
Originally published at https://lab.heuritech.com.