Paper Summary — Referring Image Segmentation via Cross-Modal Progressive Comprehension

Aditya Chinchure
Technonerds
Published in
4 min readMay 27, 2021

Referring Image Segmentation (RIS) is a multimodal task where, given an image and a text phrase describing an object in the image, the goal is to segment that object in the image. The Cross-Modal Progressive Comprehension (CMPC) model (Huang et al. 2020) is based on the Recurrent Refinement Network (RRN) model (Li et al. 2018) , which used convLSTM to refine multimodal features. CMPC incorporates a fully-connected spatial graph to localize the object in the image.

A summary of the method

The CMPC method is quite complex, but it can be divided into three separate parts. First, image and text features are independently encoded, using a CNN backbone and an LSTM respectively. The text features are assigned probabilities to identify certain words as entities, attributes, and relationship words. The entity and attribute words are combined with the image features from three layers of the CNN to obtain three sets of multimodal features.

Second, a fully-connected graph is formed for each set of multimodal features, where the multimodal features are the vertices (they essentially represent spatial regions in the image, combined with text information) and the relationship words form the edges. Using this, graph convolution is applied to localize the object, i.e. select the vertices that represent the object to be segmented.

Third, feature exchange allows information to flow between the three sets of refined multimodal features and text is used to guide this interaction. Then the three sets are combined using convLSTM to produce the output segmentation map.

What I found most interesting

The use of graph convolution is very interesting. Mathematically speaking, any CNN or transformer model can be written as a graph model with certain constraints. Graph models are very general but very useful, and in this case, the graph model is used to localize the object in the image.

I also found the use of probabilities to represent what each word means quite interesting. These probabilities act like weights that determine where a word makes bigger impact — whether it be forming multimodal features that represent objects in the image, or graph edges where the word probabilities determine the information flow in the graph.

Why should you be excited (and why am I excited)?

Having worked on RIS models in my research, specifically including modifying the CMPC model, I am excited to see this task be worked on in many more papers. While some approaches try to find objects in the image using Fast-RCNN, this method does not do that. It instead produces a pixel-level segmentation straight from the image.

When this model was released, it was SOTA for the task. Still, on most datasets, it only has an IoU score of around 0.60, which explains how large the opportunity for improvement is. With better graph-based learning and some refinement architectures, I think the score can be improved significantly. Of course, I think the use of vision transformers to encode the multimodal features before applying graph convoltion could be a very exciting direction as well.

I am writing a series of summaries of papers that I have been reading, mostly involving multimodal computer vision and NLP tasks. These summaries are in layman’s terms, and not detailed. You can find all the papers I have summarized here.

I am a student researcher at The University of British Columbia working on Vision and NLP tasks. If you are interested in these topics as well, let’s get in touch!

--

--

Aditya Chinchure
Technonerds

CS at UBC | Computer Vision Researcher | Photographer