Pruning and Merging of Tokens for Efficient VL Models: A Review

Vedant Palit
9 min readAug 9, 2023

--

Often in tasks related to computer vision and NLP, computationally expensive and memory-intensive processing becomes a hindrance towards faster model inferencing as well as reduction of memory footprints.

The Paper “PuMer: Pruning and Merging of Tokens for Efficient Vision-Language Models” published in ACL 2023 proposes an effective method of reducing the time and space complexity of cross-modal interactions.

The main methodology of the paper involves pruning (reducing) the tokens in the input images irrelevant to the text inputs, while also merging similar textual and visual tokens using lightweight token reducers spread across the cross-modal layers. The paper evaluates the PuMer method for two VL models — METER and ViLT on four downstream tasks: Image-Text Retrieval, Visual Question Answering, Visual Entailment and Natural Language for Visual Reasoning.

Before moving on to the details of the method, a brief knowledge of ViLT and METER as well as the downstream tasks:

ViLT (Vision-and-Language Transformer Without Convolution or Region Supervision) — The transformer uses a 12-layer encoder architecture for the cross-modal fusion of the concatenated BERT-encoded text embeddings and linear-layer projected image patches.

ViLT architecture compared with other architectures
The METER Framework

METER (Multimodal End-to-end TransformER) — The transformer uses a RoBERTa-based text encoder, a CLIP-based image encoder and a 12 BERT-like cross-attention layer to fuse image and text modalities. It has 330 million parameters.

The Problem and The Proposed Solution

Deep Vision-Language Models are usually computationally inefficient due to the requirement to process entire input images and text tokens, which often include irrelevant information in images, not required for the output to a specific text input.

The image shows the image tokens relevant to a particular question input in a VQA task

The proposed solution to the above problem involves adapting unimodular token pruning as well as merging to a multimodular setting while preventing huge information loss through pruning as well as model confusion due to modality unaware merging.

The method introduces —

  • Text-informed image token pruning — This removes image tokens which are irrelevant to the text input (for example The image tokens of the four footballers for the question, what sport are they playing? ).
  • Modality-aware token merging — This merges semantically redundant image and text tokens modality independently (for example: combining the tokens for the four footballers for the question, how many people are playing? ).

Utilising the above methods, lightweight non-parametric token reducers are placed along the layers of the model, in a downward-increasing fashion with higher token reduction occurring in the deeper layers to prevent information loss.

Related Works

Works on Token Pruning: DynamicViT and A-ViT are involved in unimodular pruning by focusing on input image tokens, removing the content deemed uninformative in the image and maintaining only the salient features.

The major limitation in such a case is the lack of contextualising with text inputs which prevent it from expanding into the Vision-Language domain.

On the language side, PoWER-BERT reduces input text tokens, allowing faster computations which however is not the major hindrance in V-L tasks.

Works on Token Merging: SPViT and EViT models, work on selecting the uninformative tokens and merging them into one. GroupViT combines semantically similar image tokens together through cross-attention.

ToMe, TokenLearner and TokenPooling combine tokens via pruning and achieve higher throughput versus accuracy tradeoffs.

The Token Reduction Framework is placed in the Image Text Cross-Modal Encoder of a VL Architecture

The Token Reduction Framework

Token Reducers in Cross-Modal Encoder

Considering n-layers in a cross-modal encoder, the token reducer removes k% tokens in layer l where l lies between (f,n). Here f signifies an arbitrary layer, only beyond which a token reducer is placed, i.e. no token reducers are placed between (0,f). The major reason for this is that despite the processing efficiency increase, there is also major information as well as performance loss.

Additionally, in layer l the token reducer also merges r% image tokens and t% text tokens.

Every Token Reducer consists of two sequential, non-parametric modules: Text-informed Pruner and Modality-Aware Merger. The TIP prunes image tokens that are not contextually relevant to the text input whereas the MAM compresses semantically similar tokens into a smaller number of tokens.

The Algorithm

Input: Image Token vector V, Text Token vector T, text-to-image cross attention scores A vector, prune ratio k, image merge ratio r, text merge ratio t

Working:

1) Utilising the cross-attention scores A: calculated in the cross-modal attention layers

The Text-Saliency score i.e the correlation between the image tokens and text tokens is calculated as :

s(v) is the text-saliency score for v-th image token in image token vector V
A rough understanding of the text-saliency score s(v)

2) After having obtained an S vector containing all the s(v), the index values of the top k’ items where k’ is the number of image tokens to be kept and is governed by the prune ratio k calculated as k’=(1-k)|V|. (TIP)

3) A new image token vector V(p) = V[idx] is formed.

4) Text Tokens T and Image Tokens V(p) are merged using bipartite-soft matching followed by concatenation which is defined as — (MAM)

A rough visualisation of the bipartite soft matching and merging algorithm

The similarity score S(p) is computed through the dot product of the key values (calculated in the self-attention layers). Suppose for two tokens t1 and t2, the value of S(p) = K(t1)*K(t2).

Using all the edges from P(r), the tokens from O and E are extracted and merged computing the average of the two — (O+E)/2 to form OE.

Finally, the unmerged O and E tokens are gathered alongside OE, to form the new merged text tokens T(m) and pruned-merged image tokens V(m).

Training and Inference

There are no added parameters due to the PuMer framework, which is why the setup for fine-tuning is similar to the fine-tuning of the original VL model. Further accuracy drops are prevented by implementing a knowledge distillation loss, which minimizes the distance between the feature activations of the teacher model (original non-PuMer model) and the student model (PuMer framework model). The only configurable hyperparameters of the framework are the prune ratio k, and merge ratios r and t.

Experimental Setup and the Results

The PuMer framework is tested on the previously mentioned ViLT and METER models through 4 VL tasks —

  • Image-Text Retrieval: This consists of two subtasks — image-to-text retrieval and text-to-image retrieval and is tested on Flickr30k.
  • Visual Question Answering: This task is tested on the VQAv2 dataset and consists of questions about images from the MSCOCO dataset as well as real-world scenes.
  • Visual Entailment(VE): This is a task about visual inferencing i.e predicting whether an image premise entails a text input tested on a dataset constructed using statements from the Stanford Natural Language Inference Corpus and Flickr30k.
  • Natural Language Visual Reasoning(NLVR): This is a task to predict whether a given sentence is true about two input images. The NLVR2 corpora contain 100K linguistically diverse English sentences written by humans and grounded with a pair of visually complex images.

The Baseline models include —

  • DynamicViT: It consists of prediction modules parametrised by MLP to predict prune-worthy image tokens in vision transformers.
  • ToMe: It utilises Token Merging to reduce the number of tokens in transformers.
  • Resolution Reduction: Another baseline is the reduction of the input image resolution which automatically improves computational efficiency.
Table 1 — Results

The Metrics of measurement are —

  • Accuracy Metrics: VQA accuracy i.e. checking when the model output matches the ground truth answer for VQAv2. For visual entailment and natural language visual reasoning, it is accuracy and for image-text retrieval tasks, it is Top1-recall.
  • Efficiency Metrics: Throughput increase and Memory reduction are the two efficiency metrics which are believed to be more accurate than the standard FLOPs (Floating point Operations per second) measurement.

Results

PuMer compared with baselines DynamicViT and ToMe

Table 1 demonstrates that the PuMer-integrated ViLT and METER models are better performers than the traditional METER and ViLT in terms of throughput and memory while also maintaining competitive accuracy on the tasks. There is an overall 1.85x speedup and 46% reduction in memory utilisation while keeping the accuracy shifts within a 1% margin.

The figure above shows the comparison of PuMer with baselines DynamicViT and ToMe on the VQA task for different pruning and merging ratios to observe the throughput versus accuracy trade-offs.

For the same value of throughput(represented by the vertical dotted line in the figure), PuMer demonstrates a higher accuracy on the VQA task. Similarly, for the same value of accuracy drop(represented by the horizontal dotted line in the figure), PuMer demonstrates a higher throughput increase.

Table 2 — PuMer compared against smaller resolution baselines

Table 2 demonstrates the values of accuracy, throughput increase and memory reduction for PuMer compared with the original METER model with smaller-resolution images(less than 384x384) along with a smaller-resolution(320x320) PuMer integrated METER model.

For the same resolution reduction (320x320), although PuMer has a lower accuracy however the throughput increase is almost 1.76 times(2.86/1.62) the throughput increase of the original 320x320 METER model. Additionally, memory reduction in the PuMer 320x320 model is almost 22% more than the original 320x320 model.

Ablation Study

Study on TIP, MAM and KDL

To identify how each component i.e. Text-informed Pruning, Modality-aware Merging and Knowledge Distillation Loss improves on the throughput, memory consumption and accuracy, the three are ablated for various observations.

Table 3 — Ablation Study Observations
  • Without TIP, although there is a lesser accuracy drop, the throughput is almost 0.24x less.
  • Without MAM, there is a 0.3x lesser throughput increase.
  • Without KDL, although there is almost no throughput change, the accuracy remains a little lower than PuMer-ViLT.

Hence, all three components aid in closing the accuracy gaps while at the same time allowing a larger throughput increase.

Study on Token Reduction Design Choices

Table 4 — Token Reduction Design Choice Results

The above are the different results on the SNLI-VE task when the prune & merge ratios, number of layers, and prune & merge locations are varied. The final PuMer design choices are — Reduction in layers 2,4,6 and 8 for even spread across the model, prune ratio k = 0.1, image merge ratio r = 0.3 and text merge ratio t = 0.2 which yields an accuracy of 75.6 (-0.4 less than the original model) and almost double throughput.

Conclusion

Efficient deep learning is becoming crucial at present when computation resources remain limited and allows widespread usage of these multi-million or even multi-billion parameter models. Although token reduction within the vision models of these multimodal transformers could prove to improve efficiency further, there is a need to differ the framework with the task of the transformer, since the vision model does not have access to the text inputs.

All figures are from the paper, except the ViLT and METER figures which are from the respective transformer papers — “ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”, “An Empirical Study of Training End-to-End Vision-and-Language Transformers”.

The handwritten pages are my own rough work to get a better understanding of the concepts involved.

Thank You :)

--

--