AI Acceleration

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

A brief review of the research paper co-authored by Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim, published February 2024.

Semin Cheon

Published in

SqueezeBits Team Blog

5 min readMay 7, 2024

Paper at: https://arxiv.org/abs/2402.09025

Despite proficiency in NLP tasks, large language models(LLMs) have too many parameters; deploying them on ‘real-world services’ is difficult as the models demand heavy memory and computations. SLEB is a newly introduced efficient pruning method with the goal of optimizing LLMs without compromising linguistic prowess. The SLEB method is distinct from previous methods as it proposes to have the transformer block as the fundamental unit of pruning. It identifies and eliminates redundant transformer blocks to achieve inference speedup and memory usage reduction.

Challenges in Previous Methods

Conventional pruning methods (involving methods like 2:4 pruning or channel-wise pruning) have difficulty achieving the desired inference speedup. Numerous factors, such as batch and model size, and hardware support, hinder 2:4 structured pruning from faster inference. Other techniques, which involve channel-wise pruning, require extensive fine-tuning. The resource-intensiveness in this training makes it difficult to use the technique on extremely large-scale models.

Techniques related to the ‘Early Exit’ approach, which skips transformer blocks, need dynamic decision-making or extensive training. An analysis of this type of strategy shows that it leads to a decline in LLM’s linguistic capabilities. They are especially faced with the following challenges:

limitations in acceleration (complications from different processing paths in multi-batch contexts that diminish efficiency)
limitations in memory efficiency
resource-heavy training (90% of transformer blocks still need processing)

A novel and efficient pruning approach, SLEB, solves the aforementioned challenges and streamlines LLMs by identifying and removing transformer blocks.

Redundancy Verification and Elimination

The paper analyzes the cosine similarity between transformer blocks’ output. It reveals that neighboring transformer blocks in the LLM architecture consistently show a high degree of output similarity. The Early Exit method overlooks this point — bypassing a continuous sequence of transformer blocks that are essential for preserving LLMs’ linguistic integrity.

Cosine similarity between the outputs of two transformer blocks, Block i and Block j

To carefully identify each block’s significance and select redundant ones for elimination, 3 metrics are presented to measure the significance and impact of each block. Out of the three, ‘Metric3’ is utilized in the SLEB algorithm since LLMs streamlined using Metric3 can maintain perplexity scores regardless of target models.

Using Metric3, the SLEB method streamlines the model by iteratively identifying the most redundant block and eliminating it. It avoids removing blocks in consecutive order as it considers the current state of the LLM. It determines the optimal selection of blocks based on the traits of the particular model.

Key Results

SLEM outperforms previous pruning methods

Table 1 shows OPT-66B and LLaMA-2-70B models’ perplexity results on the C4 dataset. The SLEB method not only sufficiently preserves perplexity scores for all cases, but also outperforms other pruning methods. Compared to 2:4 and channel pruning methods(Wanda, SliceGPT), the perplexity scores of the SLEB method are the lowest for OPT-66B and LLaMA-2–70B models.

What’s more, LLM models pruned using SLEB perform well on zero-shot tasks. They show superior accuracies at a 10% sparsity level — surpassing other approaches and preserving the original accuracies well.

[Tables 3 and 4] present latency and throughput improvements from pruning methods in LLaMA-2–70B using 2 Nvidia A100 GPUs for prompt processing and token generation. An input sequence of 2048 tokens is processed, and 128 token-length sentences are generated with a batch size of 64 in this experiment.

In the case of 2:4 pruning(Wanda) and channel-wise pruning(Slice GPT), sparsity levels are 50% and 25% each. Though SLEB’s sparsity is lower than Wanda and SliceGPT methods, at 20%, the latency speedup and improvements in the token throughput are the greatest. This is due to adopting the transformer block as the basic pruning unit. By removing entire transformer blocks, SLEB enhances end-to-end LLM inference providing speedup advantages in various serving scenarios.

The perplexity of LLMs pruned with SLEB(target sparsity: 20%) VS. those further compressed using 4-bit AWQ

Moreover, a popular 4-bit post-training quantization algorithm, AWQ, can be orthogonally applied to LLMs that have first undergone SLEB to further improve memory efficiency and inference speed. Results show no ‘discernible impact’ is made on perplexity when additionally applying AWQ to SLEB streamlined models.

In conclusion, by identifying and removing up to 20% of redundant transformer blocks, SLEB can achieve substantial inference speedup in LLMs while simultaneously having minimal impact on linguistic performance such as language modeling and zero-shot tasks.

Our Mission

As the front line of innovation in AI compression, Squeezebits actively contributed to this collaborative research and publication. By sharing this brief review of the SLEB method, we hope to share more insight into compressing your AI model. We always strive to lead the industry and academia by challenging conventional ways of thinking and finding new, enhanced solutions to AI model optimization.

Paper Source: https://arxiv.org/abs/2402.09025

For more information on how we apply various state-of-the-art techniques of pruning, visit the links below or contact us at info@squeezebits.com

SqueezeBits

Optimizing an AI model strongly requires deep understanding of the hardware. SqueezeBits maintain technical expertise…

www.squeezebits.com

OwLite

AI compression got much easier

owlite.ai

GitHub - SqueezeBits/owlite: OwLite is a low-code AI model compression toolkit for AI models.

OwLite is a low-code AI model compression toolkit for AI models. - SqueezeBits/owlite

github.com

GitHub - SqueezeBits/owlite-examples: OwLite Examples repository offers illustrative example codes…

OwLite Examples repository offers illustrative example codes to help users seamlessly compress PyTorch deep learning…