Efficient Streaming LLM Using Intel Extension for Transformers

Enabling Continuous LLM Inference on CPUs

Intel(R) Neural Compressor

Published in

Intel Analytics Software

3 min readOct 27, 2023

Bo Dong, Zhenwei Liu, Zhentao Yu, Yi Ding, Hanwen Chang, and Haihao Shen, Intel Corporation

Intel Extension for Transformers provides an efficient inference runtime of large language models (LLMs) on Intel platforms through the state-of-the-art model compression techniques. It has the following features:

Modular design to support new models
Optimized kernels
AMX, VNNI and AVX512F instruction set usage
Supports CPU (x86) and GPU (Intel)
Supports 4- and 8-bit quantization
Indirect access KV cache
Streaming LLM
Tensor parallelization for distributed inference/training on multi-node and multi-socket systems
Multi-postprocess top-k sampling/ beam search

Intel Extension for Transformers Runtime Architecture

The Intel Extension for Transformers runtime has identified specific issues that the LLM may encounter in the Chat Scene:

Limited Output Length: The LLM model is primarily pretrained on a limited sequence length. Consequently, its accuracy diminishes when the sequence length exceeds the attention window size used in pretraining.
Inefficiency: In the decoding phase, Transformer-based LLMs store Key and Value states (KV) for all prior tokens, resulting in excessive memory usage and higher decoding latency.

To address this problem, we integrate Streaming LLM into Intel Extension for Transformers, bringing about substantial improvements in memory usage and inference latency. Unlike the traditional KV cache algorithms, our approach incorporates Attention Sink (four initial tokens) to stabilize attention computation, and the Rolling KV Cache retains the most recent tokens, crucial for language modeling. This design is remarkably flexible, seamlessly integrating into autoregressive language models that leverage relative positional encoding, such as RoPE and ALiBi.

(Image source: EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS)

This video shows the result with and without Streaming LLM on Intel Extension for Transformers runtime.

The absence of Streaming LLM results in the Intel Extension for Transformers runtime slowing down and eventually running out of memory. Streaming LLM excels in managing infinite inference by optimizing memory usage and delivering outstanding performance. Moreover, we enhanced the Streaming LLM strategy by introducing parameters such as n_keep and n_discard. You can use the former to specify the number of tokens to retain in the KV cache and the latter to determine the number of generated tokens to discard. By default, the system discards half of the recent tokens in the KV cache to strike a balance between optimizing performance and maintaining accuracy.

You can enable Streaming LLM with Intel Extension for Transformers runtime as follows:

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
 
# Recommend n_keep=4 to do attention sinks (four initial tokens) and n_discard=-1 to drop half rencetly tokens when meet length threshold
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300, ctx_size=100, n_keep=4, n_discard=-1)

Subsequently, we will add Streaming LLM to the MHA fusion pattern to improve its performance. Stay tuned for further updates. We encourage you to try Intel Extension for Transformers and run LLM inference with efficiency on Intel platforms!

Efficient Streaming LLM Using Intel Extension for Transformers

Enabling Continuous LLM Inference on CPUs

Written by Intel(R) Neural Compressor