Photo by Meagan Carsience on Unsplash

Accelerating Qwen2 Models with Intel Extension for Transformers

High Performance WOQ INT4 Inference on Intel Xeon Processors

Intel(R) Neural Compressor
2 min readJun 6, 2024

--

Bo Dong, Jun Lin, Zhenzhong Xu, Yu Luo, Wenhua Cheng, Hanwen Chang, and Haihao Shen, Intel Corporation

Qwen2 is the new series of large language models (LLM). For Qwen2, Alibaba released a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a mixture-of-experts model. In this article, we will deploy the Intel Extension for Transformers to accelerate low-bit Qwen2 inference on Intel Xeon Processors. First, we use the Intel Extension for Transformers to do 4-bit compression of the model. We use the innovative AutoRound algorithm for Qwen2–1.5B and Qwen2–7B to generate an INT4, weight-only quantization model.

Table 1 shows the accuracy comparison of the two models.

Table 1. Qwen2 1.5B/7B AutoRound BF16/INT4 accuracy comparison

You can access the INT4 models directly from Hugging Face:

The following sample codes show how to do quantization and inference using the Intel Extension for Transformers backend Neural Speed:

##pip install auto-round
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2-1.5B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

from auto_round import AutoRound
bits, group_size, sym = 4, 128, True
device="cpu"
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym, device=None)
autoround.quantize()
output_dir = "./qwen2_7b_autoround"
autoround.save_quantized(output_dir)
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig

# Auto Round quantized model
model_name = "./qwen2_7b_autoround"
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
outputs = model.generate(inputs)

In Qwen2, all instruction-tuned models have been trained on 32K length contexts, and extrapolated to longer context lengths using techniques like YARN or Dual Chunk Attention. Intel Extension for Transformers achieves less than 50ms next token latency up to 32K input length:

Intel Extension for Transformers provides several ways to minimize token latency while maintaining acceptable accuracy. The user-friendly Transformers-like API requires only minimal code changes. There are also more advanced features like streaming LLM, tensor parallelism, and low-bit compression (as low as 1 bit), so we encourage you to take a look at Intel Extension for Transformers.

--

--