Highly-efficient LLM Inference on Intel Platforms

Leadership performance yet compatible with llama.cpp

2 min readOct 20, 2023

Team Intel Extension for Transformers, Intel Corporation

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms. We provide 4-bits weight-only quantization inference on Intel Xeon Scalable processors, especially 4th Gen Sapphire Rapids.

We conducted a performance comparison with llama.cpp on an Intel® Xeon® Platinum 8480+ system; The system details: @3.8GHz, 56 cores/socket, HT On, Turbo On, Total Memory 256GB (16x16GB DDR5 4800 MT/s [4800 MT/s]), BIOS 3A14.TEL2P1, microcode 0x2b0001b0, CentOS Stream 8.

Here is the inference performance of input size 32, output size 32, beam=1:

Here is the inference performance of input size 1024, output size 32, beam=1:

The advantages of Intel-optimized LLamaCPP extend to client CPUs as well. Experience enhanced support for AVX2 instructions. We conducted a performance comparison with llama.cpp on an Intel ® Core™ i9–12900; The system details: @2.4GHz, 24cores/socket, HT On, Turbo On, Total Memory 32GB (4x8GB DDR5 4800 MT/s [4800 MT/s]), BIOS ADLSFWI1. R00.2257.A01.2106221245, microcode 0x2e, Ubuntu 22.04.1 LTS. Here is the inference performance of input size 1024, output size 32, beam=1:

Up to 7.27x Performance Speedup on Client CPU

Note that llama.cpp is measured using the default code base. Please drop us a note if you see the potential improvements with additional settings.

We encourage you to try Intel Extension for Transformers and run LLM inference with efficiency on Intel platforms!

Intel Extension for Transformers has also conducted validation of int4 accuracy with lambada_openai, piqa, winogrande, and hellaswag datasets. We compute the average scores and then make a comparison with FP32.

The Intel Extension for Transformers also supports fine-tuning on CPU. We assessed the duration of both single-node and multi-node configurations using Xeon® Platinum 8480+. We opted to perform the fine-tuning of llama2–7b on the alpaca dataset using 1, 2, and 4 nodes. From the table, you can observe a decrease in fine-tuning time as the number of nodes increases. Stock PT refers to the official PyTorch, while PPN stands for Processors Per Node.

References:

SignRound: https://arxiv.org/abs/2309.05516
GPTQ: https://arxiv.org/abs/2210.17323
Intel Extension for Transformers: https://github.com/intel/intel-extension-for-transformers
Intel Neural Compressor: https://github.com/intel/neural-compressor

Highly-efficient LLM Inference on Intel Platforms

Leadership performance yet compatible with llama.cpp

References:

Written by Intel(R) Neural Compressor