Accelerating GGUF Models with Transformers

Improving Performance and Memory Usage on Intel Platforms

Intel(R) Neural Compressor
Intel Analytics Software
3 min readMay 31, 2024

--

Bo Dong, Jun Lin, Zhentao Yu, Zhenzhong Xu, Yu Luo, Hanwen Chang, and Haihao Shen, Intel Corporation

GGUF (GPT-Generated Unified Format) is a new binary format that allows quick inspection of tensors and metadata within the file (Figure 1). It represents a substantial leap in language model file formats, optimizing the efficiency of storing and processing large language models (LLM) like GPT. Converting a PyTorch model to the GGUF format is straightforward.

Figure1 GGUF format (source: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/gguf-spec.png)
Figure 1. GGUF format (source)

HuggingFace Transformers recently added support for GGUF with Transformers PR. Transformers performs dequantization of GGUF models to FP32 before conducting inference using PyTorch. It’s easy to use; just specify the gguf_file parameter in from_pretrained, as shown in Figure 2.

Figure 2. Using GGUF models in HuggingFace Transformers

The Intel Extension for Transformers accelerates low-bit LLM inference. It extends HuggingFace Transformers and enhances performance on Intel platforms. It supports GGUF inference on a broad range of models. Its usage is also straightforward (Figure 3).

Figure 3. Using GGUF models in Intel Extension for Transformers

Currently, Transformer only supports Llama and Mistral, but we support over 50 popular LLMs (Table 1):

Table 1. Supported LLMs

Setup is easy. First, install Neural Speed:

git clone https://github.com/intel/neural-speed.git
cd neural_speed
python setup.py install

Second, install Intel Extension for Transformers:

git clone https://github.com/intel/intel-extension-for-transformers.git
cd intel_extension_for_transformers
python setup.py install

Finally, install Transformers:

pip install transformers

The code snippet in Figure 4 can be used to easily generate text.

Figure 4. Example code to generate text

We conducted a performance comparison with Transformers on an Intel Xeon Platinum Intel Core Ultra 5 125H system (CPU max MHz: 4400.0000, CPU min MHz: 400.0000, Total Memory 32 GB (8 x 4 GB LPDDR5, [7467 MT/s]), Ubuntu 13.2.0–9ubuntu1). Our comparison was limited to TinyLlama-1.1B-Chat because of the small memory. Intel Extension for Transformers provided superior latency and memory usage compared to Transformers (Table 2).

Table 2. Performance, memory usage, and accuracy comparisons

Intel Extension for Transformers is a great way to minimize token latency on memory-constrained systems, while requiring only small code changes. Kernel support ranges from 1 to 8 bits. There are also more advanced features like StreamingLLM and Tensor Parallelism, so we encourage you to take a look at Intel Extension for Transformers.

--

--