Run LLMs on Intel GPUs Using llama.cpp

Taking Advantage of the New SYCL Backend

Intel(R) Neural Compressor

Published in

Intel Analytics Software

4 min readMar 22, 2024

Zhang Jianyu, Meng Hengyu, Hu Ying, Luo Yu, Duan Xiaoping, and Majumder Abhilash, Intel Corporation

The open-source project, llama.cpp, is a light LLM framework that is gaining popularity. Its high-performance and customizability have turned the project into a thriving and dynamic community of developers, researchers, and hobbyists. Approximately one year since launch, the GitHub project has more than 600 contributors; 52,000 stars; 1,500 releases; and 7,400 forks. Thanks to recent code merges, llama.cpp now supports more hardware, including Intel GPUs across server and consumer products. Intel’s GPUs join hardware support for CPUs (x86 and ARM) and GPUs from other vendors.

The original implementation was created by Georgi Gerganov. The project is mainly for educational purposes and serves as the main playground for developing new features for the ggml library, a Tensor library for machine learning. With the recent updates, Intel is bringing AI everywhere to more users by enabling inference on far more devices. Llama.cpp is fast because it’s written in C and has several other attractive features:

16-bit float support
Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
No third-party dependencies
Zero memory allocations during runtime

SYCL Backend for Intel GPUs

There are several backends of ggml to support and optimize for different hardware. We chose SYCL (a direct programming language) and oneMKL (a high performance BLAS library) from oneAPI to develop the SYCL backend because it supports GPUs from different vendors. SYCL is a programming model to improve productivity on hardware accelerators. It is a single-source, embedded, domain-specific language based on pure C++17.

The SYCL backend supports all Intel GPUs. We have verified with:

Intel Data Center GPU Max and Flex Series
Intel Arc Discrete GPU
Built-in Intel Arc GPU in Intel Core Ultra CPU
iGPU in Intel 11th, 12th, and 13th Gen Core CPUs

With llama.cpp now supporting Intel GPUs, millions of consumer devices are capable of running inference on Llama. Compared to the OpenCL (CLBlast) backend, the SYCL backend has significant performance improvement on Intel GPUs. It also supports more devices, like CPU, and other processors with AI accelerators in the future. Please refer to guide to learn how to use the SYCL backend: llama.cpp for SYCL.

Run LLM on Intel GPU Using the SYCL Backend

A detailed guide is available in llama.cpp for SYCL. It can run on all Intel GPUs supported by SYCL and oneAPI. Server and cloud users can run on Intel Data Center GPU Max and Flex Series GPUs. Client users can try it out on their Intel Arc GPU or iGPU on Intel Core CPUs. We have tested the iGPUs of 11th Gen Core and newer. The older iGPU could works but with poor performance.

The only limitation is memory. The iGPU uses the host shared memory. The dGPU uses its own memory. We recommend using the iGPU with 80+ EUs (11th Gen Core and newer) and shared memory is more than 4.5 GB for llama2–7b-Q4 model (total host memory is 16 GB and more, and half memory could be allocated to iGPU).

Install the Intel GPU Driver

Both Linux and Windows (WLS2) are supported. For Linux, we recommend Ubuntu 22.04, which was used for development and testing.

Linux:

sudo usermod -aG render username sudo usermod -aG video username

sudo apt install clinfo sudo clinfo -l

Output (example):

Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics

Platform #0: Intel(R) OpenCL HD Graphics -- Device #0: Intel(R) Iris(R) Xe Graphics \[0x9a49\]

Windows: Install Intel GPU Drivers.

Enable the oneAPI Runtime

First, install the Intel oneAPI Base Toolkit to get the SYCL compiler and oneMKL. Next, enable the oneAPI runtime:

Linux: source /opt/intel/oneapi/setvars.sh
Windows: "C:\Program Files (x86)\Intel\oneAPI\setvars.bat\" intel64

Runsycl-lsto confirm that there are one or more Level Zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].

Build by one-click:

Linux: ./examples/sycl/build.sh
Windows: examples\sycl\win-build-sycl.bat

Note, the scripts above include the command to enable the oneAPI runtime.

Run an Example by One-Click

Download llama-2–7b.Q4_0.gguf and save to the models folder:

Linux: ./examples/sycl/run-llama2.sh
Windows: examples\sycl\win-run-llama2.bat

Note that the scripts above include the command to enable the oneAPI runtime. If the ID of your Level Zero GPU is not 0, please change the device ID in the script. To list the device ID:

Linux: ./build/bin/ls-sycl-device or ./build/bin/main
Windows: build\bin\ls-sycl-device.exe or build\bin\main.exe

Summary

The SYCL backend in llama.cpp brings all Intel GPUs to LLM developers and users. Please check if your Intel laptop has an iGPU, or your gaming PC has an Intel Arc GPU, or your cloud VM has Intel Data Center GPU Max and Flex Series GPUs. If yes, please enjoy the magical features of LLM by llama.cpp on Intel GPUs. We welcome developers to try and contribute to SYCL backend to add more features and optimization on Intel GPU. It’s a good project to learn Intel oneAPI for cross-platform development.