BigDL-LLM: Easily Optimize Your Large Language Model on Intel® Arc™ GPUs

Intel

Published in

Intel Tech

4 min readOct 24, 2023

Take advantage of open source BigDL-LLM’s recent support of GPUs

Author : Guoqiong Song & Yang Wang

We’re entering a new era of artificial intelligence (AI) driven by the emergence of large language models (LLMs), which are playing an increasingly important role in powerful applications such as customer service, virtual assistants, content creation, and programming assistance. Yet as the dimensions of LLMs continue to expand, resulting in a noticeable slowdown in performance, the demand for efficient acceleration has never been more pressing.

To address this, Intel recently released BigDL-LLM*: a low-bit LLM library on Intel XPU(from laptop to GPU to Cloud). It is an open source optimization acceleration library for LLMs on Intel® platforms. It’s designed to help AI developers and researchers accelerate performance and improve the user experience of LLMs on Intel® platforms. This post will show you how to enable BigDL-LLM on Intel® Arc™ GPU and provide a short demo showcasing the real-time performance of a LLaMa 2* LLM accelerated by BigDL-LLM, running on a server equipped with an Intel® Arc™ A770 GPU.

BigDL-LLM overview

BigDL-LLM is part of the open source BigDL Project*, released under the Apache* 2.0 license. Optimized for Intel® platforms, BigDL-LLM is designed for running LLMs using low-bit optimizations (INT3/ NF3/ INT4/ NF4/ INT5/INT8) with very low latency, and it is built on top of various technologies (such as llama.cpp, gptq, bitsandbytes, qlora, etc.). Leveraging hardware acceleration technologies built into Intel® hardware and integrating the latest software optimizations, BigDL-LLM empowers LLMs to achieve more efficiency and fast execution on Intel® platforms.

With bigdl-llm, users can build and run LLM applications for both inference and fine-tuning, using standard PyTorch APIs (e.g., Hugging face Transformers and LangChain). Meanwhile, a wide range of models (such as LLaMA/LLaM2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly/Dolly-v2, Bloom, StarCoder , Whisper, InternLM, Baichuan, QWen, MOSS, etc.) have already been verified and optimized on bigdl-llm.

The Intel BigDL team recently expanded LLM support to include the Intel® Arc™ Graphics, the Intel® Data Center GPU Flex Series, and Intel Data Center Max series, mirroring the seamless functionality of BigDL-LLM APIs on CPUs, this new development facilitates support for models based on standard Pytorch APIs on Intel® platforms with a single line of code alteration.

To help developers get started quickly, BigDL-LLM on GPUs provides commonly used accelerated examples of LLMs, read more on the GitHub README and the official documentation.

Demo setup: How to enable BigDL-LLM on Intel® Arc GPU

Enabling BigDL-LLM on Intel® Arc™ graphics is a crucial make space unlocking their full potential. To do so, you’ll need the following:

Choose an Intel® Arc GPU: Ensure your server is equipped with an Intel® Arc GPU, such as the Intel® Arc ™ A770 GPU, or you can apply one from Intel Developer Cloud. These GPUs are designed to handle complex AI workloads and are ideal for accelerating LLMs.
Prepare your environment: Review the recommended requirements to install the Intel® oneAPI Base Toolkit and configure oneAPI environment variables as well as other environment variables.
Install the BigDL-LLM library: BigDL-LLM can be easily installed by executing this line of command:

pip install - pre - upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu

Accelerating large models with BigDL-LLM is as straightforward as using it on an Intel® laptop — read more about it here or see an example of a LLaMa 2 model, as it is using the BigDL-LLM Transformer-style API which involves changing the model loading part, and the subsequent usage process is identical to native Transformers APIs. The way to load the model using the BigDL-LLM API is almost the same as the Transformers API. Users only need to change the import statement and set load_in_4bit=True in the from_pretrained parameter. You can also use the load_in_low_bit API to support other low-bit types.

# Load Hugging Face Transformers model with int4 optimizations

from bigdl.llm.transformers import AutoModelForCausalL
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True).to("xpu")

BigDL-LLM converts the model into 4-bit precision during model loading and optimizes its execution using various software and hardware acceleration techniques in the subsequent inference process.

output = model.generate(input_ids.to("xpu"))
output = output.cpu()

Demo: Running a LLaMA 2 model on an Intel® Arc™ GPU

The following image shows inferencing a LLaMa 2 13 billion-parameter running on a server equipped with an Intel® Arc™ A770 GPU. BigDL-LLM substantially accelerates inference tasks and makes them more responsive and efficient. Watch the full example here.

BigDL-LLM provides substantial speedups to a LLaMa 2 model

Get started

BigDL-LLM unlocks the full potential of Intel® Arc GPU, accelerating your LLM workloads and opening the door to exciting possibilities in the world of AI. Head to our GitHub repo to enable BigDL-LLM and stay up to date on the latest updates as we continue to explore the cutting-edge technologies that are shaping the future of computing.

· Visit the GitHub repo

About the authors

Guoqiong Song: AI Frameworks Engineer, Intel, works on AI and big data analytics, and builds end-to-end AI solutions.

Yang Wang: An AI Frameworks Engineer at Intel, specializing in AI and Big Data systems for heterogeneous platforms within the AI Software Engineering department.

For more open source content from Intel, check out open.intel