Simplify LLM Inference on Your Laptop with BigDL-LLM and LLaMa

Intel

Published in

Intel Tech

5 min readAug 9, 2023

In just three steps with a new SDK.

Five llamas on a snowy plain with a mountain in the background. — Photo by Ryan Ancill on Unsplash

Authors: Ezequiel Lanza, Song Guoqiong

The remarkable scale of over 100 billion parameters in large language models (LLMs) has brought about a groundbreaking transformation in human-language applications. Nonetheless, AI developers and researchers often encounter obstacles stemming from the massive size and latency associated with these models. These challenges can hamper collaboration and hinder progress in building robust applications.

To tackle this issue, Intel recently introduced a solution called BigDL-LLM. It combines a number of optimizations that enable the execution of large language models on laptops, making it more accessible for developers.

In this post, we’ll show you how to use BigDL-LLM to inference a LLM in your environment in three simple steps using the LLaMa model as example.

LLaMa

Meta* launched LLaMa in February 2023. The release introduced a language model equipped with an impressive sixty-five billion parameters. Moreover, the tech giant also offered reduced versions of LLaMa, featuring seven billion, thirteen billion, and thirty-three billion parameters that can be downloaded and used for research purposes. LLaMa operates on an auto-regressive methodology and harnesses the power of transformer architecture, like numerous other Language Models (see this article for more details.)

The key benefit of LLaMa lies in its ability to leverage optimized transformer architecture, resulting in improved performance when deployed on CPU devices. (For details, see this research paper). This optimization allows for efficient use of computational resources. To enhance the model even further, additional optimizations like quantization can be used. By implementing quantization techniques, the model’s size can be reduced without compromising accuracy.

BigDL-LLM

BigDL-LLM, recently open sourced by Intel, is a software development kit (SDK) created with a specific focus on large language models (LLMs) on Intel XPUs. Developers can enhance their LLM models for edge devices by utilizing BigDL-LLM and INT4 support on compatible Intel XPUs. This optimization enables efficient execution, improved memory utilization, and enhanced computational performance.

(It builds on the foundational work of llama.cpp, gptq, ggml, llama-cpp-python, gptq_for_llama, bitsandbytes, redpajama.cpp, gptneox.cpp, bloomz.cpp,.)

Users can employ BigDL-LLM to:

Convert models to lower precision (INT4).
Use transformers like APIs to run the model inference.
Integrate the model with a LangChain pipeline.

BigDL-LLM currently supports:

Precision: INT4
Model Family: llama, gptneox, bloom, StarCoder.
Platform: Ubuntu* 20.04 or later, CentOS* 7 or later, Windows* 10/11
Device: Intel CPU
Python*: 3.9 (recommended) or later

Install and Run

There are three standards steps for using the library:

Download model from the Hugging Face* Hub
Convert model from the Hugging Face format to GGML format
Inference using llm-cli, transformers like API, or LangChain.

Before running the first step, you need to install the library first. While the model conversion procedure will rely on some third-party libraries. Add [all] option for installation to prepare environment.

$pip install --pre --upgrade bigdl-llm[all]

If you’re familiar with Hugging Face, run the command below to download the model desired.

Here we’ll be using the seven-billion parameter version, LLaMA 7B.

from transformers import AutoModelForCausalLM 


model_name = "decapoda-research/llama-7b-hf"  # Example model name or identifier 

output_folder = "/path/to/Llama-7b" 

 

# Pull the pre-trained model 

model = AutoModelForCausalLM.from_pretrained(model_name) 

 

# Save the model to the specified folder 

model.save_pretrained(output_folder)

The next step is to convert your downloaded model to an optimized version.

To convert the model, BigDL-LLM offers a transformers-like Api.

from bigdl.llm import llm_convert

Now you’ll have a .bin file in the folder selected as “outfile” where the INT4 model is stored.

Note: This BigDL optimized version is currently only available for llama/bloom/gptneox/starcoder model families. For other model families, use the native INT4 version provided by Hugging Face.

from bigdl.llm import llm_convert 

# pth model 

llm_convert(model="/path/to/llama-7b-hf/", 

            outfile="/path/to/llama-7b-int4/", 

            model_format="pth", 

            model_family="llama")

Run the inference. There are multiple options:

BigDL-LLM command line

llm-cli -t 16 -x llama -m "/path/to/llama-7b-int4/bigdl_llm_llama_q4_0.bin" -p 'Once upon a time,'

2. Transformers like-API

#load the converted model 

from bigdl.llm.transformers import BigdlForCausalLM 

llm = BigdlForCausalLM.from_pretrained("/home/ubuntu/BIGDL/Models/Llama-7b-int4") 

 

prompt= " Once upon a time," 

 

#run the converted  model# 

input_ids = llm.tokenize(prompt) 

output_ids = llm.generate(input_ids, max_lenght=50) 

output = llm.batch_decode(output_ids)

3. LangChain API

llama-cpp-python has become a popular pybinding for llama.cpp program. The API in question, familiar to many users, is allocated and expanded to encompass other model families by bigdl-llm such as gptneox and bloom.)

from bigdl.llm.langchain.llms import TransformersLLM
from bigdl.llm.langchain.embeddings import TransformersEmbeddings
from langchain.chains.question_answering import load_qa_chain
embeddings = TransformersEmbeddings.from_model_id(model_id=model_path)
bigdl_llm = TransformersLLM.from_model_id(model_id=model_path, …)
doc_chain = load_qa_chain(bigdl_llm, …)
output = doc_chain.run(…)

You’ll get a prompt, along with the time it took for execution.
Here’s an example of text generated by the algorithm with a “once upon a time” prompt:

Once upon a time, I got to be a part of a very cool event for the release of a book called “The Little Engine That Could.” It was actually one of my first jobs ever in television and it was a lot of fun.
We went to the railroad museum in El Monte, CA where they have an old engine (pictured above) that has been used as their mascot at the museum for years. I got to interview some of the people who made the museum, and even rode on top of the little red engine that could from a real caboose! It was fun, goofy, and definitely memorable.

Learn More

Intel hopes to help AI experts work smarter with LLMs by offering optimizations and tools through BigDL-LLM. Developers can now collaborate, experiment, and innovate more easily with this open source project and overcome the challenges posed by the size and latency of LLMs. Plus, it allows them to run LLMs on their personal laptops instead of expensive equipment.

BigDL-LLM represents a substantial leap in addressing the obstacles encountered by AI developers and researchers. This breakthrough boosts collaboration and paves the way for wider use of large language models across diverse applications.

Ready to do more? Check out the demos, tutorials and documentation on the BigDL-LLM GitHub repo.

Intel Developer Cloud

You can also use Intel Developer Cloud where developers can test their software examples and models from anywhere in the world, as well as testing before moving into production. Access the latest Intel CPUs, GPUs, FPGAs, and software.

Go to Intel Developer Cloud to learn more and sign up.

For more open source content from Intel, check out open.intel and follow us on Twitter.

About the Authors

Ezequiel Lanza, Open Source Evangelist at Intel. Passionate about helping people discover the exciting world of artificial intelligence, he’s a frequent AI conference presenter and the creator of use cases, tutorials, and guides that help developers adopt open source AI tools like TensorFlow* and Hugging Face*. Find him on Twitter at @eze_lanza

Guoqiong Song is a Senior Software Engineer at Intel Corporation specialized in machine learning, artificial intelligence, numerical modeling, data science and big data.