Reduce LLM Footprint with OpenVINO™ Toolkit Weight Compression

Published in

OpenVINO-toolkit

9 min readJul 2, 2024

Large language models (LLMs) enable conversational AI, giving rise to powerful chatbots and personal assistants with the potential to boost worker productivity. However, LLMs are massive in size, requiring over 100 billion parameters, and they’re only getting larger. To help solve this challenge, weight compression using the OpenVINO™ toolkit can reduce the data storage and memory/GPU video random-access memory (vRAM) footprint of an LLM to as little as one-eighth the original size. This results in leaner models that can run in more environments, with lower latency and less strain on system resources.

In this post, we’ll explore two weight compression methods, one for Hugging Face LLMs using the Optimum Intel API and another using the OpenVINO Neural Network Compression Framework (NNCF). This post is excerpted and abridged from an extensive white paper on using OpenVINO for LLMs, which you can read in full here.

The size of LLMs

Addressing the size challenge will be key to making AI simpler to deploy and more practical for everyday use, including deployment to client devices. The total file size of an LLM can range from 2 GB for small models to over 300 GB for large models. A 70-billion-parameter model like Llama 2–70B takes around 140 GB of storage space. Generally, systems running LLMs should have as much random-access memory (RAM) as the size of the model file, which will allow the system to load the full model in memory.

With insufficient RAM, the system may resort to using disk storage as swap space for memory, and inference will run slowly, or the system may crash when memory is exhausted. The system GPU should also have as much vRAM as the model size, with additional memory padding to run other operating system (OS) tasks. Naturally, these high hardware requirements create a high barrier to entry for deploying LLMs, requiring powerful servers and a significant investment in memory and GPUs.

Weight compression reduces model size

The good news is it’s possible to dramatically reduce these requirements using weight compression, a key function of OpenVINO. Compressing a model’s weights from FP32 to int8 reduces the model size to one-fourth of the original, and converting the model to int4 format reduces the model size to one-eighth the original. For example, the Zephyr 7B beta model in FP32 format is 28 GB, but compressing the model to int4 using OpenVINO™ NNCF reduces the size to 4 GB while maintaining similar accuracy.

By reducing the model size, we also reduce the amount of RAM and GPU vRAM required to run the model. A client device with a mid-to-high-end processor and 16 GB of RAM can consistently run OpenVINO-quantized 7B models such as Llama-7B or Zephyr 7B beta. This can be helpful to developers and providers who want to offer increasingly complex and capable LLMs but are mindful of end users’ hardware limitations.

What is OpenVINO?

Weight compression is just one feature enabled by OpenVINO, an open source AI toolkit that makes it easier to “write once, deploy anywhere” for AI models. For LLMs, OpenVINO provides a flexible and efficient runtime environment, offering advantages of deployment size, speed, and the flexibility to run on a variety of hardware, and it has official support from Intel, so users will continue to benefit from updates and optimizations over time.

OpenVINO is also a self-contained, lean package with only hundreds of megabytes of dependencies, compared to the several gigabytes of dependencies required for Hugging Face, PyTorch, and other machine learning frameworks. Read the white paper to learn more about performing LLM inference using OpenVINO.

The benefits of weight compression with OpenVINO

Unlike full-model quantization, where weights and activations are quantized, weight compression using OpenVINO NNCF only targets the model’s weights. With this approach, activations can remain as floating-point numbers to help preserve model accuracy. Developers can also forego calibrating the range of activation values, so it’s easier and more efficient than full quantization.

Weight compression data types

OpenVINO NNCF supports three types of weight compression: 8-bit (int8), 4-bit symmetric (int4_SYM), and 4-bit asymmetric (int4_ASYM). Each type has its advantages and drawbacks:

int8, 8-bit weight quantization: This default compression method quantizes weights to an 8-bit integer data type, which balances model size reduction and accuracy, making it a versatile option for a broad range of applications.
int4_SYM, 4-bit symmetric weight quantization: int4 symmetric mode involves quantizing weights to an unsigned 4-bit integer symmetrically around a fixed zero point of eight (i.e., the midpoint between zero and 15). Inference is faster than a model with int8 precision, making it ideal for situations where speed is prioritized over an acceptable trade-off against accuracy.
int4_ASYM, 4-bit asymmetric weight quantization: int4 asymmetric mode also quantizes weights to unsigned 4-bit integers but does so asymmetrically with a non-fixed zero point. This mode slightly compromises speed in favor of better accuracy than the symmetric mode but enables faster performance than int8.

Performing weight compression with OpenVINO

There are two primary methods of performing weight compression with OpenVINO:

Use the Optimum Intel API with a Hugging Face LLM
Use command-line interface (CLI) to convert the Hugging Face model to an OpenVINO intermediate representation (IR) format and then perform weight compression using OpenVINO NNCF

For both methods, developers should perform weight compression offline rather than in a real-time application. The LLM can be compressed and exported in a development environment and then used in a deployment environment.

Requirements

To get started with OpenVINO, set up a Python virtual environment for OpenVINO by following the OpenVINO installation instructions. Once the environment is created and activated, install Optimum Intel, OpenVINO, NNCF, and their dependencies in a Python environment by issuing:

pip install optimum[openvino]

Weight compression on a Hugging Face model with Optimum Intel

The example below shows how to perform weight compression on a model from Hugging Face. In the example, a Zephyr 7B beta model is loaded from Hugging Face using Optimum Intel. When the model is loaded, it is automatically compressed to the specified compression type using NNCF.

The compression type is specified in the OVModelForCausalLM.from_pretrained method using the compression_option=”<option>” argument. When this option is set to none, int8 weight compression will be enabled by default for decoder models with more than 1B parameters. It accepts any of these options:

“int8” : int8 compression using NNCF
“int4_sym_g128” : Symmetric int4 compression with group size of 128
“int4_asym_g128” : Asymmetric int4 compression with group size of 128
“int4_sym_g64” : Symmetric int4 compression with group size of 64
“int4_asym_g64” : Asymmetric int4 compression with group size of 64

For more information on group size, see the weight compression page in OpenVINO documentation. In this example, compression_option=”int8" indicates int8 quantization should be performed.

from nncf import compress_weights, CompressWeightsMode
from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load model from Hugging Face and compress to int8
model_id = HuggingFaceH4/zephyr-7b-beta"
model = OVModelForCausalLM.from_pretrained(model_id, export=True, compression_option="int8")

# Inference
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
phrase = "The weather is"
results = pipe(phrase)
print(results)

# Save compressed model and tokenizer for later use
model.save_pretrained("zephyr-7b-beta-int8-sym-ov")
tokenizer.save_pretrained("zephyr-7b-beta-int8-sym-ov")

Note that at the end of the example, the compressed model and its tokenizer are saved so they can be imported for use in a future session. This saves time compressing the model whenever it is used in a new session.

Converting a Hugging Face model to OpenVINO IR format to use with NNCF

Before we can run NNCF, we must convert the Hugging Face model to IR format using the optimum-cli tool. This helps convert models without using a Python script.

The command to perform this conversion is structured as follows:

optimum-cli export openvino --model <MODEL_NAME> <NEW_MODEL_NAME>

— model <MODEL_NAME>: This part of the command specifies the name of the mode to be converted. Replace <MODEL_NAME> with the actual model name from Hugging Face.

<NEW_MODEL_NAME>: Here, you specify the name you want to give to the new model in the OpenVINO IR format. Replace <NEW_MODEL_NAME> with your desired name.

For example, to convert the Llama 2–7B model from Hugging Face (formally named meta-llama/Llama-2–7b-chat-hf) to an OpenVINO IR model and name it “ov_llama_2”, use the following command:

optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf ov_llama_2

In this example, meta-llama/Llama-2–7b-chat-hf is the Hugging Face model name, and ov_llama_2 is the new name for the converted OpenVINO IR model.

Additionally, when exporting your model with the CLI, you can specify the — weight-format argument to apply 8-bit or 4-bit weight quantization. Here is an example command applying 8-bit quantization to the model gpt2:

optimum-cli export openvino --model gpt2 --weight-format int8 ov_gpt2_model

Performing weight compression with NNCF

Now that we have the OpenVINO IR model, we can run NNCF. In the example, a model in OpenVINO IR format is read using ov.core.read_model. The nncf.compress_weights method quantizes the model weights to the specified data type.

The compression type used by the nncf.compress_weights method is set using the mode argument, which accepts one of three options:

mode=CompressWeightsMode.int8
mode=CompressWeightsMode.int4_SYM
mode=CompressWeightsMode.int4_ASYM

In this example, the int4_SYM mode is set to use 4-bit symmetric quantization.

from nncf import compress_weights, CompressWeightsMode
import openvino as ov

# Read an OpenVINO IR model
core = ov.Core()
model = core.read_model("model.xml")

# Compress to int4 Symmetric
model = compress_weights(model, mode=CompressWeightsMode.int4_SYM)

# Save compressed model for later use
ov.save_model(model, "model-int4-sym.xml")

NNCF also allows for configuring the group_size and ratio compression parameters, which can be used to tweak the size and inference speed of the compressed model. For more information, see the weight compression page in OpenVINO documentation.

Benchmarking LLMs postcompression

Post weight compression, you may want to benchmark your LLMs to see the impact that compression has on performance. The OpenVINO GenAI repository contains an LLM benchmarking tool, which provides a unified approach to estimating performance for LLMs based on pipelines provided by Optimum Intel. You can use this tool to estimate performance for PyTorch and OpenVINO models by following these directions:

Install benchmarking dependencies using requirements.txt

pip install -r requirements.txt

Note: You can specify the installed OpenVINO version through pip install.

# e.g. 
pip install openvino==2024.0.0

Use these commands to test the performance of an LLM:

python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters>

# e.g.
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2

Command parameters:

-m ‒ model path
-d ‒ inference device (default=cpu)
-r ‒ report csv
-f ‒ framework (default=ov)
-p ‒ interactive prompt text
-pf ‒ path of JSONL file, including interactive prompts
-n ‒ number of benchmarking iterations, if the value is greater than 0, that will exclude the first iteration (default=0)

python ./benchmark.py -h # for more information

Measuring LLM accuracy

The lm-evaluation-harness tool is a third-party test harness for measuring LLM accuracy. It recently added support for OpenVINO. Visit the repository for more information on how to use it to measure model accuracy.

Enabling leaner LLMs to adjust for hardware requirements or more parameters

As LLM developers and providers look for ways to both simplify deployment and increase their model accuracy with more data and more parameters, weight compression can be a useful tool to support these goals while keeping model size and memory footprint in check. OpenVINO offers many other benefits to LLM inference, including speed, flexibility, and hardware support. Continue the journey by trying OpenVINO for yourself.

Author attribution

This post is based on the solution white paper, “Optimizing Large Language Models with the OpenVINO™ toolkit,” by Ria Cheruvu, Intel AI evangelist, and Ryan Loney, Intel OpenVINO product manager. Additional credits: Ekaterina Aidova, Alexander Kozlov, Helena Kloosterman, Artur Paniukov, Dariusz Trawinski, Ilya Lavrenov, Nico Galoppo, Jan Iwaszkiewicz, Sergey Lyalin, Adrian Tobiszewski, Jason Burris, Ansley Dunn, Michael Hansen, Raymond Lo, Yury Gorbachev, Adam Tumialis, and Milosz Zeglarski.

Notices and disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel® technologies may require enabled hardware, software, or service activation.