Deployment of Llama3 on Your AI PC with OpenVINO™

Published in

OpenVINO-toolkit

7 min readApr 24, 2024

The development speed of large language models (LLMs) is astonishing. On April 18, 2024, Meta officially announced the new generation of its Llama series model, Llama3, setting a new milestone in this field. Llama3 not only inherits the powerful capabilities of previous models but also makes qualitative leaps in multimodal understanding, long text processing, and language generation through technological innovations. The openness and flexibility of Llama3 also provides developers with unprecedented convenience. Whether fine-tuning the model or integrating it into existing systems, Llama3 shows great adaptability and ease of use.

Moreover, when it comes to deploying the Llama3 model, aside from cloud deployment, local deployment of the model enables developers to achieve high efficiency and privacy in data processing and large model inferencing without relying on cloud computing resources. Deploying Llama3 locally using OpenVINO™, such as on an AI PC, not only means faster response time and lower operational costs but also effectively protects data security and prevents the leakage of sensitive information.

This article will briefly introduce the Llama3 model and focus on how to use OpenVINO™ to optimize, accelerate inference, and deploy it on an AI PC for faster, smarter AI inference.

General Introduction of Llama3

Llama3 offers models at various parameter scales, such as 8B and 70B parameter models. The core features and main advantages can be summarized as follows:

● Advanced capabilities and performance: Deliver state-of-the-art performance in reasoning, language generation, and code execution, setting new industry standards for LLMs.

● Enhanced efficiency: Utilizes a decoder-only transformer architecture with Group Query Attention (GQA), optimizing both language encoding efficiency and computational resource usage, making it suitable for large-scale AI tasks.

● Comprehensive training and fine-tuning: Pretrained on over 15 trillion tokens and enhanced with innovative instruction fine-tuning techniques such as SFT and PPO, Llama3 excels in handling complex, multilingual tasks and diverse AI applications.

● Open-source community focus: Released as part of Meta’s open-source initiative, Llama3 encourages community engagement and innovation, supporting an ecosystem where developers can easily access and contribute to its development.

Optimization, Acceleration, and Deployment on AI PC with OpenVINO™

As mentioned above, deploying the Llama3 model to a local AI PC not only means faster response times and lower operating costs but also effectively protects data security. This is particularly important in applications that need to handle highly sensitive data, such as healthcare, finance, and personal assistants.

The process of optimizing, accelerating inference, and deploying the Llama-3–8B-Instruct to AI PC includes the following specific steps, using the llm-chatbot code example from our commonly used OpenVINO™ Notebooks GitHub repository. Detailed information and the complete source code can be found here.

Starting From Prerequisite Package Installations

The detailed installation guide for running the OpenVINO™ Notebooks repository can be found here. To run the llm-chatbot code example, the following prerequisite dependencies need to be installed first.

Select a Model for Inference

In our Jupyter Notebook demonstration, we provide a set of LLMs supported by OpenVINO™ in multiple languages. You can first select a language from the dropdown box. For Llama3, we choose English.

Next, select “llama-3–8b-instruct” to run the remaining optimization and inference acceleration steps for this model. Of course, it is easy to switch to any other model listed in the dropdown box.

Convert Model Using Optimum-CLI

Optimum Intel serves as the interface between Hugging Face Transformers and the Diffuser libraries with OpenVINO™, designed to accelerate end-to-end pipelines on Intel® architectures. It provides an easy-to-use CLI (Command Line Interface) for exporting models into OpenVINO™ Intermediate Representation (IR) format. The model export can be completed with the following command:

optimum-cli export openvino --model <model_id_or_path> --task <task> <out_dir>

Where — model argument is the model id from HuggingFace Hub or local directory with the model (saved using .save_pretrained method), — task is one of the supported tasks that the exported model should solve. For LLMs, it will be text-generation-with-past. If model initialization requires to use of remote code, the — trust-remote-code flag additionally should be passed.

Compress Model Weights

Although LLMs like “Llama-3–8B-Instruct” are becoming increasingly powerful and complex in understanding and generating human-like text, managing and deploying these models presents key challenges in terms of computing resources, memory footprint, and inference speed, especially for client devices, such as AI PCs. Weight compression algorithms aim to compress the model’s weights and can be used to optimize the space occupancy and performance of large models where the size of the weights is relatively larger than the activations, such as in LLMs. Compared to INT8 compression, INT4 compression can further reduce the model size and enhance text generation performance, but there is a slight decrease in prediction quality. Therefore, here we choose to compress the model weights to INT4 precision.

Model Compression Using Optimum-CLI

The Optimum-CLI tool facilitates the export of models with options for applying FP16, INT8, or INT4 bit weight compression to linear, convolutional, and embedding layers. This functionality is crucial for optimizing model size and inference speed, enhancing the model’s operational efficiency. The method is straightforward: set the — weight format to FP16, INT8, or INT4 respectively. This type of optimization allows for a reduction in memory usage and inference latency. By default, the quantization scheme for INT8/INT4 will be asymmetric quantization. You can add the — sym option if you need to use symmetric compression.

For INT4 quantization, we specify the following parameters for “Llama-3–8B-Instruct” as follows:

compression_configs = {
        "llama-3-8b-instruct": {
            "sym": True,
            "group_size": 128,
            "ratio": 0.8,
        },
}

The — group_size parameter defines the group size used for quantization.

The — ratio parameter controls the ratio between 4-bit and 8-bit quantization. In this case, it means that 80% of the layers will be quantized to INT4, while 20% of the layers will be quantized to INT8.

The model compression using Optimum-CLI could be performed with the following code:

optimum-cli export openvino --model "llama-3-8b-instruct" --task text-generation-with-past --weight-format int4 
--group-size 128 --ratio 0.8 --sym

After model compression, it could be seen that the model size of this 8B parameter model has been reduced to around 5GB.

Select Device for Inference and Model Variant

Since OpenVINO™ can easily be deployed across a range of hardware devices, a dropdown box is also provided for you to select the device on which to run inference. Considering the model size and performance requirements, here we choose the GPU of an AI PC equipped with the Intel® Core™ Ultra7 155H processor as the inference device.

Instantiate Model Using Optimum Intel

Optimum Intel can be used to load models that have been downloaded locally and optimized with weight compression and to create pipelines that run inference using the OpenVINO Runtime through the Hugging Face API. In this case, this means we simply need to replace the AutoModelForXxx class with the corresponding OVModelForXxx class to set up and run the inferencing pipeline for “Llama-3–8B-Instruct”.

Run the Chatbot in Llama3 with OpenVINO™

We got everything ready! To facilitate the easy-to-use of this Llama3-based chatbot, we also provide a user-friendly interface based on Gradio. Now let’s chat.