How to run Llama 3.1 locally with OpenVINO™

Published in

OpenVINO-toolkit

5 min readAug 22, 2024

Authors: Raymond Lo, Zhuo Wu, Dmitriy Pastushenkov

Introduction

With the release of Llama 3.1, the latest advancements in AI models are now more accessible than ever. Thanks to the seamless integration of OpenVINO™ and Optimum Intel, you can compress, optimize, and run this powerful model locally on your Intel hardware. In this guide, we’ll walk you through the entire process, from setup to execution, enabling you to harness the full potential of Llama 3.1 with minimal effort.

Running Llama 3.1 on AI PC’s iGPU.

Download the OpenVINO GenAI Sample Code
Install the Latest Build and Dependencies
Download and Export Llama 3.1 with NNCF
Run the Model
Conclusion

Step 0: Prepare your machine for development!

For the first time user, we recommend you follow the basic setup (1, 2 and 3) steps in the wiki.

Setting up the machine and get ready =).

Home

📚 Jupyter notebook tutorials for OpenVINO™. Contribute to openvinotoolkit/openvino_notebooks development by creating…

github.com

Step 1: Download the OpenVINO GenAI Sample Code

The simplest way to get Llama 3.1 running is by using the OpenVINO GenAI API on Windows. We’ll walk you through setting it up using the sample code provided.

Start by cloning the repository:

git clone https://github.com/openvinotoolkit/openvino.genai.git

Inside the repository, you’ll find a Python sample called chat_sample. This concise sample enables you to execute Llama 3.1 with a user prompt in under 40 lines of code. It’s a straightforward way to start exploring the model’s capabilities.

OpenVINO chat sample under OpenVINO GenAI’s Python sample

Here’s a preview of the sample code:

#!/usr/bin/env python3 
# Copyright (C) 2024 Intel Corporation 
# SPDX-License-Identifier: Apache-2.0 
import argparse 
import openvino_genai 
def streamer(subword): 
    print(subword, end='', flush=True) 
    # Return flag corresponds to whether generation should be stopped. 
    # False means continue generation. 
    return False 
def main(): 
    parser = argparse.ArgumentParser() 
    parser.add_argument('model_dir') 
    args = parser.parse_args() 
    device = 'CPU'  # GPU can be used as well 
    pipe = openvino_genai.LLMPipeline(args.model_dir, device) 
    config = openvino_genai.GenerationConfig() 
    config.max_new_tokens = 100 
    pipe.start_chat() 
    while True: 
        prompt = input('question:\n') 
        if 'Stop!' == prompt: 
            break 
        pipe.generate(prompt, config, streamer) 
        print('\n----------') 
    pipe.finish_chat() 
 
if '__main__' == __name__:

Next, let’s set up the environment to handle the model downloading, conversion, and execution.

Step 2: Install the Latest Build and Dependencies

To avoid dependency conflicts, it’s recommended to create a separate virtual environment:

python -m venv openvino_venv

Activate the environment:

openvino_venv\Script\activate

Now, install the necessary dependencies:

python -m pip install --upgrade pip
pip install -U --pre openvino-genai openvino openvino-tokenizers[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly 
pip install --extra-index-url https://download.pytorch.org/whl/cpu "git+https://github.com/huggingface/optimum-intel.git" "git+https://github.com/openvinotoolkit/nncf.git" "onnx<=1.16.1"

Step 3: Download and Export the Llama 3.1 using NNCF.

Before exporting the model from Hugging Face, ensure you’ve accepted the usage agreement here.

Then, use the following command to download and export the model:

optimum-cli export openvino --model meta-llama/Meta-Llama-3.1-8B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 1.0 --sym llama-3.1-8b-instruct/INT4_compressed_weights

Step 4: Running the Model

Before we execute the model, the tokzener_config.json file inside llama-3.1–8b-instruct/INT4_compressed_weights is missing the proper chat_template variable. Here is a patched version to get this demo to work.

tokenizer_config.json

Edit description

drive.google.com

Replace this variable with the given one below.

"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",

You’re now ready to run the model inference with OpenVINO. Use this command:

python chat_sample.py ./llama-3.1-8b-instruct/INT4_compressed_weights

By default, the sample runs on the CPU. To switch to the GPU, simply update the device parameter in chat_sample.py:

device = 'GPU'  # GPU can be used as well
pipe = openvino_genai.LLMPipeline(args.model_dir, device)

Finally, here’s a snapshot of the inference running on the integrated GPU of my AI PC!

Update # 1: I’ve tested the system on Ubuntu and it works well. Here is a pip freeze file you can use as a golden path reference.

freeze.txt

Edit description

drive.google.com

Conclusion

Running Llama 3.1 locally with OpenVINO™ provides a robust and efficient solution for developers looking to maximize AI performance on Intel hardware. With this setup, you can enjoy faster inference times, lower latency, and reduced resource consumption — all with minimal setup and coding effort. We hope this guide helps you get started quickly and effectively. Happy coding!

Explore more on AI and OpenVINO™ with these related guides: