How to run Llama 3.1 locally with OpenVINO™
Authors: Raymond Lo, Zhuo Wu, Dmitriy Pastushenkov
Introduction
With the release of Llama 3.1, the latest advancements in AI models are now more accessible than ever. Thanks to the seamless integration of OpenVINO™ and Optimum Intel, you can compress, optimize, and run this powerful model locally on your Intel hardware. In this guide, we’ll walk you through the entire process, from setup to execution, enabling you to harness the full potential of Llama 3.1 with minimal effort.
Table of Contents
- Download the OpenVINO GenAI Sample Code
- Install the Latest Build and Dependencies
- Download and Export Llama 3.1 with NNCF
- Run the Model
- Conclusion
Step 0: Prepare your machine for development!
For the first time user, we recommend you follow the basic setup (1, 2 and 3) steps in the wiki.
Step 1: Download the OpenVINO GenAI Sample Code
The simplest way to get Llama 3.1 running is by using the OpenVINO GenAI API on Windows. We’ll walk you through setting it up using the sample code provided.
Start by cloning the repository:
git clone https://github.com/openvinotoolkit/openvino.genai.git
Inside the repository, you’ll find a Python sample called chat_sample
. This concise sample enables you to execute Llama 3.1 with a user prompt in under 40 lines of code. It’s a straightforward way to start exploring the model’s capabilities.
Here’s a preview of the sample code:
#!/usr/bin/env python3
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import argparse
import openvino_genai
def streamer(subword):
print(subword, end='', flush=True)
# Return flag corresponds to whether generation should be stopped.
# False means continue generation.
return False
def main():
parser = argparse.ArgumentParser()
parser.add_argument('model_dir')
args = parser.parse_args()
device = 'CPU' # GPU can be used as well
pipe = openvino_genai.LLMPipeline(args.model_dir, device)
config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100
pipe.start_chat()
while True:
prompt = input('question:\n')
if 'Stop!' == prompt:
break
pipe.generate(prompt, config, streamer)
print('\n----------')
pipe.finish_chat()
if '__main__' == __name__:
Next, let’s set up the environment to handle the model downloading, conversion, and execution.
Step 2: Install the Latest Build and Dependencies
To avoid dependency conflicts, it’s recommended to create a separate virtual environment:
python -m venv openvino_venv
Activate the environment:
openvino_venv\Script\activate
Now, install the necessary dependencies:
python -m pip install --upgrade pip
pip install -U --pre openvino-genai openvino openvino-tokenizers[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
pip install --extra-index-url https://download.pytorch.org/whl/cpu "git+https://github.com/huggingface/optimum-intel.git" "git+https://github.com/openvinotoolkit/nncf.git" "onnx<=1.16.1"
Step 3: Download and Export the Llama 3.1 using NNCF.
Before exporting the model from Hugging Face, ensure you’ve accepted the usage agreement here.
Then, use the following command to download and export the model:
optimum-cli export openvino --model meta-llama/Meta-Llama-3.1-8B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 1.0 --sym llama-3.1-8b-instruct/INT4_compressed_weights
Step 4: Running the Model
Before we execute the model, the tokzener_config.json file inside llama-3.1–8b-instruct/INT4_compressed_weights is missing the proper chat_template variable. Here is a patched version to get this demo to work.
"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",
You’re now ready to run the model inference with OpenVINO. Use this command:
python chat_sample.py ./llama-3.1-8b-instruct/INT4_compressed_weights
By default, the sample runs on the CPU. To switch to the GPU, simply update the device
parameter in chat_sample.py
:
device = 'GPU' # GPU can be used as well
pipe = openvino_genai.LLMPipeline(args.model_dir, device)
Finally, here’s a snapshot of the inference running on the integrated GPU of my AI PC!
Update # 1: I’ve tested the system on Ubuntu and it works well. Here is a pip freeze file you can use as a golden path reference.
Conclusion
Running Llama 3.1 locally with OpenVINO™ provides a robust and efficient solution for developers looking to maximize AI performance on Intel hardware. With this setup, you can enjoy faster inference times, lower latency, and reduced resource consumption — all with minimal setup and coding effort. We hope this guide helps you get started quickly and effectively. Happy coding!
Read More
Explore more on AI and OpenVINO™ with these related guides:
- Build Agentic-RAG with OpenVINO™ and LlamaIndex
- Comprehensive Guide to Building Advanced AI Systems Using OpenVINO™ and LlamaIndex - How to Build Faster GenAI Apps with Fewer Lines of Code using OpenVINO™ GenAI API
- Learn how to build faster GenAI applications with minimal code. - Running Llama2 on CPU and GPU with OpenVINO
- Run Llama 2 on CPU with optimized performance using OpenVINO.
Additional Resources
OpenVINO Documentation
Jupyter Notebooks
Installation and Setup
Product Page
About the Authors & Editors:
Notices & Disclaimers
Intel technologies may require enabled hardware, software, or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.