How to run Llama 3.1 locally with OpenVINO™

Raymond Lo, PhD
OpenVINO-toolkit
Published in
5 min readAug 22, 2024

Authors: Raymond Lo, Zhuo Wu, Dmitriy Pastushenkov

Introduction

With the release of Llama 3.1, the latest advancements in AI models are now more accessible than ever. Thanks to the seamless integration of OpenVINO™ and Optimum Intel, you can compress, optimize, and run this powerful model locally on your Intel hardware. In this guide, we’ll walk you through the entire process, from setup to execution, enabling you to harness the full potential of Llama 3.1 with minimal effort.

Running Llama 3.1 on AI PC’s iGPU.

Table of Contents

  1. Download the OpenVINO GenAI Sample Code
  2. Install the Latest Build and Dependencies
  3. Download and Export Llama 3.1 with NNCF
  4. Run the Model
  5. Conclusion

Step 0: Prepare your machine for development!

For the first time user, we recommend you follow the basic setup (1, 2 and 3) steps in the wiki.

Setting up the machine and get ready =).

Step 1: Download the OpenVINO GenAI Sample Code

The simplest way to get Llama 3.1 running is by using the OpenVINO GenAI API on Windows. We’ll walk you through setting it up using the sample code provided.

Start by cloning the repository:

git clone https://github.com/openvinotoolkit/openvino.genai.git

Inside the repository, you’ll find a Python sample called chat_sample. This concise sample enables you to execute Llama 3.1 with a user prompt in under 40 lines of code. It’s a straightforward way to start exploring the model’s capabilities.

OpenVINO chat sample under OpenVINO GenAI’s Python sample

Here’s a preview of the sample code:

#!/usr/bin/env python3 
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import argparse
import openvino_genai
def streamer(subword):
print(subword, end='', flush=True)
# Return flag corresponds to whether generation should be stopped.
# False means continue generation.
return False
def main():
parser = argparse.ArgumentParser()
parser.add_argument('model_dir')
args = parser.parse_args()
device = 'CPU' # GPU can be used as well
pipe = openvino_genai.LLMPipeline(args.model_dir, device)
config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100
pipe.start_chat()
while True:
prompt = input('question:\n')
if 'Stop!' == prompt:
break
pipe.generate(prompt, config, streamer)
print('\n----------')
pipe.finish_chat()

if '__main__' == __name__:

Next, let’s set up the environment to handle the model downloading, conversion, and execution.

Step 2: Install the Latest Build and Dependencies

To avoid dependency conflicts, it’s recommended to create a separate virtual environment:

python -m venv openvino_venv

Activate the environment:

openvino_venv\Script\activate

Now, install the necessary dependencies:

python -m pip install --upgrade pip
pip install -U --pre openvino-genai openvino openvino-tokenizers[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
pip install --extra-index-url https://download.pytorch.org/whl/cpu "git+https://github.com/huggingface/optimum-intel.git" "git+https://github.com/openvinotoolkit/nncf.git" "onnx<=1.16.1"

Step 3: Download and Export the Llama 3.1 using NNCF.

Before exporting the model from Hugging Face, ensure you’ve accepted the usage agreement here.

Then, use the following command to download and export the model:

optimum-cli export openvino --model meta-llama/Meta-Llama-3.1-8B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 1.0 --sym llama-3.1-8b-instruct/INT4_compressed_weights

Step 4: Running the Model

Before we execute the model, the tokzener_config.json file inside llama-3.1–8b-instruct/INT4_compressed_weights is missing the proper chat_template variable. Here is a patched version to get this demo to work.

Replace this variable with the given one below.
"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",

You’re now ready to run the model inference with OpenVINO. Use this command:

python chat_sample.py ./llama-3.1-8b-instruct/INT4_compressed_weights

By default, the sample runs on the CPU. To switch to the GPU, simply update the device parameter in chat_sample.py:

device = 'GPU'  # GPU can be used as well
pipe = openvino_genai.LLMPipeline(args.model_dir, device)

Finally, here’s a snapshot of the inference running on the integrated GPU of my AI PC!

Running Llama 3.1 on AI PC’s iGPU

Update # 1: I’ve tested the system on Ubuntu and it works well. Here is a pip freeze file you can use as a golden path reference.

Conclusion

Running Llama 3.1 locally with OpenVINO™ provides a robust and efficient solution for developers looking to maximize AI performance on Intel hardware. With this setup, you can enjoy faster inference times, lower latency, and reduced resource consumption — all with minimal setup and coding effort. We hope this guide helps you get started quickly and effectively. Happy coding!

Read More

Explore more on AI and OpenVINO™ with these related guides:

  1. Build Agentic-RAG with OpenVINO™ and LlamaIndex
    - Comprehensive Guide to Building Advanced AI Systems Using OpenVINO™ and LlamaIndex
  2. How to Build Faster GenAI Apps with Fewer Lines of Code using OpenVINO™ GenAI API
    - Learn how to build faster GenAI applications with minimal code.
  3. Running Llama2 on CPU and GPU with OpenVINO
    - Run Llama 2 on CPU with optimized performance using OpenVINO.

Additional Resources

OpenVINO Documentation
Jupyter Notebooks
Installation and Setup

Product Page

About the Authors & Editors:

Zhuo Wu, who has her PhD in electronics, is an AI evangelist at Intel specializing in the OpenVINO™ toolkit. Her work covers deep learning, 5G wireless communication, computer vision, edge computing, and IoT systems. She has delivered AI solutions across various industries and has conducted significant research in 4G-LTE and 5G systems. Previously, she was a research scientist at Bell Labs (China) and an associate professor at Shanghai University, where she led several research projects and filed multiple patents.
Raymond Lo, currently based in Silicon Valley, is the global lead of Intel’s AI evangelist team, focusing on the OpenVINO™ Toolkit. With a diverse background that includes founding the augmented reality company Meta, Raymond has also held key roles at Samsung NEXT and Google Cloud AI. His work spans startup entrepreneurship and enterprise innovation, with a strong presence in global conferences like TED Talks and SIGGRAPH.
Dmitriy Pastushenkov is an AI PC Evangelist at Intel Germany with over 20 years of experience in industrial automation, IIoT, real-time operating systems, and AI. He has held roles in software development, architecture, and technical management. Since joining Intel in 2022 as a Software Architect, he has focused on optimizing AI and real-time workloads at the smart edge. Currently, he advocates for OpenVINO and the AI PC Software Stack. Dmitriy holds a Master’s degree in Computer Science from Moscow Power Engineering Institute.
Stephanie Maluso is a product marketer and analyst for Intel, specializing in the OpenVINO™ Toolkit. With over three years of experience on the team, beginning as an intern, she has developed a deep passion for creating impactful content around the innovative AI products and tools she supports.

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

--

--

Raymond Lo, PhD
OpenVINO-toolkit

@Intel - OpenVINO AI Software Evangelist. ex-Google, ex-Samsung, and ex-Meta (Augmented Reality) executive. Ph.D. in Computer Engineer — U of T.