Why and How to Use OpenVINO™ Toolkit to Deploy Faster, Smaller LLMs

Published in

OpenVINO-toolkit

12 min readJul 2, 2024

Large language models (LLMs) are driving breakthrough applications in conversational AI, comprehension, and translation. As businesses and users continue to explore the potential of LLMs, the OpenVINO™ Toolkit is a leading solution for optimizing and deploying LLMs that are lean, fast, and flexible. You can use OpenVINO to compress LLMs, integrate them into AI-assistant applications, and deploy them on edge devices or in the cloud with powerful performance.

In this post we’ll explore the benefits of OpenVINO-enabled LLMs and how to load and deploy LLMs using OpenVINO. This information is excerpted and abridged from an extensive white paper on using OpenVINO for LLMs, which you can read in full here.

What is OpenVINO?

OpenVINO provides an efficient runtime environment for deploying LLMs that offers key advantages over other frameworks:

· Slim deployment: OpenVINO is a self-contained package that requires only a few hundred megabytes of dependencies compared to gigabytes of dependencies with Hugging Face and PyTorch.

· Speed: OpenVINO provides optimized inference performance for LLMs and continues to receive updates for consistent improvement, rivaling or exceeding the performance of third-party solutions, while also providing full C/C++ and Python APIs.

· Official Intel support: As the official AI framework distributed by Intel, OpenVINO is fully supported with patches, upgrades, feature updates, and access to Intel field application engineers for Q&A.

· Flexibility: OpenVINO supports all kinds of models and architectures to help support multimodal application development and deployment, from LLMs to computer vision, image generation, text-to-speech, data classification, and more.

· Hardware support: OpenVINO supports a wide range of x86/x64 and ARM-based hardware targets including CPUs, integrated GPUs, and discrete GPUs, enabling deployment across high-powered servers and compact edge devices.

How to deploy LLMs with OpenVINO

There are two options for optimizing and deploying LLMs with OpenVINO. In both cases, the OpenVINO runtime is used as the backend for inference, and OpenVINO tools are used for model optimization. The main differences are in ease of use, footprint size, and customizability:

1. Hugging Face: Use OpenVINO as a backend for the Hugging Face Transformers API through the Optimum Intel extension. The API is easy to learn and provides a simpler interface but has more dependencies and thus results in weightier LLMs. There are also fewer options for customization as much of the complexity is hidden beneath abstraction layers.

2. Native OpenVINO: Use OpenVINO native APIs (C++ and Python) with custom pipeline code. This requires fewer dependencies, which minimizes application footprint, but requires explicit implementation of the text generation loop, tokenization, and scheduler functions. There is a steeper learning curve as well as greater potential for customization.

We’ll explore both options now.

Requirements

To get started with OpenVINO, set up a Python virtual environment for OpenVINO by following the OpenVINO Installation Instructions. Once the environment is created and activated, install Optimum Intel, OpenVINO, NNCF and their dependencies in a Python environment by issuing:

pip install optimum[openvino]

If you are deploying an LLM in a native OpenVINO environment, you will also need to install OpenVINO Tokenizers for tokenization. It is supported on Linux, macOS, and Windows operating systems. A list of supported tokenizer types can be found in the “Supported Tokenizer Types” section of the OpenVINO Tokenizers documentation.

In the same Python virtual environment that was set up in the Install Dependencies section of this Solution White Paper, install OpenVINO™ Tokenizers by issuing:

pip install openvino-tokenizers[transformers]

Inference with Hugging Face and Optimum Intel

For AI text generation applications, LLM inference consists of four stages, which look very different depending on if you use Hugging Face or the native OpenVINO API:

1. Load model

2. Tokenize input text

3. Execute inference loop

4. Process output tokens

Loading Hugging Face LLMs into OpenVINO

The easiest way to use a LLM in OpenVINO is to load a model from the Hugging Face Hub using Optimum Intel. Models loaded with Optimum Intel are optimized for OpenVINO while being compatible with the Hugging Face Transformers API. The OVModelForCasualLM class takes a model name, downloads it from Hugging Face, and initializes it as an object in memory.

To initialize a model from Hugging Face, use the OVModelForCasualLM.from_pretrained method shown in the snippet below. By setting the parameter export=True, the model is converted to OpenVINO intermediate representation (IR) format on the fly.

from optimum.intel import OVModelForCausalLM
model_id = "HuggingFaceH4/zephyr-7b-beta"
model = OVModelForCausalLM.from_pretrained(model_id, export=True)

Saving and loading models

Once a model has been converted to IR format using Optimum Intel, it can be saved and exported to use in a future session or deployment environment. The conversion process takes a while, so it’s preferable to convert the model to IR format once, save it, and then load the compressed model later for faster time to first inference.

To save and export a model and its tokenizer, use model.save_pretrained(“your-model-name”) and tokenizer.save_pretrained(“your-model-name”).

# Save model for faster loading later
model.save_pretrained("zephyr-7b-beta-ov")
tokenizer.save_pretrained("zephyr-7b-beta-ov")

The model will be exported in OpenVINO IR format (openvino_model.bin, openvino_model.xml) and saved to a new folder in the specified directory. The tokenizer will also be saved to the directory. To load the model and tokenizer in a future session, use OVModelForCausalLM.from_pretrained(“your-model-name”) and AutoTokenizer.from_pretrained(“your-model-name”).

# Load a saved model
model = OVModelForCausalLM.from_pretrained("zephyr-7b-beta-ov")
tokenizer = AutoTokenizer.from_pretrained("zephyr-7b-beta-ov")

Initializing the model

Hugging Face’s high-level Transformers API provides a simple option for initializing the model and running inference. It is wrapped with the Optimum Intel extension, which converts the LLM to OpenVINO IR format and sets OpenVINO runtime as the backend.

Here is a quick example that uses Hugging Face Transformers and Optimum Intel to set up and run a simple text generation pipeline.

Python
from optimum.intel import OVModelForCausalLM 
# new imports for inference
from transformers import AutoTokenizer, pipeline

# load the model
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = OVModelForCausalLM.from_pretrained(model_id, export=True)

# inference
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "What is OpenVINO?"
results = pipe(prompt)

In the example, three key classes and methods are used:

OVModelForCausalLM.from_pretrained from Optimum Intel: Loads the LLM from Hugging Face, converts it to OpenVINO IR format, and compiles it on a target device using OpenVINO as the inference backend.
AutoTokenizer from Hugging Face Transformers: Initializes a text tokenizer for the LLM.
Pipeline from Hugging Face Transformers: Handles the bulk of text generation, including tokenizing the inputs, executing the inference loop, and processing the outputs.

These classes provide a simple interface for setting up text generation. Each class or method has more parameters that can be used to further configure the LLM. The Transformers API also has other features that give more control over inference parameters, such as the model.generate() method. To learn more, see Hugging Face Transformers documentation.

Selecting devices to compile the LLM in Hugging Face

There are two options to select which device (CPU, iGPU, GPU, etc.) the LLM is compiled on:

1. Specify the device parameter in the .from_pretrained() call. For example, use OVModelForCausalLM.from_pretrained (model_id, export=True, device=”GPU.0”)to run the model on the GPU. See the Device Query documentation for more information.

2. Use the model.to method after the model has been loaded and pass in the name of the target device. For example, use model.to(“GPU.0”) to run the model on the GPU.

While Hugging Face APIs greatly simplify the code for implementing text generation, one drawback is that they cannot be implemented in C/C++. In contrast, the native OpenVINO API supports building solutions with C/C++.

Inference with native OpenVINO API in Python

Inference can also be run on LLMs using the native OpenVINO API. An inference pipeline for a text generation LLM can be set up in the following stages:

1. Read and compile the model

2. Tokenize text and set model inputs

3. Run token generation loop

4. De-tokenize outputs

This section provides code snippets showing how to implement each stage with the native OpenVINO Python API. These snippets implement a stateful model technique to increase the memory efficiency of LLMs. With this technique, the context of the model — its internal states (the KV cache) — are shared among multiple iterations of inference. The KV cache that belongs to a particular text sequence is accumulated inside the model during the generation loop. The stateful model implementation supports both greedy search and beam search (preview) for LLMs.

Before you get started, make sure you’ve installed all required extensions and APIs as directed under the “Requirements” section of this post.

Convert Hugging Face tokenizer and model to OpenVINO IR Format

Before an LLM and its tokenizer can be used with the native OpenVINO API, it must be converted to OpenVINO IR format. OpenVINO Tokenizer comes equipped with a command line interface (CLI) tool, convert_tokenizer, that converts tokenizers from the Hugging Face Hub to OpenVINO IR format:

convert_tokenizer HuggingFaceH4/zephyr-7b-beta --with-detokenizer -o openvino_tokenizer

The example above transforms the HuggingFaceH4/zephyr-7b-beta tokenizer from the Hugging Face Hub. The — with-detokenizer argument tells the command to also convert the detokenizer. The -o argument specifies the name of the output directory where the converted objects will be saved (openvino_tokenizer, in this case).

Next, convert the LLM itself to OpenVINO IR format using the optimum-cli tool. This is helpful for converting models without using a Python script.

The command to perform this conversion is structured as follows:

optimum-cli export openvino --model <MODEL_NAME> <NEW_MODEL_NAME>

— model <MODEL_NAME>: This part of the command specifies the name of the mode to be converted. Replace <MODEL_NAME> with the actual model name from Hugging Face.

<NEW_MODEL_NAME>: Here, you specify the name you want to give to the new model in the OpenVINO IR format. Replace <NEW_MODEL_NAME> with your desired name.

For example, to convert the Llama 2–7B model from Hugging Face (formally named as meta-llama/Llama-2–7b-chat-hf) to an OpenVINO IR model and name it “ov_llama_2”, use the following command:

optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf ov_llama_2

In this example, meta-llama/Llama-2–7b-chat-hf is the Hugging Face model name, and ov_llama_2 is the new name for the converted OpenVINO IR model.

Additionally, you can specify the — weight-format argument to apply 8-bit or 4-bit weight quantization when exporting your model with the CLI. An example command applying 8-bit quantization to the model gpt2 is below:

optimum-cli export openvino --model gpt2 --weight-format int8 ov_gpt2_model

The model and tokenizer are now saved in the openvino_model and openvino_tokenizer folders.

Step 1: Read and compile the model

Now that the model and tokenizer have been converted to OpenVINO IR format, they can be read and compiled using the ov.core.compile_model method.

import numpy as np 
from pathlib import Path
import openvino_tokenizers
from openvino import compile_model, Tensor
model_dir = Path("path/to/model/directory")

# Compile the tokenizer, model, and detokenizer using OpenVINO. These files are XML representations of the models optimized for OpenVINO
tokenizer = compile_model(model_dir / "openvino_tokenizer.xml")
detokenizer = compile_model(model_dir / "openvino_detokenizer.xml")
infer_request = compile_model(model_dir / "openvino_model.xml").create_infer_request()

The model and tokenizer are now compiled and ready to be used for inference.

Step 2: Tokenize input text

Input text must be tokenized and set up in the structure expected by the model before running inference. Tokenization converts the input text into a sequence of numbers (“tokens”), which are the format that the model can understand and process.

**Figure 1.** An example phrase broken into tokens, where each token has its own numerical value. [Source]

The compiled tokenizer can be used to convert the input text string into tokens, as shown here.

text_input = [" What is OpenVINO?"]
model_input = {name.any_name: Tensor(output) for name, output in tokenizer(text_input).items()}

Step 3: Run token generation loop

The core of text generation lies in the inference and token selection loop. In each iteration of this loop, the model runs inference on the input sequence, generates and selects a new token, and appends it to the existing sequence.

if "position_ids" in (input.any_name for input in infer_request.model_inputs):
    model_input["position_ids"] = np.arange(model_input["input_ids"].shape[1], dtype=np.int64)[np.newaxis, :]

# no beam search, set idx to 0
model_input["beam_idx"] = Tensor(np.array(range(len(text_input)), dtype=np.int32))

# end of sentence token - the model signifies the end of text generation
# for now can be obtained from the original tokenizer `original_tokenizer.eos_token_id`
eos_token = 2

tokens_result = [[]]

# reset KV cache inside the model before inference
infer_request.reset_state()
max_infer = 10

for _ in range(max_infer):
    infer_request.start_async(model_input)
    infer_request.wait()

    # use greedy decoding to get most probable token as the model prediction
    output_token = np.argmax(infer_request.get_output_tensor().data[:, -1, :], axis=-1, keepdims=True)
    tokens_result = np.hstack((tokens_result, output_token))

    if output_token[0][0] == eos_token:
        break
    
    # Prepare input for new inference
    model_input["input_ids"] = output_token
    model_input["attention_mask"] = np.hstack((model_input["attention_mask"].data, [[1]]))
    model_input["position_ids"] = np.hstack(
        (model_input["position_ids"].data, [[model_input["position_ids"].data.shape[-1]]])
    )

Step 4: De-Tokenize outputs

The final step in the process is de-tokenization, where the sequence of token IDs generated by the model is converted back into human-readable text. The compiled detokenizer is used to convert the output token IDs back into a string of text.

# Decode the model output back to string
text_result = detokenizer(tokens_result)["string_output"]
print(f"Prompt:\n{text_input[0]}")
print(f"Generated:\n{text_result[0]}")

Here is the resulting output from running this example:

[‘ <s> OpenVINO is an open-source toolkit for building and optimizing deep learning applications using Intel® hardware.’]

Inference with native OpenVINO API in C++

The previous example can also be implemented in C++, leveraging the stateful model technique. The following program from OpenVINO GenAI GitHub loads a tokenizer, a detokenizer, and a model in OpenVINO IR format to OpenVINO. A prompt is tokenized and passed to the model. The model greedily generates token by token until the special end of sequence (EOS) token is obtained. The predicted tokens are converted to chars and printed in a streaming fashion.

// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include <openvino/openvino.hpp>

namespace {
std::pair<ov::Tensor, ov::Tensor> tokenize(ov::InferRequest& tokenizer, std::string&& prompt) {
    constexpr size_t BATCH_SIZE = 1;
    tokenizer.set_input_tensor(ov::Tensor{ov::element::string, {BATCH_SIZE}, &prompt});
    tokenizer.infer();
    return {tokenizer.get_tensor("input_ids"), tokenizer.get_tensor("attention_mask")};
}

std::string detokenize(ov::InferRequest& detokenizer, std::vector<int64_t>& tokens) {
    constexpr size_t BATCH_SIZE = 1;
    detokenizer.set_input_tensor(ov::Tensor{ov::element::i64, {BATCH_SIZE, tokens.size()}, tokens.data()});
    detokenizer.infer();
    return detokenizer.get_output_tensor().data<std::string>()[0];
}

// The following reasons require TextStreamer to keep a cache of previous tokens:
// detokenizer removes starting ' '. For example detokenize(tokenize(" a")) == "a",
// but detokenize(tokenize("prefix a")) == "prefix a"
// 1 printable token may consist of 2 token ids: detokenize(incomplete_token_idx) == "�"
struct TextStreamer {
    ov::InferRequest detokenizer;
    std::vector<int64_t> token_cache;
    size_t print_len = 0;

    void put(int64_t token) {
        token_cache.push_back(token);
        std::string text = detokenize(detokenizer, token_cache);
        if (!text.empty() && '\n' == text.back()) {
            // Flush the cache after the new line symbol
            std::cout << std::string_view{text.data() + print_len, text.size() - print_len};
            token_cache.clear();
            print_len = 0;
        }
        if (text.size() >= 3 && text.compare(text.size() - 3, 3, "�") == 0) {
            // Don't print incomplete text
            return;
        }
        std::cout << std::string_view{text.data() + print_len, text.size() - print_len} << std::flush;
        print_len = text.size();
    }

    void end() {
        std::string text = detokenize(detokenizer, token_cache);
        std::cout << std::string_view{text.data() + print_len, text.size() - print_len} << '\n';
        token_cache.clear();
        print_len = 0;
    }
};
}

int main(int argc, char* argv[]) try {
    if (argc != 3) {
        throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> '<PROMPT>'");
    }
    // Compile models
    ov::Core core;
    core.add_extension(USER_OV_EXTENSIONS_PATH);  // USER_OV_EXTENSIONS_PATH is defined in CMakeLists.txt
    // tokenizer and detokenizer work on CPU only
    ov::InferRequest tokenizer = core.compile_model(
        std::string{argv[1]} + "/openvino_tokenizer.xml", "CPU").create_infer_request();
    auto [input_ids, attention_mask] = tokenize(tokenizer, argv[2]);
    ov::InferRequest detokenizer = core.compile_model(
        std::string{argv[1]} + "/openvino_detokenizer.xml", "CPU").create_infer_request();
    // The model can be compiled for GPU as well
    ov::InferRequest lm = core.compile_model(
        std::string{argv[1]} + "/openvino_model.xml", "CPU").create_infer_request();
    // Initialize inputs
    lm.set_tensor("input_ids", input_ids);
    lm.set_tensor("attention_mask", attention_mask);
    ov::Tensor position_ids = lm.get_tensor("position_ids");
    position_ids.set_shape(input_ids.get_shape());
    std::iota(position_ids.data<int64_t>(), position_ids.data<int64_t>() + position_ids.get_size(), 0);
    constexpr size_t BATCH_SIZE = 1;
    lm.get_tensor("beam_idx").set_shape({BATCH_SIZE});
    lm.get_tensor("beam_idx").data<int32_t>()[0] = 0;
    lm.infer();
    size_t vocab_size = lm.get_tensor("logits").get_shape().back();
    float* logits = lm.get_tensor("logits").data<float>() + (input_ids.get_size() - 1) * vocab_size;
    int64_t out_token = std::max_element(logits, logits + vocab_size) - logits;

    lm.get_tensor("input_ids").set_shape({BATCH_SIZE, 1});
    position_ids.set_shape({BATCH_SIZE, 1});
    TextStreamer text_streamer{std::move(detokenizer)};
    // There's no way to extract special token values from the detokenizer for now
    constexpr int64_t SPECIAL_EOS_TOKEN = 2;
    while (out_token != SPECIAL_EOS_TOKEN) {
        lm.get_tensor("input_ids").data<int64_t>()[0] = out_token;
        lm.get_tensor("attention_mask").set_shape({BATCH_SIZE, lm.get_tensor("attention_mask").get_shape().at(1) + 1});
        std::fill_n(lm.get_tensor("attention_mask").data<int64_t>(), lm.get_tensor("attention_mask").get_size(), 1);
        position_ids.data<int64_t>()[0] = int64_t(lm.get_tensor("attention_mask").get_size() - 2);
        lm.start_async();
        text_streamer.put(out_token);
        lm.wait();
        logits = lm.get_tensor("logits").data<float>();
        out_token = std::max_element(logits, logits + vocab_size) - logits;
    }
    text_streamer.end();
    // Model is stateful which means that context (kv-cache) which belongs to a particular
    // text sequence is accumulated inside the model during the generation loop above.
    // This context should be reset before processing the next text sequence.
    // While it is not required to reset context in this sample as only one sequence is processed,
    // it is called for education purposes:
    lm.reset_state();
} catch (const std::exception& error) {
    std::cerr << error.what() << '\n';
    return EXIT_FAILURE;
} catch (...) {
    std::cerr << "Non-exception object thrown\n";
    return EXIT_FAILURE;
}

Run lean LLMs with weight compression optimization in OpenVINO

Before running your LLM, you can reduce its storage size and memory footprint using OpenVINO weight compression, which can reduce LLMs to 1/4th or 1/8th the original size with similar accuracy. Read this post to learn more.

Discover a smarter, faster way to deploy LLMs with OpenVINO today

Whether deployed in Python, C++, or in Hugging Face using the Intel Optimum API, LLMs in OpenVINO deliver the advanced functionality of conversational AI with the added benefits of speed, flexibility, and official Intel support. Continue the journey by trying OpenVINO for yourself.

Author attribution

This post is based on the solution white paper “Optimizing Large Language Models with the OpenVINO™ Toolkit” by Ria Cheruvu, Intel AI evangelist, and Ryan Loney, Intel OpenVINO product manager. Additional credits: Ekaterina Aidova, Alexander Kozlov, Helena Kloosterman, Artur Paniukov, Dariusz Trawinski, Ilya Lavrenov, Nico Galoppo, Jan Iwaszkiewicz, Sergey Lyalin, Adrian Tobiszewski, Dariusz Trawinski, Jason Burris, Ansley Dunn, Michael Hansen, Raymond Lo, Yury Gorbachev, Adam Tumialis, and Milosz Zeglarski.

Notices & Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.