How to Build Faster GenAI Apps with Fewer Lines of Code using OpenVINO™ GenAI API

Published in

OpenVINO-toolkit

8 min readJul 9, 2024

Authors: Raymond Lo, Dmitriy Pastushenkov, Zhuo Wu

The new OpenVINO GenAI API provides developers with simpler and clearer code to maintain. OpenVINO has also evolved from a computer vision and AI acceleration and optimization library to a GenAI enabler for developers.

Generative Pre-trained Transformers (GPTs) are becoming a new household name among developers as the rise of chatbots such as ChatGPT is taking the world by storm. The development of Generative AI (GenAI), especially the advancement of large language models and chatbots, is fast and ever-changing, and it is difficult to predict what breakthrough will come next and what developers should focus on. We know that GenAI is here to stay, and developers would love to see clean and easy ways to develop, maintain, and deploy AI applications locally.

Despite the excitement around GenAI, running inference for these models presents significant challenges, particularly on edge devices and AI PCs.

A Live Demo of GenAI API in action running Llama3–8B-instruct model on the AI PC CPU or GPU.

The Current State-of-the-art GenAI on Intel hardware

Today, to get the best performance on GenAI on Intel hardware, developers can run GenAI -models using the Hugging Face pipeline optimized with Optimum Intel and the OpenVINO back-end. OpenVINO can enable the optimization of CPU, GPU, or NPU, and these can significantly reduce latency and increase efficiency. Furthermore, we can take advantage of model optimization techniques such as quantization and weight compression to minimize the memory footprint (2x-3x less memory usage). This is often the major bottleneck in model deployment as client or edge devices only come with 32GB or less RAM.

Figure 1: With the new OpenVINO GenAI API, we can do even better on the coding side! As you can see below the inference code is reduced to 3 lines of code. This new workflow provides developers with a much lower learning curve to start on the GenAI app development journey.

Checking the installation of the OpenVINO GenAI library, we could see that not only the number of code lines is reduced, but also only a few dependencies are installed, resulting in a neat and compact environment to run the Gen AI inferencing with only 216Mb!

Figure 2: Deploying solutions using OpenVINO GenAI API not only reduces disk usage but can also simplify the dependencies requirements to build generative AI apps. This often is one of the biggest challenges when developers start maintaining Gen AI applications.

Table 1: A comparison of the OpenVINO GenAI API vs. Optimum-Intel packages

Compared with Optimum-Intel, GenAI API only integrated the most used sampling methods including Greedy and Beam search. Meanwhile, developers can customize the sampling parameters through Multinomial decoding as well (e.g., top-k or temperature)

Considering multiple users’ scenarios, GenAI API only implemented Continues-batching, Paged-attention natively. During text generation, these technologies can help to improve performance and optimize memory consumption when inferencing with multiple batches.

Since Hugging Face’s tokenizer can only work with Python, to align with the input/output tensor formats of OpenVINO C++ runtime, GenAI API will tokenize the input text and detokenize the output vector via inferencing 2 OpenVINO models separately. Prior to this approach, developers can use Optimum-Intel CLI to convert Hugging Face’s tokenizer to OpenVINO IR models.

So far we have highlighted some of the key benefits of using the new OpenVINO GenAI API. In the next session, let’s dig deeper into how to run a demo step-by-step.

Lightweight Gen AI with the OpenVINO GenAI API

Installation

Setting up the new OpenVINO™ GenAI API for running inference on generative AI and LLMs is designed to be simple and straightforward. The installation process can be performed through either PyPI or by downloading an archive, giving you the flexibility to choose the method that best suits your needs. For example, you could use the following for PyPI installation which comes in our latest OpenVINO 2024.2 release:

python -m pip install openvino-genai

More information on the installation can be found here.

Running inference

Once you have installed OpenVINO, you can start running inference on your GenAI and LLM models. By leveraging this API, you can load a model, pass a context to it, and receive a response with just a few lines of code. Internally, OpenVINO handles the tokenization of the input text, executes the generation loop on your selected device, and delivers the final response. Let’s explore this step-by-step process in both Python and C++ based on the chat_sample provided in the openvino.genai repository

Step 1: An LLM model must be downloaded and exported using Hugging Face Optimum-Intel (in this example, we use a chat-tuned Tiny Llama) into OpenVINO IR format. For this step, it is recommended to create a separate virtual environment to avoid any dependencies conflicts. For example,

python -m venv openvino_venv

activate it,

openvino_venv\Script\activate

and install dependencies, which are necessary for the model export process. These requirements are available here in openvino.genai repository

python -m pip install --upgrade-strategy eager -r requirements.txt

To download and export the model please use the following command.

optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0

For improved performance during LLM inference, we recommend using lower precision for model weights, such as INT4. You can compress weights using the Neural Network Compression Framework (NNCF) during the model export process, as demonstrated below.

optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0 --weight-format int4

The virtual environment and dependencies installed in this step are no longer needed anymore as the model needed to be exported only once. Feel free to remove this virtual environment from the disk.

Step 2: Running inference of text generation of LLM via Python or C++ API.

Set up the pipeline via the new Python API:

pipe = ov_genai.LLMPipeline(model_path, "CPU")
print(pipe.generate("The Sun is yellow because"))

Set up the pipeline via the new C++ API:

int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");//target device is CPU
std::cout << pipe.generate("The Sun is yellow because"); //input context

As you can see, building an LLM generation pipeline now requires just a few lines of code. This simplicity is due to the model exported from Hugging Face Optimum-Intel already containing all the necessary information for execution, including the tokenizer/detokenizer and generation config, ensuring results consistent with Hugging Face generation. We offer both C++ and Python APIs to run LLMs, with minimal dependencies and additions to your application.

The provided code is working on the CPU, but it is easy to make it work on the GPU by replacing the device name with “GPU”:

pipe = ov_genai.LLMPipeline(model_path, "GPU")

To create more interactive UIs for generation, we have added support for streaming model output tokens, so that output words could be provided as soon as they are generated by the model. Token generation also can be stopped at any time by returning True from the streamer.

What’s more, stateful models are running internally for the inferencing of text generation, resulting in faster generation speed and reduced overhead due to data representation conversion. Therefore, maintaining KVCache across inputs may prove beneficial. The chat-specific methods start_chat and finish_chat are used to mark a conversation session, as you can see in the following example.

In Python:

import argparse
import openvino_genai


def streamer(subword):
    print(subword, end='', flush=True)
    # Return flag corresponds to whether generation should be stopped.
    # False means continue generation.
    return False
model_path = TinyLlama-1.1B-Chat-v1.0

device = 'CPU'  # GPU can be used as well
pipe = openvino_genai.LLMPipeline(args.model_dir, device)

config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100

pipe.start_chat()
while True:
    prompt = input('question:\n')
    if 'Stop!' == prompt:
        break
    pipe.generate(prompt, config, streamer)

    print('\n----------')
pipe.finish_chat()

In C++:

#include "openvino/genai/llm_pipeline.hpp"

int main(int argc, char* argv[]) try {
    if (2 != argc) {
        throw std::runtime_error(std::string{"Usage: "} + argv[0] 
                                  + " <MODEL_DIR>");
    }
    std::string prompt;
    std::string model_path = argv[1];

    std::string device = "CPU";  // GPU can be used as well
    ov::genai::LLMPipeline pipe(model_path, "CPU");
    
    ov::genai::GenerationConfig config;
    config.max_new_tokens = 100;
    std::function<bool(std::string)> streamer = [](std::string word) { 
        std::cout << word << std::flush;
        // Return flag corresponds to whether generation should be stopped.
        // false means continue generation.
        return false; 
    };

    pipe.start_chat();
    for (;;) {
        std::cout << "question:\n";
        
        std::getline(std::cin, prompt);
        if (prompt == "Stop!") 
            break;

        pipe.generate(prompt, config, streamer);
        
        std::cout << "\n----------\n";
    }
    pipe.finish_chat();
} catch (const std::exception& error) {
    std::cerr << error.what() << '\n';
    return EXIT_FAILURE;
} catch (...) {
    std::cerr << "Non-exception object thrown\n";
    return EXIT_FAILURE;
}

Finally, here’s what we got when running the above example on an AI PC:

Figure 3: Live demo of a Llama-based chatbot running locally on an AI PC.

In summary, the GenAI API includes the following API enabling lightweight deployment and coding:

generation_config — configuration for enabling customization of the generation process such as the maximum length of the generated text, whether to ignore end-of-sentence tokens, and the specifics of the decoding strategy (greedy, beam search, or multinomial sampling).
llm_pipeline — provides classes and utilities for text generation, including a pipeline for processing inputs, generating text, and managing outputs with configurable options.
streamer_base — an abstract base class for creating streamers.
tokenizer — the tokenizer class for text encoding and decoding.
visibility — controls the visibility of the GenAI library.

Conclusion

The new OpenVINO™ GenAI API from the latest OpenVINO 2024.2 release offers significant benefits and features, making it a powerful tool for developers to create GenAI and LLM applications. With its simple setup process and minimal dependencies, the API reduces code complexity, enabling you to quickly build efficient Gen AI inferencing pipelines with only a few lines of code. Additionally, the support for streaming model output tokens facilitates the creation of interactive UIs, enhancing user experience.

We welcome you to try out the new GenAI API and explore its capabilities in your projects! Together, we can push the boundaries of what generative AI can achieve via open-source libraries!

openvino.genai/samples/python/chat_sample at master · openvinotoolkit/openvino.genai

Run Generative AI models using native OpenVINO C++ API - openvino.genai/samples/python/chat_sample at master ·…

github.com

openvino_notebooks/notebooks/llm-chatbot/llm-chatbot-generate-api.ipynb at latest ·…

📚 Jupyter notebook tutorials for OpenVINO™. Contribute to openvinotoolkit/openvino_notebooks development by creating…

GitHub.com

Run LLMs with OpenVINO GenAI Flavor - OpenVINO™ documentation

Learn how to use the OpenVINO GenAI flavor to execute LLM models.

docs.openvino.ai

References:

https://huggingface.co/docs/transformers/pipeline_tutorial

Additional Resources

OpenVINO™ Documentation

OpenVINO™ Notebooks

Provide Feedback & Report Issues

Special thanks to all of our contributors & editors:

Ria Cheruvu, Paula Ramos, Ryan Loney, Stephanie Maluso

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.