Supercharge AI PC Inferencing with Google Gemma and OpenVINO™

Published in

OpenVINO-toolkit

5 min readMar 19, 2024

Author: Zhuo Wu, Intel AI Software Evangelist

Large language models (LLMs) are rapidly growing more powerful and efficient. This results in increasingly sophisticated understanding and generation of human-like text across a wide range of applications. For example, LLMs are crucial for chatbot applications to perform natural language understanding, contextual awareness, and conversational depth.

Google’s Gemma, a new family of lightweight, state-of-the-art open-source models, is at the forefront of this LLM innovation with its ability to perform text generation tasks, such as question answering, summarization, and reasoning.

Named after the Latin word “gemma”, which means “precious stone,” these models are text-to-text, decoder-only LLMs, currently available in English. They are built with the same research and technology used in Google’s generative AI Gemini models, and come with open weights, pre-trained variants, and instruction-tuned variants.

The Gemma model family, including the Gemma-2B and Gemma-7B sized models, represents a tiered approach to deep learning model scalability and performance.

But the pursuit of even faster and more intelligent inference goes beyond just developing advanced models. It extends into the realm of optimization and deployment technologies, where OpenVINO™ toolkit emerges as a powerful tool. This blog post explores how to optimize Google’s Gemma models for a chatbot solution and accelerate inference with OpenVINO™ on AI PCs, which are computer systems designed specifically for AI experiences. This combination transforms them into high-performance engines capable of faster and smarter inference.

For this post, we will focus on optimizing and accelerating the inference of the Gemma-7B-it model, an instruction-tuned version of 7B parameters model, on an AI PC.

Streamlining Performance: Optimization and Inference Acceleration with OpenVINO™

The process of optimizing, accelerating, and deploying the Gemma-7B-it model includes the following specific steps, using the LLM-powered chatbot notebook in our commonly used OpenVINO Notebooks repository.

1. Start with the prerequisites installation:

How to set up the environment for running OpenVINO™ notebooks can be found in our installation guide. To run the specific notebook for LLMs may require some prerequisite packages to be installed first.

2. Select a model for inference:

Since we provide a bunch of LLMs supported by OpenVINO in our Jupyter notebook demo, you can select “Gemma-7B-it” from the dropdown box to run the rest of the optimization and inference acceleration steps for this model. The list includes additional models, such as Gemma-2B-it, that you can easily switch to if you need.

3. Instantiate the model using Optimum Intel:

Optimum Intel is the interface between the Hugging Face Transformers and Diffusers libraries and OpenVINO to accelerate end-to-end pipelines on Intel architectures. We use Optimum Intel to load optimized models from the Hugging Face Hub and create pipelines to run inference with OpenVINO Runtime using Hugging Face APIs. In this case, it means we need to replace only the AutoModelForCausalLM class with the corresponding OVModelForCausalLM class.

4. Perform weights compression:

Although LLMs like Gemma-7B-it are becoming more powerful and sophisticated in understanding and generating human-like text, managing and deploying these models pose critical challenges on computation resource, memory footprint, inference speed, etc., especially for client devices like AI PCs. The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance.

Both INT8 and INT4 compressions are provided in our Jupyter notebook, using Optimum Intel and Neural Network Compression Framework (NNCF). Compared to INT8 compression, INT4 compression improves performance, but introduces a minor drop in prediction quality. We’ll therefore select INT4 compression here.

We can also compare the model size before and after weight compression.

5. Select the device for inference and model variant:

Since OpenVINO enables deployment across a range of hardware devices, a dropdown box is also provided for you to choose the device for the inference to run on. In this case, we’ll choose GPU, which is the integrated GPU on AI PC.

6. Run the chatbot:

With everything ready, now we can run the Gemma-7B-it based chatbot. A user-friendly interface based on Gradio is also provided for running the chatbot. Now, let’s chat!

By leveraging OpenVINO’s optimization techniques, you’ve unlocked the full potential of your Gemma models. We can’t wait to see what other AI PC applications you come up with using Gemma’s newfound efficiency. Be sure to check out our other OpenVINO notebook tutorials and start bringing your innovative solutions to life.