Generative AI Fundamentals: Deploying LLMs with OpenVINO™

Published in

OpenVINO-toolkit

4 min readJul 23, 2024

With the rise of generative AI, large language models (LLMs) revolutionize everything from work to research, e-commerce, and entertainment. However achieving success with LLMs and generative AI is often difficult due to their size, cost, and complexity in training, optimization, and deployment.

Notably, over the past year, key advancements have made it easier to accelerate and deploy these models.

In this post, we discuss the fundamentals of generative AI, and outline how to achieve better performance and flexibility on any platform with the OpenVINO toolkit. To get a full code tutorial on running LLMs locally and to learn even more, be sure to catch this on-demand webinar where AI evangelists walk through real-life demos and discuss the future of AI.

Unlocking the Potential of LLMs with OpenVINO™

The great thing about using OpenVINO is that it gives AI developers the ability to write once and deploy anywhere. Previously trained models are converted to the Intermediate Representation (IR) format and optimized for the actual task at hand, and then the inference can run, without changes, on every architecture — from ordinary CPUs to FPGAs.

The OpenVINO team constantly refines and updates the AI toolkit to keep up with the latest technology trends and innovations. Case in point is the OpenVINO 2024.2 release, which provides LLM performance improvements such as more efficient processing, less computational overhead, higher throughput, and lower latency.

OpenVINO also leverages the Neural Network Compression Framework for optimizing the IR model and inference process through weight compression, key-value caching, and stateful transformation. In addition — since many real-world scenarios require the ability to connect several different models — developers can take advantage of the OpenVINO Model Server (OVMS) to easily deploy generative AI models on various architectures.

Actual implementation and deployment of a model with these techniques can take as little as five lines of code, thanks to HuggingFace Optimum Intel tools that make handling AI models with OpenVINO a matter of invoking simple API calls.

To prove this point, in our OpenVINO DevCon video series, we show how a chatbot and an image generator can summarize complex text or create images via user requests with quality comparable to that of much bigger models, and get running in just a few seconds on a local PC.

In the GenAI Fundamentals with OpenVINO webinar, you can also see how OpenVINO optimizes a pipeline of three different AI models: one for embedding, one for ranking, and then an LLM.

Where to Run GenAI LLMs

As we aim to develop better alternatives to huge, general-purpose GenAI models, the first question you may ask yourself is: “Where should GenAI services run, or could they run?”

While cloud platforms are one of the most popular choices for managing really large GenAI workloads in a centralized way, it’s not always the best option. For instance, sending sensitive data to the cloud is not always acceptable in many mission-critical applications, and for some users, even the lack of subscription portability across different clouds may be a problem.

Edge servers are a better location for smaller models that need less computational power but very low latency, or for any service that is based on distributed processing.

But cloud and edge servers aren’t the only solutions anymore. The idea that only huge LLMs running in the cloud can yield good results has run its course. As elaborated in the DevCon video, with tools like OpenVINO, it is easy to optimize AI applications for whatever requirements one may have — low power, high throughput, bandwidth, or latency — and run them wherever they need to be, including on personal computers.

This is great news as the demand for small but powerful GenAI services continues to rise.

Why? Because GenAI applications can fulfill their promise of greatly boosting personal creativity and productivity in only one way: by enabling services that are always available, even offline, and never exposing sensitive data to the Internet. Even more important, those services must be as tailor-made as possible for the few but very specific tasks that each user actually needs — whether that is summarizing a memo on a plane or planning the next steps of one’s vacation in a remote area.

These needs for ubiquitous availability, minimal resource consumption, data protection, and customization — demanding optimized local inference — lead to a clear conclusion: There is plenty of space in the future for very specialized GenAI models. These models, specifically built from training to optimization for specialized tasks (e.g., text summarization in a specific field), can run on edge servers as well as on local AI PCs.

To learn more about how to run LLMs locally, or to explore the new era of AI PCs, model compressions, and other AI and GenAI topics, explore live and on-demand webinars available in our OpenVINO Workshop series.

Additional Resources

OpenVINO Documentation
Jupyter Notebooks
Installation and Setup
Product Page

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Generative AI Fundamentals: Deploying LLMs with OpenVINO™

Unlocking the Potential of LLMs with OpenVINO™

Where to Run GenAI LLMs

Additional Resources

Notices & Disclaimers

Written by OpenVINO™ toolkit