Build Better GenAI Applications with the Right Techniques

Published in

OpenVINO-toolkit

6 min readMar 1, 2024

Author: Ria Cheruvu — Intel AI SW Architect and Evangelist

Generative AI (GenAI) is the fastest-moving field in AI technology today. If you’re new to GenAI or transitioning over from traditional machine learning, keeping up can seem daunting.

As an AI software architect, my role challenges me to think about the bigger picture around today’s research trends, including ones that are hyped up.

With many industries considering using large language models (LLMs) and GenAI tools, it’s important to consider not just the models but also the techniques powering these GenAI experiences — with their capabilities, limitations, and potential!

A key element behind these applications is the need for the right data, around bringing techniques like specialization, contextualization, and multimodality into training pipelines. Let’s look at each.

Training Your Own Specialized GenAI

Users starting out on language-based AI projects typically choose from several pre-trained advanced LLMs, including GPT-4, LLaMA 2, Mistral 7B, and of course ChatGPT. Each has its pros and cons, but one feature they have in common (at least in their base form) is that they are all intentionally trained on broad, general data sets that provide linguistic capabilities but can lack focus/specificity. These kinds of models are known as foundation models. Foundation models are large AI models capable of performing multiple tasks, making them beneficial for a wide range of downstream applications.

But you can also train your own model. And doing so comes with major advantages:

Data Privacy: Avoid exposing sensitive or proprietary information to a third party.
Better Performance: Optimize for specific tasks to produce better results at lower cost.
Content Control: Train a model to align with specific values or standards.
Limiting Bias: Curate the training data set to achieve better fairness and neutrality.

The downside of training your own model from scratch is that it requires considerable effort and expertise not everyone has. That’s why fine-tuning is emerging as the future of GenAI model optimization.

Fine-Tuning GenAI with Human and Machine Feedback

Fine-tuning uses a pre-trained model as a starting point and orients it for new, specific training data sets. This set of techniques dramatically accelerates development time and reduces costs. It’s hard to overstate the savings, as fine-tuning LLMs can cost orders of magnitude less than training the same model from scratch simply because you’re able to shortcut the bulk of the training process.

Just as there are many LLMs, there are also many approaches to fine-tuning. All of these involve exposing a pre-trained model to new data sets, and include:

Repurposing: Adapts the model for a new but related task.
Full model fine-tuning: Adjusts all the parameters to take on a new and significantly different task.
Instruction fine-tuning: Trains a model to follow specific guidelines, constraining its behavior.
Supervised fine-tuning: Uses labeled data sets to optimize for tasks where desired outcomes are clearly defined.
Reinforcement learning and human feedback (RLHF): Uses human evaluation to provide nuanced feedback for complex tasks.
Parameter-efficient fine-tuning (e.g., Low-Rank Adaptation (LoRA)): Adjusts only part of the model, helping overcome the challenges of tuning large models.

RLFH has received the most attention due to its ability to deliver human-like reasoning and decision-making. In one example, a team at OpenAI found that RLHF resulted in a model that customers preferred over GPT-3 despite having at least 100x fewer parameters. The downside to RLHF is that it still requires significant human and computational resources. This brings me to LoRA, a newer and more specialized technique designed for efficiency.

LoRA focuses on the transformer attention and feed-forward blocks of a model. Unlike other fine-tuning methods that adjust the model weights, LoRA freezes these values and injects additional trainable layers instead. These added layers require far less computing to train, but the results are comparable to full model fine-tuning. Our team recently demonstrated the potential of LoRA with a pipeline that combines Stable Diffusion + ControlNet with OpenVINO™ optimization to generate images with different styles.

*Stable Diffusion v1.5+runtime LoRA safetensors weights + ControlNet*

Optimization and Decision-Making in Fine-Tuning

Optimization is a critical consideration because it determines not only the cost but also the flexibility of GenAI. By optimizing the precision of model parameters (e.g., INT8, FP16, FP32), developers can considerably improve the speed, memory footprint, and scalability of a model.

Although LoRa greatly limits the training required, it also creates questions about which parameters to freeze. New APIs and abstractions like the popular libraries from Hugging Face give developers an “off-the-shelf” path to optimization. Intel has collaborated with the company to advance the democratization of AI by optimizing Hugging Face models with OpenVINO. Through OpenVINO, developers can leverage pre-optimized libraries, allowing them to train models locally — such as on a laptop equipped with Intel® Arc™ Graphics — and in the cloud on Intel® Xeon® processors.

Multimodal Approaches

The other big change coming to GenAI is the move to multiple data sources. This can be seen in the multimodal capabilities familiar to any user of LLMs like ChatGPT. Here, text-based capabilities are complemented by the ability to ingest other data types, such as images or sound.

Now the focus has shifted to data representation, with the goal of integrating different modalities into a single data set. This will enable models to process diverse data types simultaneously, leading to more sophisticated and capable AI systems that can serve as assistants.

One challenge for multimodal models is that introducing new data structures can affect performance and accuracy. But with OpenVINO, developers can easily accelerate inference and benchmarking for visual and other complex data.

For example, we recently explored the creation of a virtual assistant using LlaVa and OpenVINO, a multimodal system that accepts visual and image inputs. After compressing our model weights (to 4 and 8 bits) using OpenVINO NNCF, we then ran inference on our interactive virtual assistant to ask it questions about an image.

In-Context Learning

The other way models can tap into multiple data sources is through in-context learning. This technique pairs an LLM with a database or other data repository. This approach does not modify the model itself; instead, it adds data from the repository to user queries, giving the LLM better context for its responses. It is complementary to fine-tuning and can be paired with techniques like LoRA.

Below, we have an example of an LLM using Retrieval Augmented Generation, retrieving text from documents we upload, and efficiently answering questions in an interactive interface, from our llm -chatbot in the OpenVINO notebooks repository.

*Using Large Language Models such as Mistral-7B and Zephyr-7B, with Retrieval Augmented Generation*

Accelerating the Future of GenAI

The GenAI revolution is driving a rapid evolution in model training and tuning techniques, as well as the fusion of different AI disciplines. I’m excited to see how the industry will use these advances to enable new levels of intelligence!

To get started on your own journey, I recommend checking out OpenVINO notebooks, many of which include new generative AI applications. The possibilities are endless, and I hope this blog inspired you to help bring your own ideas to life.

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.