Getting started with Generative AI: 4 ways to customize your models

Georgian
Georgian Impact Blog

--

By: Azin Asgarian

Generative AI models such as GPT-4, Bard and LLaMA have become increasingly popular due to their remarkable capabilities in various tasks, including text generation, summarization and categorization. While big tech companies have the resources to train and maintain these powerful models, smaller businesses and organizations may need to explore cost-effective ways to adapt and customize existing models to address their specific requirements.

The process of selecting the right model and customizing it can be a complex endeavor, particularly for those companies new to this field. You need to think about factors like the intended application, available computer power, task difficulty, dataset size and quality, scalability and the necessary expertise for successful implementation, among others, when considering the appropriate model and customization approach.

In this blog post, we present our insights from working with growth-stage software companies and experimenting with generative AI models and technologies. Specifically, we’ll discuss and compare various types of generative AI models and outline four techniques for tailoring these models to an organization’s unique needs. From fine-tuning to prompt engineering, prompt optimization and reinforcement learning from human feedback (RLHF), this guide aims to lay out some considerations for determining the most suitable approach for adopting the GenAI models to specific use cases.

What is Generative AI?

Generative AI has taken the world by storm, with models like GPT-4, ChatGPT and Stable Diffusion transforming human-machine interactions. These models empower users to generate inventive visual and textual content, engage in life-like conversations and more. [1]

What sets generative AI models apart from previous deep learning models is their complexity, size and extensive training on massive datasets. As a result, they exhibit advanced capabilities, such as generating new content from prompts, engaging in logical reasoning, solving mathematical problems and passing human-like tests, which were previously unattainable with the previous generation of AI models.

In contrast to traditional deep learning models that rely on supervised or semi-supervised methods for specific tasks, generative AI models employ unsupervised techniques, making them suitable for a wide array of downstream applications. This versatility has cemented their status as foundational models in the field.

What are the different types of Generative AI models?

Currently, there are four main types of generative AI models:

  1. Generative Adversarial Networks (GANs): Introduced in 2014, GANs consist of two neural networks: a generator that creates new examples and a discriminator that distinguishes between real and generated content. As the generator improves content quality, the discriminator becomes better at identifying generated content, resulting in a continuous cycle of improvement. GANs are known for producing high-quality samples quickly but often lack sample diversity, making them better suited for domain-specific data generation.
  2. Variational Autoencoders (VAEs): VAEs consist of two neural networks, the encoder and decoder. The encoder compresses input data into a dense representation, while the decoder reconstructs the original input data. This process allows for the generation of novel data by sampling new latent representations that are mapped through the decoder. Although VAEs generate outputs faster than diffusion models, they often lack detail in comparison.
  3. Diffusion Models: Diffusion models, also known as denoising diffusion probabilistic models (DDPMs), have become the new state-of-the-art in image generation, outperforming previous approaches like GANs. These models involve a two-step process: forward diffusion, which adds random noise to training data, and reverse diffusion, which reconstructs the data samples by reversing the noise. By starting the reverse de-noising process from random noise, novel data can be generated. Although diffusion models are considered foundational models due to their high-quality outputs and flexibility, their reverse sampling process can be slow and time-consuming.
Photo Credit: Nvidia

Diffusion models are closely related to GANs, and have largely replaced GANs in many applications. Moreover, they share similarities with de-noising autoencoders, to the extent that some researchers have even contended that diffusion models are a specific type of autoencoders. The performance of diffusion models in various applications, such as image synthesis, video generation and molecule design, has captured the attention of the machine learning community, turning it into a burgeoning research field.

For a comprehensive literature review on diffusion models, please refer to this survey paper.

Credit: Diffusion Models: A Comprehensive Survey of Methods and Applications

4. Transformer-based models: Transformers, which were introduced in the groundbreaking 2017 paper “Attention is All You Need,” are a class of deep learning models that have transformed natural language processing.

These models, also called Large Language Models (LLMs), are distinct in that they process sequential input data in a non-sequential manner and rely on self-attention and positional encoding mechanisms. Self-attention determines the importance of different parts of the input, while positional encoding captures the order of input words.

Transformer networks consist of multiple layers, such as self-attention, feed-forward and normalization layers, and are excellent at deciphering and predicting tokenized data, making them highly effective for text-based generative AI applications. For more info about the attention mechanism and transformer architecture, you can read this article or watch this video.

In the last few years, dozens of models belonging to the transformer family have emerged, each with a quirky name that may not be self-explanatory.

For a comprehensive but straightforward overview of these models, check out this survey paper.

Photo Credit: Transformer Models: An introduction and catalog

To evaluate and compare LLMs, benchmarks such as HELM from Stanford University are available (36 models, 57 metrics, and 42 scenarios).

Photo credit: HELM

And survey papers such as this paper from University of China:

Four ways to customize pre-trained models

As alluded to earlier, established players have access to the data and intellectual property needed for training large-scale foundation models.

In contrast, numerous small to medium-sized startups typically do not possess the resources (such as expertise, computational power, and funding) to build GPT-like models from scratch.

To address this, we present four alternative methods for customizing pre-trained models, which may help startups to achieve differentiation and deliver a personalized experience to their customers without the burden of training massive models from the ground up [1, 2, 3, 4]:

  1. Fine-tuning

Fine-tuning is a popular approach that involves updating some of a pre-trained model’s parameters with additional labeled data to adapt a general-purpose model for a specific task. This technique enables the fine-tuned model to retain knowledge from its pre-training process while becoming more proficient in a specialized domain. However, fine-tuning carries the risk of overfitting, where the model becomes excessively specialized in one area, losing its ability to perform well in other tasks.

A recent variation of fine-tuning, particularly for LLMs, is called instruction tuning. Instruction tuning refines a pre-trained language model on a mixture of tasks formulated as instructions. As shown in this paper by Google, this method can enhance the model’s performance on specific tasks that require intricate instructions.

There are several common fine-tuning strategies, including:

a. Updating the embedding layer(s): This strategy focuses on refining the layers responsible for converting input data into meaningful representations. For T5-large, this strategy involves updating approximately 32 million parameters.

b. Updating the language modeling head: This approach targets the components of the model responsible for generating output predictions. In the case of T5-large, this strategy would involve updating approximately 32 million parameters.

c. Updating all parameters of the model: This comprehensive strategy involves updating the entire model, including both the aforementioned components and other layers. For T5-large, this strategy means refining approximately 770 million parameters.

Photo Credit: Google

2. Prompt engineering

Prompt engineering offers an alternative to fine-tuning large language models for specific tasks. This approach emerged with GPT-3, which convincingly demonstrated that a frozen model could be conditioned to perform various tasks through “in-context” learning.

With prompt engineering, users condition the model for a particular task by hand-crafting a text prompt that includes a description (zero-shot) or examples of the task (few-shot). For instance, to prime a model for sentiment analysis, one could use the prompt, “Is the following movie review positive or negative?” followed by the input sequence, “This movie was amazing!” Researchers found that specific prompt engineering methods, such as few-shot prompting or chain-of-thought prompting, can significantly enhance output quality without the need for fine-tuning.

This approach simplifies the process of serving a model for multiple downstream tasks by allowing a single frozen pre-trained language model to be shared across tasks. Unlike fine-tuning, it eliminates the need to store and serve separate tuned models for each downstream task.

However, based on these studies and research papers, there may be some drawbacks. Text prompts require manual effort to design, and their performance often falls short compared to fine-tuned models. For example, a frozen GPT-3 model with 175 billion parameters scores 5 points lower on the SuperGLUE benchmark than a fine-tuned T5 model with 800 times fewer parameters.

Photo Credit: Google

3. Prompt-tuning or optimization

The research community soon realized that by moving beyond prompt engineering (manually crafted prompts) and treating prompts as tunable parameters, they could optimize LLM performance for specific tasks.

Prompt-tuning serves as a compromise between fine-tuning and prompt engineering, as it only tunes the prompt parameters for a specific task without retraining the model or updating its other parameters. This method conserves resources compared to fine-tuning, delivers higher-quality outputs than prompt engineering and reduces the need for manual labor in crafting prompts.

For instance, in the case of T5-large, prompt-tuning involves adjusting approximately 50,000 parameters, as opposed to the 32 to 770 million parameters that would be updated during fine-tuning [1].

Photo Credit: Google

In particular, the automatically-induced prompts can be further separated into two groups: 1) discrete or hard prompts, where the prompt is an actual text string and 2) continuous or soft prompts, where the prompt is instead described directly in the embedding space of the underlying LLM [1].

We will briefly explain both hard or soft prompts below:

Discrete or hard prompts

Discrete prompts, also known as hard prompts, involve automatically searching for templates in a discrete space, usually corresponding to natural language phrases.

Various methods have been proposed for discrete prompts [1] including:

a. Prompt Mining: Automatically finds templates using a large text corpus, searching for frequent middle words or dependency paths between inputs and outputs.
b. Gradient-based Search: Iteratively searches for short sequences that trigger pre-trained language models to generate desired predictions by applying gradient-based search over tokens.
c. Prompt Generation: Treats prompt creation as a text generation task and uses pre-trained models like T5 or reinforcement learning techniques to generate prompts.

Continuous or soft prompts

Similar to hand-crafted prompts, soft prompts are combined with input text. However, instead of using existing vocabulary tokens, soft prompts consist of learnable vectors that can be optimized end-to-end across a training dataset. While soft prompts are not immediately interpretable, they intuitively extract evidence on how to perform a task from labeled datasets.

This approach serves the same purpose as manually written text prompts but without the constraints of discrete language. It eliminates the need for manual prompt creation and allows the soft prompt to efficiently process and condense information from extensive datasets containing thousands or even millions of examples. In contrast, engineered prompts typically face limitations due to model input length constraints, accommodating fewer than 50 examples.

In a recent paper, Google researchers demonstrated that it is possible to create a soft prompt for a specific task by initially configuring the prompt as a fixed-length sequence of vectors (for example, 20 tokens long). These vectors are appended to the beginning of each embedded input, and the combined sequence is then fed into the model. The model’s prediction is compared to the target in order to calculate the loss, and the error is back-propagated to compute gradients. However, only the learnable vectors are updated, while the core model remains unchanged.

Since this discovery, numerous soft prompt optimization techniques have been developed, including:

a. Prefix Tuning: Prepends a sequence of continuous task-specific vectors to the input while keeping the language model parameters frozen. This method is more sensitive to different initialization in low-data settings than discrete prompts.
b. Tuning Initialized with Discrete Prompts: Initializes the search for a continuous prompt using a prompt created or discovered via discrete prompt search methods. This approach can provide a better starting point for the search process.
c. Hard-Soft Prompt Hybrid Tuning: Inserts tunable embeddings into a hard prompt template, combining learnable and fixed elements. One example is P-tuning, which represents prompt embeddings as the output of a BiLSTM and introduces task-related anchor tokens within the template. Another example is Prompt Tuning with Rules (PTR), which uses manually crafted sub-templates and logic rules to compose a complete template, inserting virtual tokens with tunable embeddings for enhanced representation ability.

Left: With model tuning, incoming data are routed to task-specific models. Right: With prompt tuning, examples and prompts from different tasks can flow through a single frozen model in large batches, better utilizing serving resources. Photo Credit: Google.

Prompt-tuning, which originated with large language models, has since been extended to other domains such as audio and video due to its practical efficiency. Prompts can take the form of text snippets, speech streams or pixel blocks in still images or videos. For more information, refer to this comprehensive survey paper.

4. Reinforcement learning from human feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a method used to improve language models by aligning them with human values that are not easily quantifiable (e.g., funny, helpful). They can help models avoid erroneous answers, reducing bias and hallucinations.

The process consists of three main steps:

a. Pre-training a Language Model (LM): This step involves using an existing language model as a starting point. This pre-trained model has already been trained on a large corpus of text data and can generate text based on given prompts. Examples include smaller versions of GPT-3 or DeepMind’s Gopher. The choice of the initial model depends on the specific application and is not yet standardized.

b. Gathering Data and Training a Reward Model (RM): In this step, human feedback is collected to create a dataset that assigns a scalar reward to the text based on human preferences. Prompts are passed through the initial language model to generate new text, which is then ranked by human annotators. Various ranking methods can be used to convert these human rankings into a scalar reward signal for training. The reward model aims to represent human preferences numerically, and it is used to guide the fine-tuning of the language model in the next step.

c. Fine-tuning the Language Model with Reinforcement Learning (RL): The final step involves optimizing the initial language model using reinforcement learning. The objective is to update the model parameters to maximize the reward metrics obtained from the reward model. The fine-tuning process involves formulating the task as a Reinforcement Learning (RL) problem, defining the policy, action space, observation space and reward function, and then using an RL algorithm (e.g., Proximal Policy Optimization (PPO) or Advantage Actor-Critic (A2C)) to optimize the model based on these components.

Photo Credit: Hugging Face

Overall, RLHF is an exciting and complex research area that aims to improve language models by directly optimizing them using human feedback. The method has been successfully implemented in models like GPT-4 and ChatGPT but requires further exploration to optimize the design space and understand its full potential. For more technical details please check these blog posts (1 and 2) from Hugging Face.

Comparison between the four methods

Here we briefly describe the differences between these four methods.

We do this by looking into four different dimensions and examining how each method ranks for each dimension:

Photo Credit: Georgian
  1. Data Volume: When it comes to data volume, in our view, prompt engineering stands out as the most efficient method.
    Prompt engineering can be executed with just a task description (zero examples) or a task description accompanied by as many examples as the model allows. Due to model input size limitations, most models can accommodate fewer than 100 examples. If there are more examples than the model’s input can handle, prompt optimization methods should be considered, and soft prompts can be utilized for higher performance if explainability is not a concern. Prompt optimization methods demand significantly less data compared to fine-tuning, as they update far fewer parameters. Depending on the size of your model, you may need a few thousand to a few hundred thousand examples to go with fine-tuning or RLHF. The choice between fine-tuning or RLHF will depend on the nature of the objective. In our view, it would be efficient to opt for fine-tuning if the objective is easily quantifiable, such as improving performance for a specific task, or choose RLHF if the objective is not easily quantifiable, like making a chat assistant more ethical and harmless.
  2. Training Costs: There are no training costs for prompt engineering, and the costs remain low if you want to experiment with various prompts and examples to optimize performance. Prompt optimization methods also have relatively low costs, as they involve tuning a small number of parameters. In comparison, fine-tuning can incur higher costs, as it involves updating some or all of the model’s parameters.
    RLHF has relatively high costs, as not only does it require tuning some or all of the model’s parameters, but it also necessitates paying human annotators to create high-quality data. It’s crucial to note that data quality is vital for achieving exceptional performance and should not be compromised.
  3. Technical Difficulty: In our view, among the four approaches, prompt engineering and fine-tuning are the easiest to implement, as numerous platforms and tools support execution with just a few lines of code. In contrast, prompt optimization methods are more recent, and there is a lack of sufficient tools or platforms to assist the user. The same applies to RLHF. While there have been significant advancements, such as the emergence of platforms that simplify collecting human feedback (e.g., Surge.AI) and support from key players like Hugging Face (code example), we believe that this area is still relatively new and requires time to mature.
  4. Performance Gain: Although prompt engineering methods can improve performance, particularly when providing the model with a few examples, they may still lag behind other approaches.
    Prompt optimization methods have proven to be more effective than prompt engineering methods, but they only achieve parity with fine-tuning for models with 10 billion or more parameters [1]. Both fine-tuning and RLHF require a substantial amount of data, but they can lead to the most significant performance gains. The choice between fine-tuning and RLHF ultimately depends on the specific goal and whether it is quantifiable or not.

Takeaways

In this post, we discussed various types of generative AI models and four approaches to customize them for a specific use case. We believe that it is important to have a good understanding of what is available, choose the relevant models for the use case, and further tailor the models using appropriate techniques and relevant data.

--

--

Georgian
Georgian Impact Blog

Investors in high-growth business software companies across North America. Applied artificial intelligence, security and privacy, and conversational AI.