Model Adaptation & Fine Tuning + Optimal Generative AI Model Deployment: Part #1

9 min readAug 28, 2023

Purpose

The purpose of this document is to portray the importance of Parameter Efficient Fine Tuning (PEFT), or in some cases ‘Instruction Fine Tuning’, along with training Generative AI models using RLHF (Reinforcement Learning Using Human Feedback), and lastly optimizing the model, deploying it cost optimally with inference at low latency, and building external applications with these deployed models on SageMaker, specifically using AWS Inferentia2 and AWS Trainium to make sure these fine tuned generative AI models can serve the user base with high performance and low latency.

Generative AI Product Example

Humans go through several development stages in their life, and the trajectory of their development differs at various stages of their lives. Let’s focus on learning and increasing knowledge about subjects, for example. We all study through high school and college, or whatever degree you might want to accomplish, and really learn throughout the course of a human’s lifespan. Ironically, the trajectory of our learning is the highest (almost exponential) when we do not have any knowledge — that’s right! Our brains plasticity, or flexibility to incorporate and learn new information, is the highest from birth to two or three years of age — if you have ever wondering how children learn new languages, that is the perfect time for this! View the brain plasticity chart below and how thee trajectory changes:

Now, generative AI is used for several natural language processing use cases, whether it be text summarization, modern LLM use cases for education, easing manual procedures in software development life cycles. With the launch of GPT and other LLMs (BERT, BLOOM, etc), we have seen an exponential number of start ups launch ‘AI-Powered Tutors’ for students of various age groups — how efficient exactly is this? For our blog today, we will aim to choose a model, scope of our problem, and then dive deeply into how we can fine tune the model we use to really teach such children at a young age, depending on their progress and self development in language development, and other subjects such as mathematical reasoning (especially when ML models do not really do math reason, only unique symbols operations), history, and general knowledge. We will go over these concepts and really take an interesting stretch to creating a model to aid in self development and educative learning for children from ages 1–5 (where brain plasticity is high, reliability on model for a steady learning pace becomes essential).

Generative AI Product Idea: Tailored Educator for Elementary Students (Ages 1–5) [IDEA SCOPE]

Our scope for this model is to fine tune for a use case that acts as a point of source to aid young children (with high brain plasticity) in learning basic concepts, language, mathematical reasoning, history and general knowledge, and that this model tailors itself from student to student for the sake of attention and personalized learning and checkpoints to leverage learning to the maximum. After efficiently training this model, we will focus on model deployment for optimal costs, high performance and low latency on AWS Inferentia2 or AWS Trainium. Here, we will focus on choosing a large language pre trained model, and fine tune it efficiently, using various techniques, followed by tailoring it to our idea on the following subjects:

1. Langauge processing and learning for young children
2. Basic and tailored mathematical reasoning for each student for logical reasoning progression
3. Learning in basic subjects, such as History, General knowledge and situational scenarios to develop soft skills

Generative AI Model Selection: Tailored to our Product Idea

For the purpose of our idea/scope, we will be using a decoder only model for natural processing of language and serving our idea as a general modern LLM trained for our use case — we will use a BLOOM based model to make sure it is sufficient to teach a human brain on basic subjects and self development. Let’s take a look at fine tuning processes for our use case.

PEFT’ing’ our Model (Parameter Efficient Fine Tuning our BLOOM model)

Before we get started and get into the nitty gritty of training our model, we need to clarify that since our model focuses on training on data on the below subjects:

1. Langauge processing and learning for young children
2. Basic and tailored mathematical reasoning for each student for logical reasoning progression
3. Learning in basic subjects, such as History, General knowledge and situational scenarios to develop soft skills

We need to focus on using a multi task fine tuning technique, rather than single task instruction fine tuning.

→ This way, we avoid catastrophic forgetting, and then can have our BLOOM model perform these multiple tasks, tailored specifically for the children in their learning process.

Now, let’s start off top down, take a look at fine tuning our model in a manual manner, using in context learning and along with examples of prompts and completions in the datasets, to assist with our multi tasks.
We will then take a look at some of the limitations and keep moving forward until we reach a conclusion that helps us efficiently train our BLOOM model on all of the data
Followed by setting certain metrics and benchmarks to make sure our model performs better and bettern.
We will establish an automated pipeline for model training and evaluation for improvements, and lastly we will look at optimizing and deployments for high performance and low costs techniques.

1. Multi-Task Instruction Fine-Tuning our BLOOM Model

Now that we know that our BLOOM model is supposed to perform different tasks, we can create datasets to be able to fine tune our pre trained bloom model in the following manner:

Here, we can create a large dataset, or convert it into chunks of data with instructions and examples of completions: targeting various subjects of what the model aims to teach the students.

→ Now — let’s look at a couple of limitations of using an instruction multi task fine tuning technique with data categories with instructions and examples:

Too narrow for a Large Language Model for Vast Use Cases (In our case, BLOOM): Our language model is a vast product for a use case to scale and grow as it teaches children, so defining instructions for use cases will never end, and we will keep exploring the never ending cycle of improvements in learning. Lastly, every user will have their own way of learning, and we cannot predefine that aspect, so using instruction fine tuning, even for a multi task approach will not be the most optimal way to accomplish what we need to with our product.
Ambiguity in Instructions to fulfill all our possible use cases: Use cases grow as users grow assuming that each user will have their own way of learning every subject. Even if we start of with a 1000 users learning 5 new subjects, and if we assume each user has a different way of approaching the path of learning, we will have 1000 * 5 = 5,000 instructions, assuming we know all the learning approaches. A predefined solution will aid in basic questions and fixed learning, but since we are targeting a younger user base, the learning approaches are as ambiguous and flexible and so starting with a predefined set of instruction fine tuning in multi tasks is going to slow our model down and narrow our product goal.
Data size/Model size/Compute Power: Worst case scenario, if we use thousands of instructions and keep adding it to our data:

From the image above, we know that our model consumes not only the data we feed in, but it will grow exponentially as we keep adding in more instructions (which is bound to happen for a vague and vast use case like ours), and so our data will increase and the model size will go out of bounds, leading to higher costs, manual scaling and a very low performance efficient model.
Iterations over models and data sizes: Iterating over model after performance evaluations, adding more and more data and instructions to our model is going to increase the memory and decrease memory efficiency. In the case of our use case, using BLOOM will be much more efficient with PEFT (Parameter Efficient Fine Tuning).
Ambiguous vast model != Instruction fine tuning: To finish off some of the pain points of using multi task instruction fine tuning, I want to reiterate that training models that are supposed to excel in ambiguous tasks such as teaching pre schoolers, to have the basic language understanding, ability to interactively teach mathematical, history and general knowledge concepts, using instruction fine tuning is NOT the way to go. Let’s take an example — put yourself in the shoes of the student and say you are in this evolving world of AI, and instead of learning the way your brain wants to, you are given predefined instructions to learn from — how would that go? You might learn the concepts but are you developing yourself? Or is the AI developing you? Maybe that’s a debate for sometime later. Point being, we can train our models more flexibly for tasks that require the model to learn and iterate thousands of times, without taking up a lot of memory, retaining performance and delivering to the users with low latency.
Evaluations: Now, that we have made sure using multi task instruction fine tuning is not an optimal way of training a vast model, let’s strengthen on that argument and understand some metrics you might want to consider using in case you used multi task fine tuning and how you could ensure of model functionalities effectively and that it performs better at every iteration from before and as more instructions are added:

Evaluation Metrics: Multi-Task Fine Tuning

In our case, let’s establish one main challenge for evaluating what our BLOOM LLM might generate:

Languages are several and the ways to understand and learn from different phrases in different contexts are vague and endlessly different for all users.
At the end of the day, psycholinguistics and perception are two concepts that rely and vary from user to user, so a way to write a context or a completion to the prompt might be understood and evaluated differently for different users in different scenarios.

Now that we have established that this is always going to be a challenge, I personally feel that since our brains have 100s of billions of neurons and perceiving and psycholinguistics are only developed through personal, tailored human experiences, no matter how big a model is, even if it contains a trillion parameters, it is a stretch to evaluate all possible metrics for model performance improvements.

So, now that we have talked a little about cognition and psychology, let’s jump back into metrics we can use with our multi-task instruction fine-tuned models (if we used one for this scenario):

ROUGE Metric (string evaluator): We can use string evaluators in our case to understand how the texts generated improve. ROUGE, in this case can be used for text summarizations — let’s say we want out product to be able to summarize basic explanations of a language task for historical or general knowledge examples, we can use the ROUGE metric to do that.

BLEU Metric (string evaluator): We can use string evaluators in our case to understand how the texts generated improve. BLEU, in this case can be used for text translation tasks.
For example, we can use it for children to translate and refer to sequence to sequence conversions of basic language tasks and break it down into different parts for linguistic understandings.

This would be beneficial in scenarios where the model would want to teach basic important and foundational values to students, that are human aligned and translated in different ways, tailored based on different students.

Other Evaluators (comparison, trajectory evaluators): You can use other evaluators to make sure that you are fulfilling all evaluation modes, by comparing or using the trajectory class of methods for evaluating your model’s performance. For more details on this, view: https://python.langchain.com/docs/guides/evaluation/

Now, a big challenge with this, that would transition to the next part of this blog is that for vast LLMs, for example for our BLOOM model to train and teach students different subjects, using such evaluation metrics are not enough. Now, we will use the most efficient training techniques for our model, Using PEFT, followed by setting some benchmarks for our model, and then establishing a model improvement pipeline. We will then dive into model deployments and how we can make this highly performant and cost optimal using AWS Inferentia2 and AWS Trainium on Multi Model Endpoints on none other than, Amazon SageMaker.