Model Adaptation & Fine Tuning + Optimal Generative AI Model Deployment: Part #2

Madhur Prashant
6 min readAug 28, 2023

--

Continued From Part #1….

Now, a big challenge with this, that would transition to the next part of this blog is that for vast LLMs, for example for our BLOOM model to train and teach students different subjects, using such evaluation metrics are not enough. Now, we will use the most efficient training techniques for our model, Using PEFT, followed by setting some benchmarks for our model, and then establishing a model improvement pipeline. We will then dive into model deployments and how we can make this highly performant and cost optimal using AWS Inferentia2 and AWS Trainium on Multi Model Endpoints on none other than, Amazon SageMaker.

Parameter Efficient Fine-Tuning Our BLOOM Model

From part 1, if you remember, a model’s memory is not only dependent on the data you will save on it but the follows:

It will take in the number of parameters, that will increase as the model scales and evaluates itself for better performance, gradients, weights, etc. Now that we know we do not want to instruction fine tune our model, an efficient way to preserve performance and memory while training instead of tuning the entire model is to train our BLOOM model using parameter efficient fine tuning techniques.

→ Here, we can keep all of the model weights frozen, add a layer of training weights (that are the only parameters which will be tuned in the process of training our model), and lead to an updated model with not more than a couple of MBs fine tuned for our task. Instead of using instructions for every task, we can add a new trainable set of parameters and train our model based on that — for example,

  1. We can add a trainable set of weights for language understanding and examples to teach basic language vocabulary.
  2. In the same way, we can add examples and data from historic, mathematics data sets alone with examples of the best ways to teach, but not bound by it, so the model can use reinforcement learning to build on top of what is best for the user, and be the most optimal for the users’ learning progressions and interactivity in development.

Using this way, we can easily focus on training the model on specific subjects for our students, and then use the model to evaluate on itself using certain benchmarks and iterate to increase and better it’s performance. Now, there are a couple of different ways to use PEFT on your model:

You can either select only a subset of your initial LLM parameters and train those, add more training parameters (using prompt tuning) or you could re-parameterize your mode with low rank adaptation techniques, such as LoRA.

  1. In our case, we can use soft prompting under prompt tuning to add a certain amount of training vectors.
  2. In this case, we will only be changing a subset of parameters (10k-100k parameters)
  3. We will grab the four different tasks:
  4. Historical concepts
  5. Mathematical reasoning
  6. Basic Language Understanding
  7. General knowledge
  8. We can create different tasks for each of the use case, use soft prompting on it to be able to train our model.
  9. We could use LoRA, but since we are adding new data, we can focus on utilizing prompt tuning while retaining performance, and efficiency of the model. Take a look at how we can utilize prompt tuning rather than tuning our entire model below.
  10. Above, you could use ‘QLoRA’ if you want to apply this technique and use quantization on the side as a way to reduce model size without impacting model performance and keeping memory on the model stable.

Now, we need to set benchmarks for our model, since our model is so vast, and establish an automated pipeline to re evaluate, iterate and make our BLOOM model perform efficiently and better every single time.

Setting Benchmarks — BLOOM LLM Model

Sometimes using a model or an LLM that has a large amount of data and several tasks, using evaluators as specified before is not enough to assess the performance of the model. In this case, use specific benchmarks such as GLUE, SuperGLUE and HELM for comprehensive model evaluation purposes.

  • Use GLUE or SuperGLUE for your model if it performs natural language tasks, where SuperGLUE is more meaningful for more challenging tasks.
  • Use HELM (Holistic Evaluation of Language Models) if you want your model to be evaluated on multi-metrics (7 metrics across fairness, bias, toxicity, etc) across several scenarios. This is perfect if your model performs several tasks and you want to test it ‘holistically’.
  • Use MMLU(Massive Multitask Language Understanding) if your model acts as a modern LLM and consumes knowledge about topics across several domains, such as history, law, education, etc.

→ **In the case of our model, we will focus on using SuperGLUE for natural and basic language understanding and processing. We can also use HELM to make sure that our model is human aligned and that the vastness of the model is evaluated on important criteria, such as toxicity, bias, fairness in teaching and being helpful. Lastly, we can make use of MMLU to make sure all of the basic concepts of history, general knowledge and mathematical reasonings are communicated clearly and transparently.**

Now that we have certain benchmarks for our model, let’s make sure we have established an automated pipeline for model evaluation and iteration as the model versions improve with time and as our user base scales:

Establishing an Evaluation Automated Pipeline

While establishing our BLOOM model performance pipeline, we can focus on the most important aspects involved in evaluating the model accuracy and performance efficiency, from aggregating data selection for your use case, to selecting evaluation metrics for your model’s task performance and data type, setting benchmarks for your model or LLM and a continued chain of evaluation and iterative strategies. Let’s look an implementation plan we can use for your model in our specific use case.

Implementation Plan

  • Assessing and Aggregating Data: Start out your pipeline with the data selection process for the task you are willing to complete as a part of your use case for your specific model. Identify and collect all of the diverse data that would be meaningful to your model’s performance and use case.
  • Evaluation metric Definition: Once our data has been collected, we can analyze the functionalities of the model and the metrics that might be meaningful to track the progress, accuracy and performance of our model in a given use case.
  • For example, Use specific evaluators (String, Trajectory, or Comparison evaluators) for your given use case, and then define the metrics you want to use for your given model and use case.
  • If your model performs text summarization, use the string evaluator and utilize the ROUGE metric (which is specifically designed for text summarizations by comparing outputs from the model with a baseline reference example). On the other hand, if your model performs text translations, use BLEU.
  • Setting Benchmarks: If our model performs certain tasks that are vast, for example natural language processing acting as a modern LLM, using several scenarios and knowledge sources as input, we can use benchmarks.
  • Use benchmarks such as GLUE/SuperGLUE, HELM, MMLU for your given use case to track the performance and accuracy of your model holistically.
  • Set an iterative self evaluation process flow for our model, to focus on making it more human aligned and legal based on the tasks it performs and the users it interacts with.
  • Recurring Feedback Aggregation: If your model performs several tasks for your use cases and interacts with several users, make sure to adopt a recurring feedback loop from individuals who can label the prompt completions from the model on certain criteria based on the tasks you want to accomplish (for example, “how honest and factually correct is this model?”

Documentation, human aligned and iterate: Make sure to document the performance and accuracy of your model, setting human aligned training for it and iterating it for continuous improvements in performance and accuracy.

Here, we can use a human in the loop at a recurring basis to make sure that we are ranking and labeling each portion of the data that we get from the LLM to teach the students, and then in a recurring manner, train our reward model which can be used to keep updating our actual BLOOM Model. In this way, we can make sure that our model pipeline is automated and secure in a human aligned manner.

Part #3 will contain optimal model deployment on SageMaker strategies! Stay tuned!

--

--

Madhur Prashant

Learning is my passion, so is the intersection of technology & strategy. I am passionate about product. I work @ AWS but these are my own personal thoughts!