Generative AI Project Life Cycle
Conception to → Launch with Integrated LLM Capabilities
The basic Project lifecycle of a Generative AI deals with 4 core principles
- Scope : define the problem statement you want to solve using LLM to determine how it should word
- Select : Choose a model or pretrain your — existing one or train from scratch
- Adapt and align model : develop and align model to design with Prompts, fine tune and evaluate for best output
- Application integration : optimize and deploy models for inference, build LLM powered applications
LLM represents a deep statistical representation of language of corpora of huge text to internalize and use to derive patterns and derive objectives to minimize the loss of training for each token and GPUs.
- Define the scope accurately and narrow down the use case
- Select model / Build a model from scratch estimate feasibility
- Evaluate Performance, carry out additional training if necessary
- In context, learning if required.
- Supervised learning model to finetune
- Fine-tune the model to align with human preferences
- Reinforcement Learning with human feedback
- How to align to your preferences
- highly iterative
- additional infrastructure requirements
There are Model Hubs to browse model cards that help you to understand and learn more of Model that suits your use case application. Each Model Card includes details that help you understand better and details of tasks you can to carry out to use them in your application and how the Language models are trained.
- Model Details
- Uses
- Bias, Risks and Limitations
- Training Details
- Evaluation
How Large Langugae Models Trained?
- Pre-training LLM —
-Learns from Unstructured textural data collected from web scraping, various data sources, and corpora for training language models usually datasets in GB-TB-PBs
- Model weights can be updated to minimize the loss of training
- Large amount of patterns depends on architecture of model
- Dataset needs to increase its data quality in order to use for model training purposes to address bias, remove harmful content.
- Data Quality filter — 1–3 % of original tokens to decide pretraining of model
- During training the model weights get updated to minimise the loss of the function objective to train the model - Model architecture and pertaining objectives — As we have seen, there are three architectures of the basic encoder-decoder Transformer model. Encoder only, Decoder only, or Encoder-Decoder model. Variance used for different tasks, has intuition for which task to use for what task LLM to encode with deep key statistical representation of language.
- AutoEncoding models —Encoder only LLM
- Masked Language Modeling (MLM)
- Objective = Reconstruct test (“ denoising ”)
- Build Bidirectional learning full context, Ideally suited to task for benefit with bidirectional learnings
- Good use case :
— Sentiment analysis
— Named Entity Recognition
— Word classification
For Example :
- BERT
- ROBERTA - Autoregressive models — Decoder only LLM
- Pretrained using Causal Language Modeling (CLM)
- Predict next token based on previous sequence of token also known as Full language model for researchers.
- Context is unidirectional, Builds statistical decoder architecture,
- Good use case :
- Text generation,
- Other emergent behaviour- depend on model behavior
For Example :
- GPT
- BLOOM - Sequence to sequence models
- Span corruption
- Masks of random input tokens
- Sentinel tokens special tokens , not correspond to actual work
-Good use case :
- Translation
- Text summarisation
- Question answering
For Example :
- Flan T5
- BART
Things to keep in mind when choosing a model —
Larger models are capable of carrying out tasks well
Comparison of model architecture and pre training objectives —
Increased size and number of parameters has larger capabilities
Introduction of highly scalalebl transformer architecture without additional context in training.
Growth powered by -
- Introduction of highly scalable transformer
- Access to massive datasets
- More powerful compute resources
Hypothesize existence of new Moore’s law of elements
“Adding parameters to make models smarter”
Large numbers of parameters and continuously training large models is difficult and expensive.
- Computational challenges of training LLMs
Often training LLMs results in error message —
“ Running out of memory — Cuda out of memory”
Compute unified device architecture — collection of libraries and tools to process and perform large calculations.
Pytorch and tensor flow use matrix multiplication and complex calculations to scale problems where it uses CUDA processors
- need to Boost performance and other operation in deep learning
- Tonnes of memory to store and train all of their parameters
Calculating approx GPU RAM needed to store 1B parameters
Computation issue explained: develop intuition of scale of parameter
- 1 parameter — 4 bytes (32 bit float)
- 1B param = 4 x109 bytes — 4GB @32 bit full precision memory store model weights so far
- Additional computations requirement for the memory in calculation is required
- Above is calcualted to Store model weights only
- Model parameters (weights) — 4 bytes per parameter
- Adam optimiser (2 state) — +8 bytes per parameter
- Gradients — +4 bytes per parameter
- Activations & temp memory (variable size) needed for activation — +8 bytes per parameter(high end estimate)
Total => 4 bytes per parameter == 20 extra bytes per parameter
Memory needed to store model < 20 X memory ended to train the model
Memory requirement: 4GB of model — requires 80GB of gpu ram
Quantisation — reduce memory to store weights of model by reducing weigth of their variables from 32 bit integer to FP 32 bit — FP16 BFLOT16 format.
16 bit floating point to 8 bit integer
Original 32 bit floating point numbers to a lower 16 bit floating point number require 4 bytes of memory to store one value.
- Quantised full precision 4GB @32 bit full precision to 1GB @8 -bit precision. 4 GB to 1 GB sized model — but loss of precision
- Ultimate Goal is to reduce memory to store and train model
- Statistically protects 32 but floating point number into lower precision spaces, Quantisation aware training scaling factors doing train
- BFLOAT 16 popular choice, reduces Memory footprint by half, Fitting model into gpu memory, Reduce the memory consumption by reducing the model parameters to half
- Full precision model — 16 bit quantised model — 8 bit quantised model, 4gb 2gb 1 gb quantised model
- 32 but full precision -> 16 but half precision -> 8bit precision, Same degree of savings
- 1 b parameter with 32 but precision, GPU needed to train larger model
Fine tuning your model
Data parallelism allows for the use of multiple GPUs to process different parts of the same data simultaneously, speeding up training time.
Scale model training with multiple GPU to speed up training
Scaling in efficient way
When a model fits in single on GPU
Batches of data in parallel called PyTorch DDP — Distributed Data Parallel
- Copies model to each GPU
2. Sends data to process
3. Synchronize combines Forward backward pass computes and results in to update model and pass the training
4. processed in parallel and synchronise in gradients to update a model
5. Additional optimizer gradients are required
Model Sharding
FSDP Fully Shared Data Parallel
ZeRO — zero data zero overlap between GPU;sZeRO stands for zero redundancy optimizer all of the memory components required for training LLMs, the largest memory requirement was for the optimizer states, which take up twice as much space as the weights, followed by weights themselves and the gradients and the goal of ZeRO is to optimize memory by distributing or sharding model states across GPUs with ZeRO data overlap.
Weights, States, gradients
- Full model copy on each GPU which leads to redundant memory consumption
- Distributing model parameters gradient and optimiser across GPU instead of keeping a copy at each GPU
- Zero stage 1 stage 2 stage 3 including model parameters across GPUs
- DDP — Distributed Data Parallel
- FSDP — Fully shared data parallel — across GPU nodes too bug to fit on single chip
- Collect data from all GPU before forward and backward pass 0 Weights are collecting
- Performance vs memory tradeoff position
- Across the GPU, FSDP
- helps reduce overall GPU memory utilization
- Support offloading to CPU if needed
- Configure level of sharing via sharing factor
- Full replication, no sharding
- Shared across gpu better performance with smaller model
Data parallelism is a strategy that splits the training data across multiple GPUs. Each GPU processes a different subset of the data simultaneously, which can greatly speed up the overall training time
Scaling laws and computing optimal models
Goal — maximize the model performance of learning objective by minimizing loss
Scaling choose 0 Dataset size is increased
Model size number of parameters
Constraints: compute budget GPU training time, cost
Unit of computing
A petaFLOP per second day is a measurement of the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day. Note, one petaFLOP corresponds to one quadrillion floating point operations per second.
Project time line,Financial budget
Impact of training dataset size on compute budget
The volume of training data increases performance.
As model increases, test loss decrease in the performance
Pretraining compute models
Chinchilla paper: Performance of llm models
Optimal no of parameters for given compute budget. Many parameters may actually be over-parameterized. Under trained, Smaller model same perm as larger if trained with larger data set. One model train
Strategically develop smaller models trained on larger dataset result in smaller meter optimal resource utilization and processing time of the overall model
Achieve similar if not better results in a nonoptimal way
ZeRO offers three optimization stages.
ZeRO Stage 1 — shots only optimizer states across GPUs, this can reduce your memory footprint by up to a factor of four.
ZeRO Stage 2 — also shots the gradients across chips, When applied together with Stage 1, this can reduce your memory footprint by up to eight times.
Finally, ZeRO Stage 3 shots — all components including, the model parameters across GPUs. Comparison of model architecture and pre-training objectives
PyTorch DDP — distributed DAta Parallel
Copies model to each GPU
Sends data to precess
Synchronise combines Forward backward pass computes and results in to update model and pass the training
Proceledd in parallel and synchronise in gradients to update a model
Additional optimiser gradients are required
Model Sharding
FSDP Fully Shared Data pallet
ZeRO — zero data zero overlap between GPU;s
Wraiths
States
gradients
Full model copy on each GPU which leads to redundant memory consumption
Distributing model parameters gradient and optimiser across GPU instead of keeping a copy at each GPU
Zero stage 1 stage 2 stage 3 including model parameters across GPUs
DDP — Distributed Data Parallel
FSDP — Fully shared data parallel — across GPU nodes too bug to fit on single chip
Collect data from all GPU before forward and backward pass 0 Weights are collecting
Performance vs memory tradeoff position
Across the GPU
FSDP
0 hep to reduce overall GPU memory utilisation
Support offloading to CPU if needed
Configure level of sharing via sharing factor
Full replication no sharing
Shared across gpu better performance with smaller model s
Data parallelism is a strategy that splits the training data across multiple GPUs. Each GPU processes a different subset of the data simultaneously, which can greatly speed up the overall training time
Scaling laws and compute optimal models
Goal — maximise model performance of learning ibectibe by minimising loss
Scaling choose 0 Dataset size is increased
Model size number of parameters
Constraints : compute budget GPU training time, cost
Unit of compute
A petaFLOP per second day is a measurement of the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day. Note, one petaFLOP corresponds to one quadrillion floating point operations per second.
Project time line, Financial budget
Impact of training dataset size on compute budget
Vol of training data increase performance increase
As model increase test loss decrease in the performance
Pretraining compute models
Chinchilla paper
Performance of llm models
Optimal no of parameters for given compute budget
Many parameters may actually be over parameterised
Under trained, Smaller model same perm as larger if trained with larger data set
One model train
Started to develop smaller models trained on larger datasets result in smaller meter optimal resource utilization and processing time of the overall model
Achieve similar if not better results in a non-optimal way
Last model shown on the slide is BloombergGPT
Pretraining for domain adaptation
Working prototype faster, inclusion of Highly idiosyncratic language dataset for good model performance.