Generative AI Project Life Cycle

9 min readJul 13, 2023

Conception to → Launch with Integrated LLM Capabilities

The basic Project lifecycle of a Generative AI deals with 4 core principles

Scope : define the problem statement you want to solve using LLM to determine how it should word
Select : Choose a model or pretrain your — existing one or train from scratch
Adapt and align model : develop and align model to design with Prompts, fine tune and evaluate for best output
Application integration : optimize and deploy models for inference, build LLM powered applications

LLM represents a deep statistical representation of language of corpora of huge text to internalize and use to derive patterns and derive objectives to minimize the loss of training for each token and GPUs.

Define the scope accurately and narrow down the use case
Select model / Build a model from scratch estimate feasibility
Evaluate Performance, carry out additional training if necessary
In context, learning if required.
Supervised learning model to finetune
Fine-tune the model to align with human preferences
Reinforcement Learning with human feedback
How to align to your preferences
highly iterative
additional infrastructure requirements

There are Model Hubs to browse model cards that help you to understand and learn more of Model that suits your use case application. Each Model Card includes details that help you understand better and details of tasks you can to carry out to use them in your application and how the Language models are trained.
- Model Details
- Uses
- Bias, Risks and Limitations
- Training Details
- Evaluation

How Large Langugae Models Trained?

Pre-training LLM —
-Learns from Unstructured textural data collected from web scraping, various data sources, and corpora for training language models usually datasets in GB-TB-PBs
- Model weights can be updated to minimize the loss of training
- Large amount of patterns depends on architecture of model
- Dataset needs to increase its data quality in order to use for model training purposes to address bias, remove harmful content.
- Data Quality filter — 1–3 % of original tokens to decide pretraining of model
- During training the model weights get updated to minimise the loss of the function objective to train the model
Model architecture and pertaining objectives — As we have seen, there are three architectures of the basic encoder-decoder Transformer model. Encoder only, Decoder only, or Encoder-Decoder model. Variance used for different tasks, has intuition for which task to use for what task LLM to encode with deep key statistical representation of language.
AutoEncoding models —Encoder only LLM
- Masked Language Modeling (MLM)
- Objective = Reconstruct test (“ denoising ”)
- Build Bidirectional learning full context, Ideally suited to task for benefit with bidirectional learnings
- Good use case :
— Sentiment analysis
— Named Entity Recognition
— Word classification
For Example :
- BERT
- ROBERTA
Autoregressive models — Decoder only LLM
- Pretrained using Causal Language Modeling (CLM)
- Predict next token based on previous sequence of token also known as Full language model for researchers.
- Context is unidirectional, Builds statistical decoder architecture,
- Good use case :
- Text generation,
- Other emergent behaviour- depend on model behavior
For Example :
- GPT
- BLOOM
Sequence to sequence models
- Span corruption
- Masks of random input tokens
- Sentinel tokens special tokens , not correspond to actual work
-Good use case :
- Translation
- Text summarisation
- Question answering
For Example :
- Flan T5
- BART

Biref of model architecture and pre training objectives

Things to keep in mind when choosing a model —
Larger models are capable of carrying out tasks well

Comparison of model architecture and pre training objectives —
Increased size and number of parameters has larger capabilities
Introduction of highly scalalebl transformer architecture without additional context in training.

Growth powered by -

Introduction of highly scalable transformer
Access to massive datasets
More powerful compute resources

Hypothesize existence of new Moore’s law of elements

“Adding parameters to make models smarter”

Large numbers of parameters and continuously training large models is difficult and expensive.

Computational challenges of training LLMs

Often training LLMs results in error message —
“ Running out of memory — Cuda out of memory”
Compute unified device architecture — collection of libraries and tools to process and perform large calculations.
Pytorch and tensor flow use matrix multiplication and complex calculations to scale problems where it uses CUDA processors
- need to Boost performance and other operation in deep learning
- Tonnes of memory to store and train all of their parameters

Calculating approx GPU RAM needed to store 1B parameters
Computation issue explained: develop intuition of scale of parameter
- 1 parameter — 4 bytes (32 bit float)
- 1B param = 4 x109 bytes — 4GB @32 bit full precision memory store model weights so far
- Additional computations requirement for the memory in calculation is required
- Above is calcualted to Store model weights only

Model parameters (weights) — 4 bytes per parameter
Adam optimiser (2 state) — +8 bytes per parameter
Gradients — +4 bytes per parameter
Activations & temp memory (variable size) needed for activation — +8 bytes per parameter(high end estimate)

Total => 4 bytes per parameter == 20 extra bytes per parameter

Memory needed to store model < 20 X memory ended to train the model

Memory requirement: 4GB of model — requires 80GB of gpu ram

Quantisation — reduce memory to store weights of model by reducing weigth of their variables from 32 bit integer to FP 32 bit — FP16 BFLOT16 format.

16 bit floating point to 8 bit integer

Original 32 bit floating point numbers to a lower 16 bit floating point number require 4 bytes of memory to store one value.

As we Quantize the model size and precision reduce

Quantised full precision 4GB @32 bit full precision to 1GB @8 -bit precision. 4 GB to 1 GB sized model — but loss of precision
Ultimate Goal is to reduce memory to store and train model
Statistically protects 32 but floating point number into lower precision spaces, Quantisation aware training scaling factors doing train
BFLOAT 16 popular choice, reduces Memory footprint by half, Fitting model into gpu memory, Reduce the memory consumption by reducing the model parameters to half
Full precision model — 16 bit quantised model — 8 bit quantised model, 4gb 2gb 1 gb quantised model
32 but full precision -> 16 but half precision -> 8bit precision, Same degree of savings
1 b parameter with 32 but precision, GPU needed to train larger model

Fine tuning your model

Data parallelism allows for the use of multiple GPUs to process different parts of the same data simultaneously, speeding up training time.

Scale model training with multiple GPU to speed up training

Scaling in efficient way

When a model fits in single on GPU

Batches of data in parallel called PyTorch DDP — Distributed Data Parallel

Copies model to each GPU

2. Sends data to process

3. Synchronize combines Forward backward pass computes and results in to update model and pass the training

4. processed in parallel and synchronise in gradients to update a model

5. Additional optimizer gradients are required

Model Sharding

FSDP Fully Shared Data Parallel

ZeRO — zero data zero overlap between GPU;sZeRO stands for zero redundancy optimizer all of the memory components required for training LLMs, the largest memory requirement was for the optimizer states, which take up twice as much space as the weights, followed by weights themselves and the gradients and the goal of ZeRO is to optimize memory by distributing or sharding model states across GPUs with ZeRO data overlap.

Weights, States, gradients
- Full model copy on each GPU which leads to redundant memory consumption
- Distributing model parameters gradient and optimiser across GPU instead of keeping a copy at each GPU
- Zero stage 1 stage 2 stage 3 including model parameters across GPUs
- DDP — Distributed Data Parallel
- FSDP — Fully shared data parallel — across GPU nodes too bug to fit on single chip
- Collect data from all GPU before forward and backward pass 0 Weights are collecting
- Performance vs memory tradeoff position
- Across the GPU, FSDP
- helps reduce overall GPU memory utilization
- Support offloading to CPU if needed
- Configure level of sharing via sharing factor
- Full replication, no sharding
- Shared across gpu better performance with smaller model
Data parallelism is a strategy that splits the training data across multiple GPUs. Each GPU processes a different subset of the data simultaneously, which can greatly speed up the overall training time

Scaling laws and computing optimal models

Goal — maximize the model performance of learning objective by minimizing loss

Scaling choose 0 Dataset size is increased

Model size number of parameters

Constraints: compute budget GPU training time, cost

Unit of computing

A petaFLOP per second day is a measurement of the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day. Note, one petaFLOP corresponds to one quadrillion floating point operations per second.

Project time line,Financial budget

Impact of training dataset size on compute budget

The volume of training data increases performance.

As model increases, test loss decrease in the performance

Pretraining compute models

Chinchilla paper: Performance of llm models

Optimal no of parameters for given compute budget. Many parameters may actually be over-parameterized. Under trained, Smaller model same perm as larger if trained with larger data set. One model train

Strategically develop smaller models trained on larger dataset result in smaller meter optimal resource utilization and processing time of the overall model

Achieve similar if not better results in a nonoptimal way

ZeRO offers three optimization stages.

ZeRO Stage 1 — shots only optimizer states across GPUs, this can reduce your memory footprint by up to a factor of four.

ZeRO Stage 2 — also shots the gradients across chips, When applied together with Stage 1, this can reduce your memory footprint by up to eight times.

Finally, ZeRO Stage 3 shots — all components including, the model parameters across GPUs. Comparison of model architecture and pre-training objectives

PyTorch DDP — distributed DAta Parallel

Copies model to each GPU

Sends data to precess

Synchronise combines Forward backward pass computes and results in to update model and pass the training

Proceledd in parallel and synchronise in gradients to update a model

Additional optimiser gradients are required

Model Sharding

FSDP Fully Shared Data pallet

ZeRO — zero data zero overlap between GPU;s

Wraiths

States

gradients

Full model copy on each GPU which leads to redundant memory consumption

Distributing model parameters gradient and optimiser across GPU instead of keeping a copy at each GPU

Zero stage 1 stage 2 stage 3 including model parameters across GPUs

DDP — Distributed Data Parallel

FSDP — Fully shared data parallel — across GPU nodes too bug to fit on single chip

Collect data from all GPU before forward and backward pass 0 Weights are collecting

Performance vs memory tradeoff position

Across the GPU

FSDP

0 hep to reduce overall GPU memory utilisation

Support offloading to CPU if needed

Configure level of sharing via sharing factor

Full replication no sharing

Shared across gpu better performance with smaller model s

Data parallelism is a strategy that splits the training data across multiple GPUs. Each GPU processes a different subset of the data simultaneously, which can greatly speed up the overall training time

Scaling laws and compute optimal models

Goal — maximise model performance of learning ibectibe by minimising loss

Scaling choose 0 Dataset size is increased

Model size number of parameters

Constraints : compute budget GPU training time, cost

Unit of compute

Project time line, Financial budget

Impact of training dataset size on compute budget

Vol of training data increase performance increase

As model increase test loss decrease in the performance

Pretraining compute models

Chinchilla paper

Performance of llm models

Optimal no of parameters for given compute budget

Many parameters may actually be over parameterised

Under trained, Smaller model same perm as larger if trained with larger data set

One model train

Started to develop smaller models trained on larger datasets result in smaller meter optimal resource utilization and processing time of the overall model

Achieve similar if not better results in a non-optimal way

Last model shown on the slide is BloombergGPT

Pretraining for domain adaptation

Working prototype faster, inclusion of Highly idiosyncratic language dataset for good model performance.

Generative AI Project Life Cycle

Written by kanika adik