Generative AI Project Life Cycle

kanika adik
9 min readJul 13, 2023


Conception to → Launch with Integrated LLM Capabilities

The basic Project lifecycle of a Generative AI deals with 4 core principles

  • Scope : define the problem statement you want to solve using LLM to determine how it should word
  • Select : Choose a model or pretrain your — existing one or train from scratch
  • Adapt and align model : develop and align model to design with Prompts, fine tune and evaluate for best output
  • Application integration : optimize and deploy models for inference, build LLM powered applications
Generative AI Project Life Cycle

LLM represents a deep statistical representation of language of corpora of huge text to internalize and use to derive patterns and derive objectives to minimize the loss of training for each token and GPUs.

  • Define the scope accurately and narrow down the use case
  • Select model / Build a model from scratch estimate feasibility
  • Evaluate Performance, carry out additional training if necessary
  • In context, learning if required.
  • Supervised learning model to finetune
  • Fine-tune the model to align with human preferences
  • Reinforcement Learning with human feedback
  • How to align to your preferences
  • highly iterative
  • additional infrastructure requirements

There are Model Hubs to browse model cards that help you to understand and learn more of Model that suits your use case application. Each Model Card includes details that help you understand better and details of tasks you can to carry out to use them in your application and how the Language models are trained.
- Model Details
- Uses
- Bias, Risks and Limitations
- Training Details
- Evaluation

How Large Langugae Models Trained?

  • Pre-training LLM —
    -Learns from Unstructured textural data collected from web scraping, various data sources, and corpora for training language models usually datasets in GB-TB-PBs
    - Model weights can be updated to minimize the loss of training
    - Large amount of patterns depends on architecture of model
    - Dataset needs to increase its data quality in order to use for model training purposes to address bias, remove harmful content.
    - Data Quality filter — 1–3 % of original tokens to decide pretraining of model
    - During training the model weights get updated to minimise the loss of the function objective to train the model
  • Model architecture and pertaining objectives — As we have seen, there are three architectures of the basic encoder-decoder Transformer model. Encoder only, Decoder only, or Encoder-Decoder model. Variance used for different tasks, has intuition for which task to use for what task LLM to encode with deep key statistical representation of language.
  • AutoEncoding models —Encoder only LLM
    - Masked Language Modeling (MLM)
    - Objective = Reconstruct test (“ denoising ”)
    - Build Bidirectional learning full context, Ideally suited to task for benefit with bidirectional learnings
    - Good use case :
    Sentiment analysis
    — Named Entity Recognition
    — Word classification
    For Example :
    - BERT
  • Autoregressive models — Decoder only LLM
    Pretrained using Causal Language Modeling (CLM)
    - Predict next token based on previous sequence of token also known as Full language model for researchers.
    - Context is unidirectional, Builds statistical decoder architecture,
    - Good use case :
    - Text generation,
    - Other emergent behaviour- depend on model behavior
    For Example :
    - GPT
    - BLOOM
  • Sequence to sequence models
    - Span corruption
    - Masks of random input tokens
    - Sentinel tokens special tokens , not correspond to actual work
    -Good use case :
    - Translation
    - Text summarisation
    - Question answering

    For Example :
    - Flan T5
    - BART
Biref of model architecture and pre training objectives

Things to keep in mind when choosing a model —
Larger models are capable of carrying out tasks well

Comparison of model architecture and pre training objectives —
Increased size and number of parameters has larger capabilities
Introduction of highly scalalebl transformer architecture without additional context in training.

Growth powered by -

  • Introduction of highly scalable transformer
  • Access to massive datasets
  • More powerful compute resources

Hypothesize existence of new Moore’s law of elements

“Adding parameters to make models smarter”

Large numbers of parameters and continuously training large models is difficult and expensive.

  • Computational challenges of training LLMs

Often training LLMs results in error message —
“ Running out of memory — Cuda out of memory”
Compute unified device architecture — collection of libraries and tools to process and perform large calculations.
Pytorch and tensor flow use matrix multiplication and complex calculations to scale problems where it uses CUDA processors
- need to Boost performance and other operation in deep learning
Tonnes of memory to store and train all of their parameters

Calculating approx GPU RAM needed to store 1B parameters
Computation issue explained: develop intuition of scale of parameter
- 1 parameter — 4 bytes (32 bit float)
- 1B param = 4 x109 bytes — 4GB @32 bit full precision memory store model weights so far
- Additional computations requirement for the memory in calculation is required
- Above is calcualted to Store model weights only

  • Model parameters (weights) — 4 bytes per parameter
  • Adam optimiser (2 state) — +8 bytes per parameter
  • Gradients — +4 bytes per parameter
  • Activations & temp memory (variable size) needed for activation — +8 bytes per parameter(high end estimate)

Total => 4 bytes per parameter == 20 extra bytes per parameter

Memory needed to store model < 20 X memory ended to train the model

Memory requirement: 4GB of model — requires 80GB of gpu ram

Quantisation — reduce memory to store weights of model by reducing weigth of their variables from 32 bit integer to FP 32 bit — FP16 BFLOT16 format.

16 bit floating point to 8 bit integer

Original 32 bit floating point numbers to a lower 16 bit floating point number require 4 bytes of memory to store one value.

As we Quantize the model size and precision reduce
  • Quantised full precision 4GB @32 bit full precision to 1GB @8 -bit precision. 4 GB to 1 GB sized model — but loss of precision
  • Ultimate Goal is to reduce memory to store and train model
  • Statistically protects 32 but floating point number into lower precision spaces, Quantisation aware training scaling factors doing train
  • BFLOAT 16 popular choice, reduces Memory footprint by half, Fitting model into gpu memory, Reduce the memory consumption by reducing the model parameters to half
  • Full precision model — 16 bit quantised model — 8 bit quantised model, 4gb 2gb 1 gb quantised model
  • 32 but full precision -> 16 but half precision -> 8bit precision, Same degree of savings
  • 1 b parameter with 32 but precision, GPU needed to train larger model

Fine tuning your model

Data parallelism allows for the use of multiple GPUs to process different parts of the same data simultaneously, speeding up training time.

Scale model training with multiple GPU to speed up training

Scaling in efficient way

When a model fits in single on GPU

Batches of data in parallel called PyTorch DDP — Distributed Data Parallel

  1. Copies model to each GPU

2. Sends data to process

3. Synchronize combines Forward backward pass computes and results in to update model and pass the training

4. processed in parallel and synchronise in gradients to update a model

5. Additional optimizer gradients are required

Last model shown on the slide is BloombergGPT

Pretraining for domain adaptation

Working prototype faster, inclusion of Highly idiosyncratic language dataset for good model performance.

