Generative AI — An intro for product managers

4 min readDec 8, 2023

What is Generative AI?

Generative artificial intelligence (also generative AI or GenAI[1]) is artificial intelligence capable of generating text, images, or other media, using generative models. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics. Source: Wikipedia

Generative AI applications are multimodal in nature, spanning across Image, Text, Audio, Video, and even decision-based algorithms. It works as a Prompt -> Response

Source — Introduction to Generative AI — Google Cloud — Youtube

Attention is All you need! — An Overview of Text generation

High-level system diagram of encoder-decoder-based transformer architecture

In the now famous 2017 paper from Google Brain, the concept of Transformers. The Transformer’s encoder-decoder architecture, featuring self-attention and multi-head attention mechanisms, revolutionized natural language processing. The encoder processes input sequences, while the decoder generates output sequences. Attention mechanisms weigh the importance of tokens, enabling the model to capture context and dependencies, making it highly effective for tasks like translation and text generation. Unlike earlier models using RNNs or CNNs, the Transformer relied solely on self-attention, allowing words’ importance in a sentence to be considered when encoding it. This approach sped up training, enabling larger models on bigger datasets, resulting in major NLP advancements in Text generation.

OpenAI published the paper “Improving Language Understanding by Generative Pre-Training” in which they defined a Decoder-only architecture to create GPT.

How to build a Chatbot like ChatGPT? The Pre-trained GPT model output is passed through multiple layers like Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF) to get a stable product like ChatGPT. Let’s explore what each of these layers means.

Source: OpenAI-Ouyang, Long, et al. “Training language models to follow instructions with human feedback.”

Supervised Fine-Tuning — As shown in the above diagram, SFT is nothing but providing the unsupervised pre-trained LLM with a smaller set of labeled data, in-order to finetune the results. These labeled data may either come from humans or another fine-tuned LLM model.

Reinforcement Learning with Human Feedback — This introduces another supervised model training that introduces a reward structure to prefer certain answers over others. These preferences are human-labeled and are used to further fine-tune the model. The human preferences include the reward model of choosing one answer over the other and also making sure the LLM response is closer to the policy pre-defined by the humans.

What is Retrieval Augmented Generation (RAG)? RAG is nothing by providing LLMs with a set of additional source data from which it can retrieve information before generating an answer for the user. This improves the robustness and transparency of the model.

The Problems — Supervised fine-tuning performance plateaus after a few thousand human-written examples. Also, the amount of data needed to efficiently train the Reward model or the PPO under RLHF is very high compared to the Supervised fine-tuning. This means a higher computation cost.

Evolution of LLMs Andrej Karpathy predicts that LLMs will evolve more than just a chatbot or a text generator but into an Operating system having access to I/O, memory, tools, and the Internet.

How do Generative Image models work? — Similar to a paint drop diffusing in water!

The most common model used in a Generative Adversarial Network model (GAN) which uses the concept of Discriminator and Generator. A random noise (eg: uniform distribution) is used to generate a sample and this sample is compared to the real image. The Discriminator then identifies two different loss functions and the GAN tries to reduce the Generator loss and increase the Discriminator Loss. Through the process of backpropagation, the generated image becomes closer and closer to the real image. Later this model is used to generate new images from input images or texts.

Stable diffusion on the other hand works by adding noise to the original image in multiple steps and then training the model to de-noise it to get to the original image. The Noise predictor is called U-Net, and the encoder-decoder part of it is called VAE (Variable Auto Encoder)

Generative AI — An intro for product managers

What is Generative AI?

Written by Shiv Viswanathan