Stories by Soumyadeep Saha on Medium

Embeddings Explained: From Sparse Representations to Transformer-Based Semantic Spaces

Soumyadeep Saha — Wed, 18 Feb 2026 05:22:17 GMT

Introduction: Why Embeddings Matter

Every modern AI system — from Google Search to ChatGPT, from recommendation engines to facial recognition — relies on a single powerful idea:

Represent complex objects as vectors in a continuous space.

This idea is called embedding.

But what does that actually mean?

The Core Problem

Computers do not understand:

· Words

· Images

· Graphs

· Users

· Products

They understand numbers.

If we want machines to reason about:

· The similarity between “dog” and “puppy”

· The relationship between “king” and “queen”

· Whether two images depict the same object

· Whether two users have similar preferences

We must convert these objects into numbers.

Not just any numbers — but numbers arranged in a way that preserves meaning.

A Simple Thought Experiment

Imagine we want to represent three words:

dog
cat
car

We want:

· dog close to cat

· dog far from car

If we place them randomly in space, this structure is lost.

But if we map them carefully into a geometric space:

Now distance encodes similarity.

This geometric representation is an embedding.

Why Geometry?

Geometry gives us:

· Distance → similarity

· Direction → relationships

· Clusters → semantic groups

· Linear transformations → analogies

For example:

king - man + woman ≈ queen

This works because embeddings transform symbolic relationships into geometric operations.

Meaning becomes direction in space.

The Big Idea

An embedding is a function:

That maps complex objects into a low-dimensional vector space such that:

“similar objects”⇒”nearby vectors”

This idea appears everywhere:

· NLP (Word2Vec, BERT, GPT)

· Computer Vision (CNN features, ViT embeddings)

· Graph Learning (node2vec, GCN)

· Recommender Systems (user/item embeddings)

· Multimodal systems (text–image alignment)

How Did We Get Here?

The journey to modern embeddings evolved through several major phases:

1. Sparse symbolic representations (one-hot, TF-IDF)

2. Matrix factorization (LSA)

3. Predictive neural embeddings (Word2Vec, GloVe)

4. Contextual embeddings (ELMo, BERT, GPT)

5. Contrastive and multimodal embeddings

6. Graph and manifold-based representations

Each stage improved:

· Scalability

· Semantic richness

· Context awareness

· Transfer learning ability

Note: We will be discussion just few of them

What This Article Will Do

In this article, we will:

· Define embeddings formally

· Explain every major category

· Derive the mathematics behind each approach

· Compare their geometric intuition

· Understand why Transformers became dominant

· Explore modern embedding paradigms

This is not just a tutorial — it is a conceptual and mathematical journey through how machines learn meaning.

We will move from intuition → math → architecture → geometry → modern systems.

Before We Begin

Keep this mental model in mind:

Embeddings turn meaning into geometry.

Once you understand that, everything else becomes a refinement of that core idea.

Now let’s begin the deep dive.

1) Definition: What an Embedding is?

An embedding is a learned mapping from objects (tokens/words, sentences, images, users/items, nodes in a graph, etc.) into a continuous vector space Rd such that geometric relationships in that space correspond to meaningful relationships in the original domain.

Formally, for a set of objects X, an embedding is a function

where d is typically much smaller than the size/complexity of the original representation.

Why embeddings matter

Embeddings turn “symbolic” or high-dimensional inputs into vectors where we can:

· compare items via distance/similarity (nearest neighbors),

· use vector operations in ML models (linear layers, dot products),

· generalize across similar items (shared structure).

Common similarity measures:

2) Types of embeddings (major categories)

A useful way to classify embeddings is what they embed and how they behave.

A. By what is embedded (data modality / object type)

Discrete symbol embeddings (categorical / token / word embeddings)

2. Subword / character-aware embeddings

· Embed morphemes, byte-pair units, or characters; sometimes compose them (CNN/RNN/Transformer over characters or subword tokens).

· Helps with rare words and morphology.

3. Sentence / document embeddings

· Produce one vector for a span of text.

· Either aggregate token embeddings (mean/max pooling) or use a special token like (Transformers).

4. Graph / network embeddings

· Nodes (and sometimes edges/subgraphs) mapped to

· Preserve graph proximity (random-walk context) or message-passing structure.

5. Knowledge graph embeddings

· Embed entities and relations to model triples

· Often use scoring functions like translation: h + r ≈ t

6. Vision embeddings

· Images mapped to vectors (e.g., CNN/ViT features).

· Often derived from patch tokens (ViT) or pooled CNN activations.

7. Audio/speech embeddings

· Represent speakers (speaker ID), phonetic content, or general audio semantics.

8. Multimodal embeddings

· Put different modalities in a shared space (e.g., text and images aligned so matching pairs are close). This is central to contrastive models like CLIP-style training.

9. User–item / recommendation embeddings

· Users and items embedded so interactions are predictable (matrix factorization, neural recommenders).

B. By behavior: static vs contextual

1. Static embeddings

· Each token has one vector regardless of context (e.g., classic word2vec/GloVe).

· Limitation: “bank” (river vs finance) can’t change.

2. Contextual embeddings

· A token’s representation depends on surrounding context:

· This became the standard for modern NLP because it resolves polysemy and yields richer features.

C. By learning supervision: unsupervised/self-supervised/supervised

· Unsupervised / self-supervised: learn from raw structure (co-occurrence, reconstruction, masked prediction, contrastive).

· Supervised: learn embeddings that optimize a downstream label objective (classification, ranking).

· Metric learning: explicitly structure distances via pairs/triplets.

3) How embeddings are calculated (major historical approaches)

Below are the main families of methods, their math intuition, and how they compare.

Pre-Embedding Era: Sparse Vector Representations

Before dense embeddings were introduced, words and documents were represented using high-dimensional sparse vectors.

These methods did not learn latent meaning — they encoded surface-level statistics only.

We will cover:

1. One-Hot Encoding

2. Bag-of-Words (BoW)

3. TF-IDF

4. Why these methods fail to capture semantics

1. One-Hot Encoding

Definition

Example

Vocabulary: V={“cat”,”dog”,”apple”,”car”}

Then:

Geometric Property:

This means:

cat and dog similarity = 0
cat and apple similarity = 0

The model believes all words are equally unrelated.

Problem

There is:

· No notion of semantic similarity

· No relationship between similar words

· Very high dimensionality

· No compression of meaning

This motivated better representations.

2. Bag-of-Words (BoW)

Now instead of representing a single word, we represent a document.

Definition

Example

Vocabulary: V={“cat”,”dog”,”apple”,”car”}

Document: “cat dog dog”

Vector: (1,2,0,0)

Geometrically: Each document becomes a point in high-dimensional space.

Similarity Between Documents

Usually cosine similarity:

If two documents share many words → angle small → high similarity.

Problems

1. Order is lost:

“dog bites man”

“man bites dog”
Same vector.

2. Very sparse.

3. No latent semantics.

4. Large vocabulary → huge dimensionality.

3. TF-IDF (Improved BoW)

Bag-of-Words treats all words equally.

But common words (“the”, “is”) are not informative.

So we weight words.

Rare words → high IDF
Common words → low IDF

Geometric View of Sparse Methods

All these methods share this property:

High-Dimensional Space (|V| dimensions)

Each word = one axis

Each document = sparse vector

Key characteristics:

· Dimensionality = vocabulary size (often 50k–1M)

· Vectors are sparse (mostly zeros)

· No latent compression

· No semantic structure

Why These Methods Fail Semantically

Consider:

Document A: “dog puppy bark”
Document B: “canine pet bark”

BoW vectors:

No shared exact words → low similarity.

But semantically → very similar.

Sparse models fail because: They operate in surface word space, not meaning space.

Visual Comparison: Sparse vs Dense

Sparse Representation

Dimension: 50,000
Vector: [0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,...]
Mostly zeros

Dense Embedding (Modern)

Dimension: 300
Vector: [0.12, -0.87, 0.44, 0.09, ...]
All values meaningful
Encodes latent semantics

What Is LSA?

Latent Semantic Analysis (LSA) is a technique that:

Uses word co-occurrence statistics and matrix factorization (SVD) to discover hidden (“latent”) semantic structure in text.

It was one of the first methods to convert words and documents into dense vector representations.

Build Word–Document Matrix

Suppose we have documents:

D1: dog barks loudly
D2: cat meows loudly
D3: dog runs fast

Vocabulary:

dog, cat, barks, meows, runs, fast, loudly

Construct matrix XXX:

D1   D2   D3
dog         1    0    1
cat         0    1    0
barks       1    0    0
meows       0    1    0
runs        0    0    1
fast        0    0    1
loudly      1    1    0

This is a count matrix.

Predictive Embeddings — Word2Vec

Instead of:

“Count how often words appear together” (like LSA),

Word2Vec says: “Learn vectors that are good at predicting nearby words.”

So embeddings are learned as parameters of a predictive model.

4. Neural Contextual Embeddings (ELMo → BERT → GPT)

This section introduces the major conceptual breakthrough in embedding research:

A word does not have one fixed vector.
Its vector depends on the sentence it appears in.

We will explain:

1. Why static embeddings fail

2. ELMo (BiLSTM contextual embeddings)

3. Transformer architecture

4. Self-attention mathematically

5. BERT (Masked LM)

6. GPT (Causal LM)

7. Full architecture diagrams

8. Why contextual embeddings became dominant

Why Static Embeddings Fail

Consider the word: “bank”

Sentence A: I deposited money in the bank.

Sentence B: The river overflowed its bank.

Word2Vec assigns: e_bank = ”same vector in both cases”

But meaning differs.

We need:

This is the motivation for contextual embeddings.

ELMo (2018) — First Major Contextual Embedding

ELMo = Embeddings from Language Models

Instead of learning one vector per word type, it learns representations from a bidirectional language model (BiLSTM).

ELMo Architecture Diagram

Final ELMo Representation

So representation depends on:

✔ Left context
✔ Right context

Thus:

Limitations of ELMo

· Sequential processing (slow)

· LSTMs struggle with long-range dependencies

· Hard to parallelize

This led to Transformers.

5. Transformer-Based Contextual Embeddings

Transformers replace recurrence with self-attention.

Core idea: Each word directly attends to all other words.

Transformer Input Representation:

Input Diagram

Sentence:  The   dog   barked   loudly

Token Embeddings:
   E(The)
   E(dog)
   E(barked)
   E(loudly)

Positional Embeddings:
   P1
   P2
   P3
   P4

Final Input:
   X1 = E(The) + P1
   X2 = E(dog) + P2
   X3 = E(barked) + P3
   X4 = E(loudly) + P4

Stacked into matrix

Self-Attention Mechanism

Self-Attention Formula

Self-Attention Diagram

For word “dog” in The dog barked loudly

dog attends to:

The
dog
barked
loudly

Visualization:

Each arrow weight determined by:

Final representation:

So each word becomes: Weighted mixture of all words in sentence.

Transformer Block

Each layer contains:

Stacked L times.

Final contextual embedding:

BERT — Masked Language Model (Bidirectional)

Training objective:

Randomly mask tokens:

The dog [MASK] loudly

Model predicts masked word.

Loss:

Because model sees both left and right context, it learns deep bidirectional representations.

GPT — Causal Language Model

BERT vs GPT Diagram

BERT (Bidirectional)

GPT (Causal)

Why Contextual Embeddings Dominated NLP

They combine:

✔ Context sensitivity
✔ Large-scale self-supervised learning
✔ Deep semantic modeling
✔ Transfer learning

Instead of training task-specific models, we:

1. Pretrain large LM

2. Fine-tune on downstream tasks

This drastically improved:

· Question answering

· Translation

· Classification

· Named entity recognition

· Summarization

Geometric Interpretation

Unlike Word2Vec: One word → one fixed point.

In contextual embeddings: Each occurrence → different point.

So “bank” forms multiple clusters:

Autoencoders & Variational Autoencoders

Instead of predicting context (Word2Vec) or next token (GPT), we train a model to:

Compress input → then reconstruct it.

The compressed representation becomes the embedding.

1. Basic Autoencoder

Mathematical Formulation

Architecture Diagram:

Visually:

This is called a bottleneck architecture.

Why This Produces Embeddings

The model is forced to:

· Compress D-dimensional input

· Into d-dimensional latent vector

· Without losing important information

So: Z becomes a compressed representation of meaning.

Geometric Intuition

Suppose input data lies near a low-dimensional manifold:

Autoencoder learns:

→ A nonlinear projection onto that manifold.

Latent space:

So embedding = coordinate in learned manifold.

Linear Autoencoder = PCA

Then minimizing reconstruction error is equivalent to:

So autoencoders generalize Principal Component Analysis (PCA)to nonlinear embeddings.

2. Variational Autoencoder (VAE)

Regular autoencoder:

· Deterministic encoding

VAE introduces probability.

Probabilistic Formulation

Two terms:

1. Reconstruction loss

2. KL divergence regularization

VAE Diagram

This forces latent space to:

✔ Be smooth
✔ Be continuous
✔ Be structured

Autoencoders vs Word2Vec

Autoencoders are general-purpose embedding learners

Graph Embeddings

Now we move to structured data: graphs

A graph: G = (V , E)

Nodes = entities
Edges = relationships

So that connected nodes are close.

One of the example we willstudy is DeepWalk / node2vec

DeepWalk / node2vec

Core idea: Treat random walks like sentences.

Step 1: Random Walk

Example graph:

Step 2: Apply Skip-Gram

Diagram

Intuition

Nodes appearing in similar walks → similar embeddings.

Captures:

✔ Community structure
✔ Graph proximity

Designing Scalable RAG Systems Using VectorDB: A Hands-On Walkthrough with ChromaDB

Soumyadeep Saha — Wed, 11 Feb 2026 17:57:29 GMT

In my previous blog, I provided a comprehensive overview of vector stores, vector databases, and the internal workings of RAG:

https://medium.com/@saha.soumyadeep90/vector-stores-positional-encoding-and-rag-explained-simply-and-with-a-practical-guide-dea70512f6fc

This article, however, is specifically focused on the practical implementation side — offering a hands-on view of how vector databases and vector stores, such as Chroma, actually work in real-world scenarios.

What is RAG?

RAG (Retrieval Augmented Generation) is an architecture that adds external knowledge to a Large Language Model (LLM) at query time.

Instead of relying only on what the model was trained on, RAG:

retrieves relevant documents
injects them into the prompt
then generates the answer

Why RAG is Needed?

LLMs alone have problems:

Hallucination
Outdated knowledge
Cannot access private/local data

RAG solves this by grounding answers in real documents.

Core Components of RAG?

RAG Working (Step-by-Step)

Prepare documents
Split documents into chunks
Convert chunks to embeddings
Store embeddings in vector database
User asks a question
Question converted to embedding
Most similar chunks retrieved
Chunks injected into LLM prompt
LLM generates grounded answer

RAG vs Fine-Tuning (Very Important)

RAG is a technique that combines information retrieval with language generation to produce context-aware and factually grounded responses.

Implementation 1:

RAG + Chroma is perfect to understand modern LLM apps locally on a MacBook

I’ll assume:

· macOS

· Python 3.9+

· No paid APIs (we’ll use local embeddings)

· Simple text files as knowledge base

What You’ll Build (Big Picture)

You’ll build a local RAG pipeline:

Your documents → Embeddings → ChromaDB (vector store)

User question → Retrieve relevant chunks → Send to LLM → Answer

Step 1: Install Ollama (Mac)

Install Ollama

brew install ollama

Start Ollama service

ollama serve

(Leave this running in one terminal)

ollama serve starts the Ollama background service that:

· Loads LLM models (LLaMA, Mistral, Phi, etc.)

· Exposes them via a local HTTP API

Pull a model (lightweight + good)

ollama pull llama3

# ollama run llama3 → ollama run sends requests to the server started by serve.

You now have a local LLM running at: http://localhost:11434

Step 2: Install Python Libraries

python3.11 -m venv rag_venv

source rag_venv/bin/activate

python3 -m pip install chromadb sentence-transformers langchain langchain-community langchain-text-splitters langchain-ollama

Dependency Usage Table (RAG + Ollama Setup)

Step 3: Sample Documents

Create folder:

mkdir data

data/rag.txt

RAG stands for Retrieval Augmented Generation.
It combines retrieval with language models.
RAG improves accuracy by grounding answers in documents.

data/time_dilation.txt

The concept of "time dilation" in physics is fascinating. 
According to Einstein's theory of relativity, time is not universal; 
it stretches and compresses based on speed and gravity. 
If you were to travel in a spaceship at near-light speed for a few years, 
you would return to Earth to find decades or even centuries had passed. 
Similarly, time moves slower near massive objects like black holes due to 
extreme gravity. This means astronauts aboard the International Space Station 
actually age slightly slower than people on Earth. Time is not constant; 
it is flexible, making the universe far stranger than it appears.

Step 4: Build Vector Store (ChromaDB)

Create build_chroma.py with the below code

# Import OS module to work with files and directories
import os

# Import loader to read text files
from langchain_community.document_loaders import TextLoader

# Import text splitter to break text into chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Import embedding model for converting text → vectors
from langchain_community.embeddings import HuggingFaceEmbeddings

# Import Chroma vector database
from langchain_community.vectorstores import Chroma

# Create empty list to store loaded documents
documents = []

# Loop through all files inside the data directory
for file in os.listdir("data"):
    # Load each text file
    loader = TextLoader(f"data/{file}")
    # Add loaded document to the list
    documents.extend(loader.load())

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,      # Maximum size of each text chunk (in characters by default)
    chunk_overlap=20     # Overlap between chunks, Chunk 1: Characters 0–200, Chunk 2: Characters 180–380, 20 characters overlap, Helps preserve context
)

# Split documents into smaller chunks
chunks = text_splitter.split_documents(documents)

# Load local embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"  # Small and fast embedding model
)

# Create Chroma vector database from document chunks
vectorstore = Chroma.from_documents(
    documents=chunks,             # Text chunks
    embedding=embedding_model,    # Embedding function
    persist_directory="chroma_db" # Folder to save the DB
)

# Save vector DB to disk
vectorstore.persist()

# Confirmation message
print("ChromaDB created successfully")

Run:

python build_chroma.py

Step 5: RAG with Ollama (Retrieval + Generation)

Create rag_ollama.py with the below code

# Import embedding model for converting text → vectors
from langchain_community.embeddings import HuggingFaceEmbeddings

# Import Chroma vector database
from langchain_community.vectorstores import Chroma

# Import Ollama LLM wrapper

from langchain_ollama import OllamaLLM

# Load the same embedding model used during indexing
embedding_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

# Load the existing Chroma vector database
db = Chroma(
    persist_directory="chroma_db",     # Path to stored DB
    embedding_function=embedding_model # Embedding function
)

# User question
query = "What is RAG?"

# Perform similarity search to retrieve relevant chunks
retrieved_docs = db.similarity_search(
    query,  # User question
    k=2     # Number of top matching chunks
)

# Combine retrieved document text into one context string
context = "\n".join([doc.page_content for doc in retrieved_docs])

# Initialize Ollama with LLaMA 3 model
llm = OllamaLLM(
    model="llama3",   # Name of model pulled via OllamaLLM
    temperature=0.2  # Lower = more factual answers
    # LLM temperature is a hyperparameter ranging from 0 to 2 (typically) that controls the randomness and creativity of AI-generated text. Lower settings ((0.0)–(0.4)) produce precise, repetitive, and deterministic outputs ideal for factual tasks, while higher settings ((0.7)–(1.5+)) increase diversity, creativity, and risk of hallucinations for storytelling or brainstorming.
)

# Create RAG prompt
prompt = f"""
You are a helpful assistant.
Answer the question using ONLY the context below.

Context:
{context}

Question:
{query}

Answer:
"""

# Send prompt to Ollama LLM and get response
response = llm.invoke(prompt)

# Print final answer
print("\nFinal Answer:")
print(response)

Run:

python rag_ollama.py

Implementation 2:

FastAPI RAG Server (Ollama + Chroma)

FastAPI is a modern Python web framework used to build APIs quickly and efficiently.

It is mainly used for:

Building REST APIs
Serving machine learning models
Creating backend services
Microservices
AI applications (like your RAG app)

Why Is It Called “Fast”?

FastAPI is fast because:

Built on Starlette (async framework)
Uses Pydantic (fast data validation)
Supports async/await
Very low overhead

It performs close to Node.js and Go speeds.

Now let’s convert your RAG pipeline into an API: WHAT YOU WILL BUILD (END GOAL)

A local RAG system with:

Ollama (LLaMA 3) → LLM
ChromaDB → Vector database
FastAPI → Backend server
Local text files → Knowledge base

You’ll end with an API:

POST /ask

that answers questions using your documents.

Step 1: Prerequisites (One Time)

Install Python (if not installed)

brew install python

Check:

python3 - version

Step 2: Install & Setup Ollama (Llm)

Install Ollama

brew install ollama

Start Ollama Service

ollama serve

Keep this terminal running.

Download Model

Open another terminal:

ollama pull llama3

Ollama now runs locally at:

http://localhost:11434

Create Virtual Environment (Recommended)

# Select the interpreter from the Commmand Palette

python3.11 -m venv rag_venv_fastapi

source rag_venv_fastapi/bin/activate

Install Python Dependencies

python3 -m pip install fastapi uvicorn chromadb sentence-transformers langchain langchain-community langchain-text-splitters langchain-ollama

Step 3: Project Structure (From Scratch)

Create a folder:

mkdir rag_db_fastapi

cd rag_db_fastapi

Inside it:

rag-api/
│
├── data/
│   ├── rag.txt
│   └── time_dilation.txt
│
├── build_chroma.py
└── main.py

Step 4: Create Knowledge Documents

Create data/ folder

mkdir data

Create rag.txt

RAG stands for Retrieval Augmented Generation.
It combines document retrieval with language models.
RAG reduces hallucinations by grounding answers in data.

Create time_dilation.txt

The concept of "time dilation" in physics is fascinating. 
According to Einstein’s theory of relativity, time is not universal; 
it stretches and compresses based on speed and gravity. 
If you were to travel in a spaceship at near-light speed for a few years, 
you would return to Earth to find decades or even centuries had passed. 
Similarly, time moves slower near massive objects like black holes due to 
extreme gravity. This means astronauts aboard the International Space Station 
actually age slightly slower than people on Earth. Time is not constant; 
it is flexible, making the universe far stranger than it appears.

Step 5: Build Vector Database (Chromadb)

Create build_chroma.py with the below code

# Import OS utilities to read files
import os

# Import loader to read text files
from langchain_community.document_loaders import TextLoader

# Import text splitter to break text into chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Import embedding model for converting text → vectors
from langchain_community.embeddings import HuggingFaceEmbeddings

# Import Chroma vector database
from langchain_community.vectorstores import Chroma


# List to store all loaded documents
documents = []

# Loop through each file in data directory
for file in os.listdir("data"):
    # Load each text file
    loader = TextLoader(f"data/{file}")
    documents.extend(loader.load())

# Split documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,    # Max characters per chunk
    chunk_overlap=20  # Overlap to preserve context
)

chunks = text_splitter.split_documents(documents)

# Load embedding model (local, fast)
embedding_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

# Create Chroma vector store
vectorstore = Chroma.from_documents(
    documents=chunks,              # Text chunks
    embedding=embedding_model,     # Embedding function
    persist_directory="chroma_db"  # Folder to store vectors
)

# Save vector DB to disk
vectorstore.persist()

print("ChromaDB created from scratch")

Run Vector DB Creation

python build_chroma.py

You will now see:

chroma_db/

This is your knowledge base.

Step 6: Fastapi Rag Server (From Scratch)

Create main.py

# FastAPI framework
from fastapi import FastAPI

# Request body validation
from pydantic import BaseModel

# Embedding model
from langchain_community.embeddings import HuggingFaceEmbeddings

# Chroma vector store
from langchain_community.vectorstores import Chroma

# Ollama LLM wrapper
from langchain_ollama import OllamaLLM

# Create FastAPI app
app = FastAPI(title="Local RAG API")

# Load embedding model (same as used for indexing)
embedding_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

# Load ChromaDB from disk
vector_db = Chroma(
    persist_directory="chroma_db",
    embedding_function=embedding_model
)

# Initialize Ollama LLM
llm = OllamaLLM(
    model="llama3",
    temperature=0.2  # Low temperature for factual answers
)

# Request schema
class QuestionRequest(BaseModel):
    question: str

# Health check endpoint
@app.get("/")
def health():
    return {"status": "RAG server running"}

# Main RAG endpoint
@app.post("/ask")
def ask_question(request: QuestionRequest):

    # Step 1: Retrieve relevant documents
    docs = vector_db.similarity_search(
        request.question,
        k=2
    )

    # Step 2: Combine retrieved text as context
    context = "\n".join([doc.page_content for doc in docs])

    # Step 3: Construct RAG prompt
    prompt = f"""
    Answer the question using only the context below.

    Context:
    {context}

    Question:
    {request.question}

    Answer:
    """

    # Step 4: Generate answer using Ollama
    answer = llm.invoke(prompt)

    # Step 5: Return response
    return {
        "question": request.question,
        "answer": answer,
        "context_used": context
    }

Step 7: Run Everything

Start FastAPI

python -m uvicorn main:app –-reload

python -m it guarantees:

· Uses venv Python

· Uses venv packages

· No global conflicts

The --reload flag in the uvicorn main:app --reload command enables auto-reloading, which automatically restarts the server whenever code changes are detected in your project. It is specifically designed for local development, eliminating the need to manually stop and restart the server every time a code modification is made.

In the command uvicorn main:app, main refers to the Python module (the file main.py), and app refers to the specific application object (e.g., a FastAPI instance) created within that file.

Extra Notes:

Default Port of FastAPI

When you run FastAPI with Uvicorn:

uvicorn main:app

Default values: Host: 127.0.0.1 (localhost), Port: 8000

Change Host + Port

uvicorn main:app --host 0.0.0.0 --port 9000

With Auto Reload (Development)

uvicorn main:app --reload --port 7000

Change Port Programmatically (Less Common)

import uvicorn

if __name__ == "__main__":
    uvicorn.run(
        "main:app",
        host="127.0.0.1",
        port=5050,
        reload=True
    )

Step 8: Open Swagger UI

Open in browser: http://127.0.0.1:8000/docs

Let’s start testing:

Response is:

Complete Rag Flow

Text Files
   ↓
Chunking
   ↓
Embeddings
   ↓
ChromaDB
   ↓
Query Embedding
   ↓
Similarity Search
   ↓
Context Injection
   ↓
Ollama LLM
   ↓
Answer

You do NOT need to install ChromaDB separately as a service.
In our example, ChromaDB is running inside your Python process.

ChromaDB can work in two modes:

In Our RAG Example: Which One We Are Using?

Mode 1: You are using PERSISTENT MODE: Because we wrote

Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="chroma_db"
)

and later:

Chroma(
    persist_directory="chroma_db",
    embedding_function=embedding_model
)

This means:

Vectors are stored on disk
Data survives restarts
No re-embedding needed every time

So this is NOT purely in-memory.

Do You Need to Install ChromaDB Separately?

No separate installation
No server to run
No Docker required

You only install the Python library:

pip install chromadb

That’s it.

Chroma runs embedded inside your app, like SQLite.

Mode 2: When Is Chroma In-Memory?

If you do this:

Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model
)

(no persist_directory)

Then:

Data is stored in RAM

Lost when Python stops

Good for testing only

ChromaDB is like SQLite for vectors.
It runs inside your app unless you choose a server-based DB.

When Do You Need a Separate Vector DB Server?

Only when:

Huge data (millions of vectors)

Multi-user access

High availability

Then you switch to:

Pinecone

Weaviate

Qdrant

Milvus

Chroma isn’t the only game in town
Here’s a clear, practical list of vector databases similar to Chroma, grouped by how they’re used, so you know when to pick what.

️1. FAISS (Most Common Alternative)

Best mental model: NumPy + vectors

What it is

Facebook AI Similarity Search
Library, not a database server

Key points

In-memory by default
Extremely fast
No metadata filtering (basic)
No built-in persistence (manual save/load)

When to use

Local experiments
Research
Single-machine apps

2. Qdrant (Closest to “Production Chroma”)

Best mental model: Postgres for vectors

What it is

Vector DB with server mode
Can also run embedded (local)

Key points

REST & gRPC APIs
Strong metadata filtering
Disk-backed
Scales well

When to use

Medium to large RAG systems
Production APIs

3. Weaviate

Best mental model: Search engine + vectors

What it is

Full vector search platform
Cloud + self-hosted

Key points

GraphQL API
Built-in embedding support
Multi-tenant
Heavy but powerful

When to use

Enterprise apps
Complex schemas

4. Milvus

Best mental model: Big data vector warehouse

What it is

High-scale vector DB
Kubernetes-friendly

Key points

Handles billions of vectors
Needs more infra
Used by big companies

When to use

Massive datasets
High throughput systems

5. Pinecone (Managed / Cloud)

Best mental model: AWS for vectors

What it is

Fully managed vector DB

Key points

No infra management
Paid
Very reliable

When to use

Production without infra headache
SaaS RAG apps

6. Redis Vector Search

Best mental model: Redis + vectors

What it is

Redis with vector indexing

Key points

Super fast
Good for real-time apps
Limited vector-specific features

When to use

Low latency use cases
Already using Redis

ChromaDB itself is a vector database library, there are tools that give you an application-like interface to browse collections, view documents, inspect embeddings, metadata, and run queries without writing Python code yourself.

1. Chroma Explorer (Desktop GUI) — macOS app

A native desktop client for visualizing ChromaDB:

Browse collections
View documents in each collection
See embeddings & metadata
Run semantic search with natural language
Inspect similarity scores

Built for macOS with a visual UI — great if you want a regular app experience rather than coding.

From your project folder:

chroma run --path ./chroma_db --port 8001

You should see something like:

Running Chroma server on http://localhost:8001

Install the application with a .dmg file in mac.

Now connect using the below

It will look like this

I have given a practical demo only on Database Administration & Visualization Tool (GUI Client) for ChromaDB. Please try the other tools given below if you are interested.

2. ChromaDB Viewer (Gradio UI)

A Python-based lightweight web interface that runs locally with Gradio:

Connect to any local ChromaDB
Browse all collections
See vector distances and embeddings
Query the database interactively via browser

To use:

Install Python dependencies
Run the viewer script
Open a browser at a local URL

Useful if you want a simple browser-based tool without desktop installation.

A simple Python server that shows your local ChromaDB in a browser.

3. Chromadb-UI (Web UI)

A community-built web application for managing ChromaDB:

Browse and filter results
Visual interface instead of coding
Run locally via Docker or dev server

You can clone and run it locally to interact with ChromaDB through a UI.

Unlike SQL databases (e.g., MySQL Workbench, pgAdmin), vector databases store high-dimensional embeddings rather than structured rows. But the tools above let you view:

Stored text content
Embedding vectors
Metadata fields
Query results
Distance / similarity scores

This gives a feel similar to inspecting a regular database, but tailored for vector data.

A simple Python server that shows your local ChromaDB in a browser.

A web application for browsing ChromaDB. You run it locally and open it in a browser.

Positional Encoding Explained Simply

Soumyadeep Saha — Wed, 11 Feb 2026 14:45:33 GMT

I’ve already covered the fundamentals of vector stores, vector databases, and the internal workings of RAG in detail in my previous blog:

https://medium.com/@saha.soumyadeep90/vector-stores-positional-encoding-and-rag-explained-simply-and-with-a-practical-guide-dea70512f6fc

This article, however, focuses specifically on a deeper and more detailed exploration of positional encoding — its intuition, mathematical foundation, and how it works internally within Transformer architectures.

Let’s Start

When we read a sentence, word order matters.
“The cat chased the mouse” means something very different from “The mouse chased the cat.”

For humans, understanding word order is natural. But for machine learning models — especially Transformers — this is not automatic.

Traditional sequence models like RNNs and LSTMs process text one word at a time, so they naturally capture order. However, Transformer models process all words in parallel using a mechanism called self-attention. While this makes them extremely powerful and efficient, it also creates a challenge:

How does a Transformer know which word comes first, second, or last?

This is where Positional Encoding comes in.

Positional Encoding is a technique used to inject information about the position of each word directly into its embedding. By adding positional information, Transformers can understand the structure and order of sequences — allowing them to correctly interpret meaning.

We are going to dive deep into Positional Encoding in Transformers.

Why transformers need positional encoding (what self-attention can’t do)

What self-attention is great at : Self-attention creates contextual embeddings.

Meaning:

The word “bank” in “river bank” becomes different from “bank account”
Because attention uses surrounding words to update the representation

Also it’s parallel:

RNNs read tokens one-by-one (slow)
Transformers process all tokens together (fast)

The big problem (order blindness)

Self-attention, by itself, doesn’t naturally know word order.

If you shuffle the tokens, attention can still compute relationships… but it doesn’t have a built-in “this came before that” signal.

So:

“dog bites man”
“man bites dog”

contain the same words, but mean different things.
Without position info, the model can get confused.

Positional Encoding is the mathematical “hack” that fixes this.

1. The Evolution of the Solution (First Principles)

Let’s see the “First Principles” approach. Let’s trace the logic of how researchers arrived at the final solution.

Attempt 1: Just Count (Integers)

Why not just number the words?

“The” = 1
“bear” = 2
“ate” = 3
…

The Problem: These numbers get unbounded. If you have a document with 5,000 words, the last word has a value of 5,000. This huge number destroys the “Numerical Stability” of the Neural Network (gradients explode).

Attempt 2: Normalize (0 to 1)

Okay, let’s divide by the sentence length so everything is between 0 and 1.

“The” = 0.1
“bear” = 0.2
…

The Problem: The “step size” changes depending on sentence length.

In a 10-word sentence, the distance between words is 0.1.
In a 100-word sentence, the distance is 0.01. The model gets confused because “next door neighbor” means different things in different sentences.

Attempt 3: One-hot position vectors

Position 3 = [0,0,0,1,0,0,…]

Problem C: no smoothness
Neural nets like smooth, continuous signals.
One-hot doesn’t tell the model that position 3 is closer to 4 than to 97.
It’s all equally “different”.

So we want:

bounded values (not exploding)
smooth / continuous change
something that helps model learn relative positions

Attempt 4: The Sine/Cosine Solution (The Winner)

We need a system that is:

Bounded: Values stay between -1 and 1.
Consistent: The distance between Position 1 and 2 is always the same.
Deterministic: No random numbers.

This is where Waves come in.

PE(pos) = sin(pos)

But there is a problem: The periodicity problem → Sine repeats.

So:

sin(0) = 0
sin(2π) ≈ 0
sin(4π) ≈ 0

Different positions can produce the same value → the model might think they are the same position.

Fix periodicity (part 1): use sine AND cosine together

Instead of encoding a position as one number, encode it as a 2D vector:

That pair behaves like a point on a circle.

That’s what you’re seeing in this image:

This helps because:

even if sine repeats, cosine won’t match at the same time (except full cycle)
together they give a stronger signature

Still periodic, but improved uniqueness.

Fix periodicity (part 2): don’t use just one sine/cos pair — use MANY (different frequencies)

Now comes the “real” transformer positional encoding idea:

Make the positional encoding a vector the same size as the token embedding (e.g., 128, 512, 768 dims)
Use many sine/cos pairs
Each pair uses a different frequency (some change fast, some change slow)

The classic sinusoidal positional encoding formula

The Math: Frequencies and Wavelengths

Imagine the Positional Encoding as a set of many dials or clocks, each spinning at a different speed.

Low Dimensions: Spin very fast (like a second hand).
High Dimensions: Spin very slow (like an hour hand).

By looking at the combination of all these hands, you can tell exactly what time (position) it is.

What you would see:

· Left side (Low dimensions): Rapid flickering (High Frequency).

· Right side (High dimensions): Slow, smooth changes (Low Frequency).

· This pattern is unique for every single row (word position).

2. Visualising with Python Code

Let’s write the code to visualize this “wobbly” matrix

import numpy as np
import matplotlib.pyplot as plt

def get_positional_encoding(seq_len, d_model):
        """
    Generates the Positional Encoding Matrix.
    seq_len: Number of words in sentence (e.g., 100)
    d_model: Dimensionality of the embedding (e.g., 512)
    """
        # 1. Initialize the matrix
        pe = np.zeros((seq_len, d_model))

        # 2. Create the position indices (0, 1, 2, ..., seq_len-1)
position = np.arange(seq_len)[:, np.newaxis]

        # 3. Create the division term (the "10000^..." part)
    # We use a trick with log space for numerical stability
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

        # 4. Apply Sine to even indices
pe[:, 0::2] = np.sin(position * div_term)
    
    # 5. Apply Cosine to odd indices
pe[:, 1::2] = np.cos(position * div_term)
    
    return pe

# Generate and Visualize
        seq_length = 100
d_model = 128
pe_matrix = get_positional_encoding(seq_length, d_model)

plt.figure(figsize=(10, 6))
        plt.imshow(pe_matrix, cmap='RdBu', aspect='auto')
plt.title("Positional Encoding Matrix")
plt.xlabel("Embedding Dimension (Depth)")
plt.ylabel("Word Position (Sequence Length)")
plt.colorbar(label="Value (-1 to 1)")
plt.show()

What you would see:

· Left side (Low dimensions): Rapid flickering (High Frequency).

· Right side (High dimensions): Slow, smooth changes (Low Frequency).

· This pattern is unique for every single row (word position).

3. The “Relative Position” Magic

This is the coolest part of the math.

Why did we choose Sine and Cosine? Because of this trigonometric identity:

sin(x + k) = sin(x)cos(k) + cos(x)sin(k)

In simple words:

If the model knows the position of word A(at pos) and wants to look at word B(at pos+k), it doesn’t need to “re-learn” the position. It can just apply a Rotation (a linear matrix multiplication) to get from A to B.

This allows the Transformer to easily learn concepts like “pay attention to the word 3 steps behind me” regardless of whether “me” is at the start or end of the sentence.

5. Final Architecture: Addition

How do we combine this with the word meaning? We simply Add them.

# Pseudo-code for the final step in a Transformer
word_embeddings = embedding_layer(input_words) # Shape: [Batch, Seq_Len, 512]
pos_encodings = get_positional_encoding(Seq_Len, 512)

# Crucial Step: Direct Addition
final_input = word_embeddings + pos_encodings

We add them because the “Word Meaning” is like the content, and “Positional Encoding” is like the timestamp. Adding them allows the model to separate “What” (content) from “Where” (position) using its internal math.

RAG, Vector Stores And Positional Encoding Explained Simply And With A Practical Guide

Soumyadeep Saha — Sun, 25 Jan 2026 17:57:43 GMT

RAG And Vector Stores Explained Simply And With A Practical Guide

Large Language Models feel magical — they recommend movies you’ll love, understand context in long conversations, and answer questions using your own documents.

But under the hood, none of this is magic.

Three core ideas make these systems work:
Vector Stores, Positional Encoding, and Retrieval-Augmented Generation (RAG).

In this article, we’ll build an intuitive understanding of all three — starting from first principles and moving toward practical implementations. We’ll see:

why keyword matching fails and how vector embeddings let machines understand meaning
why transformers are naturally order-blind, and how positional encoding mathematically injects sequence information
how RAG combines vector search with LLMs to reduce hallucinations and unlock private, up-to-date knowledge

Along the way, we’ll use simple analogies, diagrams, and Python examples with tools like LangChain and Chroma — no hand-waving, no unnecessary math.

If you’ve ever wondered how modern AI systems actually retrieve information, understand word order, and ground their answers in facts, this article will connect the dots.

Vector Stores

We will focus on moving away from old-school “keyword matching” toward “semantic understanding” (understanding meaning).

1. The Problem: Keywords vs. Meaning

The “keyword” way (old-school)

A basic recommender might do:

You liked a movie → take its plot text
Find other plots that share similar words
Recommend those

Problem: words don’t always mean what you want.

Example:

Movies like Kabhi Alvida Naa Kehna and My Name is Khan can feel similar in theme/emotion…
…but their plots may not share many exact keywords.
So keyword matching can miss good recommendations.

The “meaning” way (semantic)

Instead of comparing words, we compare meaning.

To do that, we convert each plot into a vector (a list of numbers) called an embedding.

Then:

Similar meaning → vectors end up near each other
Different meaning → vectors are far apart

Mini diagram: keyword vs semantic

2. The Solution: Embeddings (Vectors)

Please Note: There are numerous approach for Embeddings and the agreed approach is Transformer based. Please go through my blog https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df if you want to learn in detail.

An embedding model converts text into numbers:

“movie plot text” — -> [0.12, -0.44, 0.98, …] (hundreds/thousands of numbers)

To compare “plots” or “meanings,” computers need numbers. We convert text (like a movie plot) into a list of numbers called a Vector or Embedding.

What is it? A long list of floating-point numbers (e.g., [0.12, -0.98, 0.55…]).
How it works: Similar concepts end up close to each other in mathematical space. The vector for “King” will be mathematically close to “Queen” and “Royalty.”

This allows us to perform Semantic Search. We are no longer matching words; we are matching meanings.

3. What is a Vector Store?

Once we convert our movie plots into vectors, we need a place to save them. You cannot efficiently store and search these complex vectors in a normal Excel sheet or SQL database. You need a Vector Store.

Key Features discussed:

Storage:

In-Memory: Fast but temporary (data is lost when the computer turns off). Good for testing.
On-Disk: Slower but permanent. Good for production.

Indexing: This is the “secret sauce” for speed. Instead of comparing your query to every single movie (which takes too long), the store uses an index to quickly find the closest match.
Clustering: It explains breaking data into groups (clusters).

Imagine a library. You don’t look at every book. You go to the “Sci-Fi” section.
In a Vector Store, we calculate a Centroid (the center point of a cluster). If your search query is far from that Centroid, we ignore that whole cluster. This makes searching massive datasets very fast.

A vector store is basically a system that can:

1. Store vectors (embeddings)

2. Store metadata (like title, year, genre, etc.)

3. Quickly retrieve the most similar vectors when you query

So for movie recommendations:

· Each movie plot → embedding vector

· Store vectors in a vector store

· User gives a movie / preference → embed that too

· Do a similarity search → return closest movies

Core idea: “Recommendation = nearest neighbors in embedding space.”

4. Vector Store vs. Vector Database

This makes a distinction between a simple “Store” and a full “Database.”

Chroma DB, as an example, bridges this gap. It is lightweight and open-source but offers features like “Collections” (similar to tables in SQL) and persistent storage.

Similarity: how do we decide “close”?

A very common metric is cosine similarity:

Think of vectors like arrows
Cosine similarity asks: “Are these arrows pointing in the same direction?”
It focuses on direction (meaning) more than length

Cosine similarity definition is standard: dot product divided by magnitudes.

Tiny diagram (vector similarity intuition)

Why we need indexing (otherwise it gets painfully slow)

If you have N movies, a naive search compares your query to every movie vector.

1,000 movies → okay-ish
1,000,000 movies → not okay

So vector stores use indexes (special data structures) to search faster.

The clustering/centroid idea

You can cluster vectors into groups.
Each cluster has a centroid (the “center vector”).

Query time:

Find nearest centroid(s)
Search inside those clusters only
Instead of searching everything.

This is basically the idea behind IVF-style indexing (inverted file indexes) where vectors are assigned to clusters, and search probes a subset of clusters using something like nprobe.

Interpretation:

A vector database does not understand words like humans do.

It differentiates between “happy” and “enjoy” by:

Converting them into numerical vectors and measuring their distance in high-dimensional space.

If the vectors are close → meanings are similar.
If far apart → meanings are different.

Step 1: Words Become Vectors

Before storing in a vector DB, text goes through an embedding model.

Example:

“happy” → [0.12, -0.45, 0.88, …] (384 dimensions)

“enjoy” → [0.10, -0.40, 0.85, …]

“sad” → [-0.60, 0.22, -0.90, …]

Each word becomes a point in high-dimensional space.

Step 2: Semantic Similarity = Distance

Vector DBs use math like:

Cosine similarity
Euclidean distance
Dot product

If two vectors are:

Very close → similar meaning
Far apart → different meaning

Why Are “happy” and “enjoy” Close?

Because embedding models are trained on massive text corpora.

They learn patterns like:

· “I am happy”

· “I enjoy this”

· “She felt happy”

· “She enjoyed the event”

The model statistically learns that:

happy ≈ enjoy ≈ joyful ≈ delighted

So their vectors end up near each other.

Important: The Vector DB Does NOT Understand Meaning

The embedding model does the semantic learning.

The vector DB only does:

Store vectors
+
Compute distance

Why This Works Better Than Keyword Search

Keyword search:

Search: enjoy

Document: happy

→ No match ❌

Vector search:

Search: enjoy

Document: happy

→ Similar vector → Match ✅

5. Practical Implementation: Coding with LangChain & Chroma

Let’s look at how to build this in Python. We will use LangChain to manage the logic and Chroma as our database.

Step A: Setup and Ingestion

pip install -U langchain-chroma langchain-openai langchain-core chromadb

First, we need to import our tools and set up the “Embedding Function” (the brain that turns text into numbers).

# Import necessary libraries
        from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document

# 1. Initialize the Embedding Model
# This converts text like "A story about love" into [0.01, 0.45, ...]
embeddings = OpenAIEmbeddings()

# 2. Prepare our Movie Data (The "Documents")
movie_plots = [
Document(page_content="A man embarks on a journey to find his lost love across borders.", metadata={"title": "Movie A", "id": 1}),
Document(page_content="Space rangers fight an alien invasion on Mars.", metadata={"title": "Movie B", "id": 2}),
Document(page_content="A romantic drama about a couple separating.", metadata={"title": "Movie C", "id": 3})
        ]

        # 3. Create the Vector Store (Chroma)
# We tell it where to save data (persist_directory) so we don't lose it.
vector_db = Chroma.from_documents(
        documents=movie_plots,
        embedding=embeddings,
        persist_directory="./chroma_db_storage"
)

print("Movies stored successfully!")

Step B: Similarity Search

Now, let’s find a recommendation. If a user likes “heartbreak stories,” we query the database.

# The user's query
query = "sad love story about separation"

        # Perform Similarity Search
# k=1 means "give me the top 1 most similar movie"
docs = vector_db.similarity_search(query, k=1)

print(f"Recommended Movie: {docs[0].metadata['title']}")
print(f"Plot Summary: {docs[0].page_content}")

# Output should be "Movie C" because the meaning matches "separation" and "sad love".

Step C: CRUD Operations (Update & Delete)

Let’s emphasizes that managing data (CRUD) is vital.

Updating a Document: In Chroma, updating often requires the Document ID.

# Updating the plot of Movie A
updated_movie = Document(
        page_content="A man travels to find his lost brother, not his love.",
        metadata={"title": "Movie A", "id": 1}
)

        # Use the update function provided by the DB wrapper
vector_db.update_document(document_id="1", document=updated_movie)
print("Movie A updated.")

Deleting a Document: If a movie is removed from the catalog, we delete its vector.

# Delete the movie with ID 2 (The space movie)
        vector_db.delete(ids=["2"])
print("Movie B deleted.")

Where RAG fits into this (quick connection)

Even though this particular section is recommendation-focused, vector stores are also the main “Retrieval” piece of RAG.

RAG = Retrieval-Augmented Generation

Retrieve relevant documents (using vector store)
Give them to the LLM as context
LLM answers using those docs

Retrieval Augmented Generation (RAG) | What is RAG | How does RAG Work

RAG is the technique that stops AI from “hallucinating” (making things up) and gives it access to your private data. It is an architecture that adds external knowledge to a Large Language Model (LLM) at query time.

Instead of relying only on what the model was trained on, RAG:

retrieves relevant documents
injects them into the prompt
then generates the answer

1. What is RAG? (The “Open Book Exam” Analogy)

Imagine you are taking a very hard history exam.

Standard LLM (ChatGPT): You have to answer purely from memory. If you studied 2 years ago, you won’t know about events that happened yesterday. You might also “guess” if you aren’t sure.
RAG: You are allowed to take a textbook into the exam. When a question comes up, you Retrieve the relevant page, read it, and then Generate your answer.

RAG stands for:

· Retrieval: Find the right data.

· Augmentation: Add that data to the user’s prompt.

· Generation: Let the AI write the answer using that data.

2. Why do we need it? (The Problems)

This highlights three major problems with standard LLMs:

Knowledge Cut-off: They don’t know recent news (e.g., “Who won the game last night?”).
Private Data: They don’t know your company’s internal emails or documents.
Hallucination: They confidently lie when they don’t know the answer.

Solution: RAG fixes all three by forcing the model to look at facts before answering.

3. RAG vs. Fine-Tuning

A common question is: “Why not just train (fine-tune) the model on my data?”

4. How RAG Works: The 4 Steps

This is the core technical part. RAG is a pipeline.

Step 1: Ingestion & Indexing (Preparing the Data)

Before we can search our documents, we need to prepare them.

Load: Read PDF, Text, or Webpage.
Split (Chunking): LLMs can’t read a 500-page book at once. We cut the text into small “chunks” (e.g., 500 words each).
Embed: Convert those text chunks into numbers (Vectors), just like we learned in the previous lesson!

Please Note: There are numerous approach for Embeddings and the agreed approach is Transformer based. Please go through my blog https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df if you want to learn in detail.

4. Store: Save them in a Vector Database.

Python Code for Step 1:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# 1. Load the data
        loader = TextLoader("my_private_document.txt")
documents = loader.load()

# 2. Split the text (Chunking)
# We split into chunks of 1000 characters
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# 3. & 4. Embed and Store
# This creates the Vector Database automatically
db = Chroma.from_documents(docs, OpenAIEmbeddings())

print("Data stored in Vector Database!")

Step 2: Retrieval

When a user asks a question (e.g., “What is our refund policy?”), the system does a Semantic Search. It compares the numbers (vector) of the user’s question with the numbers of all the saved chunks and picks the top 3 most similar chunks.

Types Of Retrieval:

1.Retrieve Multiple Chunks (Top-K Retrieval) — Most Common

Instead of sending one chunk to the LLM, we retrieve multiple relevant chunks.

Example: Retriever → top_k = 5

So the prompt may contain:

Chunk 2
Chunk 5
Chunk 9
Chunk 11
Chunk 14

Even if these chunks are far apart in the document, the LLM can combine them and reconstruct the global meaning.

Example in LangChain:

retriever = vectorstore.as_retriever(search_kwargs={"k":5})

2. Larger Chunk Size

Use bigger chunks so each chunk contains more context.

Example:

chunk_size = 800
chunk_overlap = 100

Pros: More context inside each chunk

Cons: Fewer precise matches in retrieval

3. Parent–Child Chunking (Hierarchical Retrieval)

This is a very powerful RAG technique.

Process:

Split document into large parent chunks

Split parents into smaller child chunks

Retrieval happens on child chunks

When retrieved → return the full parent chunk

Example:

Parent chunk: 1500 tokens
Child chunks: 200 tokens

So retrieval finds precise pieces, but the LLM receives the larger parent context.

LangChain example concept: ParentDocumentRetriever

4. Document-Level Metadata

Store metadata with chunks.

Example:

chunk
├─ text
├─ document_id
├─ section
└─ page_number

When a chunk is retrieved, the system can also fetch: All chunks from the same section

This helps reconstruct global context.

5. Sliding Window Retrieval

When one chunk is retrieved, also return neighbor chunks.

Example:

Retrieved chunk: 7
Also include: 6 and 8

So the final context becomes:

Chunk 6
Chunk 7
Chunk 8

This expands context automatically.

Python Code for Step 2:

query = "What is the refund policy?"

        # Search the DB for the 2 most relevant chunks
        relevant_docs = db.similarity_search(query, k=2)

print(f"Found snippet: {relevant_docs[0].page_content}")

Step 3: Augmentation

We take the User Query and stick the Retrieved Data right next to it. We create a “Mega Prompt” behind the scenes.

The Prompt looks like this:

“You are a helpful assistant. Answer the user’s question using ONLY the context provided below.

Context: [The refund policy is 30 days…] (This is the retrieved chunk)

User Question: What is the refund policy?”

Step 4: Generation

The LLM reads the Mega Prompt. Because the answer is right there in the context, it generates a perfect, factual answer.

Python Code for Step 3 & 4 (The Full Chain):

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Initialize the LLM
        llm = OpenAI()

# Create the RAG Chain
qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff", # "Stuff" simply means stuffing the context into the prompt
                retriever=db.as_retriever()
)

        # Run the chain
        response = qa_chain.run("What is the refund policy?")

print(response)

In RAG systems, global context means the model can understand relationships across the entire document, not just neighboring chunks. Since simple chunk overlap only preserves local context, several techniques are used to recover global context.

Embeddings themselves do NOT solve the neighboring vs global context problem.

Embeddings only convert text → vectors so that similar pieces of text are close in vector space.

The global context problem is mainly solved by retrieval strategies, not by the embedding type alone.

However, different embedding models capture semantic relationships better, which helps retrieve relevant chunks from anywhere in the document, indirectly helping global context.

Below is a clear list of the main embedding types you will encounter in RAG systems, with code examples and when to use them.

1. OpenAI Embeddings

Most commonly used in production.

from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

Usage

RAG systems

semantic search

chatbots with private data

Pros

High accuracy
Optimized for retrieval
No local GPU needed

Cons

Paid API
Requires internet

Working

Uses large transformer models trained on massive datasets.

Converts text into high-dimensional dense vectors (~1536 dimensions).

Similar meaning → vectors close in vector space.

How it helps global context

High semantic understanding.

Even if information is far apart in the document, similar meaning vectors allow retrieval of relevant chunks.

Example:

Document:
Chunk1 → Introduction to AI
Chunk10 → Applications of AI

Query:
"What are AI applications?"

Embedding similarity retrieves Chunk10 even if far away.

2. HuggingFace Embeddings (Local Models)

Use for local testing and studying purpose

from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

Usage

local RAG systems

private data

offline applications

Pros

Free
Runs locally
Many models available

Cons

Slightly lower performance than large APIs

Examples of HF embedding models:

all-MiniLM-L6-v2
all-mpnet-base-v2
bge-large
e5-large

Working

Based on BERT-style sentence transformers.

Uses contrastive learning:

Similar sentences → closer vectors

Different sentences → farther vectors.

Example training idea:

Sentence A: "Cat is an animal"
Sentence B: "Dog is an animal"
→ embeddings placed close

Global context benefit

Retrieves semantically similar chunks, even if wording differs.

Example

Query:
"Neural network training"

Chunk:
"Backpropagation is used to train deep learning models"
Embedding similarity connects them.

3. Cohere Embeddings

Another cloud embedding provider.

from langchain.embeddings import CohereEmbeddings
embeddings = CohereEmbeddings(
    model="embed-english-v3.0"
)

Usage

enterprise search

semantic similarity

document clustering

Pros

High-quality embeddings
good multilingual support

Cons

Paid API

Working

Trained specifically for:

semantic search

clustering

retrieval tasks

Global context benefit

Better semantic similarity scoring, enabling retrieval of relevant chunks from anywhere.

4. Instructor Embeddings

Instruction-based embeddings.

from langchain.embeddings import HuggingFaceInstructEmbeddings
embeddings = HuggingFaceInstructEmbeddings(
    model_name="hkunlp/instructor-large"
)

Usage

Used when embeddings must adapt to different tasks.

Example:

Instruction: Represent the document for retrieval
Text: "Machine learning is..."

These models embed instruction + text together, improving retrieval performance.

Pros

Task-aware embeddings
better semantic understanding

Working

Embeddings include an instruction + text.

Example

Instruction: Represent document for retrieval
Text: Machine learning models learn patterns

The model learns to create vectors specific to the task.

Global context benefit

Better task-specific embeddings improve retrieval accuracy across the document.

Example tasks:

search
clustering
question answering

5. Sentence Transformer Embeddings

A popular family of models based on BERT architecture.

Example models:

all-MiniLM

mpnet

sentence-t5

Sentence transformers generate sentence-level embeddings for similarity tasks.

Example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode("Hello world")

Usage

semantic search

document similarity

recommendation systems

6. Google / Gemini Embeddings

Google provides embeddings through its AI APIs.

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001"
)

Usage

Google ecosystem

large-scale enterprise search

7. BGE Embeddings (BAAI)

Very strong open-source embeddings.

Example models:

bge-small

bge-large

bge-m3

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

Usage

high-quality retrieval

multilingual search

These models perform extremely well on the MTEB benchmark for embedding evaluation.

The Massive Text Embedding Benchmark (MTEB) is a standardized framework and public leaderboard used to evaluate the performance of text embedding models across a wide range of tasks and languages. It is currently the most popular and comprehensive tool for selecting embedding models for applications like Retrieval-Augmented Generation (RAG) and semantic search.

Working

Uses contrastive learning optimized for retrieval.

Training objective:

Query → relevant document closer
Query → irrelevant document farther

Example training pair:
Query: "capital of France"
Positive: "Paris is the capital of France"
Negative: "Python is a programming language"

Global context benefit

Strong query-document matching, which improves retrieving correct chunks anywhere in the document.

8. Self-Hosted Embeddings

You can host your own embedding models on servers or GPUs.

from langchain.embeddings import SelfHostedEmbeddings

Usage

enterprise security

large-scale private deployments

9. Fake Embeddings (Testing Only)

Used only for testing pipelines.

from langchain.embeddings import FakeEmbeddings

Usage:

testing

debugging RAG pipeline

10. Multilingual Embeddings

Special models for multiple languages.

Examples:

multilingual-e5-large

LaBSE

bge-m3

Usage:

cross-language search

global products

If you’re interested in a step-by-step working example of RAG, check out my detailed blog post.

https://medium.com/@saha.soumyadeep90/designing-scalable-rag-systems-using-vectordb-a-hands-on-walkthrough-de9f1eac768d

Positional Encoding in Transformers

How Positional Encoding Depends on Embeddings

Positional Encoding is not independent — it works together with word embeddings. In fact, its design is tightly connected to how embeddings are represented in Transformers.

Let’s break this down clearly.

1. Same Dimension as Embeddings

Every token in a Transformer is first converted into a word embedding vector of size dmodel.

For example:

If dmodel=512, each word becomes a 512-dimensional vector.

Positional Encoding is also created with the same dimension (512).

Why?

Because positional encoding is added directly to the embedding vector:

Input to Transformer = Word Embedding + Positional Encoding

If the dimensions didn’t match, this addition wouldn’t be possible.

So positional encoding is structurally dependent on embedding size.

2. It Modifies the Embedding Space

Embeddings capture semantic meaning:

“king” and “queen” are close in embedding space.
“dog” and “table” are far apart.

Positional encoding shifts these embeddings slightly to encode position.

Example:

Embedding(“cat”) at position 1
Embedding(“cat”) at position 5

They start with the same semantic embedding, but after adding positional encoding, they become different vectors.

This allows the model to distinguish:

The same word appearing in different positions.

So positional encoding does not replace embeddings — it augments them.

Please go through my article on positional encoding:

https://medium.com/@saha.soumyadeep90/positional-encoding-explained-simply-9c6b88b5d8ff

Natural Language Processing: From Beginner to Advanced

Soumyadeep Saha — Sat, 24 Jan 2026 11:41:55 GMT

We are going to explore the fascinating world of Natural Language Processing (NLP)

Introduction to NLP

I have broken this down into 5 key modules. For each, I will provide a simple explanation, a visual aid, and Python code to show you how it works in the real world.

Module 1: What is Natural Language Processing (NLP)?

Simple Explanation: Imagine you are trying to teach a dog to understand English. You can teach it simple commands (“Sit”, “Stay”), but it can’t understand a Shakespeare poem or a complex joke. Computers are similar; they understand 0s and 1s, not words.

NLP is the bridge that helps computers understand, interpret, and generate human language. It is a mix of three fields:

Linguistics: The rules of language (grammar, syntax).
Computer Science: The programming and algorithms.
Artificial Intelligence (AI): The “brain” that learns from data.

The goal (in simple words)

Humans talk using messy, flexible language. NLP tries to make machines handle that mess.

Humans: “Can you please remind me tomorrow?”
Machine must infer: “Set a reminder at tomorrow’s date/time.”

Code Example: The First Step (Tokenization) Before a machine can understand a sentence, it must break it down into small pieces called “tokens” (words).

# We use a popular library called NLTK (Natural Language Toolkit)
import nltk
nltk.download('punkt')
        from nltk.tokenize import word_tokenize

text = "NLP helps machines understand humans."

        # Break the text into words (tokens)
tokens = word_tokenize(text)

print(tokens)
# Output: ['NLP', 'helps', 'machines', 'understand', 'humans', '.']

Why NLP is important?

The lecture’s main point: language is how humans transfer knowledge. If machines can work with language, machines become way more useful.

NLP matters because:

We communicate constantly via text: emails, chats, reviews, posts
There is too much text for humans to read manually
Businesses want automation: support tickets, moderation, analytics, search

Module 2: Major Applications of NLP

Simple Explanation: Why do we care about NLP? Because it powers the apps you use every day. The three main uses:

Smart Reply & Translation: Like Gmail suggesting “Sounds good!” or Google Translate converting English to Spanish.
Content Moderation: Automatically hiding “hate speech” or bullying comments on social media.
Sentiment Analysis: Figuring out the “mood” of a text. Companies use this during elections or product launches to see if people are happy (positive) or angry (negative).

A) Sentiment Analysis……..

What it is: Detect the emotion/opinion in text.

Positive: “This phone is amazing!”
Negative: “Worst service ever.”
Neutral: “The package arrived today.”

Where used

Product reviews
Election/public opinion analysis (as the lecture mentions)
Brand monitoring (“Are people angry about us today?”)

Code Example: Sentiment Analysis Let’s write a simple program to detect if a review is positive or negative.

from textblob import TextBlob

# A user review
        review = "I absolutely love this new phone! It's amazing."

# Analyze sentiment
blob = TextBlob(review)
sentiment_score = blob.sentiment.polarity

# Polarity ranges from -1 (Negative) to +1 (Positive)
        if sentiment_score > 0:
print("This is a POSITIVE review.")
elif sentiment_score < 0:
print("This is a NEGATIVE review.")
else:
print("This is NEUTRAL.")

# Output: This is a POSITIVE review.

B) Text Classification……..

What it is: Assign a label/category to text.

Examples:

Email → spam / not spam
News → sports / politics / tech

Support tickets → “refund”, “delivery”, “payment”

C) Smart Reply (like Gmail suggestions)……….

What it is: Given a message, the model suggests short replies.

Incoming email: “Can we meet at 5?”
Smart replies: “Sure”, “Can we do 6?”, “Sounds good”

This is basically a context-based text generation problem.

D) Content Moderation (toxic/hate/inappropriate filtering)………

What it is: Detect harmful content and flag/remove it.

Why it’s hard:

People hide meaning using slang, sarcasm, spelling tricks
Cultural context matters
False positives can censor harmless content

E) Language Detection + Translation………

Language detection: detect language automatically
Translation: convert one language to another

Example:

“Bonjour” → French
Translate speech/text instantly (like Google Translate)

F) Question Answering + Knowledge Graphs………

A knowledge graph is like a giant network of facts.

Example query: “Who is the CEO of X?”
Google often uses structured data (entities + relationships).

Diagram: tiny knowledge graph

This structure helps machines answer faster than searching raw text every time.

G) Text Summarization………

What it is: Reduce a long text into a shorter version while keeping key meaning.

Two styles:

Extractive: pick important sentences from the original
Abstractive: generate a new shorter version (more human-like)

Module 3: The Evolution of NLP Techniques

Simple Explanation: We didn’t just wake up with smart AI like ChatGPT. It explains three stages of history:

Rule-Based (The Old Way): Programmers wrote strict manual rules.

Example: “If the sentence has the word ‘bad’, label it ‘Negative’.”
Problem: It fails on sentences like “Not bad” (which is actually positive).

Machine Learning (1990s): Computers started using statistics. Instead of rules, we fed them thousands of documents and let them calculate the probability of which words appear together.
Deep Learning (2010s): We built “Neural Networks” that mimic the human brain. These can handle very complex data.

Code Example: Rule-Based vs. Machine Learning

The Old Rule-Based Way (Brittle):

def simple_sentiment(text):
        if "bad" in text:
        return "Negative"
        return "Positive"

print(simple_sentiment("This movie is not bad"))
        # Output: Negative (INCORRECT! 'Not bad' is good, but the rule failed.)

2. The Modern Way (Concept): In modern ML, we don’t look for specific keywords; we train a model on millions of sentences so it learns that “not” flips the meaning of “bad”.

Module 4: Deep Learning & Transformers

Before Transformers, computers read sentences one word at a time, from left to right. They often forgot the beginning of a long sentence by the time they reached the end

Transformers changed this. They can look at the entire sentence at once. They use a mechanism called “Attention”. Imagine reading a sentence and highlighting the most important words that relate to each other, even if they are far apart.

Code Example: Using a Transformer (BERT) We can use the transformers library by Hugging Face to use these powerful models easily.

from transformers import pipeline

# Load a pre-trained transformer model for sentiment analysis
classifier = pipeline("sentiment-analysis")

# The model understands complex context
        result = classifier("The food was okay, but the service was terrible.")

print(result)
# Output: [{'label': 'NEGATIVE', 'score': 0.99}]
        # It correctly understood that "terrible service" outweighs "okay food".

Module 5: Challenges (Ambiguity & Sarcasm)

Simple Explanation: Human language is messy. This highlights why machines still struggle:

Ambiguity: One word can have multiple meanings.

Example: “I went to the bank.” (River bank or Money bank?) Humans know from context; machines struggle.

Sarcasm: Saying the opposite of what you mean.

Example: “Oh, great! Another flat tire.” (The machine sees “Great” and thinks you are happy).

Idioms: Phrases that don’t make sense literally.

Example: “Break a leg.” (Machine thinks you want to hurt someone; you actually mean “Good Luck”).

Code Example: Disambiguation (The “Bank” Problem) This example shows how we use a method called “Word Sense Disambiguation” to tell the difference.

from nltk.wsd import lesk
from nltk.tokenize import word_tokenize

# Context 1: Money
        sentence1 = "I went to the bank to deposit money."
sense1 = lesk(word_tokenize(sentence1), 'bank')
print(f"Context 1 meaning: {sense1.definition()}")
# Output: a financial institution...

        # Context 2: River
        sentence2 = "I sat on the bank of the river."
sense2 = lesk(word_tokenize(sentence2), 'bank')
print(f"Context 2 meaning: {sense2.definition()}")
# Output: sloping land (especially the slope beside a body of water)

End to End NLP Pipeline

In this lesson, we are going to walk through the 5-Step End-to-End NLP Pipeline. Think of this pipeline as an assembly line in a factory. You start with raw materials (messy text), process them, and end up with a finished product (a working AI model).

Here is the roadmap we will follow:

Step 1: Data Acquisition (Gathering the Raw Material)

Simple Explanation: Before you can cook, you need ingredients. In NLP, your “ingredients” are text data. You can get data in three ways:

Available Data: You already have it (e.g., company emails).
Public Data: You download it from the internet (e.g., Kaggle datasets).
Scraping: You write a bot to “read” websites and save the text (e.g., copying product reviews from Amazon).

Code Example: Web Scraping Let’s say we want to grab some text from a webpage using a library called BeautifulSoup.

import requests
from bs4 import BeautifulSoup

# The URL we want to scrape
url = "https://example.com/reviews"

        # Get the page content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all paragraphs (simulating grabbing reviews)
reviews = [p.text for p in soup.find_all('p')]

print(reviews[:2]) 
# Output: ['This product is great!', 'I did not like the service.']

Step 2: Text Preparation (Cleaning the Ingredients)

Simple Explanation: Raw text is messy. It has emojis, HTML tags (
), and weird symbols. If we feed this to the computer, it will get confused. This step involves:

Cleaning: Removing HTML tags, emojis, and punctuation.
Tokenization: Chopping sentences into words.
Stop Word Removal: Deleting boring words like “is”, “the”, “at” that don’t add much meaning.

Code Example: Cleaning & Tokenizing

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Raw dirty text
        text = "The movie was AMAZING!!! 😃 I loved it."

# 1. Remove HTML tags
        clean_text = re.sub('<.*?>', '', text)

# 2. Remove special characters (keep only letters)
clean_text = re.sub('[^a-zA-Z]', ' ', clean_text)

# 3. Tokenize (split into words)
words = word_tokenize(clean_text.lower())

        # 4. Remove Stop Words ("the", "it", "was")
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if w not in stop_words]

print(filtered_words)
# Output: ['movie', 'amazing', 'loved']

Step 3: Feature Engineering (Translating for the Computer)

Simple Explanation: Computers cannot read words; they only understand numbers. Feature Engineering is the process of converting your text into a list of numbers (vectors) that represent the meaning.

Two common ways to do this:

Bag of Words (BoW): We count how many times each word appears.
TF-IDF: A smarter way that gives more importance to rare, unique words and less importance to common words.

Code Example: Bag of Words (CountVectorizer)

from sklearn.feature_extraction.text import CountVectorizer

documents = [
        "I love coding",
        "Coding is fun"
        ]

        # Create the vectorizer
        vectorizer = CountVectorizer()

# Convert text to numbers
X = vectorizer.fit_transform(documents)

# Show the numbers (The "features")
print(vectorizer.get_feature_names_out())
print(X.toarray())

        # Output:
        # ['coding' 'fun' 'is' 'love']
        # [[1, 0, 0, 1]]  <- "I love coding" (1 'coding', 0 'fun', 0 'is', 1 'love')
        # [[1, 1, 1, 0]]  <- "Coding is fun"

Step 4: Modeling (The Brain)

Simple Explanation: Now that we have numbers, we can train a “Model”.

Machine Learning (ML): We use algorithms like Naive Bayes or Support Vector Machines. These are great when you have less data. You have to tell the model what to look for (manual feature engineering).
Deep Learning (DL): We use Neural Networks. These are better for huge amounts of data. They figure out the features automatically, but they are “Black Boxes” (hard to explain why they made a decision).

Code Example: Training a Simple Classifier

from sklearn.naive_bayes import MultinomialNB

# X is our numbers from Step 3, y is our labels (1=Positive, 0=Negative)
y = [1, 1] # Both our previous sentences were positive

# Train the model
        model = MultinomialNB()
model.fit(X, y)

# Predict a new sentence
        test_sentence = vectorizer.transform(["Coding is amazing"])
prediction = model.predict(test_sentence)

print(f"Prediction: {prediction[0]}") 
# Output: Prediction: 1 (Positive)

Step 5: Deployment (Going Live)

Simple Explanation: A model sitting on your laptop is useless. Deployment means putting your model on a server so other people can use it (like via a website or app).

Monitoring: Once live, you must watch it. If people start using new slang words that your model doesn’t know, it will stop working.
Updating: You need to retrain the model periodically with new data to keep it smart.

Code Example: A Mock API Endpoint (Flask) This is how a web server might look when you deploy your model.

from flask import Flask, request, jsonify

        app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
data = request.json
        text = data['text']
    
    # Preprocess and Predict (using our previous steps)
text_vector = vectorizer.transform([text])
result = model.predict(text_vector)[0]

        return jsonify({'sentiment': 'Positive' if result == 1 else 'Negative'})
        # If you ran this, you could send a text to the server and get a prediction back!

Text Preprocessing

We are now diving deep into Step 2 of the NLP Pipeline: Text Preprocessing.

If you feed dirty text (with HTML tags, emojis, and typos) into a model, you get “Garbage In, Garbage Out.” We need to scrub it clean.

Module 1: Basic Cleaning (The “Janitorial” Work)

Simple Explanation: Before we look at the meaning of words, we need to standardize the format.

Lowercasing: “Apple” and “apple” should be treated as the same word.
Removing HTML: If you scrape data from the web, it comes with invisible tags like
or
that confuse the machine.
Removing URLs & Punctuation: Links (http://...) and symbols (!?,.) often add noise without adding sentiment.

Code Example: Cleaning with Regex We use Python’s re (Regular Expressions) library for this.

import re
import string

def clean_text(text):
        # 1. Lowercase
        text = text.lower()

    # 2. Remove HTML tags (anything between < and >)
text = re.sub(r'<.*?>', '', text)

    # 3. Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # 4. Remove Punctuation
    # We replace punctuation with an empty string
        text = text.translate(str.maketrans('', '', string.punctuation))

    return text

        raw_tweet = "Watch this MOVIE! 
 It's GR8. https://movie.com"
print(clean_text(raw_tweet))
        # Output: watch this movie  its gr8

Module 2: Advanced Cleaning (Emojis & Spelling)

Simple Explanation: This highlights two specific “noise” types:

Emojis: :) or 🔥. Sometimes we want to keep them (for sentiment), but often we want to remove them or translate them to text (e.g., convert :) to “happy”).
Chat Speak & Typos: Users write “u” instead of “you” or “luv” instead of “love.” We use a dictionary mapping to fix these.

Code Example: Handling Emojis & Chat Speak

# A simple dictionary for chat speak
chat_words = {
        "u": "you",
        "gr8": "great",
        "luv": "love",
        "r": "are"
        }

def clean_chat_speak(text):
words = text.split()
new_words = []
        for w in words:
        if w in chat_words:
        new_words.append(chat_words[w])
        else:
                new_words.append(w)
    return " ".join(new_words)

# Removing Emojis (using regex)
def remove_emojis(text):
        # This regex removes non-ASCII characters (which covers most emojis)
    return text.encode('ascii', 'ignore').decode('ascii')

input_text = "u r gr8 😃"
clean = clean_chat_speak(input_text)
clean = remove_emojis(clean)
print(clean)
# Output: you are great

Module 3: Tokenization

Simple Explanation: Tokenization is the act of chopping text into pieces.

Sentence Tokenization: Splitting a paragraph into sentences.
Word Tokenization: Splitting a sentence into individual words.

Why? Because the computer analyzes text one “token” (unit) at a time.

Code Example: NLTK Tokenization

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

        text = "NLP is fun. I am learning fast!"

# Sentence Tokenization
print(sent_tokenize(text))
        # Output: ['NLP is fun.', 'I am learning fast!']

        # Word Tokenization
print(word_tokenize(text))
        # Output: ['NLP', 'is', 'fun', '.', 'I', 'am', 'learning', 'fast', '!']

Module 4: Stemming vs. Lemmatization

Simple Explanation:

In English, words change shape (inflection): “run,” “running,” “ran,” “runs.” To a computer, these look like 4 different words. We want to reduce them to their root concept: “RUN.”

Let’s compares two methods:

Stemming (Fast but dumb): It just chops off the end of the word.

Example: “Changing”  “Chang” (Not a real word, but usually good enough).

Lemmatization (Slow but smart): It uses a dictionary (like WordNet) to find the actual root word.

Example: “Better”  “Good” (It understands the meaning).

Code Example: PorterStemmer vs. WordNetLemmatizer

from nltk.stem import PorterStemmer, WordNetLemmatizer

        stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "running"

print("Stemming:", stemmer.stem(word))
        # Output: run

        word2 = "better"
print("Stemming:", stemmer.stem(word2))
        # Output: better (Stemmer doesn't know grammar)

        print("Lemmatization:", lemmatizer.lemmatize(word2, pos='a'))
        # Output: good (Lemmatizer knows 'better' is an adjective form of 'good')

Module 5: The Project (Movie Classification)

Simple Explanation: Let’s concludes with an assignment: Building a dataset of movie reviews to classify them. This brings all the steps together. You scrape the data, create a DataFrame, and apply the cleaning functions we just wrote.

Code Example: The Complete Pipeline Function Here is how you would apply all these steps to a list of movie reviews in a real project.

import pandas as pd

# 1. Create the Dataset (Simulating the 'Data Acquisition' step)
data = {
        'review': [
        "The movie was 
 AMAZING!!! 😃",
        "worst. movie. ever. don't watch it.",
        "I luv the acting, u should see it."
        ],
        'sentiment': ['positive', 'negative', 'positive']
        }
df = pd.DataFrame(data)

# 2. Define the Master Preprocessing Function
def master_clean(text):
text = text.lower()                     # Lowercase
        text = re.sub(r'<.*?>', '', text)       # Remove HTML
text = text.encode('ascii', 'ignore').decode('ascii') # Remove Emojis
text = text.translate(str.maketrans('', '', string.punctuation)) # Remove Punctuation
text = clean_chat_speak(text)           # Fix "luv" -> "love"
        return text

# 3. Apply to the DataFrame
df['cleaned_review'] = df['review'].apply(master_clean)

print(df[['review', 'cleaned_review']])

Output:

review                   cleaned_review
0     The movie was 
 AMAZING!!! 😃            the movie was amazing 
1  worst. movie. ever. don't watch it      worst movie ever dont watch it
        2  I luv the acting, u should see it    i love the acting you should see it

Text Representation | Bag of Words | Tf-Idf | N-grams, Bi-grams and Uni-grams

We have arrived at a critical juncture: Step 3 of the Pipeline — Feature Engineering (Text Representation).

In simple terms: How do we turn words into numbers so the machine can understand them?

It covers three major techniques, from simple to advanced. I will explain each with a diagram, code, and a clear “Pros & Cons” list.

Module 1: One-Hot Encoding (The Simplest Approach)

Simple Explanation:

Imagine you have a vocabulary of 5 words: [“I”, “love”, “NLP”, “coding”, “hate”].

To represent the word “NLP”, you create a list of zeros and put a 1 in the slot where “NLP” sits.

“I” → [1, 0, 0, 0, 0]
“NLP” → [0, 0, 1, 0, 0]

The Problem:

If your vocabulary has 50,000 words (like the English language), every single word becomes a list with 49,999 zeros and one 1. This is called Sparsity. It wastes memory and computes power.

import pandas as pd
        from sklearn.preprocessing import OneHotEncoder

# Our vocabulary is implicity created from this data
data = [['I'], ['love'], ['NLP']]
encoder = OneHotEncoder(sparse_output=False)

# Convert to One-Hot
        one_hot = encoder.fit_transform(data)

print(encoder.get_feature_names_out())
        # Output: ['x0_I' 'x0_NLP' 'x0_love']

print(one_hot)
# Output:
        # [[1. 0. 0.]   <- "I"
        #  [0. 0. 1.]   <- "love"
        #  [0. 1. 0.]]  <- "NLP"

Module 2: Bag of Words (BoW) & N-Grams

Simple Explanation: Instead of marking just one word, we count all the words in a sentence.

Sentence: “I love NLP and I love coding.”
BoW Vector: {“I”: 2, “love”: 2, “NLP”: 1, “coding”: 1}

The Problem with BoW: It loses order. “dog bites man” and “man bites dog” look exactly the same because they share the same words.

The Solution: N-Grams Instead of counting single words (Unigrams), we count pairs (Bigrams) or triplets (Trigrams).

Bigrams: “I love”, “love NLP”, “NLP and”… Now “not bad” is treated as a single unit, preserving meaning.

Code Example: CountVectorizer (BoW & Bigrams)

from sklearn.feature_extraction.text import CountVectorizer

text = ["I love NLP and I love coding"]

        # 1. Standard Bag of Words (Unigrams)
cv = CountVectorizer()
vector = cv.fit_transform(text)
print("BoW:", cv.get_feature_names_out())
        # Output: ['and', 'coding', 'love', 'nlp']

        # 2. Bigrams (Pairs of words)
cv_bigram = CountVectorizer(ngram_range=(2, 2))
vector_bi = cv_bigram.fit_transform(text)
print("Bigrams:", cv_bigram.get_feature_names_out())
        # Output: ['and love', 'love coding', 'love nlp', 'nlp and']

Module 3: TF-IDF (Term Frequency — Inverse Document Frequency)

Simple Explanation: In Bag of Words, common words like “the” or “is” might appear 100 times, making them seem most important. But they are useless! TF-IDF fixes this by:

TF (Term Frequency): How often a word appears in this document. (Rewards frequent words).
IDF (Inverse Document Frequency): How rare the word is across all documents. (Punishes common words like “the”).

Result: Words like “Netflix” or “Quantum” get high scores. Words like “the” get near-zero scores.

What is TF-IDF?

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. It is a statistical technique that quantifies the importance of a word in a document based on how often it appears in that document and a given collection of documents (corpus). The intuition for this measure is : If a word occurs frequently in a document, then it should be more important and relevant than other words that appear fewer times and we should give that word a high score (TF). But if a word appears many times in a document but also in too many other documents, it’s probably not a relevant and meaningful word, therefore we should assign a lower score to that word (IDF). The relevancy of a word is proportional to the amount of information that it gives about its context (a sentence, a document or a full dataset). The more relevant words help us better understand the entire document without reading it completely. The most relevant words are not necessary the most frequent words since stopwords like “the”, “of” or “a” tend to occur very often in many documents, but do not give much information. TF-IDF method is widely used in Information Retrieval and Text Mining. The TF-IDF score of term in document with respect to corpus is:

TF (Term Frequency) Score

How often a term appears inside a document.

Common version:

IDF (Inverse Document Frequency)

How rare the term is across documents.

Common version:

TF‑IDF

Example 1:

"i love nlp"
"i love love deep learning"
"nlp is fun"

Here:

N = 3
df("love") = 2 (in doc1 and doc2)
df("deep") = 1 (only doc2)

Compute IDF (natural log):

IDF(love)=log⁡(3/2)≈0.405
IDF(deep)=log⁡(3/1)≈1.099

In document 2 ("i love love deep learning", total words = 5):

TF(love) = 2/5 = 0.4 → TFIDF(love) = 0.4×0.405 ≈ 0.162
TF(deep) = 1/5 = 0.2 → TFIDF(deep) = 0.2 × 1.099 ≈ 0.220

Example 2:

TF-IDF Example

In order to fully understand how TF-IDF works, I will give you a concrete example. Let’s assume that we have a collection of four documents as follows:

d1 : “The sky is blue.
d2 : “The sun is bright today.”
d3 : “The sun in the sky is bright.”
d4 : “We can see the shining sun, the bright sun.”

Task: Determine the tf-idf scores for each term in each document.

Step1: Filter out the stopwords. After removing the stopwords, we have

d1 : “sky blue

d2 : “sun bright today”

d3 : “sun sky bright”

d4 : “can see shining sun bright sun”

Step2: Compute TF, therefore, we find document-word matrix and then normalize the rows to sum to 1.

Step3: Compute IDF: Find the number of documents in which each word occurs, then compute the formula:

Step4: Compute TF-IDF: Multiply TF and IDF scores.

Code Example: TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
        "I love the movie",
        "The movie was boring",
        "I love popcorn"
        ]

        # Create TF-IDF
        tfidf = TfidfVectorizer()
output = tfidf.fit_transform(corpus)

# Let's see the score for "movie" vs "popcorn"
feature_names = tfidf.get_feature_names_out()
print(feature_names)
print(output.toarray())

        # You will notice 'popcorn' has a higher score in sentence 3
        # than 'movie' does in sentence 1, because 'movie' appears in multiple sentences (less unique).

Word2vec | CBOW and Skip-gram

Welcome to Step 4: Word Embeddings (Word2Vec)

Please Note: Word2Vec is one form of embeddings. There are numerous approach and the agreed approach is Transformer based. Please go through my blog https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df if you want to learn in detail.

In the previous section (TF-IDF/Bag of Words), we treated words as just counts. The computer knew that “Apple” appeared 5 times, but it didn’t know that “Apple” is a fruit, or that it’s similar to “Orange”.

Word2Vec changes everything. It turns words into Dense Vectors (lists of numbers) where similar words are placed close together in space.

Module 1: The Core Concept (Semantic Meaning)

Simple Explanation: Imagine a giant 3D map.

We want to put the word “King” at coordinate [5, 5].
We want to put “Queen” nearby at [5, 6].
We want to put “Apple” far away at [90, 10].

Word2Vec figures out these coordinates automatically by reading millions of sentences. The logic is simple: “You shall know a word by the company it keeps.” If “Apple” and “Orange” both appear next to “juice” often, they must be related.

The Magic Calculation: Because words are now numbers, we can do math with them! The most famous example is:

King — Man + Woman = Queen

Module 2: How it Learns (The Sliding Window)

Simple Explanation:

To teach the computer, we play a “Fill in the Blank” game. We slide a “Window” over a sentence to create training examples.

Sentence: “The quick brown fox jumps.”

Window Size: 2 (Look 2 words back and 2 words forward).

We create pairs of input/output words:

Input: “brown” → Target: “quick”
Input: “brown” → Target: “fox”

This creates a “Dummy Problem” for a Neural Network to solve. We don’t actually care about the prediction; we care about the Weights of the neural network — these weights become our Vectors!

Module 3: Two Architectures (CBOW vs. Skip-gram)

This explains two ways to train this model.

1. CBOW (Continuous Bag of Words):

Task: I give you the context (surrounding words), you guess the middle word.
Example: “The quick ____ fox.” → You guess “brown”.
Best for: Smaller datasets, faster.

2. Skip-gram:

Task: I give you the middle word, you guess the context.
Example: “____ brown ____” → You guess “quick” and “fox”.

Best for: Large datasets, captures rare words better.

Module 4: Coding Word2Vec (Game of Thrones Edition)

Simple Explanation: It uses the Game of Thrones books to train a model. Since we can’t process the whole book here, I will simulate it with a small dataset so you can see the code structure. We use the library gensim.

pip install gensim nltk

import gensim
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk

# 1. Prepare Data (Simulating the GoT text)
got_text = """
Jon Snow is a member of the Night's Watch.
Daenerys Targaryen consists of fire and blood.
Tyrion Lannister is a dwarf and a clever man.
Arya Stark has a sword named Needle.
The King in the North is Jon Snow.
"""

        # 2. Preprocessing (Tokenization)
# We need a list of lists: [['jon', 'snow', ...], ['ary', 'stark', ...]]
sentences = []
        for sent in sent_tokenize(got_text):
words = [w.lower() for w in word_tokenize(sent)]
        sentences.append(words)

# 3. Train the Model
# min_count=1 means "keep words that appear at least once" (usually set to 5 for big data)
        # vector_size=100 means "create a list of 100 numbers for each word"
        # window=5 means "look 5 words left and right"
model = Word2Vec(sentences, min_count=1, vector_size=100, window=5)

# 4. Use the Model
# Find the vector for "jon"
vector_jon = model.wv['jon']
print(f"Vector for Jon (first 5 numbers): {vector_jon[:5]}")

# Find similarity
similarity = model.wv.similarity('jon', 'stark')
print(f"Similarity between Jon and Stark: {similarity}")

# Find most similar words (Won't be great on this tiny text, but works on big data)
        print("Most similar to Daenerys:", model.wv.most_similar('daenerys'))

Module 5: Visualization (PCA)

Simple Explanation: Our vectors have 100 dimensions (a list of 100 numbers). Humans can only see 2D or 3D. To visualize them, we use a technique called PCA (Principal Component Analysis). It squashes the 100 dimensions down to 2, keeping the most important information, so we can plot them on a scatter chart.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Get all word vectors from the model
        X = model.wv[model.wv.index_to_key]

# Compress to 2D
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# Plot
plt.scatter(result[:, 0], result[:, 1])
words = list(model.wv.index_to_key)
for i, word in enumerate(words):
        plt.annotate(word, xy=(result[i, 0], result[i, 1]))
        plt.show()

Text Classification | Average Word2Vec

Welcome to Text Classification, one of the most useful skills you will learn in Machine Learning.

If you have ever wondered how Gmail knows an email is “Spam” or how a support ticket system automatically sends billing questions to the “Finance” team, the answer is Text Classification.

I have broken this lecture down into clear modules with diagrams and code.

Module 1: What is Text Classification?

Simple Explanation: Imagine you are a librarian with a huge pile of unorganized books. Your job is to read the title of each book and throw it into the correct bin.

Binary Classification: You have only two bins (e.g., “Spam” vs. “Not Spam”).
Multi-Class Classification: You have many bins (e.g., “Sports”, “Politics”, “Tech”).
Multi-Label Classification: A book can go into multiple bins at once (e.g., A movie can be both “Action” and “Comedy”).

The Goal: To build a machine (Model) that can look at the text and predict the label automatically.

Module 2: The Classification Pipeline

Simple Explanation: It emphasizes that you don’t just “throw data at an algorithm.” You must follow a pipeline.

Preprocessing: Clean the text (Lowercasing, remove HTML).
Feature Engineering: Convert text to numbers (Bag of Words, TF-IDF, or Word Vectors).
Modeling: Train an algorithm (Naive Bayes, Random Forest, etc.) to recognize patterns.
Prediction: Give it new text and get a label.

Module 3: Code Example (Building a Spam Filter)

Let’s build a real working Spam Classifier using the classic Naive Bayes algorithm (which is great for text). We will use a Pipeline to keep our code clean.

import pandas as pd
        from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. The Dataset (Simulating emails)
data = {
        'text': [
        "Win a free iPhone now! Click here.",
        "Hey, are we still meeting for lunch?",
        "URGENT: Your bank account is locked.",
        "Project deadline is tomorrow. Please review.",
        "Free cash prize winner!!! claim now"
        ],
        'label': ['Spam', 'Ham', 'Spam', 'Ham', 'Spam'] # 'Ham' means Not Spam
}
df = pd.DataFrame(data)

# 2. Split Data (Training vs Testing)
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# 3. Build the Pipeline
# Step A: Convert text to numbers (Bag of Words)
# Step B: Apply the Classifier (Naive Bayes)
pipeline = Pipeline([
                            ('vectorizer', CountVectorizer()),
        ('classifier', MultinomialNB())
        ])

        # 4. Train the Model
pipeline.fit(X_train, y_train)

# 5. Predict on New Data
new_emails = ["Meeting at 5pm?", "Free money click link!"]
predictions = pipeline.predict(new_emails)

print(f"Predictions: {predictions}")
# Output: Predictions: ['Ham' 'Spam']

Module 4: Advanced Technique (Averaging Word Vectors)

Simple Explanation:

This mentions a technique for converting a whole sentence into a single vector using 3D vectors (Word2Vec).

Since Word2Vec gives us a vector for each word, how do we get one vector for the sentence?

We take the Average.

Analogy: If you mix a drop of Red paint (“Apple”) and a drop of Yellow paint (“Banana”), you get Orange (The average color).
Math:

import numpy as np

# Imagine these are Word2Vec vectors (simplified to 2D for this example)
word_vectors = {
        "king": np.array([5, 5]),
    "rule": np.array([3, 3])
}

sentence = "king rule"
tokens = sentence.split()

# Calculate Average
vectors = [word_vectors[word] for word in tokens]
sentence_vector = np.mean(vectors, axis=0)

print(f"Sentence Vector: {sentence_vector}")
# Calculation: ([5,5] + [3,3]) / 2  = [4, 4]
        # Output: Sentence Vector: [4. 4.]

Part of Speech (POS) Tagging | Hidden Markov Models | Viterbi Algorithm in NLP

We are now entering the world of grammar and structure with Part of Speech (POS) Tagging.

If you have ever wondered how a computer knows that “Book a flight” uses “Book” as a Verb, but “Read a book” uses “Book” as a Noun, the answer is POS Tagging.

1) What is POS tagging?

POS (Part of Speech) tagging = assigning a grammatical label to each word/token in a sentence.

Examples of POS tags:

NOUN: dog, city, movie
VERB: run, eat, is
ADJ (adjective): beautiful, big
ADV (adverb): quickly, very
PRON (pronoun): I, you, he
DET (determiner): the, a, an
ADP (preposition): in, on, to

Simple example

Sentence:

“The cat sleeps.”

POS tags:

The → DET
cat → NOUN
sleeps → VERB

Why we do this: once you know the role of each word, it’s easier for a machine to understand the structure of the sentence.

Why POS tagging is important (applications)

POS tagging is a “support skill” that boosts many NLP systems:

A) Information Retrieval (Search)

If you search: “best camera for travel”, POS tags can help identify:

“camera” = main noun
“best” = adjective modifying it
So search can weight the important words better.

B) Question Answering systems

Question: “Who invented the telephone?”
POS helps find:

“Who” (question pronoun)
“invented” (verb)
“telephone” (noun)

C) Disambiguation (same word, different meaning)

Example:

“I will book a cab.” → book = VERB
“I read a book.” → book = NOUN
Sentence A: “I will park the car.” (“Park” is an Action/Verb).
Sentence B: “I walked in the park.” (“Park” is a Place/Noun).

Without POS tags, the computer thinks “park” means the same thing in both.

D) Chatbots / intent understanding

A chatbot often needs to identify:

actions (verbs)
entities (nouns)
modifiers (adjectives/adverbs)

Why do we need it? Disambiguation. Words change meaning based on how they are used.

Sentence A: “I will park the car.” (“Park” is an Action/Verb).
Sentence B: “I walked in the park.” (“Park” is a Place/Noun).

Without POS tags, the computer thinks “park” means the same thing in both.

Module 2: Doing it the Easy Way (SpaCy)

It mentions using the library spaCy. This is the modern, fast way to do tagging without writing complex algorithms from scratch.

import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

text = "I will google the answer."

        # Process the text
        doc = nlp(text)

# Print the token and its POS tag
for token in doc:
print(f"{token.text} --> {token.pos_} ({token.tag_})")

# Output:
        # I --> PRON (PRP)
# will --> AUX (MD)
# google --> VERB (VB)  <-- Look! It knew 'google' was a verb here!
        # the --> DET (DT)
# answer --> NOUN (NN)

Module 3: The Algorithm Behind It (Hidden Markov Models)

Simple Explanation:

How does the computer figure this out? It uses probability. Specifically, a Hidden Markov Model (HMM).

The HMM looks at two types of probabilities:

Transition Probability (Tag → Tag):

How likely is a Noun to follow a Determiner? (e.g., “The cat” → Very likely).
How likely is a Verb to follow a Determiner? (e.g., “The run” → Very unlikely).

2. Emission Probability (Tag → Word):

If the tag is Verb, how likely is the word “run”? (High).
If the tag is Noun, how likely is the word “run”? (Low, but possible, like “A long run”).

The Math Logic:

The model calculates the probability for a sequence by multiplying these together:

P(Sequence) = P(Start → Noun) X P(Noun → Verb) X P(Verb → “run”)

Module 4: The “Viterbi” Algorithm (Optimization)

The Problem: If you have a long sentence, checking every single possible combination of Nouns and Verbs would take forever (Exponential complexity).

“I saw the man with the telescope.”
Is “saw” a noun (tool) or verb (action)? Is “man” a verb (to man a station) or noun?

The Solution (Viterbi): Instead of checking all paths at the end, the Viterbi algorithm checks them step-by-step and throws away the bad paths immediately. It keeps only the “winning” path at each word.

Simple Analogy: Imagine you are driving from New York to LA. You don’t map out every single road in the USA. You just look at the best road to the next city, take it, and then look for the best road to the city after that.

Code Concept: Calculating Transition Probability Here is a simplified Python snippet to show how we calculate “How often does a Noun follow a Verb?”.

# A dummy dataset of tagged sentences
# (Word, Tag)
corpus = [
        [("I", "PRON"), ("love", "VERB"), ("code", "NOUN")],
        [("He", "PRON"), ("runs", "VERB"), ("fast", "ADV")]
        ]

        # Calculate Transition: P(Tag B | Tag A)
def calculate_transition(tag_a, tag_b, data):
count_a = 0
count_a_followed_by_b = 0

        for sentence in data:
        for i in range(len(sentence) - 1):
current_tag = sentence[i][1]
next_tag = sentence[i+1][1]

        if current_tag == tag_a:
count_a += 1
        if next_tag == tag_b:
count_a_followed_by_b += 1

        return count_a_followed_by_b / count_a

# Probability that a VERB follows a PRONoun
        prob = calculate_transition("PRON", "VERB", corpus)
print(f"Probability(VERB | PRON): {prob}")
# Output: 1.0 (100% in this tiny dataset)

I’ve tried to keep the explanation detailed while staying concise.
If you’d like to explore any of the topics in more depth, don’t hesitate to reach out — I’ll be glad to assist.

Foundation Models and Generative AI

Soumyadeep Saha — Tue, 23 Dec 2025 04:24:13 GMT

In the last few years, AI stopped being a collection of narrow tools and started feeling like a general-purpose helper — something that can write, summarize, explain, plan, generate images, and even assist in research. This shift isn’t magic. It comes from foundation models: huge models trained on “oceans” of data (text, images, code, and more) so they learn broad, reusable skills. When these models are used to create new content — sentences, pictures, programs, designs — we call it generative AI.

This article breaks down what’s really happening in simple terms: why self-supervised learning was the breakthrough, how modern models learn meaning from context and relationships, and why one strong base model can be adapted to dozens of tasks instead of building one model per task. We’ll also connect the technology to the real world — how businesses use these systems for “unified intelligence” how biology and medicine benefit from learning patterns in massive datasets, and why ethics, safety, bias, and regulation matter when a single model can influence decisions at scale.

Introduction

Big picture:

· Foundation Models (like GPT, Claude, etc.) are huge AI systems trained on oceans of text, images, code, audio, and more. They learn general skills (language, vision, reasoning) that you can adapt to many jobs, instead of training a new small model for each job.

· Generative AI makes new things (text, images, code, molecules) by learning the patterns and relationships in data.

· The course explores how modern AI learns (especially self‑supervised learning), where it’s used (science, business), and what it means for how we design systems in a messy, chaotic world.

1) Why the recent AI shift matters

Breakthroughs: Systems like ChatGPT showed that one general model can write, reason, summarize, plan, and even help with robotics or genomics tasks once adapted a bit.
Economics: Investment has surged because these models can be reused everywhere (customer support, coding, research, creative work).
Autonomous agents: On top of foundation models, “agents” try to plan multi-step tasks — like a smart assistant that can break a goal into steps and act.
AGI discussion: The course touches on artificial general intelligence — systems that can do most cognitive tasks a human can. The “when” is debated, but understanding the pathways (especially learning from raw data) is essential.

2) How do machines “learn” like people do?

Humans aren’t born with libraries in our heads. We learn patterns from experience and context.

Supervised learning: Learn from examples with correct answers (images labelled “dog”, “cat”).
Reinforcement learning (RL): Learn by acting, getting rewards/punishments (like learning to ride a bike).
Generative / self‑supervised learning: Learn by predicting the missing parts of raw data itself (next word in a sentence, hidden patch in a picture).

The philosophical angle: context matters. Meaning comes from relationships (how things connect) more than from isolated labels.

Three Ways Machines Learn

3) Meaning comes from relationships

When a child learns “dog,” they don’t just memorize the word + a picture. They notice relations:

Dogs fetch balls, bark, live with people.
Cats chase mice, sleep on sofas.

These networks of relations make words meaningful. Generative models work similarly: they see millions of “cat–mouse–cheese” type patterns and internalize the structure, so they can write or draw sensible new combinations.

“Context builds meaning” (diagram: a small relational map)

4) Two ways to run organizations

Top‑down: Leaders decide → analysts set metrics → processes roll out → frontline executes.

Strength: clarity and consistency.

Risk: misses on‑the‑ground nuance.

Bottom‑up: Frontline learns from customers → teams experiment → patterns bubble up → leaders align and scale.

Strength: grounded in real customer reality.

Risk: can get noisy without coordination.

The lecture connects this to philosophy (Socrates’ cave) and science: our world is partly orderly (good for top‑down math) and partly chaotic (needs intuition and adaptation). Great orgs blend both perspectives.

Top‑down vs Bottom‑up (diagram)

5) The coding reality gap & classic ML limits

Computers like precise math and explicit goals. But the world is messy. That creates friction:

Supervised learning

Pro: precise, works well with clear labels.
Con: labels are expensive and sometimes unclear (“What counts as ‘kind customer service’?”).

Reinforcement learning

Pro: learns sequences of actions to reach goals.
Con: feedback is delayed (you drive for 2 hours, then find out you took the wrong turn); unsafe to “trial‑and‑error” in real life.

Blank slate problem

Starting from nothing makes exploration slow and risky. We need priors or representations that already make sense of the world.

6) Why self‑supervised learning (SSL) was a breakthrough

Self‑supervision learns from raw, unlabeled data by setting make‑believe tasks:

Text: predict the next word or the masked word.
Images: predict missing patches or align multiple views of the same scene.
Audio/Video: predict the next frame/sound.

This makes models learn general representations (a sense of “how the world is organized”), which you can later specialize for many tasks. It’s safer than RL (no risky real‑world exploration) and cheaper than supervised (no labels needed).

Self‑supervised learning from image crops

7) How SSL detects relationships

Images: Take two crops of the same photo → train the model to put their internal vectors close together, and different images far apart. Over time, it clusters related concepts (e.g., “wheels”, “fur”, “sky”).
Genomics: Predict the next DNA base from the previous ones; the model internalizes motifs and can help find genes or regulatory elements.
Retail: Look at what people view/click/buy; SSL can learn “this and that often go together,” improving recommendations without hand‑built user profiles.

8) Using it in business

Know your customers & products deeply: Use self‑supervised patterns from behavioural data (clickstreams, sequences) to improve search, ranking, recommendations, demand forecasts, and even product design.
Learn from observation first: It’s cheaper and safer to learn from historical data before trying interactive learning in the wild.
Blend order & chaos: Keep the top‑down strategy (safety, compliance, KPIs) but let bottom‑up signals (frontline/customer data) shape decisions.

Concrete examples to make it stick

Self‑supervised text

Task: hide the word “mouse” in “The cat chased the ___.”
The model guesses “mouse” because it has seen that cat–chase–mouse pattern repeatedly.

Self‑supervised images

Task: crop two parts of a single dog photo.
The model learns both crops are “the same thing.” Later, it recognizes dogs even in new poses or lighting.

Retail playlists

Customers who buy running shoes also often look at socks and phone armbands.
SSL learns this bundle — no one had to label “these three go together.”

Genomics

DNA has recurring “motifs.” Predicting the next base forces the model to internalize these motifs, which helps spot genes.

How these pieces fit together

Foundation models get powerful because SSL lets them soak up the structure of the world from raw data.
Once they have that structure, a little supervised learning (fine‑tuning) or RL (to align behavior with goals) goes a long way.
In organizations, mirror the same idea: collect rich bottom‑up signals (customer interactions), then guide with top‑down objectives (safety, strategy).

Practical tips if you’re building with this

Start with self‑supervised pretraining on all the unlabelled data you can legally and ethically use.
Add task‑specific fine‑tuning with small labelled sets.
For sequential tasks (e.g., routing, pricing policies), use RL carefully in simulations or sandboxes first.
Measure both accuracy and robustness (does it still work when conditions shift?).
Keep human oversight for safety and fairness; raw data can encode biases.

How Does It Work?

What this covers (at a glance)

Foundation models & Generative AI learn general skills from unlabelled data using self‑supervised learning (SSL) — a big leap that makes AI scalable and versatile.
How language models learn meaning from context (masked vs causal next‑word prediction), and why text‑to‑text framing (like T5) simplifies everything.
Contrastive learning for text and images, diffusion for image generation, and classic autoencoders & GANs for compression and synthesis.
Why language is a universal interface for robots and autonomous agents (plan → act → check → improve), and how tool use (calculator, web) expands capability.

1) Self‑supervised learning: the breakthrough

Idea in one line: Make up a small puzzle from raw data (mask a word, crop an image, add noise), train the model to solve it, and the model is forced to learn useful general patterns.

No labels required: We don’t need humans to tag every example.
Scales beautifully: Tons of unlabelled text/images/audio exist.
Reusable knowledge: After pretraining, you can fine‑tune or prompt for many tasks (classification, search, QA, coding, robotics).

SSL overview → pretrain, then reuse

2) How models understand words (vectors + context)

Meanings are relational: Cat is close to kitten (species/age relation).
Context disambiguates: “bank” (money) vs “bank” (river) depends on nearby words.
Masked language modeling (MLM): Hide a word and force the model to predict it using both left and right context → strong understanding.

Context and vector space (cat–kitten, bank)

3) Pretraining → fine‑tuning, plus the text‑to‑text shift

Pretraining gives broad language sense.
Fine‑tuning nudges the model for a specific job (e.g., Amazon review sentiment).
Text‑to‑text framing (T5): Represent every task as input text → output text (e.g., “translate: …”, “summarize: …”).

Why it’s great: One uniform interface, less ad‑hoc engineering, easy to read and debug.

Pretrain → World sense → Fine‑tune; Text‑to‑text examples

4) Two training styles for language models

Masked LM (BERT‑like)

Training: “The cat sat on the [MASK].” Predict the mask using both sides.
Strength: Strong internal representations (great for understanding).
Limit: Not naturally a left‑to‑right generator.

Causal LM (GPT‑like)

Training: “The cat sat on the …” Predict the next word using only the left side.
Strength: Fluent, open‑ended generation.
Limit: No right context; may “guess” and sometimes hallucinate.

Masked vs Causal (pros & cons)

5) Contrastive learning (images & sentences)

Core idea: Pull similar things together in the model’s space; push dissimilar things apart.

Images: Two crops of the same photo → positive pair (close embeddings). Unrelated images → negative (far apart).
→ Improves image representations and classification.
Text: Paraphrases or augmented sentences → positive pairs. Helps models encode meaning consistently.

Contrastive learning for images & sentences

6) Diffusion models: generate by denoising

Forward process: Gradually add noise to an image until it’s nearly pure noise.
Reverse process: Train a model to remove a little noise at a time, walking back to a clean image.
Why it works: The model learns to reconstruct structure from noise → powerful, controllable image generation.

Diffusion intuition (noise → denoise)

7) Autoencoders & GANs

Autoencoder:

Encoder → bottleneck → decoder.
Learns compact representations; good for compression, denoising, feature learning.

GAN (Generative Adversarial Network):

Generator (artist) makes fakes; Discriminator (critic) tries to spot fakes.
Training is a competition → increasingly realistic images.

Autoencoders & GANs in one view

8) Language as a universal interface (robots & agents)

Language standardizes knowledge: Easy to write, read, and share instructions (“Pick up the red mug…”).

Robotics: The LM turns instructions into plans (steps), checks constraints, and sequences actions more clearly than low‑level numbers alone.
Agents with tools: The model plans, uses tools (calculator, browser, database), self‑checks, retries, and learns from memory/logs.
Why tools matter: Offload heavy math or retrieval to reliable tools → better accuracy and less user burden.

Language → plan → execute, with tools & self‑check loop

CHAT-GPT & LLMs

1) What’s special about ChatGPT & foundation models?

Self‑supervised pretraining: The model learns general language patterns by predicting the next word on huge amounts of text. No manual labels are needed.
Transformers: The architecture that makes training fast and effective by processing tokens in parallel with self‑attention, unlike older sequential RNNs.
Scaling: More data, parameters, and compute typically lower error — up to practical limits — making models more capable.
Engineering details matter: Beyond big ideas, stability tricks, data curation, and training pipelines drive real‑world quality and robustness.
Beyond text: The lecture mentions stable diffusion (images) and other emerging models — showing these foundations generalize across modalities.

Next‑token prediction (the core pretraining loop)

2) How next‑word prediction actually trains a model

Feed the prompt tokens (e.g., “The cat sat on the”).
The model outputs a probability for every word in its vocabulary.
Compare with the true next word (“mat”) → compute loss.
Update the model so it assigns higher probability to the correct word next time.
Repeat billions of times with diverse text → the model internalizes grammar, facts, and patterns.

This single training objective is surprisingly powerful — because language encodes knowledge about the world.

3) Why Transformers changed the game

Self‑attention lets each word look at all other words at once to pull relevant context; the model runs many tokens at once (matrix math), not one‑by‑one.
Direct long‑distance connections: The word at the start can directly attend to something at the end; RNNs struggle with long memories.
Positional encodings provide order information so the model knows which word came first, even though it processes them in parallel.

Transformer block (attention → MLP)

Multi‑head attention (different heads learn different relations)

Positional encodings (how order is represented)

Illustration — Toy sinusoidal signals for positions

4) Scaling laws: why “more” often helps

As you increase data, model size, and compute, loss tends to drop smoothly — until you hit practical limits (data quality, overfitting, etc.). That’s why foundation models keep getting stronger when scaled correctly.

Scaling (schematic)

5) From raw models to helpful dialogue: SFT and RLHF

SFT (Supervised Fine‑Tuning): Teach the model the style of helpful answers using curated instruction‑response pairs.
Preference data: Humans compare two model replies and pick the better one — capturing quality beyond token‑by‑token accuracy.
Reward model: A model trained to predict human preferences gives a score to a candidate reply.
RLHF (Reinforcement Learning from Human Feedback): Optimize the policy (the chat model) to maximize the reward model’s score (often via PPO). This improves helpfulness, harmlessness, and robustness over long responses, even though feedback is delayed until the end.

RLHF loop (prompt → candidates → preference → reward → PPO update)

6) Reinforcement learning challenges in dialogue

Delayed feedback: You don’t know if the final answer is good until the end — hard for credit assignment.
Exploration vs exploitation:
Exploit: stick to what works now.
Explore: try new phrasing/structures that may be better.
Best: targeted exploration — sample promising but uncertain options.
Robustness: RL must not overfit to the reward model or game the metric. We add safeguards (consistency checks, tool‑verified steps, rule constraints).

7) Feedback that grows with the model

As the model improves, we also raise the bar for feedback:

Token‑level losses (pretraining).
SFT demonstration quality.
Pairwise preferences (ranking).
Rule‑based checks (format, safety, citations).
Tool‑verified answers (calculators, retrieval) and self‑check steps.

Feedback curriculum (simple → sophisticated)

8) Practical takeaways for builders

Engineering intuition matters: Stability tricks, data filtering, careful validation — these “details” create the leap from demo to dependable.
Attention & causality: ChatGPT uses causal (left‑to‑right) attention for generation.
Guardrails: Balance factuality, bias reduction, and sensitivity. Use tool‑use (calculators/browsers), structured prompts, citations, and post‑processing checks.

Quick glossary

Self‑supervised learning: Learn from raw data by solving puzzles like “predict the next word.”
Transformer: Architecture that uses self‑attention to combine context efficiently in parallel.
SFT: Supervised fine‑tuning on instruction data to teach helpful outputs.
RLHF: Use human preference judgments to train a reward model and optimize the chat policy.

Data and Stable Diffusion

1) Why data is the power source for modern AI

Data > tech (over time): Better, larger, cleaner, and properly licensed data usually beats fancy tricks. Models are only as good as what they learn from.
Access matters: Whoever can legally access and refresh high‑quality datasets can retrain and keep improving (e.g., new styles, trends, vocabulary).
Ethics & copyright: Datasets must respect creators’ rights. The legal landscape affects what data can be used — and therefore what models can learn.
We are data creators: Our texts, images, and interactions become the “lessons” models learn from.

Data → Pretraining → Foundation Model → Apps

2) Stable Diffusion in plain English

What it does: Turns a text prompt (e.g., “a watercolor fox in a forest”) into an image by starting from random noise and gradually removing that noise in a series of small steps.

Key pieces:

Text encoder: Converts your prompt into a vector (a numerical summary of meaning).
Latent space: Images are compressed into a smaller grid (a “latent”) so generation is much faster and cheaper.
Denoiser (U‑Net): Learns to remove a bit of noise at each step. After many steps, the latent looks like a clean picture representation.
Decoder (VAE): Transforms the final latent back into a full‑resolution image.

Stable Diffusion pipeline (overview)

3) Why randomness is needed

Without randomness, models would keep producing the same output.
A seed controls the initial noise. Change the seed → different starting point → different image. Keep the seed → reproducible image.

Seeds change the result

4) The “iterative improvement” idea

Think of an artist sketching: rough → refine → detail.
The model does the same: many small denoising steps (from heavy noise at the start to almost none at the end) until the picture emerges.
Text guidance nudges each step toward what the prompt asked for.

From pure noise to image over steps

5) Why it’s cost‑efficient: work in latent space

Training and generating on compressed latents(smaller grids) is much faster than working on full‑resolution pixels.
The encoder compresses; the decoder reconstructs. Good encoders/decoders preserve important details while dropping redundancy.

Latent compression with VAE

6) How models know an image is “good”: losses

To train an image model, we need a way to measure how good the output is relative to a target. Common ingredients:

Pixel / reconstruction loss (MSE/MAE): Simple and stable, but can look slightly blurry.
Perceptual loss: Compare features from a vision net; pushes toward images humans find sharp and natural.
Adversarial (GAN/patch critic): A small critic network checks local patches for realism; great for texture, but training can be tricky.

Often we combine these to get both sharp details and global correctness.

Losses: pixel vs perceptual vs adversarial

7) Local detail + global structure

Patch critic rewards locally realistic textures (fur, bark, fabric).
Global similarity keeps the overall composition (shapes, layout) coherent.
Balancing both makes images look realistic up close and as a whole.

Patch critic & global similarity together

8) How text aligns with images (contrastive learning)

Models learn that the image of a fox in a forest should be close (in embedding space) to the caption “a red fox in a forest,” and far from unrelated captions.
This alignment helps prompts steer image generation in the intended direction.

9) Training with different noise levels

During training, the model sees the same image at many noise levels and learns to denoise appropriately.
This “curriculum of noise” provides directional feedback: at each step it learns how to move a little closer to the real image.
Over many iterations, it can navigate from noisy inputs to realistic outputs.

Stable Diffusion: Components & Trade‑offs (interactive table)

Data is king: high‑quality, licensed, diverse data drives better models.

Stable Diffusion generates images by iteratively denoising a compressed latent, guided by your text prompt.

Randomness (seed) gives diversity; latent space makes it fast and cheap; combined losses ensure realistic detail and coherent structure.

Contrastive alignment ties text and images together so prompts steer results effectively.

AI. ECOSYSTEM

Foundational models learn relational meaning — they understand concepts from how things connect (across text, images, audio, behaviors). Companies that unify their data and ML into a single intelligence layer compound advantages across search, recommendations, marketing, pricing, risk, support, and more.

1) What are “foundation models” and why are they a big shift?

A foundation model is a large AI model trained on a huge amount of data (text, images, etc.) so it learns general patterns. After that, you can adapt it to many tasks like:

answering questions
summarizing documents
recommending products
classifying items
extracting info from text

Earlier AI was usually one model per task (one for translation, one for search ranking, one for sentiment…).

Now, a foundation model can become a single base engine that supports many tasks.

2) The key idea: “Relational meaning” (how AI really understands concepts)

What “relational meaning” means

A word or concept doesn’t have meaning in isolation. It gets meaning from its relationships with other concepts.

Example:

You understand “dog” not only by a dictionary definition
but also by its links to bark, pet, leash, park, bite, fur, vet, cute.

Foundation models learn like this by observing massive data:

which words appear near which words
which images match which captions
which actions follow which actions in user behavior logs

This is why they can often “understand” things they were never explicitly taught.

3) Why the section talks about “how humans learn” (and why it matters)

The speakers highlight that humans learn a lot without formal teaching:

kids learn language mostly by exposure, trial, correction, context
people learn meaning from real-world experience and repeated patterns

This connects to modern AI training called self-supervised learning:

the model teaches itself from data patterns (no human labeling for every example)

So the message is:

If you want better AI, learn from how humans build understanding: mostly from exposure + relationships + experience.

4) From “isolated task models” to “unified intelligence”

Old style

Separate AI for search
Separate AI for recommendations
Separate AI for customer support
Separate AI for marketing analytics

Problem: these systems don’t “talk” to each other well, so the business acts like it has multiple small brains.

New style (unified intelligence)

Build a central intelligence layer that understands:

customers
products
context (season, location, trends)
business constraints (inventory, delivery, margins)

Then different applications connect to it.

5) Why businesses need their own central model (competitive advantage)

If everyone uses the same public model (same API, same general training),
then everyone has the same intelligence.

So where does advantage come from?

Your proprietary data:

customer clicks, searches, carts, purchases
returns and complaints
store inventory and supply chain signals
product catalogs and attributes
domain rules (retail logic, policies, constraints)

When you combine foundation models with your unique data, you get:

better personalization
better predictions
better product understanding
better decision-making

That’s hard for competitors to copy.

6) “Many foundation models will exist” (not just one model to rule all)

The future described is not: one single AI does everything best.

Instead:

some models are great at language
some at images/video
some at search/ranking
some at code
some at reasoning
some at a specific industry (health, retail, finance)

So companies will likely use a portfolio of models:

an internal “core” model for their business brain
external models for specialized skills
tools like databases, search engines, workflow systems

Use this:

7) Multi-modal learning (like human senses)

Humans don’t learn only from text:

vision, sound, touch, movement, memory, emotion all contribute

Similarly, foundation models are evolving to handle:

text + images + audio + video + user behavior + structured data

This creates “synergy”:

the model can connect how something looks with how it’s described
and how people behave around it (click, buy, return)

8) System 1 vs System 2 thinking (intuition vs conscious reasoning)

The section mentions a psychological idea:

System 1 (fast, automatic)

quick judgments
intuition
habits
pattern recognition
Most day-to-day decisions happen here.

System 2 (slow, deliberate)

careful reasoning
step-by-step logic
conscious effort
Used less often.

Why this matters for AI:

many useful business predictions are more like System 1
(pattern-based, probabilistic, fast)
not everything needs heavy “reasoning” to be valuable

Example:

predicting a customer is likely to abandon a purchase doesn’t require a proof
it requires recognising patterns from behavior

9) Retail “deep intelligence” (focus on understanding customers, not just solving one task)

The section argues that the biggest win in retail is not only:

“answer questions”
“fix tickets”
“automate emails”

…but building a model that understands:

customer intent
product meaning
shopping journey
preferences and constraints

That enables a better experience:

better navigation
better recommendations
fewer frustrating searches
more trust

10) Why “expert labelling” can miss what customers actually care about

Traditional product tagging might say:

Category: “wall art”
Style: “landscape”
Color: “orange”

But customers might actually be reacting to something else:

the feeling
the scene (example mentioned: “sunsets”)
mood, aesthetic, cultural meaning

AI can learn this from behavior:

what people click after viewing it
what they compare it with
what they save
what they return
what they search before they buy

11) Multilingual + cultural nuance (harder than it looks)

The section points out that in real markets:

people mix languages (code-switching)
meanings differ by culture
translations aren’t literal

So a retail intelligence system must understand:

blended language queries
local synonyms
culturally specific product interpretations

Example style of problem:

one region’s “slippers” might be another region’s “flip-flops”
product descriptions may need adaptation, not direct translation

12) Predictive workforce modelling (reducing attrition cost)

Instead of only relying on surveys (“Are you happy at work?”),
AI can learn patterns from behaviour signals like:

schedule changes
overtime spikes
repeated shift conflicts
performance changes
transfer requests
absentee patterns

Then it can estimate attrition risk, so managers can intervene early:

better scheduling
coaching
career development
workload balancing

(Important note in real life: this should be handled carefully with privacy, fairness, and transparency — otherwise it can create mistrust.)

AI. BIOLOGY

1) AI is changing biology and medicine: what’s the big change?

Earlier, medicine progressed mainly by:

observing something in patients,
making a hypothesis (“maybe X causes Y”),
testing it on small experiments.

Now, we can collect huge amounts of data (genetics + hospital records + medical images + lab results), and AI helps us:

find patterns we didn’t notice,
discover hidden subtypes of disease,
propose new drug targets,
personalize treatment per person.

So the professor’s main message is:

Medicine is shifting from “guess first” to “measure a lot first.”

2) Hypothesis-driven vs data-driven research (simple comparison)

Hypothesis-driven (older style)

Scientist guesses an explanation.
Runs a small targeted experiment.
Confirms or rejects it.
Repeats.

Good when we already have strong ideas.
Limited because we might miss unexpected causes.

Data-driven (new style)

Collect big datasets (genetics, EHR, images).
Use AI to find patterns and relationships.
Generate many candidate explanations.
Test the best candidates in lab/clinical experiments.

Great at discovering surprises and hidden mechanisms.
Needs careful design to avoid false patterns and bias.

3) Moving from “correlation” to “causation” (why genetics is powerful)

Correlation

Correlation means:

“X and Y occur together”

Example:

People with a certain marker often have Alzheimer’s.

But correlation does not prove cause:

Maybe X is just a side effect, not the real driver.

Causation

Causation means:

“X actually produces Y”

Genetics helps because it gives mechanistic clues:

If a gene variant increases disease risk, it’s often closer to a real cause (not always, but it’s a stronger clue).

Then researchers test causation by interventions, like:

editing genes in cells,
switching gene circuits on/off,
checking if the disease-related outcome changes.

If changing the gene changes the outcome → stronger evidence of cause → better drug target.

4) Deep learning in biology: why it’s useful

Biology data is messy and complex:

thousands of genes interact,
proteins fold in complicated ways,
disease isn’t one thing (it can have subtypes),
data comes from many sources.

Deep learning helps because it can learn patterns from high-dimensional data like:

gene expression profiles,
microscopy images,
pathology slides,
multi-step patient histories.

A key idea in the summary:

AI can predict outcomes, then scientists validate those predictions by doing real experiments on cells.

So AI doesn’t replace experiments; it helps choose which experiments to do first.

5) Genetic mechanisms → new therapies (examples in the summary)

The section summary gives examples of using genetic understanding like “rewiring circuits”:

A) Obesity / metabolic disorders

Human fat cells can behave in different modes:

fat-storing mode
fat-burning mode

If we can “switch” the gene circuit controlling that behavior, it may become possible to shift metabolism in a healthier direction.

(Important: conceptually powerful, but real therapies must be safe and proven in humans.)

B) Alzheimer’s (APOE4 example)

APOE4 is a genetic variant linked to higher Alzheimer’s risk.
The summary says:

fixing a specific biological function (cholesterol transport) improved myelination and cognition in that context.

The big idea:

Find the mechanism a risky gene disrupts, then target that mechanism.

C) Cancer immunotherapy + recurrence

If we understand the genetic circuits that let cancer return, we can:

predict recurrence risk,
design therapies to prevent relapse,
personalize follow-up and treatment intensity.

6) Integrating genetics + EHR (health records) for deeper understanding

What is the goal?

To connect:

genetic variation (differences in DNA)
with
phenotypes (what we observe: symptoms, lab values, diagnoses, disease progression)

If you map many patients, you can find:

which gene patterns connect to which disease patterns,
subtypes of diseases that look “same” clinically but differ biologically.

This is especially useful for complex diseases like Alzheimer’s.

7) How LLMs can help with medical notes (EHR text)

EHRs contain lots of unstructured text:

doctor notes
discharge summaries
radiology reports

Large Language Models (LLMs) can:

extract meaning from that text,
standardize messy descriptions,
detect patterns across huge populations (carefully, with privacy and bias control).

This is not magic — LLMs help turn text into structured signals that can be combined with labs, images, and genetics.

8) AI in pathology imaging (tumor detection)

Pathology slides are images of tissue.
A pathologist checks them for:

tumor presence,
tumor grade,
margins,
cell patterns.

AI image models can:

highlight likely tumor regions,
detect subtle patterns,
speed up screening,
assist diagnosis (as a support tool, not a replacement).

This improves:

accuracy
speed
consistency (especially when workload is high)

9) Graph Neural Networks (GNNs) for molecules and drug design

Molecules are naturally graphs:

atoms = nodes
bonds = edges

A GNN learns chemical behavior from structure, helping:

predict molecule properties,
suggest new molecules,
support synthetic chemistry planning.

This matters for pharmaceutical development because it can shorten the search for promising candidates.

10) Multi-modal embeddings: the “one common space” idea

What is an embedding (simple)?

An embedding is like turning complex data into a point on a map, so that:

similar things are close,
different things are far.

Multi-modal embedding

Means combining many types of data into one representation:

genetics + labs + images + notes

So each patient becomes a “point” in a big patient map.

Then you can:

find similar patients (“neighbors”),
predict risk/progression,
select the best treatment based on similar outcomes.

11) “Google Maps for knowledge” (papers & concepts navigation)

The summary mentions a navigation system like Google Maps:

instead of streets, you have concepts and papers,
instead of physical distance, you have “meaning distance” (embedding similarity).

This helps researchers:

see clusters of related work,
find gaps (“no one connected these two ideas yet”),
explore literature faster than manual reading.

12) Bias in medical data: NMAR (Non-Missing At Random)

What NMAR means (simple)

Medical tests are not collected randomly.

Doctors order tests because they suspect something is wrong.

So the dataset becomes skewed:

lots of abnormal cases have tests,
healthy people often don’t.

If an AI model learns from that directly, it may become biased:

it may treat “missing test” as a strong signal,
or overestimate risk because it mostly saw sick/testing cases.

How AI can help

AI can model the process of testing:

who got tested, when, and why (age, sex, symptoms, access, doctor practice)

Then it can do counterfactual analysis:

“What would we predict if this person had been tested?”
“What if they had not been treated?”

This reduces bias and improves predictions.

AI + massive biological/clinical data + genetics + experiments → mechanism discovery + personalized medicine + faster therapy design, but we must handle bias, causality, and validation carefully.

If you want, I can also turn this into a clean exam-style notes PDF (with these diagrams embedded and headings + bullet-point answers).

AI AUTONOMY

1) What are “autonomous agents” in simple words?

A normal chatbot answers questions.

An autonomous agent goes further:

It can take actions, not just talk.
It can use tools (search, code, databases, apps).
It works in a loop until the task is done.

Example tasks:

“Find the best sources and summarize them.”
“Check my logs and identify security alerts.”
“Plan a trip and create an itinerary.”
“Write code, run tests, fix errors, and repeat.”

Core idea: “Think + Use tools + Learn from results”

Instead of giving one-shot answers, the agent:

decides the next step
uses a tool
reads the result
updates the plan
repeats

2) Why is there “confusion in AI terminology”?

AI is evolving fast, so people mix words that sound similar but mean different things. Here are the common ones:

LLM (Large Language Model)

A model trained on lots of text that predicts the next token (word piece).
It’s great at language tasks: writing, explaining, summarizing, Q&A.

GPT

A type/brand/family of LLM architecture. People also say “GPT” casually to mean “an LLM chatbot,” which adds confusion.

“Model” vs “Application”

Model = the engine (like an AI brain)
Application = the product built using that engine (chatbot, copilot, agent)

Agent

An application that uses an LLM plus tools, memory, and an action loop to get tasks done.

3) What is AGI and why current AI is not AGI?

AGI (Artificial General Intelligence)

In simple terms, AGI would be:

a system that can do any intellectual task a human can (learn new things, adapt, plan, interact with the world).

Why current AI isn’t AGI (as the lecture hints)

Today’s LLMs:

don’t truly “live” in an environment like humans do
don’t automatically form long-term goals on their own
may struggle with reliable planning and real-world adaptation
can be brittle outside their training patterns

So the lecture compares LLMs to a powerful component (like a part of the brain), not the whole “complete intelligence system.”

4) How agents evolved: from “deep learning only” to “reasoning + tools”

Older AI systems were often:

a single neural network that outputs a prediction (classify spam / detect fraud / translate text)

Modern agents include:

LLM reasoning
tool use
memory
execution loops
sometimes multiple agents cooperating

That’s why agents feel more “useful” in real work: they can do things, not just answer.

5) Chain-of-thought (thinking step-by-step) — what it really means

“Chain-of-thought prompting” means encouraging the model to:

break a problem into smaller steps
reason through them iteratively

Why it helps:

complex tasks often fail if the model jumps directly to the final answer
step-by-step reasoning reduces mistakes (especially in multi-step logic)

Important note (simple):
Even without seeing the full internal steps, the key benefit is that the model is guided to be more systematic.

6) Reinforcement Learning with Feedback (RLHF / RLAIF)

Big idea

Instead of only training a model to imitate text, we also train it to prefer better answers.

How it works (simple):

model generates multiple answers (A, B, C…)
a judge picks the best:

RLHF: humans judge
RLAIF: AI judges (with rules), sometimes mixed with humans

3. model is updated to produce more “preferred” answers next time

This improves:

helpfulness
safety
style consistency
“what users actually want”

7) RAG (Retrieval Augmented Generation) — why it’s a big trend

LLMs can “hallucinate” because they generate text from learned patterns.
RAG reduces this by letting the model look things up first.

RAG flow (simple)

user asks a question
system retrieves relevant documents/snippets (from internal files or web)
model answers using those retrieved snippets

Benefits:

more accurate
up-to-date (if the source is current)
can cite sources
very useful in enterprise knowledge bases

8) The planning problem: why agents sometimes fail at “simple tasks”

Even strong LLMs can struggle with:

long multi-step planning
keeping track of constraints
not getting distracted mid-way
executing a full 20-step plan reliably

A practical solution discussed: “act first”

Instead of making a huge plan, agents do:

first best action
observe results
adjust next step
repeat

This is closer to how humans work in real life: start, see what happens, then correct course.

9) “Muscle memory” for agents (automation of common behaviours)

The lecture uses an idea like “muscle memory”:

humans don’t consciously plan every tiny movement
we learn reliable automatic routines

Similarly, agents can become better if they learn reusable skills like:

“how to search properly”
“how to debug”
“how to write a report”
“how to follow security playbooks”

This can come from:

learning from demonstrations (imitation learning)
reinforcement learning
storing successful workflows as reusable patterns

10) Imitation learning + section understanding (why it matters)

For physical or environment-based tasks (robots, self-driving, navigation):

the agent must understand sequences of observations (often video)
it must map perception → action

So efficient video processing and learning from demonstrations can make agents:

more robust in real environments
better at navigation and interaction

11) Collective intelligence: many specialized agents

Instead of one general agent doing everything, you can split work:

Research agent finds information
Builder agent writes code
QA agent tests and finds bugs
Manager agent coordinates

This can be faster and more reliable — if they share a workspace and coordinate properly.

12) Human oversight is essential (especially for high-stakes actions)

The future described is not “AI replaces humans.”
It’s more like:

AI does 80% of the work fast, humans approve critical decisions.

Where humans should stay in control:

cybersecurity actions (blocking accounts, deleting resources)
financial actions (payments, purchases)
production deployments
sending sensitive emails
healthcare or legal decisions

Common safe design:

agent proposes
safety/risk checks run
human approves/edits
system logs everything

AI ETHICS

1) Why ethics + regulation matter for foundation models

Foundation models (big models that can do many tasks) and generative AI (systems that create text/images/video) are powerful because they can influence:

what people believe (information, persuasion)
who gets opportunities (jobs, loans, admissions)
safety and security (fraud, phishing, cyberattacks)
privacy (learning patterns from personal data)
society at scale (jobs, culture, politics)

So the key question becomes:

“Who is responsible when an AI system causes harm?”

That’s where ethics and regulation come in.

2) Accountability: don’t treat AI like a “person”

The lecture warns against anthropomorphizing AI — meaning we talk like:

“the AI decided”
“the AI wanted”
“the AI is lying”

This can be dangerous because it shifts blame away from real people.

The simple truth

AI systems are built and deployed by:

companies
engineers
product teams
leaders who choose goals and incentives

So accountability should point to real stakeholders:

who built it?
who deployed it?
who profits?
who failed to add safeguards?

3) Transparency: what it means (more than “open the code”)

People say “AI should be transparent,” but transparency has layers.

A simple way to understand transparency

Data transparency:
Where did training/deployment data come from? What’s missing?
Model transparency:
What can it do well? Where does it fail? What are known risks?
Decision transparency:
Why did it output this? What evidence did it use?
Governance transparency:
Who is accountable? Is there auditing, logging, escalation?

4) Real-world vs “ideal world” ethics

The lecture highlights a classic gap:

How the world should work: fair, truthful, calm decision-making
How it actually works: incentives, competition, manipulation, conflict

So ethical discussions become more useful when they ask:

“What happens when bad actors use the tech?”
“What happens when companies optimize for profit/engagement?”
“What happens when governments use it for defence or influence?”

This is why the talk discusses warfare, media manipulation, and urgency in national defence.

5) Misinformation and manipulation: why generative AI raises the risk

Why the risk grows

Generative AI makes it cheaper and easier to create:

realistic fake images/video (“deepfakes”)
persuasive fake text at huge scale
impersonation (voice, writing style, video)

This can be used for:

political manipulation
scams and fraud
identity theft and blackmail
social unrest (spreading distrust)

Why democracy is vulnerable

Democracy depends on people agreeing on shared facts.
If people stop trusting anything (“everything might be fake”), then:

it becomes easier to manipulate crowds
it becomes harder to hold anyone accountable

6) “Information overload” → skepticism → need for critical thinking

The summary says we’re overwhelmed with information, and now we doubt:

news
images
video
even “direct evidence”

So individuals and societies need stronger habits like:

checking sources
cross-verifying
understanding incentives (who benefits if I believe this?)
resisting emotionally-triggering content designed to provoke fast reactions

This is not only a tech issue — it’s a human thinking issue.

7) Privacy risk: AI can learn your “psychology” from data

The lecture warns that AI can learn from massive personal data:

what you click
what you watch
what makes you angry or happy
what convinces you

This can lead to hyper-personalized persuasion:

ads that push your exact emotional buttons
political messaging tailored to your fears
manipulation that feels like “your own idea”

So privacy is not just about “my name and phone number.”
It’s also about:

“Can someone model my mind and influence my decisions?”

8) Bias and fairness: it’s not a simple on/off switch

Why fairness is difficult

Bias can enter at multiple stages:

Data bias:
Online content is not a perfect mirror of society. It’s selective.
Measurement bias:
What gets recorded? Who gets labeled? Who is missing?
Decision bias (use bias):
Even a “good” model can cause harm if used wrongly (over-trusting it, no appeals, no human review).

Also, fairness often has trade-offs:

improving one fairness metric can worsen another
different groups can be affected differently

So fairness is more like a continuum than a binary “fair/unfair.”

9) Risk of “one algorithm dominating society”

The lecture warns: if one algorithm (or a few models) become the default decision-maker for many systems (courts, hiring, credit, education), then:

any bias becomes system-wide
mistakes scale to millions of people
society becomes dependent on a small number of model owners

This is why governance and diversity of systems matter.

10) Social impact: jobs, leisure, and isolation

AI can automate parts of work, which might lead to:

productivity gains
more leisure time for some people

But the lecture also points to risks:

job displacement
inequality (some benefit more than others)
changes in human relationships (less interaction, more isolation)
loss of meaning (if work is a major source of identity)

So the impact is not only economic — it’s psychological and cultural too.

11) Rapid change causes unrest: lessons from history

The lecture connects fast technological change to social instability:

when people feel uncertain, they look for someone to blame
fear can beat creativity
conflict becomes more likely when systems change too fast

This is the idea behind warning against accepting change blindly.

12) Antifragility + “time tests” (how to deploy responsibly)

Because predicting the future is uncertain, the lecture suggests building systems that:

can fail safely
learn from failures
improve over time

“Time tests” (simple meaning)

Instead of rolling out a powerful system everywhere:

test it on a small scale
run it for a longer period
observe failures early (cheaply)
only then scale up

Antifragile thinking

Antifragile systems don’t just survive shocks — they improve because of them:

monitoring + alerts
fallback modes
human override
red-teaming (trying to break it)
post-incident learning

13) Why regulation is hard (and why regions differ)

Regulation is tough because it must balance:

safety (reduce harm)
innovation (don’t freeze progress)

Different regions emphasize different levers:

EU: transparency, risk categories, human oversight, societal impact
US: guidance + sector-by-sector rules, benchmarks, risk mitigation while keeping innovation
China: algorithm registration, operational standards, content and deployment controls

14) The “regulation vs big corporations” concern

A realistic issue raised:

compliance is expensive
big companies can pay for audits/lawyers/processes
small innovators may struggle

So regulation can unintentionally:

entrench big players
reduce competition

Good policy tries to protect people without making it impossible for smaller companies to build responsibly.

AI PANEL

1) Centralization vs decentralization: what does it mean?

Centralization in AI

This means a few big companies or platforms control most of:

the strongest models
the data pipelines
the distribution (apps, cloud, APIs)

Why it’s attractive

cheaper (economies of scale)
faster rollout
standardization (same tools everywhere)

Why it’s risky

“single point of failure” (one big system breaks → many people affected)
too much power in few hands
systemic bias (one model’s blind spots spread everywhere)
less competition → slower innovation over time

Decentralization in AI

This means many different models and builders exist:

open-source models
regional / domain-specific models
multiple platforms competing

Why it’s good

diversity (more approaches, more creativity)
resilience (if one fails, others still work)
innovation (new ideas appear faster)
better local fit (language, culture, domain needs)

Why it’s hard

tougher to monitor everyone
inconsistent quality
more coordination needed (standards, interoperability)

Panel’s main idea

We need a balance: strong innovation + reduced concentration of risk.

2) Why diversity is a big theme in the discussion

The panel says diversity is not only a “social value” — it’s an engineering advantage.

How diversity improves outcomes (simple)

Different people bring different:

assumptions
problem-solving styles
priorities (safety vs speed, fairness vs accuracy, etc.)

This reduces blind spots and increases creativity.

Example (simple):

If everyone designing AI has the same background, they may miss how the system affects other communities.

In AI development, diversity matters in:

the team building it
the data used to train it
the evaluation (who tests it and what tests they run)

3) Can AI reduce human bias?

The panel highlights a hopeful point:

AI can be designed to challenge stereotypes and detect unfair patterns.

But there is a catch:

AI learns from data, and data often includes society’s past unfairness.
So AI can either:
reduce bias, if carefully designed and tested, or
amplify bias, if trained/deployed carelessly.

So the “anti-bias” outcome is not automatic — it requires:

clear fairness goals
careful dataset choices
testing across groups
transparency and monitoring

4) “Community of models” (why AI systems aren’t one single brain)

The panel suggests systems like ChatGPT can be thought of as multiple components working together, such as:

a main language model (generates text)
safety filters (reduce harmful output)
retrieval/tools (look up information or run actions)
coordination logic (decides which component to use)

Why this matters for “diversity”:

multiple models can give richer, more robust outcomes
you can swap/upgrade parts without rebuilding everything
failures can be contained (one part fails, not the entire system)

5) Change, disruption, and the “evolution” analogy

The panel mentions evolution and disruption:

dinosaurs went extinct → mammals expanded
big changes create space for new forms of life/innovation

The message applied to AI:

disruption can be painful,
but it can also enable new industries and new kinds of work.

They also emphasize:

AI has no “will” or “desire.”
Risks usually come from human misuse, incentives, or poor governance.

6) AI as “fire”: a tool that can help or harm

The panel compares AI to fire:

Fire enables cooking and progress, but can also destroy.
AI can increase productivity, but also create risks.

Benefits

faster learning
higher productivity
new discoveries and services
better tutoring and personalization

Risks

scams and manipulation
bias in important decisions
concentration of power
job disruption

So the outcome depends on:

who controls it
what incentives exist
what safeguards and accountability exist

7) AI in education: “liberating force” for students

The panel suggests AI can act like a personal tutor:

helps struggling students (step-by-step explanations)
challenges advanced students (keeps them engaged)
offers practice, feedback, and personalized pacing

This can reduce inequality in education if access is broad.

But risks include:

over-reliance (students stop practicing thinking)
misinformation (AI may be wrong)
fairness issues (if only some students can afford it)

8) Future of work: jobs will change fast (5–10 years)

The panel predicts:

many current jobs will be transformed (tasks automated)
some roles will shrink
new roles will appear (AI operators, evaluators, safety, data stewards, tool builders)

Important nuance:

It may not be “jobs disappear everywhere.”
Often it’s “tasks move and skills shift,” and opportunities appear in different places.

So society needs:

reskilling and upskilling
support for workers during transitions
redesigning jobs so humans + AI work together

9) Regulation: focus on self-regulation + flexibility (but with safeguards)

The panel suggests heavy regulation can sometimes:

slow innovation
push power toward big companies that can afford compliance

This is called regulatory capture:

large players handle paperwork easily
small innovators struggle
competition drops, centralization increases

So the panel leans toward:

self-regulating systems (local governance, quick iteration)
plus standards and external checks for serious risks

A practical compromise:

light rules for low-risk uses
strong oversight for high-risk uses (health, finance, elections, justice)

10) Creativity, originality, and copyright concerns

The panel mentions experiments suggesting:

some creative patterns (like melodies) may be finite,
which raises questions:
What counts as “original”?
Who owns AI-generated content?
Is it remixing existing work too closely?

This is why copyright law and creative ethics will become more important as AI content becomes widespread.

· Centralization gives efficiency but increases systemic risk.

· Decentralization boosts diversity and resilience but needs standards.

· Diversity (people + models + evaluation) improves creativity and safety.

· AI is a tool (like fire): outcomes depend on governance and incentives.

· Education may improve, jobs will shift fast, and regulation must avoid stifling innovation while preventing harms.

I know many of you will not have patience and time to go through all the detailed explanation, so I have explained in short the gist of the above lecture in a crisp format.

Crisp Explanation Of The Subject

1) The big idea (simple words)

What is a foundation model?

A foundation model is a very large AI model trained on a huge amount of data (text, images, code, audio, etc.). Because it saw so much data, it learns general abilities (language, writing, summarizing, reasoning patterns). Then you can reuse that same model for many tasks instead of building a separate AI for each task.

What is generative AI?

Generative AI means AI that can create new content — like writing text, generating images, writing code, or proposing new designs — by learning patterns from the data it trained on.

Why this became a “recent shift”

Earlier AI was often “one model for one job.” The shift is that now one strong general model can do many jobs (write, summarize, plan, help with science/business tasks) once you adapt it slightly (prompting, fine-tuning, adding tools).

2) Why the recent AI shift matters (simple + real-world meaning)

Simple view

Foundation models are like a person who got a very broad education (read tons of books). After that, you can train them quickly for a specific job (like law, medicine, customer support) with much less extra training than starting from scratch.

Why companies care (economics)

Because one model can be reused across many products (support, search, coding help, analytics), it becomes a “platform” investment — so funding and adoption surged.

Agents (a step beyond chat)

A normal chatbot answers. An agent tries to do multi-step work: break a goal into steps, use tools, check results, retry, and finish a task loop (plan → act → check → improve).

3) The most important concept: “How do machines learn?”

The notes describe three main learning styles. Here’s the simplest way to remember them.

A) Supervised learning (learning with answer keys)

You show examples + correct labels
Example: many images labeled “dog” or “cat.”
The model learns to map input → correct output.

Downside: Labels are expensive and sometimes unclear (“what counts as ‘good customer service’?”).

B) Reinforcement learning (learning by trial-and-error)

The model takes actions, gets reward/punishment.
Example: learning to play a game by scoring points.

Downside: Feedback can be delayed and trial-and-error can be unsafe in the real world.

C) Self-supervised / generative learning (learning from “puzzles” made from raw data)

This is the breakthrough: the model learns without human labels by solving “make-believe tasks,” like:

In text: predict the next word, or fill in a missing word
In images: predict missing patches or match two crops of the same image

Why it mattered: There’s an ocean of unlabelled data. Self-supervised learning lets models learn from it cheaply and at massive scale.

4) “Meaning comes from relationships” (simple explanation)

A key point in the notes is that concepts get meaning from how they relate to other things, not from isolated labels.

Simple example

A child learns “dog” not only from a picture + the word “dog,” but from a network of relations:

dogs bark, fetch, have leashes, go to parks, live with people
cats meow, chase mice, sleep on sofas

What the model is learning

Generative models see millions/billions of examples of words appearing together (and images with captions, etc.). Over time they build an internal “map” of:

what tends to go with what
what is similar / different
what comes next in sequences

That internal map is what people loosely call “understanding.”

5) The core engine: self-supervised pretraining → then reuse

This “two-stage” idea is the backbone of foundation models.

Stage 1: Pretraining (learn general world/language patterns)

Train on massive raw data
Task looks simple (predict missing/next parts), but it forces learning deep patterns.

Stage 2: Adaptation (make it useful for specific tasks)

After pretraining, you specialize using:

Prompting (tell it what you want in words)
Fine-tuning (train a bit more on task data)
RLHF (align behavior with human preferences)
RAG (let it look things up in documents/web)
Tools (calculator, database, code runner, etc.)

6) Two main ways language models are trained (important)

The notes contrast two training styles: Masked LM vs Causal LM.

A) Masked Language Model (BERT-style)

You hide a word: “The cat sat on the [MASK].”
Model predicts the missing word using both left and right context.

Strength: strong “understanding” representations (good for classification/search).
Limit: not naturally built to generate long text left-to-right.

B) Causal Language Model (GPT-style)

You give: “The cat sat on the”
Model predicts the next word using only the left context.

Strength: great at generating fluent text.
Limit: doesn’t “see the future words,” so it sometimes guesses and may hallucinate.

7) How GPT-like models learn (step-by-step, simple)

The “next-token prediction loop” works like this:

Input prompt tokens: “The cat sat on the”
Model outputs probabilities for the next token
Compare with the true next token (“mat”)
Compute loss (how wrong it was)
Update weights so it’s more likely to predict correctly next time
Repeat billions of times

Even though the task is “just next word,” language contains huge amounts of world structure, so the model indirectly learns grammar, style, and many facts/patterns.

8) Why Transformers mattered (simple but accurate)

Transformers are the architecture that made modern LLMs work well at scale.

The key trick: attention

Attention means: while processing a word, the model can “look at” other words in the sentence and decide which ones matter most right now.

Why that’s a big deal

It handles long-range relationships better than older RNNs
It runs efficiently on GPUs because it processes many tokens in parallel
It uses positional encodings so word order still matters

9) “Scaling laws” (why bigger often gets better)

The notes describe that, in general, as you increase:

model size (parameters),
data,
compute,

the prediction error tends to drop smoothly (up to practical limits like data quality). That’s why scaling has been such a powerful strategy.

10) From raw model → helpful ChatGPT: SFT + RLHF (deep but clear)

Pretraining makes a model capable, but not necessarily helpful or safe. The notes explain two main steps used to make chat models behave better.

A) SFT (Supervised Fine-Tuning)

Humans provide example “good answers” to prompts.
The model learns the style: helpful, structured, polite, etc.

B) RLHF (Reinforcement Learning from Human Feedback)

Model generates multiple candidate replies
Humans rank which reply is better
Train a reward model to predict those preferences
Optimize the chat model to get higher reward scores

Why RLHF is tricky: feedback is delayed (you judge the whole answer at the end), and the model can “game” the reward model if you’re not careful — so guardrails and validation matter.

11) Other key learning/generative methods in the notes

A) Contrastive learning (learn “what matches what”)

Core idea: bring “related things” close in embedding space, push unrelated far away.

Images: two crops of the same photo should be close
Text: paraphrases should be close

This builds strong representations for retrieval/search and multimodal alignment.

B) Diffusion models (generate by denoising)

Diffusion is like sculpting from noise:

Add noise to images during training (forward process)
Train a model to remove noise step-by-step (reverse process)
To generate: start from random noise → denoise gradually into an image

This is the core idea behind Stable Diffusion-style generation described in the notes.

C) Autoencoders (compress then reconstruct)

Encoder compresses input into a bottleneck (a small code)
Decoder reconstructs the original
Useful for compression/denoising/feature learning.

D) GANs (generator vs critic)

Generator makes fake samples
Discriminator tries to detect fakes
They compete, pushing realism up — though training can be unstable and can collapse to low diversity.

12) Stable Diffusion pipeline (explained simply)

The notes explain Stable Diffusion as: text → image by denoising in a compressed “latent space.”

Components in plain words

Text encoder: turns your prompt into numbers representing meaning
Latent space: a compressed version of the image (faster than full pixels)
U-Net denoiser: removes noise step-by-step guided by the text
VAE decoder: turns the final latent into a real image

Why “seed” matters

Seed controls the starting noise. Same seed → reproducible. Different seed → different image variations.

13) Why “the world is messy” matters (the “coding reality gap”)

The notes emphasize a practical truth:

Computers like precise rules, but the real world is full of ambiguity and chaos — so systems that rely only on explicit labels and rigid rules struggle.

Self-supervised learning helps because it learns from real data patterns (how people speak, what users click, what happens over time) rather than depending only on perfect labels.

14) Business view: “unified intelligence” vs many small models

Old way

Separate AI systems for search, recommendations, support, etc.
Problem: it’s like having multiple small brains that don’t share knowledge well.

New way

A central intelligence layer (foundation model + company data) can improve many systems at once. Competitive advantage comes from your proprietary data (clicks, purchases, inventory, product catalog, domain rules), not only from using the same public model as everyone else.

15) Biology/medicine: why foundation models matter there

The notes describe a major shift: medicine is becoming more data-driven, using huge datasets (genetics, images, health records) to find patterns and disease subtypes.

Key ideas explained:

Correlation vs causation: genetics can provide stronger causal clues than pure correlations, but still needs validation.
LLMs for medical notes: convert messy clinical text into structured signals (with privacy and bias controls).
GNNs for molecules: molecules are graphs (atoms=nodes, bonds=edges), so graph neural nets fit naturally.
NMAR bias (Non-Missing At Random): medical tests are ordered for reasons, so missing data isn’t random — models must handle that carefully.

16) Agents + RAG (retrieval) in simple terms

Agents

An agent is an LLM plus:

tool use,
memory/logs,
an action loop (do → observe → adjust).

RAG (Retrieval Augmented Generation)

LLMs can hallucinate because they “generate from patterns.” RAG reduces this by:

retrieving relevant documents/snippets
answering grounded in those snippets

That’s why RAG is big in enterprise settings (internal knowledge bases).

17) Ethics & regulation (simple but serious)

The notes highlight why ethics matters: these models can influence beliefs, opportunities, safety, privacy, and society at scale.

Main points explained simply:

Accountability: don’t blame “the AI”; humans/organizations choose goals, data, deployment.
Transparency has layers: data transparency, model limits, decision explanations, governance/auditing.
Misinformation risk: generative AI makes mass creation of persuasive fake content cheaper.
Privacy risk: not just identity — AI can learn what persuades you (“model your psychology”).
Antifragility: deploy carefully, test at small scale, monitor, add human overrides, learn from failures (“time tests”).

18) A clean mental model to remember everything

Think of modern AI as a stack:

Data (raw text/images/behavior logs)
Self-supervised pretraining (learn general patterns)
Foundation model (general-purpose engine)
Adaptation (prompting / fine-tuning / RLHF)
Grounding & tools (RAG, calculators, databases)
Applications (chatbot, copilot, recommender, scientist assistant, agent)
Governance (safety, privacy, fairness, monitoring, accountability)

19) Quick glossary (simple definitions)

Foundation model: big reusable model trained broadly, adapted to many tasks.
Generative AI: AI that creates new content from learned patterns.
Self-supervised learning: learning from raw data by solving prediction “puzzles.”
Embedding: turning things (words/images/users) into points in a space where “close = similar.”
Transformer/attention: architecture that lets tokens “look at each other” to use context efficiently.
SFT: fine-tuning on curated instructions → answer examples.
RLHF: aligning a model using preference feedback and reinforcement learning.
RAG: retrieval + generation so answers can be grounded in documents.

Agent: LLM + tools + memory + action loop to complete tasks.

A2A vs MCP: Comparing Google’s Agent-to-Agent Protocol with Anthropic’s Model Context Protocol

Soumyadeep Saha — Thu, 23 Oct 2025 17:39:29 GMT

In AI agent development, there are two main types of protocols that help different systems work together.
One type lets agents connect with tools and resources.
The other type allows agents to work and communicate with each other.
The Model Context Protocol (MCP) and the Agent2Agent (A2A) Protocol are designed to handle these two different but complementary functions.

Model Context Protocol (MCP) — Simplified Explanation

The Model Context Protocol (MCP) sets the rules for how an AI agent connects to and uses different tools or resources — like databases or APIs.

Here’s what it does:

Creates a standard method for AI models and agents to connect with tools, APIs, and other outside systems.
Provides a clear structure for describing what each tool can do, much like how function calling works in large language models (LLMs).
Handles data flow — it sends inputs to tools and receives organized, structured outputs in return.
Supports common tasks, such as:

An LLM using an external API,

An agent querying a database, or

An agent working with built-in functions that are already defined.

Agent2Agent Protocol — Simplified Explanation

The Agent2Agent (A2A) Protocol is designed to help different AI agents work together to reach a shared goal.

Here’s what it does:

Creates a standard way for independent AI agents to talk to and cooperate with each other as equals.
Defines rules for communication, allowing agents to find one another, set up how they’ll work together, share tasks, and exchange both conversations and complex data.
Supports common situations, such as:

A customer service agent passing a question to a billing agent, or

A travel agent working with flight, hotel, and activity agents to plan a trip.

Why Different Protocols?

Both the Model Context Protocol (MCP) and the Agent2Agent (A2A) Protocol are important for creating advanced AI systems. Each serves a different purpose based on what the AI agent is interacting with.

1. Tools and Resources (MCP Domain)

What they are: Simple, clearly defined systems that take inputs and return specific outputs.
Examples: A calculator, a database query API, or a weather service.
Purpose: Agents use these tools to get information or perform small, focused tasks.
Nature: These tools are usually stateless — they don’t remember past interactions.

2. Agents (A2A Domain)

What they are: Independent, intelligent systems that can reason, plan, and hold longer conversations.
Examples: A customer support agent, a travel booking agent, or a scheduling agent.
Purpose: Agents work together to solve bigger, more complex problems that may require multiple steps or tools.
Nature: These agents often maintain state — they remember past interactions and use them to guide future actions.

A2A ❤️ MCP: How They Work Together

In an agentic system, both protocols play different but connected roles:

The A2A Protocol is used for communication between agents — it helps them share information, coordinate tasks, and work together toward a goal.
Inside each agent, the MCP Protocol is used to connect with tools and resources — allowing the agent to access data, run functions, or use APIs to get things done.

In simple terms:

Agents talk to each other using A2A,
and each agent talks to its tools using MCP.

Together, they form a complete system where agents collaborate effectively while still being able to perform specific tasks through their tools.

An agentic application might use A2A to communicate with other agents, while each agent internally uses MCP to interact with its specific tools and resources.

Example Scenario: The Smart Hospital 🏥

Let’s imagine a smart hospital run by AI “medical staff” agents.
Each agent specializes in a certain role— and they all use A2A and MCP to work together efficiently.

1. Patient Interaction (User-to-Agent using A2A)

A patient uses A2A to talk to the hospital’s Receptionist Agent.
For example, the patient might say:

“I’ve been feeling dizzy and have a fever.”

The Receptionist Agent collects basic information and assigns the case to a Doctor Agent.

2. Doctor’s Consultation (Agent-to-Agent using A2A)

The Doctor Agent uses A2A to coordinate with other agents in the hospital.

For example, the doctor might ask the Nurse Agent:

“Please take the patient’s temperature and blood pressure.”

Then the Doctor Agent might tell the Lab Agent:

“Run a blood test for infection markers.”

Here, A2A allows smooth, multi-turn communication between multiple agents (Doctor, Nurse, Lab) — just like human teamwork.

3. Using Internal Tools (Agent-to-Tool using MCP)

Now, each agent uses MCP to interact with specialized hospital tools and databases.

Nurse Agent (MCP call):
use_device(device="thermometer", patient_id="P001")
use_device(device="blood_pressure_monitor", patient_id="P001")
Lab Agent (MCP call):
run_test(test_type="CBC", sample_id="S123")
Doctor Agent (MCP call):
query_medical_database(symptoms=["fever", "dizziness"])

These MCP calls connect the agents to hospital systems — tools with structured inputs and outputs — to gather and analyze data.

4. Pharmacy Interaction (Agent-to-Agent using A2A)

After diagnosis, the Doctor Agent prescribes medication and communicates via A2A with the Pharmacy Agent:

“Please prepare an antibiotic for patient P001, 500mg twice daily.”

The Pharmacy Agent checks inventory and confirms:

“Medicine available. Ready for pickup in 10 minutes.”

This shows A2A’s role in managing cooperative, goal-oriented dialogues among agents.

5. Summary — How A2A and MCP Work Together

Why Both Matter

MCP gives agents the ability to use tools efficiently — for clear, structured tasks like data lookups, device operations, or computations.
A2A enables conversation, coordination, and teamwork among agents — like doctors, nurses, and pharmacies working together to treat a patient.

Together, A2A and MCP form the backbone of an intelligent, cooperative AI ecosystem — one that mirrors how humans use both tools and teamwork to solve complex problems.

Agent2Agent Protocol: Building the Language of AI Collaboration

Soumyadeep Saha — Thu, 23 Oct 2025 10:34:28 GMT

Building Collaborative Systems with ADK and the Agent-to-Agent (A2A) Protocol

The Agent Development Kit (ADK) empowers developers to create sophisticated multi-agent systems, where multiple agents work together seamlessly. Through the Agent-to-Agent (A2A) Protocol, these agents can communicate, collaborate, and coordinate their actions efficiently and securely.

This guide walks you through the fundamentals of ADK’s A2A features — helping you design intelligent, interconnected agents that operate as a cohesive system. Explore the sections below to unlock the full potential of ADK’s A2A capabilities.

Introduction to A2A

Start with the basics. This guide walks you through building your first multi-agent system, complete with a root agent, a local sub-agent, and a remote A2A agent. You’ll learn how they interact, exchange data, and collaborate to perform complex tasks.

A2A Quickstart (Exposing)

Already have an agent running? Learn how to expose it so that other agents can discover and use it through the A2A protocol. This is your first step toward turning your agent into a service that others can rely on.

A2A Quickstart (Consuming)

On the other side of the equation, this guide teaches you how to connect your ADK agent to a remote agent via A2A. You’ll see how to securely consume data and services from other agents to extend your system’s capabilities.

Official Website

For more details, documentation, and the latest updates, check out the [https://a2a-protocol.org/]. It’s your go-to resource for deep-diving into A2A concepts and best practices.

Introduction to A2A

As your systems grow in complexity, you’ll quickly realize that a single agent can only do so much. Real-world problems often demand multiple specialized agents, each handling a different part of the solution. That’s where the Agent-to-Agent (A2A) Protocol comes in.

The A2A Protocol acts as a common language for agents — enabling them to communicate, share insights, and collaborate effectively. With A2A, agents don’t just coexist; they work together intelligently, forming a coordinated network capable of tackling challenges far beyond the reach of any single agent.

When to Use A2A vs. Local Sub-Agents

· Local Sub-Agents: These are agents that run within the same application process as your main agent. They are like internal modules or libraries, used to organize your code into logical, reusable components. Communication between a main agent and its local sub-agents is very fast because it happens directly in memory, without network overhead.

· Remote Agents (A2A): These are independent agents that run as separate services, communicating over a network. A2A defines the standard protocol for this communication.

Consider using A2A when:

· The agent you need to talk to is a separate, standalone service (e.g., a specialized financial modeling agent).

· The agent is maintained by a different team or organization.

· You need to connect agents written in different programming languages or agent frameworks.

· You want to enforce a strong, formal contract (the A2A protocol) between your system’s components.

When Not to Use A2A (Prefer Local Sub-Agents)

Sometimes, using A2A is unnecessary and can even slow things down. In these cases, local sub-agents or simple modules are the better choice:

Internal Code Organization:

If you’re just breaking a big task into smaller parts inside one agent — like a DataValidator that cleans input before processing — use local sub-agents. It’s faster and simpler.

Performance-Critical Tasks:

For operations that need speed and low latency, such as a RealTimeAnalytics sub-agent handling live data, keep everything inside the same app. A2A’s network calls would only add delay.

Shared Memory or Context:

When agents need to share the same memory or state, local sub-agents work best. A2A adds extra overhead from network communication and data conversion.

Simple Helper Logic:

If it’s just a small, reusable function that doesn’t need to run separately — like a utility or helper class — don’t create an A2A agent. A simple local module is enough.

The A2A Workflow in ADK: A Simplified View

Agent Development Kit (ADK) simplifies the process of building and connecting agents using the A2A protocol. Here’s a straightforward breakdown of how it works:

1. Making an Agent Accessible (Exposing): You start with an existing ADK agent that you want other agents to be able to interact with. The ADK provides a simple way to “expose” this agent, turning it into an A2AServer. This server acts as a public interface, allowing other agents to send requests to your agent over a network. Think of it like setting up a web server for your agent.

2. Connecting to an Accessible Agent (Consuming): In a separate agent (which could be running on the same machine or a different one), you’ll use a special ADK component called RemoteA2aAgent. This RemoteA2aAgent acts as a client that knows how to communicate with the A2AServer you exposed earlier. It handles all the complexities of network communication, authentication, and data formatting behind the scenes.

From your perspective as a developer, once you’ve set up this connection, interacting with the remote agent feels just like interacting with a local tool or function. The ADK abstracts away the network layer, making distributed agent systems as easy to work with as local ones.

Visualizing the A2A Workflow

To understand how the A2A workflow actually works, let’s look at what happens before and after you expose your agent — and how everything fits together in a connected system.

Exposing an Agent

Before Exposing

At first, your agent is just a standalone program. It runs by itself and can’t be accessed by other agents over a network.

After Exposing

When you integrate your agent with ADK’s A2A Server, it becomes accessible to other agents remotely. The A2A Server acts like a gateway, allowing network communication between your agent and others.

Consuming an Agent

Just like exposing an agent makes it available for others, consuming an agent means your own agent is set up to connect to and use a remote one. Let’s see how this works.

Before Consuming

Your Root Agent (the main agent you’re building) can’t yet talk to any remote agents. It’s isolated and has no built-in way to communicate over the network.

After Consuming

Once you add RemoteA2aAgent (an ADK component) to your setup, it acts as a client-side proxy that connects your Root Agent to the remote agent. The communication now flows smoothly over the network.

In short:

· Before consuming, your Root Agent can’t reach remote services.

· After consuming, the RemoteA2aAgent handles all the network details, making communication with external agents as simple as calling a local function.

Final System (Combined View)

Here’s how everything fits together — the consuming and exposing sides form a complete A2A system.
This setup shows how agents communicate seamlessly through ADK’s A2A components.

Full A2A Architecture

Concrete Use Case: Customer Service and Product Catalog Agents

Let’s consider a practical example: a Customer Service Agent that needs to retrieve product information from a separate Product Catalog Agent.

Before A2A

Initially, your Customer Service Agent might not have a direct, standardised way to query the Product Catalog Agent, especially if it’s a separate service or managed by a different team.

After A2A

By using the A2A Protocol, the Product Catalog Agent can expose its functionality as an A2A service. Your Customer Service Agent can then easily consume this service using ADK’s RemoteA2aAgent.

In this setup, first, the Product Catalog Agent needs to be exposed via an A2A Server. Then, the Customer Service Agent can simply call methods on the RemoteA2aAgent as if it were a tool, and the ADK handles all the underlying communication to the Product Catalog Agent. This allows for clear separation of concerns and easy integration of specialized agents.

A2A Protocol Internal Working

From the official documentation for the Agent2Agent (A2A) Protocol, an open standard designed to enable seamless communication and collaboration between AI agents.

Originally developed by Google and now donated to the Linux Foundation, A2A provides the definitive common language for agent interoperability in a world where agents are built using diverse frameworks and by different vendors.

Why use the A2A Protocol?

How does A2A work with MCP?

How A2A and MCP Work Together

Agent-to-Agent (A2A) and Model Context Protocol (MCP) are complementary standards that form the backbone of modern, multi-agent ecosystems. Together, they make it possible for intelligent agents to communicate, collaborate, and access tools seamlessly.

Model Context Protocol (MCP) — Agent-to-Tool Communication

The MCP standard defines how an agent connects to its tools, APIs, and data sources to retrieve or process information.
Think of it as the bridge between an agent and its environment — standardizing how agents interact with resources like databases, APIs, or third-party services.

🌐Agent-to-Agent Protocol (A2A) — Agent-to-Agent Communication

The A2A Protocol focuses on how different agents talk to each other.
It serves as a universal, decentralized network — almost like the “public internet” for AI agents — allowing them to interoperate, share knowledge, and collaborate, regardless of which framework or platform they’re built on.

In short:

MCP connects agents to tools.

A2A connects agents to each other.
Together, they make scalable, intelligent, and interconnected agentic systems possible.

Why Use the A2A Protocol?

A2A addresses key challenges in AI agent collaboration. It provides a standardized approach for agents to interact. This section explains the problems A2A solves and the benefits it offers.

Problems that A2A Solves

Consider a user request for an AI assistant to plan an international trip. This task involves orchestrating multiple specialized agents, such as:

· A flight booking agent

· A hotel reservation agent

· An agent for local tour recommendations

· A currency conversion agent

Without A2A, integrating these diverse agents presents several challenges:

· Agent Exposure: Developers often wrap agents as tools to expose them to other agents, similar to how tools are exposed in a Multi-agent Control Platform (Model Context Protocol). However, this approach is inefficient because agents are designed to negotiate directly. Wrapping agents as tools limits their capabilities. A2A allows agents to be exposed as they are, without requiring this wrapping.

· Custom Integrations: Each interaction requires custom, point-to-point solutions, creating significant engineering overhead.

· Slow Innovation: Bespoke development for each new integration slows innovation.

· Scalability Issues: Systems become difficult to scale and maintain as the number of agents and interactions grows.

· Interoperability: This approach limits interoperability, preventing the organic formation of complex AI ecosystems.

· Security Gaps: Ad hoc communication often lacks consistent security measures.

The A2A protocol addresses these challenges by establishing interoperability for AI agents to interact reliably and securely.

A2A Example Scenario

This section provides an example scenario to illustrate the benefits of using an A2A (Agent2Agent) protocol for complex interactions between AI agents.

A User’s Complex Request

A user interacts with an AI assistant, giving it a complex prompt like “Plan an international trip.”

Need for Collaboration

The AI assistant receives the prompt and realizes it needs to call upon multiple specialized agents to fulfill the request. These agents include a Flight Booking Agent, a Hotel Reservation Agent, a Currency Conversion Agent, and a Local Tours Agent.

The Interoperability Challenge

The core problem: The agents are unable to work together because each has its own bespoke development and deployment.

The consequence of a lack of a standardized protocol is that these agents cannot collaborate with each other let alone discover what they can do. The individual agents (Flight, Hotel, Currency, and Tours) are isolated.

The “With A2A” Solution

The A2A Protocol provides standard methods and data structures for agents to communicate with one another, regardless of their underlying implementation, so the same agents can be used as an interconnected system, communicating seamlessly through the standardized protocol.

The AI assistant, now acting as an orchestrator, receives the cohesive information from all the A2A-enabled agents. It then presents a single, complete travel plan as a seamless response to the user’s initial prompt.

Core Benefits of A2A

Implementing the A2A protocol offers significant advantages across the AI ecosystem:

· Secure collaboration: Without a standard, it’s difficult to ensure secure communication between agents. A2A uses HTTPS for secure communication and maintains opaque operations, so agents can’t see the inner workings of other agents during collaboration.

· Interoperability: A2A breaks down silos between different AI agent ecosystems, enabling agents from various vendors and frameworks to work together seamlessly.

· Agent autonomy: A2A allows agents to retain their individual capabilities and act as autonomous entities while collaborating with other agents.

· Reduced integration complexity: The protocol standardizes agent communication, enabling teams to focus on the unique value their agents provide.

· Support for LRO: The protocol supports long-running operations (LRO) and streaming with Server-Sent Events (SSE) and asynchronous execution.

Understanding the Agent Stack: A2A, MCP, Agent Frameworks and Models

A2A is situated within a broader agent stack, which includes:

· A2A: Standardizes communication among agents deployed in different organizations and developed using diverse frameworks.

· MCP: Connects models to data and external resources.

· Frameworks (like ADK): Provide toolkits for constructing agents.

· Models: Fundamental to an agent’s reasoning, these can be any Large Language Model (LLM).

Agent Stack: A2A, MCP, Agent Frameworks and Models

A2A and MCP

A2A and ADK

The Agent Development Kit (ADK) is an open-source agent development toolkit developed by Google. A2A is a communication protocol for agents that enables inter-agent communication, regardless of the framework used for their construction (e.g., ADK, LangGraph, or Crew AI). ADK is a flexible and modular framework for developing and deploying AI agents. While optimized for Gemini AI and the Google ecosystem, ADK is model-agnostic, deployment-agnostic, and built for compatibility with other frameworks.

Core Actors in A2A Interactions

· User: The end user, which can be a human operator or an automated service. The user initiates a request or defines a goal that requires assistance from one or more AI agents.

· A2A Client (Client Agent): An application, service, or another AI agent that acts on behalf of the user. The client initiates communication using the A2A protocol.

· A2A Server (Remote Agent): An AI agent or an agentic system that exposes an HTTP endpoint implementing the A2A protocol. It receives requests from clients, processes tasks, and returns results or status updates. From the client’s perspective, the remote agent operates as an opaque (black-box) system, meaning its internal workings, memory, or tools are not exposed.

Fundamental Communication Elements

The following table describes the fundamental communication elements in A2A:

Interaction Mechanisms

The A2A Protocol supports various interaction patterns to accommodate different needs for responsiveness and persistence. These mechanisms ensure that agents can exchange information efficiently and reliably, regardless of the task’s complexity or duration:

· Request/Response (Polling): Clients send a request and the server responds. For long-running tasks, the client periodically polls the server for updates.

· Streaming with Server-Sent Events (SSE): Clients initiate a stream to receive real-time, incremental results or status updates from the server over an open HTTP connection.

· Push Notifications: For very long-running tasks or disconnected scenarios, the server can actively send asynchronous notifications to a client-provided webhook when significant task updates occur.

The Role of the Agent Card

The Agent Card is a JSON document that serves as a digital “business card” for an A2A Server (the remote agent). It is crucial for agent discovery and interaction. The key information included in an Agent Card is as follows:

· Identity: Includes name, description, and provider information.

· Service Endpoint: Specifies the url for the A2A service.

· A2A Capabilities: Lists supported features such as streaming or pushNotifications.

· Authentication: Details the required schemes (e.g., "Bearer", "OAuth2").

· Skills: Describes the agent’s tasks using AgentSkill objects, including id, name, description, inputModes, outputModes, and examples.

Client agents use the Agent Card to determine an agent’s suitability, structure requests, and ensure secure communication.

Discovery Strategies

The following sections detail common strategies used by client agents to discover remote Agent Cards:

1. Well-Known URI

This approach is recommended for public agents or agents intended for broad discovery within a specific domain.

2. Curated Registries (Catalog-Based Discovery)

This approach is employed in enterprise environments or public marketplaces, where Agent Cards are often managed by a central registry. The curated registry acts as a central repository, allowing clients to query and discover agents based on criteria like “skills” or “tags”.

3. Direct Configuration / Private Discovery

This approach is used for tightly coupled systems, private agents, or development purposes, where clients are directly configured with Agent Card information or URLs.

Life of a Task

In the Agent2Agent (A2A) Protocol, interactions can range from simple, stateless exchanges to complex, long-running processes. When an agent receives a message from a client, it can respond in one of two fundamental ways:

· Respond with a Stateless Message: This type of response is typically used for immediate, self-contained interactions that conclude without requiring further state management.

· Initiate a Stateful Task: If the response is a Task, the agent will process it through a defined lifecycle, communicating progress and requiring input as needed, until it reaches an interrupted state (e.g., input-required, auth-required) or a terminal state (e.g., completed, canceled, rejected, failed).

Agent Response: Message or Task

The choice between responding with a Message or a Task depends on the nature of the interaction and the agent's capabilities:

· Messages for Trivial Interactions: Message objects are suitable for transactional interactions that don't require long-running processing or complex state management. An agent might use messages to negotiate the acceptance or scope of a task before committing to a Task object.

· Tasks for Stateful Interactions: Once an agent maps the intent of an incoming message to a supported capability that requires substantial, trackable work over an extended period, the agent responds with a Task object.

Example Follow-up Scenario

The following example illustrates a typical task flow with a follow-up:\

1. Client sends a message to the agent:

{
  "jsonrpc": "2.0",
  "id": "req-001",
  "method": "message.send",
  "params": {
    "message": {
      "role": "user",
      "parts": [
        {
          "kind": "text",
          "text": "Generate an image of a sailboat on the ocean."
        }
      ]
      "messageId": "msg-user-001"
    }
  }
}

2. Agent responds with a boat image (completed task):

{
  "jsonrpc": "2.0",
  "id": "req-001",
  "result": {
    "id": "task-boat-gen-123",
    "contextId": "ctx-conversation-abc",
    "status": {
      "state": "completed"
    },
    "artifacts": [
      {
        "artifactId": "artifact-boat-v1-xyz",
        "name": "sailboat_image.png",
        "description": "A generated image of a sailboat on the ocean.",
        "parts": [
          {
            "kind": "file",
            "file": {
              "name": "sailboat_image.png",
              "mimeType": "image/png",
              "bytes": "base64_encoded_png_data_of_a_sailboat"
            }
          }
        ]
      }
    ],
    "kind": "task"
  }
}

3. Client asks to color the boat red. This refinement request refers to the previous taskId and uses the same contextId.

{
  "jsonrpc": "2.0",
  "id": "req-002",
  "method": "message.send",
  "params": {
    "message": {
      "role": "user",
      "messageId": "msg-user-002",
      "contextId": "ctx-conversation-abc",
      "referenceTaskIds": [
        "task-boat-gen-123"
      ],
      "parts": [
        {
          "kind": "text",
          "text": "Please modify the sailboat to be red."
        }
      ]
    }
  }
}

4. Agent responds with a new image artifact (new task, same context, updated artifact name): The agent creates a new task within the same contextId. The new boat image artifact retains the same name but has a new artifactId


{
  "jsonrpc": "2.0",
  "id": "req-002",
  "result": {
    "id": "task-boat-color-456",
    "contextId": "ctx-conversation-abc",
    "status": {
      "state": "completed"
    },
    "artifacts": [
      {
        "artifactId": "artifact-boat-v2-red-pqr",
        "name": "sailboat_image.png",
        "description": "A generated image of a red sailboat on the ocean.",
        "parts": [
          {
            "kind": "file",
            "file": {
              "name": "sailboat_image.png",
              "mimeType": "image/png",
              "bytes": "base64_encoded_png_data_of_a_RED_sailboat"
            }
          }
        ]
      }
    ],
    "kind": "task"
  }
}

Enterprise Implementation of A2A

The Agent2Agent (A2A) protocol is designed with enterprise requirements at its core. Rather than inventing new, proprietary standards for security and operations, A2A aims to integrate seamlessly with existing enterprise infrastructure and widely adopted best practices. This approach allows organizations to use their existing investments and expertise in security, monitoring, governance, and identity management.

A key principle of A2A is that agents are typically opaque because they don’t share internal memory, tools, or direct resource access with each other. This opacity naturally aligns with standard client-server security paradigms, treating remote agents as standard HTTP-based enterprise applications.

Want to know more about A2A and MCP? Please go through my blog: https://medium.com/@saha.soumyadeep90/a2a-vs-mcp-comparing-googles-agent-to-agent-protocol-with-openai-s-model-context-protocol-6798491cc87e

Quickstart: Exposing a remote agent via A2A

This quickstart is the perfect starting point for any developer asking:

“I already have an agent — how do I make it accessible so other agents can use it via A2A?”

Exposing your agent is a key step in building multi-agent systems, where different agents can collaborate, share data, and interact intelligently.

In this example, you’ll learn how to expose an ADK agent so it can be accessed and used by other agents through the Agent-to-Agent (A2A) Protocol.

There are two main ways to expose an ADK agent via A2A:

1. Using to_a2a(root_agent)

This is the simplest and fastest method.

· Use this function to convert an existing agent into an A2A-compatible one.

· You can then expose it via a server using uvicorn, instead of deploying with adk deploy api_server.

· This approach gives you more control over what gets exposed, making it great for production environments.

· The best part: the to_a2a() function automatically generates an agent card (a metadata file that describes your agent).

2. Using adk api_server --a2a with a Custom Agent Card

This method is ideal when you want more flexibility or when you’re managing multiple agents.

· You create your own agent card (agent.json) and host it using the ADK API server.

· It integrates smoothly with ADK Web, making it easier to test, debug, and visualize your agents.

· You can also specify a folder containing multiple agents, and those with agent cards will automatically be exposed via the same server.

· To create agent cards manually, follow the A2A Python tutorial (https://google.github.io/adk-docs/a2a/quickstart-consuming/).

The sample consists of :

· Remote Hello World Agent (remote_a2a/hello_world/agent.py): This is the agent that you want to expose so that other agents can use it via A2A. It is an agent that handles dice rolling and prime number checking. It becomes exposed using the to_a2a() function and is served using uvicorn.

· Root Agent (agent.py): A simple agent that is just calling the remote Hello World agent.

Exposing the Remote Agent with the to_a2a(root_agent) function

You can take an existing agent built using ADK and make it A2A-compatible by simply wrapping it using the to_a2a() function. For example, if you have an agent like the following defined in root_agent:

# Your agent code here
root_agent = Agent(
    model='gemini-2.0-flash',
    name='hello_world_agent',

    <...your agent code...>
)

Then you can make it A2A-compatible simply by using to_a2a(root_agent):

from google.adk.a2a.utils.agent_to_a2a import to_a2a

# Make your agent A2A-compatible
a2a_app = to_a2a(root_agent, port=8001)

The to_a2a() function will even auto-generate an agent card in-memory behind-the-scenes by extracting skills, capabilities, and metadata from the ADK agent, so that the well-known agent card is made available when the agent endpoint is served using uvicorn.

You can also provide your own agent card by using the agent_card parameter. The value can be an AgentCard object or a path to an agent card JSON file.

Example with an AgentCard object:

from google.adk.a2a.utils.agent_to_a2a import to_a2a
from a2a.types import AgentCard

# Define A2A agent card
my_agent_card = AgentCard(
    "name": "file_agent",
    "url": "http://example.com",
    "description": "Test agent from file",
    "version": "1.0.0",
    "capabilities": {},
    "skills": [],
    "defaultInputModes": ["text/plain"],
    "defaultOutputModes": ["text/plain"],
    "supportsAuthenticatedExtendedCard": False,
)
a2a_app = to_a2a(root_agent, port=8001, agent_card=my_agent_card)

Example with a path to a JSON file:

from google.adk.a2a.utils.agent_to_a2a import to_a2a

# Load A2A agent card from a file
a2a_app = to_a2a(root_agent, port=8001, agent_card="/path/to/your/agent-card.json")

Now let’s dive into the sample code:

1. Getting the Sample Code

First, make sure you have the necessary dependencies installed:

pip install google-adk\[a2a\]

You can clone and navigate to the a2a_root sample (https://github.com/google/adk-python/tree/main/contributing/samples/a2a_root) here:

git clone https://github.com/google/adk-python.git

As you’ll see, the folder structure is as follows:

Root Agent (a2a_root/agent.py)

· root_agent: A RemoteA2aAgent that connects to the remote A2A service

· Agent Card URL: Points to the well-known agent card endpoint on the remote server

Remote Hello World Agent (a2a_root/remote_a2a/hello_world/agent.py)

· roll_die(sides: int): Function tool for rolling dice with state management

· check_prime(nums: list[int]): Async function for prime number checking

· root_agent: The main agent with comprehensive instructions

. a2a_app: The A2A application created using to_a2a() utility

2. Start the Remote A2A Agent server

You can now start the remote agent server, which will host the a2a_app within the hello_world agent:

# Ensure current working directory is adk-python/
# Start the remote agent using uvicorn
uvicorn contributing.samples.a2a_root.remote_a2a.hello_world.agent:a2a_app --host localhost --port 8001

Once executed, you should see something like:

INFO:     Started server process [10615]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8001 (Press CTRL+C to quit)

3. Check that your remote agent is running

You can check that your agent is up and running by visiting the agent card that was auto-generated earlier as part of your to_a2a() function in a2a_root/remote_a2a/hello_world/agent.py: http://localhost:8001/.well-known/agent-card.json

You should see the contents of the agent card, which should look like:

{
    "capabilities": {},
    "defaultInputModes": [
        "text/plain"
    ],
    "defaultOutputModes": [
        "text/plain"
    ],
    "description": "hello world agent that can roll a dice of 8 sides and check prime numbers.",
    "name": "hello_world_agent",
    "protocolVersion": "0.2.6",
    "skills": [
        {
            "description": "hello world agent that can roll a dice of 8 sides and check prime numbers. \n      I roll dice and answer questions about the outcome of the dice rolls.\n      I can roll dice of different sizes.\n      I can use multiple tools in parallel by calling functions in parallel(in one request and in one round).\n      It is ok to discuss previous dice roles, and comment on the dice rolls.\n      When I are asked to roll a die, I must call the roll_die tool with the number of sides. Be sure to pass in an integer. Do not pass in a string.\n      I should never roll a die on my own.\n      When checking prime numbers, call the check_prime tool with a list of integers. Be sure to pass in a list of integers. I should never pass in a string.\n      I should not check prime numbers before calling the tool.\n      When I are asked to roll a die and check prime numbers, I should always make the following two function calls:\n      1. I should first call the roll_die tool to get a roll. Wait for the function response before calling the check_prime tool.\n      2. After I get the function response from roll_die tool, I should call the check_prime tool with the roll_die result.\n        2.1 If user asks I to check primes based on previous rolls, make sure I include the previous rolls in the list.\n      3. When I respond, I must include the roll_die result from step 1.\n      I should always perform the previous 3 steps when asking for a roll and checking prime numbers.\n      I should not rely on the previous history on prime results.\n    ",
            "id": "hello_world_agent",
            "name": "model",
            "tags": [
                "llm"
            ]
        },
        {
            "description": "Roll a die and return the rolled result.\n\nArgs:\n  sides: The integer number of sides the die has.\n  tool_context: the tool context\nReturns:\n  An integer of the result of rolling the die.",
            "id": "hello_world_agent-roll_die",
            "name": "roll_die",
            "tags": [
                "llm",
                "tools"
            ]
        },
        {
            "description": "Check if a given list of numbers are prime.\n\nArgs:\n  nums: The list of numbers to check.\n\nReturns:\n  A str indicating which number is prime.",
            "id": "hello_world_agent-check_prime",
            "name": "check_prime",
            "tags": [
                "llm",
                "tools"
            ]
        }
    ],
    "supportsAuthenticatedExtendedCard": false,
    "url": "http://localhost:8001",
    "version": "0.0.1"
}

4. Run the Main (Consuming) Agent

Now that your remote agent is running, you can launch the dev UI and select “a2a_root” as your agent.

# In a separate terminal, run the adk web server
adk web contributing/samples/

To open the adk web server, go to: http://localhost:8000.

Example Interactions¶

Once both services are running, you can interact with the root agent to see how it calls the remote agent via A2A:

Simple Dice Rolling: This interaction uses a local agent, the Roll Agent:

User: Roll a 6-sided die

Bot: I rolled a 4 for you.

Prime Number Checking:

This interaction uses a remote agent via A2A, the Prime Agent:

User: Is 7 a prime number?

Bot: Yes, 7 is a prime number.

Combined Operations:

This interaction uses both the local Roll Agent and the remote Prime Agent:

User: Roll a 10-sided die and check if it's prime

Bot: I rolled an 8 for you.

Bot: 8 is not a prime number.

Quickstart: Consuming a remote agent via A2A

This quickstart focuses on another common developer question:

“There’s a remote agent — how can my ADK agent connect to and use it via A2A?”

This step is essential when building multi-agent systems, where different agents need to communicate, share data, and collaborate to complete complex tasks.

In this example, you’ll explore how the Agent-to-Agent (A2A) Protocol works inside the Agent Development Kit (ADK). It shows how multiple agents can connect and work together as one system.

The sample demonstrates a simple but powerful concept:
an agent that can roll dice and check if numbers are prime — with these tasks handled by different agents working together.

By the end, you’ll understand how your ADK agent can consume a remote agent, call its functions over the network, and act as if those features were built in locally.

The A2A Basic sample consists of:

· Root Agent (root_agent): The main orchestrator that delegates tasks to specialized sub-agents

· Roll Agent (roll_agent): A local sub-agent that handles dice rolling operations

· Prime Agent (prime_agent): A remote A2A agent that checks if numbers are prime, this agent is running on a separate A2A server

Exposing Your Agent with the ADK Server

The ADK comes with a built-in CLI command, adk api_server --a2a to expose your agent using the A2A protocol.

In the a2a_basic example, you will first need to expose the check_prime_agent via an A2A server, so that the local root agent can use it.

1. Getting the Sample Code

First, make sure you have the necessary dependencies installed:

pip install google-adk/[a2a/]

You can clone and navigate to the a2a_basic sample here:

git clone https://github.com/google/adk-python.git

As you’ll see, the folder structure is as follows:

Main Agent (a2a_basic/agent.py)

· roll_die(sides: int): Function tool for rolling dice

· roll_agent: Local agent specialized in dice rolling

· prime_agent: Remote A2A agent configuration

· root_agent: Main orchestrator with delegation logic

Remote Prime Agent (a2a_basic/remote_a2a/check_prime_agent/)

· agent.py: Implementation of the prime checking service

· agent.json: Agent card of the A2A agent

· check_prime(nums: list[int]): Prime number checking algorithm

2. Start the Remote Prime Agent server

To show how your ADK agent can consume a remote agent via A2A, you’ll first need to start a remote agent server, which will host the prime agent (under check_prime_agent).

# Start the remote a2a server that serves the check_prime_agent on port 8001
adk api_server --a2a --port 8001 contributing/samples/a2a_basic/remote_a2a

Once executed, you should see something like:

INFO:     Started server process [56558]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)

3. Look out for the required agent card (agent-card.json) of the remote agent

A2A Protocol requires that each agent must have an agent card that describes what it does.

If someone else has already built the remote A2A agent that you are looking to consume in your agent, then you should confirm that they have an agent card (agent-card.json).

In the sample, the check_prime_agent already has an agent card provided:

a2a_basic/remote_a2a/check_prime_agent/agent-card.json

{
  "capabilities": {},
  "defaultInputModes": ["text/plain"],
  "defaultOutputModes": ["application/json"],
  "description": "An agent specialized in checking whether numbers are prime. It can efficiently determine the primality of individual numbers or lists of numbers.",
  "name": "check_prime_agent",
  "skills": [
    {
      "id": "prime_checking",
      "name": "Prime Number Checking",
      "description": "Check if numbers in a list are prime using efficient mathematical algorithms",
      "tags": ["mathematical", "computation", "prime", "numbers"]
    }
  ],
  "url": "http://localhost:8001/a2a/check_prime_agent",
  "version": "1.0.0"
}

4. Run the Main (Consuming) Agent

# In a separate terminal, run the adk web server
adk web contributing/samples/

How it works

The main agent uses the RemoteA2aAgent() function to consume the remote agent (prime_agent in our example). As you can see below, RemoteA2aAgent() requires the name, description, and the URL of the agent_card.

a2a_basic/agent.py

from google.adk.agents.remote_a2a_agent import AGENT_CARD_WELL_KNOWN_PATH
from google.adk.agents.remote_a2a_agent import RemoteA2aAgent

prime_agent = RemoteA2aAgent(
    name="prime_agent",
    description="Agent that handles checking if numbers are prime.",
    agent_card=(
        f"http://localhost:8001/a2a/check_prime_agent{AGENT_CARD_WELL_KNOWN_PATH}"
    ),
)

<...code truncated>

Then, you can simply use the RemoteA2aAgent in your agent. In this case, prime_agent is used as one of the sub-agents in the root_agent below:

a2a_basic/agent.py

from google.adk.agents.llm_agent import Agent
from google.genai import types

root_agent = Agent(
    model="gemini-2.0-flash",
    name="root_agent",
    instruction="""
            You delegate rolling dice tasks to the roll_agent and prime checking tasks to the prime_agent.
      Follow these steps:
      1. If the user asks to roll a die, delegate to the roll_agent.
      2. If the user asks to check primes, delegate to the prime_agent.
      3. If the user asks to roll a die and then check if the result is prime, call roll_agent first, then pass the result to prime_agent.
      Always clarify the results before proceeding.>
    """,
    global_instruction=(
        "You are DicePrimeBot, ready to roll dice and check prime numbers."
    ),
    sub_agents=[roll_agent, prime_agent],
    tools=[example_tool],
    generate_content_config=types.GenerateContentConfig(
        safety_settings=[
            types.SafetySetting(  # avoid false alarm about rolling dice.
                category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                threshold=types.HarmBlockThreshold.OFF,
            ),
        ]
    ),
)

Example Interactions

Once both your main and remote agents are running, you can interact with the root agent to see how it calls the remote agent via A2A:

Simple Dice Rolling: This interaction uses a local agent, the Roll Agent:

User: Roll a 6-sided die
Bot: I rolled a 4 for you.

Prime Number Checking:

This interaction uses a remote agent via A2A, the Prime Agent:

User: Is 7 a prime number?
Bot: Yes, 7 is a prime number.

Combined Operations:

This interaction uses both the local Roll Agent and the remote Prime Agent:

User: Roll a 10-sided die and check if it's prime
Bot: I rolled an 8 for you.
Bot: 8 is not a prime number.

3. Look out for the required agent card (agent-card.json) of the remote agent

A2A Protocol requires that each agent must have an agent card that describes what it does.

If someone else has already built the remote A2A agent that you are looking to consume in your agent, then you should confirm that they have an agent card (agent-card.json).

In the sample, the check_prime_agent already has an agent card provided:

a2a_basic/remote_a2a/check_prime_agent/agent-card.json


  "capabilities": {},
  "defaultInputModes": ["text/plain"],
  "defaultOutputModes": ["application/json"],
  "description": "An agent specialized in checking whether numbers are prime. It can efficiently determine the primality of individual numbers or lists of numbers.",
  "name": "check_prime_agent",
  "skills": [
    {
      "id": "prime_checking",
      "name": "Prime Number Checking",
      "description": "Check if numbers in a list are prime using efficient mathematical algorithms",
      "tags": ["mathematical", "computation", "prime", "numbers"]
    }
  ],
  "url": "http://localhost:8001/a2a/check_prime_agent",
  "version": "1.0.0"
}

Unsupervised Learning with PCA: Theory, Math, and Practical Implementation

Soumyadeep Saha — Wed, 15 Oct 2025 10:15:22 GMT

Unsupervised Learning with Principal Component Analysis (PCA): Theory, Math, and Practical Implementation

Principal Component Analysis (PCA) is a dimensionality reduction technique.
It helps you take high-dimensional data (data with many features or variables) and represent it using fewer dimensions, while keeping as much useful information (variance) as possible.

Why PCA

The Problem: Too Many Features

1. Predictive Modeling Issue — Multicollinearity

When you have a lot of features that are strongly related to each other, they cause a problem called multicollinearity.
This makes it hard for models (like regression or machine learning models) to understand which feature is actually important.

· If two features say almost the same thing, the model gets confused about which one to trust.

· To fix it, we sometimes remove features one by one — but that’s slow and may throw away useful information.

Simple example:
If you have both “height in cm” and “height in inches” as features, they’re perfectly correlated — keeping both is redundant.
But when there are hundreds of such correlated features, manually deciding which to drop becomes messy.

2. Visualization Issue — Too Many Dimensions

We humans can only visualize up to 3 dimensions easily (2D plots or 3D graphs).
If your data has 10 or 100 features, you can’t directly plot it to see patterns or clusters.

That means it’s hard to notice relationships or groupings among data points just by looking at graphs.

Example:
If each customer has 20 attributes (age, income, spending, habits, etc.), you can’t make a 20-D plot to see which customers are similar.

How PCA Helps

PCA helps in both these cases:

· It reduces the number of features by combining correlated ones into new “principal components.”
→ This fixes multicollinearity and keeps most of the important information.

· It compresses many dimensions into 2 or 3 that capture most of the data’s variation.
→ This makes visualization possible again.

In the image above, you can see that a data set having N dimensions has been approximated to a smaller data set containing ‘k’ dimensions. In this module, you will learn how this manipulation is done. And this simple manipulation helps in several ways such as follows:

For data visualisation and EDA
For creating uncorrelated features that can be input to a prediction model: With a smaller number of uncorrelated features, the modelling process is faster and more stable as well.
Finding latent themes in the data: If you have a data set containing the ratings given to different movies by Netflix users, PCA would be able to find latent themes like genre and, consequently, the ratings that users give to a particular genre.
Noise reduction

Core Idea of PCA — In Simple Words

Principal Component Analysis (PCA) is a dimensionality reduction technique.
That means it takes a dataset with many variables (columns, or dimensions) and reduces it to a smaller number of variables — while keeping most of the important information (patterns, variation).

What “Dimension” Means

· Each dimension represents a feature or variable in your data.
For example, in a dataset of students:

Marks in math → one dimension

Marks in science → another dimension

Marks in English → a third dimension

So, if you have 3 features, your data lies in 3D space.
If you have 100 features, your data lies in 100D space — something we can’t visualize directly.

What PCA Does

PCA finds new axes (directions) in this high-dimensional space that:

1. Capture most of the variance (the spread or meaningful information in the data), and

2. Are fewer in number — typically just 2 or 3 principal components.

This way, we can work with a smaller dataset that’s easier to visualize and analyze, without losing much information.

What of PCA

Dimensionality reduction means reducing the number of variables (features) in a dataset.
The goal is to keep only the useful information and remove redundant or unimportant data.

How You’ve Already Done It Before

You’ve already performed dimensionality reduction manually in earlier topics:

· In Exploratory Data Analysis (EDA):
You removed columns that were mostly empty (nulls) or duplicated.

· In Linear/Logistic Regression:
You removed features with high p-values (not statistically significant) or high VIF scores (causing multicollinearity).

What PCA Does (and How It’s Different)

Instead of dropping features directly, PCA creates new ones — called principal components — by combining the old features in a smart mathematical way.

These new features:

· Capture most of the useful information (variance) from the original data

· Are uncorrelated with each other

· Allow you to easily decide how many to keep (based on how much total information each one holds)

PCA is a statistical procedure to convert observations of possibly correlated variables to ‘principal components’ such that:

They are uncorrelated with each other.
They are linear combinations of the original variables.
They help in capturing maximum information in the data set.

So PCA doesn’t just throw away columns — it rebuilds them into a smaller, more powerful set of new variables that represent the same data better.

Here we’ll learn about two of the most important building blocks of PCA — basis and change of basis. But before that, we’ll go through a brief refresher on basic linear algebra concepts.

Vectorial Representation of Data

Before we understand how PCA works, we need to be comfortable with some basic linear algebra concepts — because PCA relies heavily on matrix and vector operations.

To summarise what you’re going to learn in this segment here’s a handy checklist:

Vectors and their properties
Vector operations (addition, scaling, linear combination and dot product)
Matrices
Matrix operations (matrix multiplication and matrix inverses)

Consider the following data set containing the height and weight of five patients.

The height and weight information can be represented in the form of a matrix as follows:

with each row representing a particular patient’s data and each column representing the original variable. Geometrically, these patients can be represented as shown in the following image.

A vector is just a mathematical way to represent data points — basically a list of numbers that describe one observation.

The vector associated with the first patient is given by the values (165, 55). This value can also be written in the following way:

A column containing the values along the rows. This is also known as the column-vector representation.

2. Sometimes, we write the same vector horizontally. As a transpose of the above form. Essentially, it is the same column vector but now written as a transpose of a row vector.

3. In terms of the basis vectors
This is something that you’ll learn in detail in later segments. To give a brief idea, the vector (165,55) can also be written as 165i +55j, where i and j are the unit vectors along X and Y respectively and are the basis vectors used to represent all vectors in the 2-D space.

Vector Representation for n-Dimensional Data

If you have more variables (or features), the vector just gets longer.
For example:

· If you add age = 22 to the data,
→ The vector becomes (165, 55, 22) → a 3D vector.

If your dataset has 10 variables, then each data point is a 10-dimensional vector — written as (x1 , x2 , x3 ,…, x10)

Even though we can’t visualize more than 3D, math can handle n dimensions easily, and PCA uses that math to simplify things.

Vector Operations

Now that you’ve understood what vectors are, let’s go ahead and learn about some vector properties and a few associated operations.

1. Vectors Have Direction and Magnitude

A vector represents both:

· A direction (where it points), and

· A magnitude (how long it is).

Think of a vector as an arrow drawn from the origin (0, 0) to a point in space (x, y).

Example (2D):
For the vector (2, 3):

· The direction is the arrow from (0, 0) → (2, 3).

· The magnitude (length) is calculated using the Pythagoras theorem:

Example (3D):
For vector (2, –3, 4):

· The direction goes from (0, 0, 0) → (2, –3, 4).

· The magnitude is

So, magnitude tells you how strong the vector is, and direction tells you where it points.

2. Vector Addition

When you add two vectors, you add each component individually (element by element).

Example:

Let V1=(2,3), V2=(1,2), Then V1 + V2 = (2+1,3+2) = (3,5)

So you’re just adding the X parts together and the Y parts together.

In i, j form (where i = x-axis, j = y-axis): V1 = 2i+3j, V2 = i+2j,

Then V1 + V2 = (2+1)i+(3+2)j = 3i+5j

It’s the same concept — just written in a different notation.

Geometrically:
When you add vectors, it’s like placing one arrow after another (head to tail). The result is the diagonal (the new arrow connecting start to end).

3. Scalar Multiplication

If you multiply a vector by a number (scalar), the direction stays the same, but the length (magnitude) changes.

Example:

Let V = (2,3) and Scalar = 2

Then 2 × V = (4,6)

The vector points in the same direction but becomes twice as long.

If the scalar is negative, say –2,

Then -2 × V = (-4,-6)

The direction reverses but the length doubles.

Why This Matters for PCA

All PCA does is transform data using these vector operations:

· It treats each data point as a vector.

· It uses vector addition and scaling to create new axes (principal components).

· It uses magnitude and direction to find which directions explain the most variation in the data.

So these simple operations are the building blocks of PCA math.

Matrix Multiplication

Apart from the vector operations that we learnt previously, we need some knowledge of matrix operations as well.

The process of matrix multiplication is quite simple, and it involves element-wise multiplication followed by the addition of all the elements present in it. The one key rule that it must satisfy is when you multiply 2 matrices, say A and B, the number of columns of A must equal the number of rows in B. Visually, you can take a look at the following image to get the idea of how that should be.

As shown in the example, since the number of columns in the first matrix and the number of rows in the second matrix are equal to 4, matrix multiplication is possible and the resultant matrix has a shape of 5 x 6.

The element-wise multiplication followed by addition is also pretty straightforward as can be seen in the following example.

What Is a Matrix Inverse?

Just like in normal arithmetic, where the reciprocal (or inverse) of a number “undoes” its multiplication,
the inverse of a matrix does the same thing in matrix algebra.

Analogy with Regular Numbers

regular math

Similarly, in matrix math:

Here, I is the identity matrix — it acts like the number “1” in normal arithmetic.

Example

Given two matrices:

If you multiply them as B × A, you get:

This resulting matrix is the Identity Matrix (I) — notice it has 1s on the diagonal and 0s elsewhere.

What the Identity Matrix Does

The identity matrix works just like multiplying a number by 1: A × I = A

It doesn’t change the matrix — it’s the “do-nothing” element in matrix multiplication.

So, if : A × B = I or B × A = I , then and are inverses of each other.

In Simple Words

The inverse of a matrix is another matrix that “undoes” its effect when multiplied — just like dividing by a number or multiplying by its reciprocal.

Why This Matters for PCA

In PCA (and other algorithms like regression), we often need to:

· Undo transformations or

· Normalize data mathematically.

Matrix inverses help us reverse matrix operations — for example, when solving systems of equations or projecting data back to its original space after transformation.

Basis

What is a Basis (Intuitively)?

Think of a basis as a set of building blocks (reference directions) that help you describe every point (or vector) in space.

You can think of it like units:

· For measuring length, the unit (basis) is meter or centimeter.

· For measuring weight, the unit (basis) is kilogram or gram.

Similarly, when you describe a vector, you need a unit direction — or basis vectors — in which to express it.

2. Representing a Vector Using Basis Vectors

In 2D space, we usually use two standard basis vectors:

· î (i-hat) represents 1 unit in the x-direction.

· ĵ (j-hat) represents 1 unit in the y-direction.

That means:

· Move aₓ units in the x direction (î)

· Then move aᵧ units in the y direction (ĵ)

and you’ll reach the point a(x, y).

Example

Basis in Higher Dimensions

· In 2D, the basis is {î, ĵ}.

· In 3D, the basis is {î, ĵ, k̂} — with k̂ = [0, 0, 1].

· In general, for n-dimensional data, the basis consists of n standard unit vectors, each pointing along one axis.

So any data point (or vector) in that space can be expressed as a combination of these basis vectors.

Why Basis Matters in PCA

The basis defines the coordinate system you use to describe data.

PCA works by finding a new basis — new directions (called principal components) that:

· Capture the maximum variance (spread) of data.

· Are uncorrelated (independent) from each other.

So PCA is like rotating the coordinate system to a new, smarter set of basis vectors that explain the data better.

Change of Basis: Introduction

1. Same Data, Different Basis

Just like you can describe a person’s weight in kilograms or pounds,
you can describe a vector (or data point) using different sets of basis vectors.

Example:

This means:

· The first basis vector represents a 1-unit change in height (ft) but no change in weight.

· The second basis vector represents a 1-unit change in weight (lbs) but no change in height.

2. Changing the Basis

You could instead measure:

· Height in centimeters instead of feet, and

· Weight in kilograms instead of pounds.

Now, your basis vectors become different — but the underlying information (the person’s size) remains the same.

So, we’re expressing the same vector (same data point) in a new coordinate system — just like converting from one unit to another.

The following table summarises the results you get when you make the change.

3. Why This Matters in PCA

This is exactly what PCA does — it changes the basis of the data.

· The old basis: your original features (like height and weight).

· The new basis: the principal components — new axes that best explain how your data varies.

So PCA doesn’t change your data’s meaning — it just re-expresses it in a more efficient coordinate system (the one aligned with maximum variance).

In the previous segment, you saw a demonstration on how the change of basis led to dimensionality reduction. Let’s go ahead and understand the elegant way of doing the same calculations.

1. From Scalar to Matrix Transformation

Earlier, when dealing with one-dimensional data (like converting meters to centimeters), you could simply multiply by a number (scalar) — for example:

length in cm = 100 × length in meters

But when you have multi-dimensional data (e.g., height and weight together), conversion involves more than one variable, so the transformation must be done using a matrix rather than a single number.

So instead of: y = Mx

(where M was just a number before),
now M becomes a matrix that handles how each variable affects the others.

2. The Example Explained

This means:

· The x-values (like height) are scaled by 30.48 — for example, converting feet to centimeters.

· The y-values (like weight) are scaled by 0.45 — for example, converting pounds to kilograms.

3. But What If You Want to Go Back?

If you want to convert from the new basis back to the old basis,
you can’t just “divide” or take a simple reciprocal (because M is a matrix, not a number).

To reverse a matrix transformation, you use the matrix inverse.

That is:

4. Why This Matters for PCA

PCA works by transforming data from the original coordinate system (your features) to a new one (the principal components).
To do that:

· It uses a transformation matrix made of eigenvectors.

· If you want to go back to your original space, you use the inverse (or transpose) of that matrix.

So this concept — of using a matrix and its inverse to switch between coordinate systems — is at the heart of PCA math.

Change of Basis: Solved Examples

Understanding How to Move Between Different Bases

The main equation that helps us move from one set of basis vectors to another is:

New Basis Representation = M × Old Basis Representation

Our goal is to express M (the transformation matrix) in terms of the old basis vectors and the new basis vectors.

Notation Setup

To make this easier:

· B₁ → represents the old basis, and v₁ is the old basis representation.

· B₂ → represents the new basis, and v₂ is the new basis representation.

So, the above equation can now be written as: v2 = M × v1

Let’s call this Equation 1.

Relating Old and New Bases

When switching between multiple bases, the following relationship always holds:

B1 × v1 = B2 × v2

This means that the same vector (point) can be represented using either the old or new basis — the vector itself doesn’t change, only the coordinate system does.

Deriving the Transformation Matrix

Let’s call this Equation 2.

Comparing Equation 1 and Equation 2

In Simple Words

To convert coordinates from one basis to another, we multiply by a transformation matrix M.
That matrix is found by multiplying the inverse of the new basis by the old basis:

Mainly when we’re moving between multiple basis vectors, it’s important to know that the point’s position in space doesn’t change. The point’s representation might be different in different basis vectors but it would be representing the same point.

Change of Basis: Solved Python

Code : https://drive.google.com/file/d/1KJZ7yei5x4rrRoeGLLeT82CYH-lfAQkO/view?usp=sharing

Practice running the above code and explore how it works. If you get stuck or have questions, let me know in the comments — we’ll figure it out together!

Understanding the Next Step — Variance as Information

In the previous session, you learned the first key idea behind PCA — the concept of a basis and how changing the basis can help simplify or reduce dimensions in data.
You also saw that the same dataset can be represented using different basis vectors (or coordinate systems).

However, we didn’t yet answer the most important question:

“How do we find the best or ideal basis vectors that summarize the data most effectively?”

The Missing Ingredient: Variance

This session introduces that missing piece — variance.

· In earlier approaches, we decided which features (columns) to remove using:

Missing values (nulls)

Irrelevant or duplicate information

Statistical measures like p-values or VIF scores

But PCA uses a different and more powerful metric — variance — to decide what’s important.

· Variance measures how much the data spreads out or varies.

· Features with higher variance carry more information about differences between observations.

· Features with low variance are often less informative and can be reduced or removed.

In Simple Words:

PCA identifies which directions (basis vectors) capture the maximum variance — meaning, where the data changes the most.
These directions become the new principal components, helping reduce dimensions while keeping the most meaningful information.

Directions of Maximum Variance

1. Unequal Variance → Easy Reduction

When one feature (column) has much less variance than another, it’s clear that it contributes less information.

· For example, if Height varies a lot but Weight hardly changes, you can safely remove Weight without losing much information.

· PCA (or even basic feature selection) can easily reduce dimensions in such cases.

2. Similar Variances → Not So Easy

Now look at the graph above — each red dot shows a data point (Height vs Weight).
You can see that:

· The data is spread similarly along both axes — height and weight have almost the same variance.

· So, you can’t easily decide which variable (axis) is more informative.

In this case, both features carry similar levels of variation, and neither axis seems clearly better for reduction.

3. What PCA Does in This Case

When variances are similar, PCA does something smarter —
it changes the coordinate system (the basis vectors).

Instead of keeping the original X (Weight) and Y (Height) axes, PCA:

· Rotates the axes to find new directions where data spreads the most.

· These new directions are called principal components.

· The first principal component (PC1) is the direction of maximum variance — the line along which the data points spread out the most.

This new basis (set of principal components) captures maximum information in fewer dimensions.

Directions of Maximum Variance

Basically, the steps of PCA for finding the principal components can be summarised as follows.

First, it finds the basis vector which is along the best- fit line that maximises the variance. This is our first principal component or PC1.
The second principal component is perpendicular to the first principal component and contains the next highest amount of variance in the dataset.
This process continues iteratively, i.e. each new principal component is perpendicular to all the previous principal components and should explain the next highest amount of variance.
If the dataset contains n independent features, then PCA will create n Principal components.

For a 2-D dataset that has the representation as shown in the image below.

The principal components can be visually represented as shown in the image below.

Also, once the Principal Components are found out, PCA assigns a %age variance to each PC. Essentially it’s the fraction of the total variance of the dataset explained by a particular PC. This helps in understanding which Principal Component is more important than the other and by how much. This is shown in the images below.

Original Dataset

PCA Modified Dataset

Since 100% of the total variance or information of the entire dataset is present in only one of the columns (PC1) we can safely drop PC2 and still be assured of losing no information.

The Workings of PCA

Let’s once again summarise the steps of PCA

· Find n new features
Choose a different set of n basis vectors ( non-standard). These basis vectors are essentially the directions of maximum variance and are called Principal Components

· Express the original dataset using these new features
Transform the dataset from the original basis to this PCA basis.

· Perform dimensionality reduction
Choose only a certain k (where k < n) number of the PCs to represent the data. Remove those PCs which have fewer variance (explain less information) than others.

PCA acts as a pre-processing tool in the ML pipeline, predominantly used for dimensionality reduction to improve model performance.

The approach is as follows:

Note — The number of principal components is the same as the number of columns in the dataset. PCs are sorted in descending order of information content.

Before we end this session…

· The methodology or the algorithm by which PCA maximises the variance and obtains the new basis vectors is the process of eigendecomposition of the covariance matrix.

· Using the eigendecomposition method, you’ll be able to obtain the new basis vectors that will function as the Principal Components numerically. These new basis vectors are also called eigenvectors.

· For example, in the roadmap case, the following PCs are obtained using the eigendecomposition of the covariance matrix of the original dataset.

Implement PCA in Python

Code and Data zipped : https://drive.google.com/file/d/1zsd8lO5HrzzvphaeHMGtnZfH2qQj4Dzp/view?usp=sharing

You learnt some important shortcomings of PCA:

PCA is limited to linearity, though we can use non-linear techniques such as t-SNE as well
PCA needs the components to be perpendicular, though in some cases, that may not be the best solution. The alternative technique is to use Independent Components Analysis.
PCA assumes that columns with low variance are not useful, which might not be true in prediction setups (especially classification problem with a high class imbalance).

Unsupervised Learning and Clustering Explained with Python Examples

Soumyadeep Saha — Wed, 15 Oct 2025 08:21:51 GMT

In the previous blogs, you have learnt supervised learning techniques such as regression and classification. These methods rely on a training set with labels to teach the algorithm, which can then be applied to make predictions on new, unseen data.

In this module, we shift focus to unsupervised learning, where the data has no predefined labels. Instead, the algorithm tries to discover hidden patterns and structures directly from the data.

In This Session

· You will begin by learning about clustering, an unsupervised learning technique that groups data points based on similarity.

· A case study will demonstrate how clustering is applied in real-world industry problems.

· You will then explore the two most widely used clustering algorithms:

K-Means Clustering

Hierarchical Clustering

· You’ll also learn how to implement these algorithms in Python.

· Finally, we’ll discuss segmentation — how it differs from clustering, and where it is applied.

Practical Applications Of Clustering

Customer Insight: Say, a retail chain with so many stores across locations wants to manage stores at best and increase the sales and performance. Cluster analysis can help the retail chain to get desired insights on customer demographics, purchase behaviour and demand patterns across locations. This will help the retail chain for assortment planning, planning promotional activities and store benchmarking for better performance and higher returns.
Marketing: Cluster Analysis can help with In the field of marketing, Cluster Analysis can help in market segmentation and positioning, and to identify test markets for new product development.
Social Media: In the areas of social networking and social media, Cluster Analysis is used to identify similar communities within larger groups.
Medical: Cluster Analysis has also been widely used in the field of biology and medical science like human genetic clustering, sequencing into gene families, building groups of genes, and clustering of organisms at species.

Segmentation: Key Requirements

For segmentation to be meaningful and useful, the segments formed must be stable.

· This means that the same person should not fall into different segments if the data is segmented using the same criteria.

· Additionally, good segmentation requires:

Intra-segment homogeneity → members within the same segment should be similar to each other.

Inter-segment heterogeneity → different segments should be clearly distinct from one another.

Later in the module, you’ll see how these ideas can be expressed mathematically.

Types of Market Segmentation

Now, let’s look at the most commonly used types of market segmentation that are applied in real-world business contexts.

Types of Customer Segmentation

In practice, three main types of customer segmentation are commonly used:

1. Behavioural Segmentation

Based on the actual patterns of behavior displayed by consumers.

Examples: purchase frequency, product usage, brand loyalty.

2. Attitudinal Segmentation

Based on the beliefs, values, or intentions of customers, even if these do not always translate into actual actions.

Example: customers who express a preference for eco-friendly products, even if they do not consistently purchase them.

3. Demographic Segmentation

Based on a customer’s profile information, such as:

Age

Gender

Income

Education

Location

Session Summary

In this session, you were introduced to the basics of unsupervised learning and got an initial understanding of how clustering works. You learned that clustering groups data points based on similarities, without relying on predefined labels.

What’s Next

In the upcoming sessions, we will dive deeper into clustering and explore two of the most widely used clustering algorithms:

· K-Means Clustering — a partition-based method that groups data into k clusters.

· Hierarchical Clustering — a tree-based method that builds clusters step by step in a hierarchy.

Welcome to the Session on K-Means Clustering

In this session, we’ll take a deeper look at one of the most widely used clustering algorithms: K-Means Clustering. This algorithm is simple, intuitive, and highly effective, making it one of the first choices for many clustering tasks in practice.

Clustering with Euclidean Distance

The concept of a distance measure is quite intuitive:

If two observations are close to each other, they will have a low Euclidean distance.
If two observations are far apart, they will have a high Euclidean distance.

So, how does clustering use this?

In the clustering process (like in K-Means):

The algorithm begins with a set of cluster centers (also called centroids).
Each observation in the dataset is assigned to the cluster whose centroid is closest to it, based on Euclidean distance.
Once all points are assigned, the centroids are recalculated as the mean of all the points in that cluster.
Steps 2 and 3 are repeated until the centroids stop moving significantly (or a maximum number of iterations is reached).

In short, clustering with Euclidean distance groups observations such that points within a cluster are close to each other, while points in different clusters are farther apart.

Centroid

A crucial concept in clustering is the centroid.

From high school geometry, you may remember that the centroid is the center point of a triangle. Similarly, in clustering, the centroid is the center point of a cluster — it represents the “average location” of all the points belonging to that cluster.

Why Do We Need a Centroid?

Take the example shown above, where students’ marks in Mathematics and Biology form four distinct clusters:

· Cluster 1: High Biology, Low Maths

· Cluster 2: Average in both Biology and Maths

· Cluster 3: High in both Biology and Maths

· Cluster 4: High Maths, Low Biology

From the visual, we can clearly see how groups are formed. But suppose you want to compare two clusters — say Cluster 1 and Cluster 2:

· By how many marks do students in Cluster 1 outperform those in Cluster 2 in Biology?

· By how much do they underperform in Maths?

You cannot answer this precisely just by looking at the plot. That’s where the centroid becomes useful.

Calculating a Centroid

As explained, a centroid is the cluster center, representing the average of all observations in that cluster. To compute it, you simply take the mean of each column (dimension) across the observations in the cluster.

Step 1: Compute Mean for Each Feature

Step 2: Form the Centroid

Thus, the centroid of this group of observations is: (173.75,” “ 83.75,” “ 23.75). This single point summarizes the cluster’s average height, weight, and age.

Key takeaway: Centroids provide a numerical summary of clusters, making it possible to compare groups quantitatively rather than just visually.

Steps of the Algorithm: K-Means Algorithm with a Simple Example

Let’s understand how the K-Means algorithm works step by step using a very simple scenario.

Suppose you have the data of 10 students with their marks in Biology (y-axis) and Math (x-axis). You want to divide these students into 2 clusters so that you can see the types of students in the class.

Imagine two groups forming — one colored red and the other yellow. The question is: how will the algorithm decide which student belongs to which group?

Step 1: Recall the Centroid

Step 2: How K-Means Uses the Centroid

End Result:
The 10 students will be divided into 2 clusters, with each cluster containing students who have similar performance in Math and Biology

K-Means Cost Function

The goal of K-Means is to form clusters where points within a cluster are as close as possible to their cluster centroid. To measure this closeness, K-Means uses a cost function (also called the objective function or distortion function).

What Does It Mean?

· The formula calculates the squared Euclidean distance between each data point and its assigned cluster centroid.

· The total cost is the sum of these squared distances across all clusters.

· The K-Means algorithm works by minimizing this cost function through iterative updates:

1. Assignment step — assign each point to the nearest centroid.

2. Update step — recalculate centroids as the mean of their assigned points.

With each iteration, the cost function decreases, and the algorithm converges when the assignments no longer change significantly.

The Two Steps of K-Means

The K-Means algorithm runs in an iterative loop of two key steps: Assignment and Optimization.

1. Assignment Step

In this step, each data point is assigned to the cluster whose centroid is closest to it.

Formally, for each data point Xi:

This ensures every point is grouped with the centroid it is closest to

2. Optimization Step

Once all points have been assigned, the centroids of the clusters are recalculated.

For cluster , the new centroid is the average of all points assigned to it:

This moves the centroid to the “center of mass” of its cluster

Repeat Until Convergence

The algorithm repeats the assignment step and the optimization step until:

· The cluster assignments no longer change, or

· The improvement in the cost function becomes negligible.

At this point, K-Means is said to have converged, producing stable clusters.

K Means++ Algorithm

In the previous segment, you learned how the K-Means algorithm alternates between two steps — assignment and optimization — in order to minimize the cost function:

However, one limitation of standard K-Means is its sensitivity to the choice of initial centroids. If the starting centroids are chosen poorly (e.g., too close to each other), the algorithm may converge to suboptimal clusters.

To address this, the K-Means++ algorithm was introduced as a smarter initialization strategy.

How K-Means++ Works

1. Choose the first centroid randomly from the dataset.

2. Compute distances: For each remaining data point , calculate its distance from the nearest chosen centroid.

3. Select the next centroid: Pick the next centroid from the data points with a probability proportional to the square of its distance from the nearest chosen centroid.

Intuitively, points that are farther away from existing centroids are more likely to be chosen as new centroids.

4. Repeat Steps 2 and 3 until centroids are chosen.

5. Once the initialization is complete, proceed with the standard K-Means algorithm (assignment + optimization).

Why K-Means++ is Better

· Ensures that initial centroids are spread out, reducing the risk of poor clustering.

· Leads to faster convergence.

· Produces better quality clusters compared to random initialization.

Key takeaway: K-Means++ is not a new algorithm but an improved initialization procedure that makes the standard K-Means more robust and efficient.

Let’s see the K-Means algorithm in action using a visualisation tool. This tool can be found on naftaliharris.com. You can go to this link and play around with the different options available to get an intuitive feel of the K-Means algorithm.

Upon trying the different options, you may have noticed that the final clusters that you obtain vary depending on many factors, such as choice of the initial cluster centres and the value of K, i.e. the number of clusters that you want. You will understand these factors and other practical considerations while using the K-means algorithm in more detail in the next segment.

Let’s solve an exercise now

Data (X1, X2):
1:(1,4) 2:(1,3) 3:(0,4) 4:(5,1) 5:(6,2) 6:(4,0)

Iteration 0 — initial assignment

Iteration 1 — recompute centroids

Practical Consideration in K Means Algorithm

Before applying K-Means clustering, you must be aware of some practical issues that can affect the quality of clusters:

1. Number of Clusters (K):
The value of K must be chosen before running the algorithm. A wrong choice leads to poor clustering.

2. Initial Cluster Centers:
The starting centroids influence the final clusters. Poor initialization can give different or unstable results.

3. Outliers:
K-Means is sensitive to outliers, as they can shift centroids away from their true positions.

4. Feature Scaling:
Since Euclidean distance is used, all features should be on the same scale. Standardization is usually required.

5. Categorical Data:
K-Means does not work well with categorical variables — it is mainly for numerical data.

6. Convergence:
The algorithm may not converge within a fixed number of iterations, so always check if clusters have stabilized.

Silhouette Analysis in K-Means

After understanding the ways to choose the value of K, another useful method is the Silhouette Analysis (or Silhouette Coefficient).

· It is a measure that shows how well a data point fits within its assigned cluster.

· It compares cohesion (similarity of a point to its own cluster) with separation (difference from other clusters).

· A higher silhouette score means the data point is well-clustered, while a lower or negative score indicates poor clustering.

Computing Silhouette Metric

To calculate the silhouette score for a data point, we need two measures:

1. Cohesion (a):
The average distance of the point from all other points in its own cluster.

2. Separation (b):
The average distance of the point from all points in the nearest neighboring cluster.

Once we have these two values, the silhouette coefficient (s) for each point is calculated as:

· If s is close to +1, the point is well clustered.

· If s is close to 0, the point lies between two clusters.

· If s is negative, the point may be assigned to the wrong cluster.

K Means Clustering — Cluster Tendency

Before applying any clustering algorithm, it is important to check whether the data actually has meaningful clusters or not. This ensures that the data is not just random. The process of evaluating whether data is suitable for clustering is called clustering tendency.

As discussed earlier, clustering algorithms like K-Means will always return K clusters, even if no natural clusters exist in the data. Hence, we should not blindly apply clustering methods. Instead, we must first check the cluster tendency.

One common way to do this is the Hopkins Test. This test checks whether the data distribution is significantly different from a uniform (random) distribution in the multidimensional space. If the data is truly random, clustering will not give meaningful results.

Session Summary: K-Means Clustering

In this session, we started by intuitively understanding K-Means through the example of grouping 10 random points into 2 clusters.

· The algorithm begins by selecting K random cluster centers.

· Then, two steps — Assignment (assigning points to the nearest cluster) and Optimization (updating cluster centers) — are repeated until the clusters stop changing.

· The result is the most optimal clusters, which minimize intra-cluster distance (points within a cluster are close) and maximize inter-cluster distance (clusters are well separated).

We also discussed several practical issues to keep in mind when applying K-Means:

1. Choosing K: You must decide the number of clusters before running the algorithm.

2. Non-deterministic nature: K-Means can give different results on the same dataset because outcomes depend on the choice of initial cluster centers.

3. Outliers: Outliers can distort the clusters, leading to poor results.

4. Feature scaling: Since Euclidean distance is commonly used, all attributes need to be brought to the same scale using standardization.

5. Categorical data: K-Means cannot be directly applied to categorical variables; specialized algorithms like K-Modes or K-Prototypes are used instead.

K-Means in Python

Data: https://drive.google.com/file/d/1BPagY_u3059RA7wFi-0NCoY-44lawUjd/view?usp=sharing

Code: https://drive.google.com/file/d/16QQio0oAI7F6nVcBiUd71tvxEDL15AHg/view?usp=sharing

Practice running the above code and explore how it works. If you get stuck or have questions, let me know in the comments — we’ll figure it out together!

Hierarchical Clustering

Hierarchical Clustering vs K-Means

· K-Means Limitation: You must decide the number of clusters K in advance.

· Hierarchical Clustering Advantage: No need to specify K beforehand.

Output Difference

· K-Means: Produces fixed clusters by assigning data to centroids and refining them.

· Hierarchical Clustering: Produces a dendrogram (an inverted tree structure) showing how data points merge step by step.

Process of Hierarchical Clustering

1. Compute an N×N distance (similarity) matrix between all items.

2. Initially, treat each item as a separate cluster (N clusters).

3. Merge the two closest clusters into one.

4. Repeat merging and updating distances until all items form one cluster.

5. The final output is a dendrogram, which shows the hierarchy of merges and the distance at which they happened.

Interpreting the Dendrogram

The result of the cluster analysis is shown by a dendrogram, which starts with all the data points as a separate cluster and indicates at what level of dissimilarity any two clusters were joined.

As you saw, the y-axis of the dendrogram is some measure of the dissimilarity or distance at which clusters join.

In the dendrogram shown above, samples 4 and 5 are the most similar and join to form the first cluster, followed by samples 1 and 10. The last two clusters to fuse together to form the final single cluster are 3–6 and 4–5–2–7–1–10–9–8.

Determining the number of groups in a cluster analysis is often the primary goal. Typically, one looks for natural groupings defined by long stems. Here, by observation, you can identify that there are 3 major groupings: 3–6, 4–5–2–7 and 1–10–9–8.

You also saw that hierarchical clustering can proceed in 2 ways — agglomerative and divisive. If you start with n distinct clusters and iteratively reach to a point where you have only 1 cluster in the end, it is called agglomerative clustering. On the other hand, if you start with 1 big cluster and subsequently keep on partitioning this cluster to reach n clusters, each containing 1 element, it is called divisive clustering.

The dendrogram helps you decide at which level to “cut the tree” to obtain the desired number of clusters.

You learnt about Hierarchical Clustering as another clustering method.
Unlike K-Means, it does not require pre-defining the number of clusters.
It produces a dendrogram, which shows how clusters are formed step by step.
The main drawback is that it requires computing the distance between every pair of points, making it time-consuming and computationally expensive for large datasets.

Expert Insights: K-Means vs Hierarchical Clustering

· The choice between K-Means and Hierarchical Clustering depends mainly on:

1. Hardware/Computing power — Hierarchical clustering is more resource-intensive since it requires pairwise distance calculations.

2. Data Size and Nature — K-Means works well on large datasets, while Hierarchical clustering is better for smaller datasets or when you want to explore natural groupings.

Summary Flow

A Practical Hack for Segmentation

Instead of relying on only one method, you can combine both approaches:

Step 1: Use Hierarchical clustering to understand the data structure and estimate the likely number of clusters (by reading the dendrogram).

Step 2: Use this number of clusters as input to K-Means to perform efficient clustering on larger datasets.

This way, you get the best of both:

· Interpretability from Hierarchical clustering,

· Scalability from K-Means clustering.

Comparison of Linkages in Hierarchical Clustering

1. Single Linkage (Minimum Distance)

Defines cluster distance by the closest pair of points.

Often causes a chaining effect → clusters become long and loose.

Dendrogram is not very well-structured.

2. Complete Linkage (Maximum Distance)

Defines cluster distance by the farthest pair of points.

Produces compact, well-separated clusters.

Dendrogram is cleaner and easier to interpret.

3. Average Linkage (Mean Distance)

Defines cluster distance as the average of all pairwise distances between clusters.

Balances the advantages of single and complete linkage.

Gives reasonably well-structured dendrograms.

Which is Best?

· Complete Linkage generally gives the most well-separated dendrogram because it forces clusters to be compact and distinct.

· Advantages:

Easier to see clear groupings.

Better interpretability.

More reliable for business decisions where distinct segments are needed.

Play around with various linkages and number of clusters. You will be able to see the number of natural clusters from the dendrogram itself. If you want, you can change the scale as well. Which group of parameters give you the best result.

Choosing the Number of Clusters

By looking at the dendrogram and applying general knowledge about Indian states (e.g., southern states being more educated, BIMARU states having lower literacy), the following clusters make logical sense:

1. High literacy and higher education states → Kerala, Tamil Nadu, Delhi, Chandigarh.

2. Moderately literate, growing education states → Maharashtra, Gujarat, Karnataka, Punjab.

3. Low literacy, education-challenged states → Bihar, Uttar Pradesh, Jharkhand, Rajasthan, Madhya Pradesh.

Cutting the dendrogram at around 3–4 clusters with complete or average linkage gives the most meaningful results.

That wraps up our journey through Unsupervised Learning and Clustering! If you have any questions or doubts, feel free to drop them in the comments — I’m always happy to help.