<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Soumyadeep Saha on Medium]]></title>
        <description><![CDATA[Stories by Soumyadeep Saha on Medium]]></description>
        <link>https://medium.com/@saha.soumyadeep90?source=rss-53767639011e------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*1IiynLs0NQsIvWX0lappJg.jpeg</url>
            <title>Stories by Soumyadeep Saha on Medium</title>
            <link>https://medium.com/@saha.soumyadeep90?source=rss-53767639011e------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 23 May 2026 07:11:24 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@saha.soumyadeep90/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Embeddings Explained: From Sparse Representations to Transformer-Based Semantic Spaces]]></title>
            <link>https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/4defcf1d78df</guid>
            <category><![CDATA[transformers]]></category>
            <category><![CDATA[attention]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[embedding]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Wed, 18 Feb 2026 05:22:17 GMT</pubDate>
            <atom:updated>2026-02-18T09:44:51.675Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction: Why Embeddings Matter</h3><p>Every modern AI system — from Google Search to ChatGPT, from recommendation engines to facial recognition — relies on a single powerful idea:</p><p><strong>Represent complex objects as vectors in a continuous space.</strong></p><p>This idea is called <strong>embedding</strong>.</p><p>But what does that actually mean?</p><h4>The Core Problem</h4><p>Computers do not understand:</p><blockquote>· Words</blockquote><blockquote>· Images</blockquote><blockquote>· Graphs</blockquote><blockquote>· Users</blockquote><blockquote>· Products</blockquote><p>They understand numbers.</p><p>If we want machines to reason about:</p><blockquote>· The similarity between “dog” and “puppy”</blockquote><blockquote>· The relationship between “king” and “queen”</blockquote><blockquote>· Whether two images depict the same object</blockquote><blockquote>· Whether two users have similar preferences</blockquote><p>We must convert these objects into numbers.</p><p>Not just any numbers — but numbers arranged in a way that preserves meaning.</p><h4>A Simple Thought Experiment</h4><p>Imagine we want to represent three words:</p><pre>dog<br>cat<br>car</pre><p>We want:</p><blockquote>· dog close to cat</blockquote><blockquote>· dog far from car</blockquote><p>If we place them randomly in space, this structure is lost.</p><p>But if we map them carefully into a geometric space:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/540/1*q_LyExnQ0j8WBMKf32FfuQ.png" /></figure><p>Now distance encodes similarity.</p><p>This geometric representation is an embedding.</p><h4>Why Geometry?</h4><p>Geometry gives us:</p><blockquote>· Distance → similarity</blockquote><blockquote>· Direction → relationships</blockquote><blockquote>· Clusters → semantic groups</blockquote><blockquote>· Linear transformations → analogies</blockquote><p>For example:</p><pre>king - man + woman ≈ queen</pre><p>This works because embeddings transform symbolic relationships into geometric operations.</p><p>Meaning becomes direction in space.</p><h4>The Big Idea</h4><p>An embedding is a function:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/362/1*qDA-mOpId8KsPcQ406qjPg.png" /></figure><p>That maps complex objects into a low-dimensional vector space such that:</p><blockquote><strong>“similar objects”⇒”nearby vectors”</strong></blockquote><p>This idea appears everywhere:</p><blockquote>· NLP (Word2Vec, BERT, GPT)</blockquote><blockquote>· Computer Vision (CNN features, ViT embeddings)</blockquote><blockquote>· Graph Learning (node2vec, GCN)</blockquote><blockquote>· Recommender Systems (user/item embeddings)</blockquote><blockquote>· Multimodal systems (text–image alignment)</blockquote><h4>How Did We Get Here?</h4><p>The journey to modern embeddings evolved through several major phases:</p><blockquote>1. Sparse symbolic representations (one-hot, TF-IDF)</blockquote><blockquote>2. Matrix factorization (LSA)</blockquote><blockquote>3. Predictive neural embeddings (Word2Vec, GloVe)</blockquote><blockquote>4. Contextual embeddings (ELMo, BERT, GPT)</blockquote><blockquote>5. Contrastive and multimodal embeddings</blockquote><blockquote>6. Graph and manifold-based representations</blockquote><p>Each stage improved:</p><blockquote>· Scalability</blockquote><blockquote>· Semantic richness</blockquote><blockquote>· Context awareness</blockquote><blockquote>· Transfer learning ability</blockquote><p><strong>Note: </strong>We will be discussion just few of them</p><h4>What This Article Will Do</h4><p>In this article, we will:</p><p>· Define embeddings formally</p><p>· Explain every major category</p><p>· Derive the mathematics behind each approach</p><p>· Compare their geometric intuition</p><p>· Understand why Transformers became dominant</p><p>· Explore modern embedding paradigms</p><p>This is not just a tutorial — it is a conceptual and mathematical journey through how machines learn meaning.</p><p>We will move from intuition → math → architecture → geometry → modern systems.</p><h4>Before We Begin</h4><p>Keep this mental model in mind:</p><blockquote><strong>Embeddings turn meaning into geometry.</strong></blockquote><p>Once you understand that, everything else becomes a refinement of that core idea.</p><p>Now let’s begin the deep dive.</p><h3>1) Definition: What an Embedding is?</h3><p>An <strong>embedding</strong> is a <strong>learned mapping</strong> from objects (tokens/words, sentences, images, users/items, nodes in a graph, etc.) into a <strong>continuous vector space <em>Rd</em> </strong>such that <strong>geometric relationships</strong> in that space correspond to <strong>meaningful relationships</strong> in the original domain.</p><p>Formally, for a set of objects X, an embedding is a function</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/566/1*VhkfxG8gbFeqLCYoGho_MQ.png" /></figure><p>where d is typically <strong>much smaller</strong> than the size/complexity of the original representation.</p><h4>Why embeddings matter</h4><p>Embeddings turn “symbolic” or high-dimensional inputs into vectors where we can:</p><blockquote>· compare items via <strong>distance</strong>/<strong>similarity</strong> (nearest neighbors),</blockquote><blockquote>· use vector operations in ML models (linear layers, dot products),</blockquote><blockquote>· generalize across similar items (shared structure).</blockquote><p>Common similarity measures:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/972/1*51kAtX5To7l8Gk8C6MVMpA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*v823rj1XebjSHpdQ5aZFFA.png" /></figure><h3>2) Types of embeddings (major categories)</h3><p>A useful way to classify embeddings is <strong>what they embed</strong> and <strong>how they behave</strong>.</p><h3>A. By what is embedded (data modality / object type)</h3><ol><li><strong>Discrete symbol embeddings (categorical / token / word embeddings)</strong></li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_uhQrpb95pTab0T2li-Hjg.png" /></figure><p>2. <strong>Subword / character-aware embeddings</strong></p><blockquote>· Embed morphemes, byte-pair units, or characters; sometimes compose them (CNN/RNN/Transformer over characters or subword tokens).</blockquote><blockquote>· Helps with rare words and morphology.</blockquote><p>3. <strong>Sentence / document embeddings</strong></p><blockquote>· Produce one vector for a span of text.</blockquote><blockquote>· Either <strong>aggregate</strong> token embeddings (mean/max pooling) or use a special token like (Transformers).</blockquote><p>4. <strong>Graph / network embeddings</strong></p><blockquote>· Nodes (and sometimes edges/subgraphs) mapped to</blockquote><blockquote>· Preserve graph proximity (random-walk context) or message-passing structure.</blockquote><p>5. <strong>Knowledge graph embeddings</strong></p><blockquote>· Embed entities and relations to model triples</blockquote><blockquote>· Often use scoring functions like translation: h + r ≈ t</blockquote><p>6. <strong>Vision embeddings</strong></p><blockquote>· Images mapped to vectors (e.g., CNN/ViT features).</blockquote><blockquote>· Often derived from patch tokens (ViT) or pooled CNN activations.</blockquote><p>7. <strong>Audio/speech embeddings</strong></p><blockquote>· Represent speakers (speaker ID), phonetic content, or general audio semantics.</blockquote><p>8. <strong>Multimodal embeddings</strong></p><blockquote>· Put different modalities in a <strong>shared space</strong> (e.g., text and images aligned so matching pairs are close). This is central to contrastive models like CLIP-style training.</blockquote><p>9. <strong>User–item / recommendation embeddings</strong></p><blockquote>· Users and items embedded so interactions are predictable (matrix factorization, neural recommenders).</blockquote><h3>B. By behavior: static vs contextual</h3><p>1. <strong>Static embeddings</strong></p><p>· Each token has <strong>one vector</strong> regardless of context (e.g., classic word2vec/GloVe).</p><blockquote>· Limitation: “bank” (river vs finance) can’t change.</blockquote><p>2. <strong>Contextual embeddings</strong></p><blockquote>· A token’s representation depends on surrounding context:</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AZLtd8J3lI2gPqryNO53zA.png" /></figure><blockquote>· This became the standard for modern NLP because it resolves polysemy and yields richer features.</blockquote><h3>C. By learning supervision: unsupervised/self-supervised/supervised</h3><blockquote>· <strong>Unsupervised / self-supervised:</strong> learn from raw structure (co-occurrence, reconstruction, masked prediction, contrastive).</blockquote><blockquote>· <strong>Supervised:</strong> learn embeddings that optimize a downstream label objective (classification, ranking).</blockquote><blockquote>· <strong>Metric learning:</strong> explicitly structure distances via pairs/triplets.</blockquote><h3>3) How embeddings are calculated (major historical approaches)</h3><p>Below are the <strong>main families</strong> of methods, their <strong>math intuition</strong>, and how they compare.</p><h3>Pre-Embedding Era: Sparse Vector Representations</h3><p>Before dense embeddings were introduced, words and documents were represented using <strong>high-dimensional sparse vectors</strong>.</p><p>These methods did not learn latent meaning — they encoded surface-level statistics only.</p><p>We will cover:</p><blockquote>1. One-Hot Encoding</blockquote><blockquote>2. Bag-of-Words (BoW)</blockquote><blockquote>3. TF-IDF</blockquote><blockquote>4. Why these methods fail to capture semantics</blockquote><h3>1. One-Hot Encoding</h3><h4>Definition</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lF3J9jMeRjFMW1ldfbeyFw.png" /></figure><h4>Example</h4><p>Vocabulary: V={“cat”,”dog”,”apple”,”car”}</p><p>Then:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*twitkef9xwWNAIAb7UslEw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/736/1*0IwbsGCl4e-MakMra5O2xg.png" /></figure><h4>Geometric Property:</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yrDRmlWnqNoPq1BJU06ujw.png" /></figure><p>This means:</p><blockquote>cat and dog similarity = 0<br>cat and apple similarity = 0</blockquote><p>The model believes all words are equally unrelated.</p><h4>Problem</h4><p>There is:</p><blockquote>· No notion of semantic similarity</blockquote><blockquote>· No relationship between similar words</blockquote><blockquote>· Very high dimensionality</blockquote><blockquote>· No compression of meaning</blockquote><p>This motivated better representations.</p><h3>2. Bag-of-Words (BoW)</h3><p>Now instead of representing a single word, we represent a <strong>document</strong>.</p><h4>Definition</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1Cp92Qh00d9Dpdb9DzJIcA.png" /></figure><h4>Example</h4><p>Vocabulary: V={“cat”,”dog”,”apple”,”car”}</p><p>Document: “cat dog dog”</p><p>Vector: (1,2,0,0)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/774/1*9dqXqB0cRIdOsEK2ukXf7g.png" /></figure><p><strong>Geometrically:</strong> Each document becomes a point in high-dimensional space.</p><h4>Similarity Between Documents</h4><p>Usually cosine similarity:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/774/1*RFdo8oe3WLWXcYFLayfGEw.png" /></figure><p>If two documents share many words → angle small → high similarity.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ni6177fw96BrUJRg013F4w.png" /></figure><h4>Problems</h4><p>1. Order is lost:</p><blockquote>“dog bites man”</blockquote><blockquote>“man bites dog”<br> Same vector.</blockquote><p>2. Very sparse.</p><p>3. No latent semantics.</p><p>4. Large vocabulary → huge dimensionality.</p><h3>3. TF-IDF (Improved BoW)</h3><p>Bag-of-Words treats all words equally.</p><p>But common words (“the”, “is”) are not informative.</p><p>So we weight words.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wPgAungWVM-MLqsMsfuYFg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vmd6DTjH5frRLN2qsPRc-A.png" /></figure><p>Rare words → high IDF<br>Common words → low IDF</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2icV0qaA8dQHahnmxIV55A.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/680/1*AuC_RlNBCq57cMLE90xBww.png" /></figure><h3>Geometric View of Sparse Methods</h3><p>All these methods share this property:</p><p>High-Dimensional Space (|V| dimensions)</p><p>Each word = one axis</p><p>Each document = sparse vector</p><p>Key characteristics:</p><blockquote>· Dimensionality = vocabulary size (often 50k–1M)</blockquote><blockquote>· Vectors are sparse (mostly zeros)</blockquote><blockquote>· No latent compression</blockquote><blockquote>· No semantic structure</blockquote><h3>Why These Methods Fail Semantically</h3><p>Consider:</p><blockquote>Document A: “dog puppy bark”<br>Document B: “canine pet bark”</blockquote><p>BoW vectors:</p><blockquote>No shared exact words → low similarity.</blockquote><blockquote>But semantically → very similar.</blockquote><p><strong>Sparse models fail because:</strong> They operate in <strong>surface word space</strong>, not meaning space.</p><h3>Visual Comparison: Sparse vs Dense</h3><h4>Sparse Representation</h4><pre>Dimension: 50,000<br>Vector: [0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,...]<br>Mostly zeros</pre><h4>Dense Embedding (Modern)</h4><pre>Dimension: 300<br>Vector: [0.12, -0.87, 0.44, 0.09, ...]<br>All values meaningful<br>Encodes latent semantics</pre><h3>What Is LSA?</h3><p><strong>Latent Semantic Analysis (LSA)</strong> is a technique that:</p><blockquote><em>Uses word co-occurrence statistics and matrix factorization (SVD) to discover hidden (“latent”) semantic structure in text.</em></blockquote><p>It was one of the first methods to convert words and documents into dense vector representations.</p><h4>Build Word–Document Matrix</h4><p>Suppose we have documents:</p><pre>D1: dog barks loudly<br>D2: cat meows loudly<br>D3: dog runs fast</pre><p>Vocabulary:</p><pre>dog, cat, barks, meows, runs, fast, loudly</pre><p>Construct matrix XXX:</p><pre>D1   D2   D3<br>dog         1    0    1<br>cat         0    1    0<br>barks       1    0    0<br>meows       0    1    0<br>runs        0    0    1<br>fast        0    0    1<br>loudly      1    1    0</pre><p>This is a <strong>count matrix</strong>.</p><h3>Predictive Embeddings — Word2Vec</h3><p>Instead of:</p><p>“Count how often words appear together” (like LSA),</p><p>Word2Vec says: “Learn vectors that are good at predicting nearby words.”</p><p>So embeddings are learned as <strong>parameters of a predictive model</strong>.</p><h3>4. Neural Contextual Embeddings (ELMo → BERT → GPT)</h3><p>This section introduces the <strong>major conceptual breakthrough</strong> in embedding research:</p><p>A word does not have one fixed vector.<br>Its vector depends on the sentence it appears in.</p><p>We will explain:</p><blockquote>1. Why static embeddings fail</blockquote><blockquote>2. ELMo (BiLSTM contextual embeddings)</blockquote><blockquote>3. Transformer architecture</blockquote><blockquote>4. Self-attention mathematically</blockquote><blockquote>5. BERT (Masked LM)</blockquote><blockquote>6. GPT (Causal LM)</blockquote><blockquote>7. Full architecture diagrams</blockquote><blockquote>8. Why contextual embeddings became dominant</blockquote><h3>Why Static Embeddings Fail</h3><p>Consider the word: “bank”</p><p>Sentence A: I deposited money in the bank.</p><p>Sentence B: The river overflowed its bank.</p><p>Word2Vec assigns: e_bank = ”same vector in both cases”</p><p><strong>But meaning differs.</strong></p><p>We need:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/684/1*V1-LAjCKJwSNrCmm2mgJag.png" /></figure><p>This is the motivation for contextual embeddings.</p><h3>ELMo (2018) — First Major Contextual Embedding</h3><p>ELMo = <strong>Embeddings from Language Models</strong></p><p>Instead of learning one vector per word type, it learns representations from a <strong>bidirectional language model (BiLSTM)</strong>.</p><p><strong>ELMo Architecture Diagram</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*b2zXLZXa6ggFjz0zPh0K5Q.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QQjGMzX98M9lk2ez8raulQ.png" /></figure><p><strong>Final ELMo Representation</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Zav70NwZXOdoBsDdkVxOeQ.png" /></figure><p>So representation depends on:</p><p>✔ Left context<br> ✔ Right context</p><p>Thus:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/588/1*4AMgfSFQvSoYkUwK53-kOw.png" /></figure><h4>Limitations of ELMo</h4><blockquote>· Sequential processing (slow)</blockquote><blockquote>· LSTMs struggle with long-range dependencies</blockquote><blockquote>· Hard to parallelize</blockquote><p><strong>This led to Transformers.</strong></p><h3>5. Transformer-Based Contextual Embeddings</h3><p>Transformers replace recurrence with <strong>self-attention</strong>.</p><p><strong>Core idea:</strong> Each word directly attends to all other words.</p><p><strong>Transformer Input Representation:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*IvRrjKCQnb9EivkH2Ck-Xw.png" /></figure><h4>Input Diagram</h4><pre>Sentence:  The   dog   barked   loudly</pre><pre>Token Embeddings:<br>   E(The)<br>   E(dog)<br>   E(barked)<br>   E(loudly)</pre><pre>Positional Embeddings:<br>   P1<br>   P2<br>   P3<br>   P4</pre><pre>Final Input:<br>   X1 = E(The) + P1<br>   X2 = E(dog) + P2<br>   X3 = E(barked) + P3<br>   X4 = E(loudly) + P4</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/312/1*nFbB--dmx-qkabbUu-sMkg.png" /><figcaption>Stacked into matrix</figcaption></figure><h4>Self-Attention Mechanism</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5SpHEOdxoXEiJwn0iS7sVQ.png" /></figure><h4>Self-Attention Formula</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XvUjI2x9GZd-ksXFBgJtHA.png" /></figure><h4>Self-Attention Diagram</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xvmKjXF0Sr3Ottz9wRaQlg.png" /></figure><p>For word “dog” in The dog barked loudly</p><pre>dog attends to:<br><br>The<br>dog<br>barked<br>loudly</pre><p><strong>Visualization:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/734/1*RBrfI4RsJLFV1tN3Y0F5qw.png" /></figure><p>Each arrow weight determined by:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/816/1*-6FFvj-Y1eDb38uIBfL-zA.png" /></figure><p>Final representation:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/496/1*VgGhTU3NKJEALW_0Ep2QLg.png" /></figure><p>So each word becomes: <strong>Weighted mixture of all words in sentence.</strong></p><h4>Transformer Block</h4><p>Each layer contains:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/792/1*zp9F2IwVu7J7JRcj_Q2jsg.png" /></figure><p>Stacked L times.</p><p>Final contextual embedding:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/190/1*TFY04LjS6E0THXruuYgqmA.png" /></figure><h3>BERT — Masked Language Model (Bidirectional)</h3><p>Training objective:</p><p>Randomly mask tokens:</p><pre>The dog [MASK] loudly</pre><p>Model predicts masked word.</p><p>Loss:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/960/1*aNjJOPk76EGM6g3OQfTtSA.png" /></figure><p>Because model sees both left and right context, it learns <strong>deep bidirectional representations</strong>.</p><h3>GPT — Causal Language Model</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*a3obN_OxsV-7IhT8PEiUZA.png" /></figure><h3>BERT vs GPT Diagram</h3><p><strong>BERT (Bidirectional)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ICWquiU4PGfKSWVWYlb9_Q.png" /></figure><p><strong>GPT (Causal)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/970/1*_1tQH_h-mG0Ty8nNO9tfdA.png" /></figure><h4>Why Contextual Embeddings Dominated NLP</h4><p>They combine:</p><blockquote>✔ Context sensitivity<br> ✔ Large-scale self-supervised learning<br> ✔ Deep semantic modeling<br> ✔ Transfer learning</blockquote><p>Instead of training task-specific models, we:</p><blockquote>1. Pretrain large LM</blockquote><blockquote>2. Fine-tune on downstream tasks</blockquote><p>This drastically improved:</p><blockquote>· Question answering</blockquote><blockquote>· Translation</blockquote><blockquote>· Classification</blockquote><blockquote>· Named entity recognition</blockquote><blockquote>· Summarization</blockquote><h4>Geometric Interpretation</h4><p>Unlike Word2Vec: <strong>One word → one fixed point.</strong></p><p>In contextual embeddings: <strong>Each occurrence → different point.</strong></p><p>So “bank” forms multiple clusters:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LODZ9qCzSoHDxkfeVQxVsw.png" /></figure><h3>Autoencoders &amp; Variational Autoencoders</h3><p>Instead of predicting context (Word2Vec) or next token (GPT), we train a model to:</p><blockquote>Compress input → then reconstruct it.</blockquote><p>The compressed representation becomes the <strong>embedding</strong>.</p><h3>1. Basic Autoencoder</h3><p>Mathematical Formulation</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*a4G41yrVskUrYtQIJlOrOw.png" /></figure><p><strong>Architecture Diagram:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/730/1*5BNw1H7skS3j95Qb49y_cA.png" /></figure><p><strong>Visually:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/590/1*Xt-KCAmWRE6EBJmRAIBpcA.png" /></figure><p>This is called a <strong>bottleneck architecture</strong>.</p><h4>Why This Produces Embeddings</h4><p>The model is forced to:</p><blockquote>· Compress D-dimensional input</blockquote><blockquote>· Into d-dimensional latent vector</blockquote><blockquote>· Without losing important information</blockquote><p>So: <strong>Z</strong> becomes a <strong>compressed representation of meaning</strong>.</p><h3>Geometric Intuition</h3><p>Suppose input data lies near a low-dimensional manifold:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/796/1*dkvsyzbvYpVXUap0SSb02Q.png" /></figure><p>Autoencoder learns:</p><blockquote>→ A nonlinear projection onto that manifold.</blockquote><p>Latent space:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/414/1*QiLWr8FCWz3HMKvh0plEKA.png" /></figure><p>So embedding = coordinate in learned manifold.</p><h4>Linear Autoencoder = PCA</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UQL2n9JrAZWWRLfpSaqn1Q.png" /></figure><p>Then minimizing reconstruction error is equivalent to:</p><p>So autoencoders generalize Principal Component Analysis (PCA)to nonlinear embeddings.</p><h3>2. Variational Autoencoder (VAE)</h3><p>Regular autoencoder:</p><blockquote>· Deterministic encoding</blockquote><p>VAE introduces probability.</p><p><strong>Probabilistic Formulation</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8yW4G7tuT60_7iU1ODX7aw.png" /></figure><p>Two terms:</p><blockquote>1. Reconstruction loss</blockquote><blockquote>2. KL divergence regularization</blockquote><p><strong>VAE Diagram</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/598/1*Qk_3le6gRLNqq7YxV9WsDg.png" /></figure><p>This forces latent space to:</p><blockquote>✔ Be smooth<br> ✔ Be continuous<br> ✔ Be structured</blockquote><h4>Autoencoders vs Word2Vec</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DyodVIvcFUSI_0oruGKJ5g.png" /><figcaption>Autoencoders are <strong>general-purpose embedding learners</strong></figcaption></figure><h3>Graph Embeddings</h3><p>Now we move to structured data: graphs</p><p>A graph: G = (V , E)</p><p>Nodes = entities<br>Edges = relationships</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/476/1*NTEgEsOzNrYOqQWF3m48xw.png" /><figcaption>So that connected nodes are close.</figcaption></figure><p>One of the example we willstudy is DeepWalk / node2vec</p><h3>DeepWalk / node2vec</h3><p>Core idea: Treat random walks like sentences.</p><h4>Step 1: Random Walk</h4><p>Example graph:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/744/1*d_y94pdsbYPfyenGJBFONg.png" /></figure><h4>Step 2: Apply Skip-Gram</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*C9JJM2k28bY7KoBFbgdDZw.png" /></figure><h4>Diagram</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*IhfDUj8ZP9L7e9-XLBiQ9w.png" /></figure><h3>Intuition</h3><p>Nodes appearing in similar walks → similar embeddings.</p><p>Captures:</p><blockquote>✔ Community structure<br> ✔ Graph proximity</blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4defcf1d78df" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Designing Scalable RAG Systems Using VectorDB: A Hands-On Walkthrough with ChromaDB]]></title>
            <link>https://medium.com/@saha.soumyadeep90/designing-scalable-rag-systems-using-vectordb-a-hands-on-walkthrough-de9f1eac768d?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/de9f1eac768d</guid>
            <category><![CDATA[retrieval-augmented-gen]]></category>
            <category><![CDATA[vector-database]]></category>
            <category><![CDATA[rags]]></category>
            <category><![CDATA[vector-store]]></category>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Wed, 11 Feb 2026 17:57:29 GMT</pubDate>
            <atom:updated>2026-02-11T20:01:09.012Z</atom:updated>
            <content:encoded><![CDATA[<p>In my previous blog, I provided a comprehensive overview of vector stores, vector databases, and the internal workings of RAG:</p><p><a href="https://medium.com/@saha.soumyadeep90/vector-stores-positional-encoding-and-rag-explained-simply-and-with-a-practical-guide-dea70512f6fc">https://medium.com/@saha.soumyadeep90/vector-stores-positional-encoding-and-rag-explained-simply-and-with-a-practical-guide-dea70512f6fc</a></p><p>This article, however, is specifically focused on the practical implementation side — offering a hands-on view of how vector databases and vector stores, such as Chroma, actually work in real-world scenarios.</p><h4><strong>What is RAG?</strong></h4><p><strong>RAG (Retrieval Augmented Generation)</strong> is an architecture that <strong>adds external knowledge</strong> to a Large Language Model (LLM) at <em>query time</em>.</p><p>Instead of relying only on what the model was trained on, RAG:</p><ul><li><strong>retrieves relevant documents</strong></li><li><strong>injects them into the prompt</strong></li><li><strong>then generates the answer</strong></li></ul><h4><strong>Why RAG is Needed?</strong></h4><p>LLMs alone have problems:</p><ul><li>Hallucination</li><li>Outdated knowledge</li><li>Cannot access private/local data</li></ul><p>RAG solves this by grounding answers in <strong>real documents</strong>.</p><h4><strong>Core Components of RAG?</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ztOUZw88-3zxb0-v0LWKNA.png" /></figure><h4><strong>RAG Working (Step-by-Step)</strong></h4><ol><li>Prepare documents</li><li>Split documents into chunks</li><li>Convert chunks to embeddings</li><li>Store embeddings in vector database</li><li>User asks a question</li><li>Question converted to embedding</li><li>Most similar chunks retrieved</li><li>Chunks injected into LLM prompt</li><li>LLM generates grounded answer</li></ol><h4><strong>RAG vs Fine-Tuning (Very Important)</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZPuLaGIOmyIhpxOpUnutEQ.png" /><figcaption>RAG is a technique that combines information retrieval with language generation to produce context-aware and factually grounded responses.</figcaption></figure><h3>Implementation 1:</h3><h4><strong>RAG + Chroma</strong> is <em>perfect</em> to understand modern LLM apps locally on a MacBook</h4><p>I’ll assume:</p><p>· macOS</p><p>· Python 3.9+</p><p>· No paid APIs (we’ll use <strong>local embeddings</strong>)</p><p>· Simple text files as knowledge base</p><p><strong>What You’ll Build (Big Picture)</strong></p><p>You’ll build a <strong>local RAG pipeline</strong>:</p><blockquote>Your documents → Embeddings → ChromaDB (vector store)</blockquote><blockquote>User question → Retrieve relevant chunks → Send to LLM → Answer</blockquote><h3><strong>Step 1: Install Ollama (Mac)</strong></h3><blockquote><strong>Install Ollama</strong></blockquote><pre>brew install ollama</pre><blockquote><strong>Start Ollama service</strong></blockquote><pre>ollama serve</pre><p>(Leave this running in one terminal)</p><p>ollama serve <strong>starts the Ollama background service</strong> that:</p><p>· Loads LLM models (LLaMA, Mistral, Phi, etc.)</p><p>· Exposes them via a <strong>local HTTP API</strong></p><blockquote><strong>Pull a model (lightweight + good)</strong></blockquote><pre>ollama pull llama3</pre><p># ollama run llama3 → ollama run sends requests to the server started by serve.</p><p>You now have a <strong>local LLM</strong> running at: <a href="http://localhost:11434">http://localhost:11434</a></p><h3><strong>Step 2: Install Python Libraries</strong></h3><pre>python3.11 -m venv rag_venv<br><br>source rag_venv/bin/activate<br><br>python3 -m pip install chromadb sentence-transformers langchain langchain-community langchain-text-splitters langchain-ollama</pre><h4>Dependency Usage Table (RAG + Ollama Setup)</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*q14MKi4OgjMl6gh4bZIixg.png" /></figure><h3><strong>Step 3: Sample Documents</strong></h3><p>Create folder:</p><pre>mkdir data</pre><p><strong>data/rag.txt</strong></p><pre>RAG stands for Retrieval Augmented Generation.<br>It combines retrieval with language models.<br>RAG improves accuracy by grounding answers in documents.</pre><p><strong>data/time_dilation.txt</strong></p><pre>The concept of &quot;time dilation&quot; in physics is fascinating. <br>According to Einstein&#39;s theory of relativity, time is not universal; <br>it stretches and compresses based on speed and gravity. <br>If you were to travel in a spaceship at near-light speed for a few years, <br>you would return to Earth to find decades or even centuries had passed. <br>Similarly, time moves slower near massive objects like black holes due to <br>extreme gravity. This means astronauts aboard the International Space Station <br>actually age slightly slower than people on Earth. Time is not constant; <br>it is flexible, making the universe far stranger than it appears.</pre><h3><strong>Step 4: Build Vector Store (ChromaDB)</strong></h3><p>Create <strong>build_chroma.py </strong>with the below code</p><pre># Import OS module to work with files and directories<br>import os<br><br># Import loader to read text files<br>from langchain_community.document_loaders import TextLoader<br><br># Import text splitter to break text into chunks<br>from langchain_text_splitters import RecursiveCharacterTextSplitter<br><br># Import embedding model for converting text → vectors<br>from langchain_community.embeddings import HuggingFaceEmbeddings<br><br># Import Chroma vector database<br>from langchain_community.vectorstores import Chroma<br><br># Create empty list to store loaded documents<br>documents = []<br><br># Loop through all files inside the data directory<br>for file in os.listdir(&quot;data&quot;):<br>    # Load each text file<br>    loader = TextLoader(f&quot;data/{file}&quot;)<br>    # Add loaded document to the list<br>    documents.extend(loader.load())<br><br># Initialize text splitter<br>text_splitter = RecursiveCharacterTextSplitter(<br>    chunk_size=200,      # Maximum size of each text chunk (in characters by default)<br>    chunk_overlap=20     # Overlap between chunks, Chunk 1: Characters 0–200, Chunk 2: Characters 180–380, 20 characters overlap, Helps preserve context<br>)<br><br># Split documents into smaller chunks<br>chunks = text_splitter.split_documents(documents)<br><br># Load local embedding model<br>embedding_model = HuggingFaceEmbeddings(<br>    model_name=&quot;all-MiniLM-L6-v2&quot;  # Small and fast embedding model<br>)<br><br># Create Chroma vector database from document chunks<br>vectorstore = Chroma.from_documents(<br>    documents=chunks,             # Text chunks<br>    embedding=embedding_model,    # Embedding function<br>    persist_directory=&quot;chroma_db&quot; # Folder to save the DB<br>)<br><br># Save vector DB to disk<br>vectorstore.persist()<br><br># Confirmation message<br>print(&quot;ChromaDB created successfully&quot;)</pre><p>Run:</p><pre>python build_chroma.py</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*EXMlJ9F3SXZ1iF7RjTfspQ.png" /></figure><h3><strong>Step 5: RAG with Ollama (Retrieval + Generation)</strong></h3><p>Create <strong>rag_ollama.py</strong> with the below code</p><pre># Import embedding model for converting text → vectors<br>from langchain_community.embeddings import HuggingFaceEmbeddings<br><br># Import Chroma vector database<br>from langchain_community.vectorstores import Chroma<br><br># Import Ollama LLM wrapper<br><br>from langchain_ollama import OllamaLLM<br><br># Load the same embedding model used during indexing<br>embedding_model = HuggingFaceEmbeddings(<br>    model_name=&quot;all-MiniLM-L6-v2&quot;<br>)<br><br># Load the existing Chroma vector database<br>db = Chroma(<br>    persist_directory=&quot;chroma_db&quot;,     # Path to stored DB<br>    embedding_function=embedding_model # Embedding function<br>)<br><br># User question<br>query = &quot;What is RAG?&quot;<br><br># Perform similarity search to retrieve relevant chunks<br>retrieved_docs = db.similarity_search(<br>    query,  # User question<br>    k=2     # Number of top matching chunks<br>)<br><br># Combine retrieved document text into one context string<br>context = &quot;\n&quot;.join([doc.page_content for doc in retrieved_docs])<br><br># Initialize Ollama with LLaMA 3 model<br>llm = OllamaLLM(<br>    model=&quot;llama3&quot;,   # Name of model pulled via OllamaLLM<br>    temperature=0.2  # Lower = more factual answers<br>    # LLM temperature is a hyperparameter ranging from 0 to 2 (typically) that controls the randomness and creativity of AI-generated text. Lower settings ((0.0)–(0.4)) produce precise, repetitive, and deterministic outputs ideal for factual tasks, while higher settings ((0.7)–(1.5+)) increase diversity, creativity, and risk of hallucinations for storytelling or brainstorming.<br>)<br><br># Create RAG prompt<br>prompt = f&quot;&quot;&quot;<br>You are a helpful assistant.<br>Answer the question using ONLY the context below.<br><br>Context:<br>{context}<br><br>Question:<br>{query}<br><br>Answer:<br>&quot;&quot;&quot;<br><br># Send prompt to Ollama LLM and get response<br>response = llm.invoke(prompt)<br><br># Print final answer<br>print(&quot;\nFinal Answer:&quot;)<br>print(response)</pre><p>Run:</p><pre>python rag_ollama.py</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*178NTQeBb2A04WgblTJQCw.png" /></figure><h3>Implementation 2:</h3><h4><strong>FastAPI RAG Server (Ollama + Chroma)</strong></h4><p>FastAPI is a modern Python web framework used to build APIs quickly and efficiently.</p><p>It is mainly used for:</p><ul><li>Building REST APIs</li><li>Serving machine learning models</li><li>Creating backend services</li><li>Microservices</li><li>AI applications (like your RAG app)</li></ul><p><strong>Why Is It Called “Fast”?</strong></p><p>FastAPI is fast because:</p><ol><li>Built on <strong>Starlette</strong> (async framework)</li><li>Uses <strong>Pydantic</strong> (fast data validation)</li><li>Supports async/await</li><li>Very low overhead</li></ol><p>It performs close to <strong>Node.js and Go</strong> speeds.</p><p>Now let’s <strong>convert your RAG pipeline into an API: WHAT YOU WILL BUILD (END GOAL)</strong></p><p>A <strong>local RAG system</strong> with:</p><ul><li><strong>Ollama (LLaMA 3)</strong> → LLM</li><li><strong>ChromaDB</strong> → Vector database</li><li><strong>FastAPI</strong> → Backend server</li><li><strong>Local text files</strong> → Knowledge base</li></ul><p>You’ll end with an API:</p><blockquote>POST /ask</blockquote><p>that answers questions using your documents.</p><h4>Step 1: <strong>Prerequisites (One Time)</strong></h4><p>Install Python (if not installed)</p><pre>brew install python</pre><p>Check:</p><pre>python3 - version</pre><h4>Step 2: <strong>Install &amp; Setup Ollama (Llm)</strong></h4><p><strong>Install Ollama</strong></p><pre>brew install ollama</pre><p><strong>Start Ollama Service</strong></p><pre>ollama serve</pre><p>Keep this terminal <strong>running</strong>.</p><p><strong>Download Model</strong></p><p>Open another terminal:</p><pre>ollama pull llama3</pre><p>Ollama now runs locally at:</p><p><a href="http://localhost:11434">http://localhost:11434</a></p><p><strong>Create Virtual Environment (Recommended)</strong></p><pre># Select the interpreter from the Commmand Palette<br><br>python3.11 -m venv rag_venv_fastapi<br><br>source rag_venv_fastapi/bin/activate</pre><p><strong>Install Python Dependencies</strong></p><pre>python3 -m pip install fastapi uvicorn chromadb sentence-transformers langchain langchain-community langchain-text-splitters langchain-ollama</pre><h4>Step 3: <strong>Project Structure (From Scratch)</strong></h4><p>Create a folder:</p><pre>mkdir rag_db_fastapi<br><br>cd rag_db_fastapi</pre><p>Inside it:</p><pre>rag-api/<br>│<br>├── data/<br>│   ├── rag.txt<br>│   └── time_dilation.txt<br>│<br>├── build_chroma.py<br>└── main.py</pre><h4>Step 4: <strong>Create Knowledge Documents</strong></h4><p>Create<strong> data/ folder</strong></p><pre>mkdir data</pre><p>Create <strong>rag.txt</strong></p><pre>RAG stands for Retrieval Augmented Generation.<br>It combines document retrieval with language models.<br>RAG reduces hallucinations by grounding answers in data.</pre><p>Create<strong> time_dilation.txt</strong></p><pre>The concept of &quot;time dilation&quot; in physics is fascinating. <br>According to Einstein’s theory of relativity, time is not universal; <br>it stretches and compresses based on speed and gravity. <br>If you were to travel in a spaceship at near-light speed for a few years, <br>you would return to Earth to find decades or even centuries had passed. <br>Similarly, time moves slower near massive objects like black holes due to <br>extreme gravity. This means astronauts aboard the International Space Station <br>actually age slightly slower than people on Earth. Time is not constant; <br>it is flexible, making the universe far stranger than it appears.</pre><h4>Step 5: <strong>Build Vector Database (Chromadb)</strong></h4><p>Create <strong>build_chroma.py </strong>with the below code</p><pre># Import OS utilities to read files<br>import os<br><br># Import loader to read text files<br>from langchain_community.document_loaders import TextLoader<br><br># Import text splitter to break text into chunks<br>from langchain_text_splitters import RecursiveCharacterTextSplitter<br><br># Import embedding model for converting text → vectors<br>from langchain_community.embeddings import HuggingFaceEmbeddings<br><br># Import Chroma vector database<br>from langchain_community.vectorstores import Chroma<br><br><br># List to store all loaded documents<br>documents = []<br><br># Loop through each file in data directory<br>for file in os.listdir(&quot;data&quot;):<br>    # Load each text file<br>    loader = TextLoader(f&quot;data/{file}&quot;)<br>    documents.extend(loader.load())<br><br># Split documents into smaller chunks<br>text_splitter = RecursiveCharacterTextSplitter(<br>    chunk_size=200,    # Max characters per chunk<br>    chunk_overlap=20  # Overlap to preserve context<br>)<br><br>chunks = text_splitter.split_documents(documents)<br><br># Load embedding model (local, fast)<br>embedding_model = HuggingFaceEmbeddings(<br>    model_name=&quot;all-MiniLM-L6-v2&quot;<br>)<br><br># Create Chroma vector store<br>vectorstore = Chroma.from_documents(<br>    documents=chunks,              # Text chunks<br>    embedding=embedding_model,     # Embedding function<br>    persist_directory=&quot;chroma_db&quot;  # Folder to store vectors<br>)<br><br># Save vector DB to disk<br>vectorstore.persist()<br><br>print(&quot;ChromaDB created from scratch&quot;)</pre><p><strong>Run Vector DB Creation</strong></p><pre>python build_chroma.py</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*ipIJofj7sJc4ggdFi6o66w.png" /></figure><p>You will now see:</p><p><strong>chroma_db/</strong></p><p>This is your <strong>knowledge base</strong>.</p><h4>Step 6: <strong>Fastapi Rag Server (From Scratch)</strong></h4><p>Create <strong>main.py</strong></p><pre># FastAPI framework<br>from fastapi import FastAPI<br><br># Request body validation<br>from pydantic import BaseModel<br><br># Embedding model<br>from langchain_community.embeddings import HuggingFaceEmbeddings<br><br># Chroma vector store<br>from langchain_community.vectorstores import Chroma<br><br># Ollama LLM wrapper<br>from langchain_ollama import OllamaLLM<br><br># Create FastAPI app<br>app = FastAPI(title=&quot;Local RAG API&quot;)<br><br># Load embedding model (same as used for indexing)<br>embedding_model = HuggingFaceEmbeddings(<br>    model_name=&quot;all-MiniLM-L6-v2&quot;<br>)<br><br># Load ChromaDB from disk<br>vector_db = Chroma(<br>    persist_directory=&quot;chroma_db&quot;,<br>    embedding_function=embedding_model<br>)<br><br># Initialize Ollama LLM<br>llm = OllamaLLM(<br>    model=&quot;llama3&quot;,<br>    temperature=0.2  # Low temperature for factual answers<br>)<br><br># Request schema<br>class QuestionRequest(BaseModel):<br>    question: str<br><br># Health check endpoint<br>@app.get(&quot;/&quot;)<br>def health():<br>    return {&quot;status&quot;: &quot;RAG server running&quot;}<br><br># Main RAG endpoint<br>@app.post(&quot;/ask&quot;)<br>def ask_question(request: QuestionRequest):<br><br>    # Step 1: Retrieve relevant documents<br>    docs = vector_db.similarity_search(<br>        request.question,<br>        k=2<br>    )<br><br>    # Step 2: Combine retrieved text as context<br>    context = &quot;\n&quot;.join([doc.page_content for doc in docs])<br><br>    # Step 3: Construct RAG prompt<br>    prompt = f&quot;&quot;&quot;<br>    Answer the question using only the context below.<br><br>    Context:<br>    {context}<br><br>    Question:<br>    {request.question}<br><br>    Answer:<br>    &quot;&quot;&quot;<br><br>    # Step 4: Generate answer using Ollama<br>    answer = llm.invoke(prompt)<br><br>    # Step 5: Return response<br>    return {<br>        &quot;question&quot;: request.question,<br>        &quot;answer&quot;: answer,<br>        &quot;context_used&quot;: context<br>    }</pre><h4>Step 7: <strong>Run Everything</strong></h4><p><strong>Start FastAPI</strong></p><pre>python -m uvicorn main:app –-reload</pre><p><strong>python -m</strong> it guarantees:</p><p>· Uses venv Python</p><p>· Uses venv packages</p><p>· No global conflicts</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*GFfdNAshO2cbSxq6i6wYKQ.png" /></figure><p>The --reload flag in the uvicorn main:app --reload command enables <a href="https://www.google.com/search?q=auto-reloading&amp;rlz=1C5GCEM_enIN1137IN1137&amp;oq=uvicorn+main%3Aapp+--reload+--%3E+why+reload%3F&amp;gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIHCAEQIRiPAtIBCDQ3NDFqMGo3qAIIsAIB8QWULV2NtC-pxg&amp;sourceid=chrome&amp;ie=UTF-8&amp;ved=2ahUKEwiGnf7s1smSAxUvS3ADHeCGKx0QgK4QegQIARAB"><strong>auto-reloading</strong></a>, which automatically restarts the server whenever code changes are detected in your project. It is specifically designed for local development, eliminating the need to manually stop and restart the server every time a code modification is made.</p><p>In the command uvicorn main:app, <strong>main</strong> refers to the Python <strong>module</strong> (the file main.py), and <strong>app</strong> refers to the specific <strong>application object</strong> (e.g., a FastAPI instance) created within that file.</p><h4>Extra Notes:</h4><p><strong>Default Port of FastAPI</strong></p><p>When you run FastAPI with <strong>Uvicorn</strong>:</p><pre>uvicorn main:app</pre><p>Default values:<strong> Host:</strong> 127.0.0.1 (localhost), <strong>Port:</strong> <strong>8000</strong></p><p><strong>Change Host + Port</strong></p><pre>uvicorn main:app --host 0.0.0.0 --port 9000</pre><p><strong>With Auto Reload (Development)</strong></p><pre>uvicorn main:app --reload --port 7000</pre><p><strong>Change Port Programmatically (Less Common)</strong></p><pre>import uvicorn<br><br>if __name__ == &quot;__main__&quot;:<br>    uvicorn.run(<br>        &quot;main:app&quot;,<br>        host=&quot;127.0.0.1&quot;,<br>        port=5050,<br>        reload=True<br>    )</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*25rmlvdpIjdEXfc3JxBbtw.png" /></figure><h4>Step 8: <strong>Open Swagger UI</strong></h4><p>Open in browser: <a href="http://127.0.0.1:8000/docs">http://127.0.0.1:8000/docs</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*K3QfyxKA-2stBpnU9bQsJw.png" /></figure><p>Let’s start testing:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*DR8SMwUr1j3AHOrj8feR2w.png" /></figure><p>Response is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*v9r0UnaHp9_KvgKOs-9u_w.png" /></figure><h4><strong>Complete Rag Flow</strong></h4><pre>Text Files<br>   ↓<br>Chunking<br>   ↓<br>Embeddings<br>   ↓<br>ChromaDB<br>   ↓<br>Query Embedding<br>   ↓<br>Similarity Search<br>   ↓<br>Context Injection<br>   ↓<br>Ollama LLM<br>   ↓<br>Answer</pre><blockquote><strong>You do NOT need to install ChromaDB separately as a service.</strong><br> In our example, <strong>ChromaDB is running inside your Python process</strong>.</blockquote><p>ChromaDB can work in <strong>two modes</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*BimZdeBuxB-9o0A8dO9Tkw.png" /></figure><p><strong>In Our RAG Example: Which One We Are Using?</strong></p><p><strong>Mode 1: You are using PERSISTENT MODE: </strong>Because we wrote</p><pre>Chroma.from_documents(<br>    documents=chunks,<br>    embedding=embedding_model,<br>    persist_directory=&quot;chroma_db&quot;<br>)</pre><p>and later:</p><pre>Chroma(<br>    persist_directory=&quot;chroma_db&quot;,<br>    embedding_function=embedding_model<br>)</pre><p><strong>This means:</strong></p><ul><li>Vectors are stored on disk</li><li>Data survives restarts</li><li>No re-embedding needed every time</li></ul><p>So <strong>this is NOT purely in-memory</strong>.</p><p><strong>Do You Need to Install ChromaDB Separately?</strong></p><blockquote><strong>No separate installation</strong><br> <strong>No server to run</strong><br> <strong>No Docker required</strong></blockquote><p>You only install the <strong>Python library</strong>:</p><pre>pip install chromadb</pre><p>That’s it.</p><p>Chroma runs <strong>embedded inside your app</strong>, like SQLite.</p><h4>Mode 2: <strong>When Is Chroma In-Memory?</strong></h4><p>If you do this:</p><pre>Chroma.from_documents(<br>    documents=chunks,<br>    embedding=embedding_model<br>)</pre><p>(no persist_directory)</p><p>Then:</p><blockquote>Data is stored in RAM</blockquote><blockquote>Lost when Python stops</blockquote><blockquote>Good for testing only</blockquote><p><strong>ChromaDB is like SQLite for vectors.</strong><br> It runs inside your app unless you choose a server-based DB.</p><h4><strong>When Do You Need a Separate Vector DB Server?</strong></h4><p>Only when:</p><blockquote>Huge data (millions of vectors)</blockquote><blockquote>Multi-user access</blockquote><blockquote>High availability</blockquote><p>Then you switch to:</p><blockquote>Pinecone</blockquote><blockquote>Weaviate</blockquote><blockquote>Qdrant</blockquote><blockquote>Milvus</blockquote><p>Chroma isn’t the only game in town<br> Here’s a <strong>clear, practical list of vector databases similar to Chroma</strong>, grouped by <strong>how they’re used</strong>, so you know <em>when</em> to pick what.</p><h4><strong>️1. FAISS (Most Common Alternative)</strong></h4><p><strong>Best mental model:</strong> <em>NumPy + vectors</em></p><p><strong>What it is</strong></p><ul><li>Facebook AI Similarity Search</li><li><strong>Library</strong>, not a database server</li></ul><p><strong>Key points</strong></p><ul><li>In-memory by default</li><li>Extremely fast</li><li>No metadata filtering (basic)</li><li>No built-in persistence (manual save/load)</li></ul><p><strong>When to use</strong></p><ul><li>Local experiments</li><li>Research</li><li>Single-machine apps</li></ul><h4><strong>2. Qdrant (Closest to “Production Chroma”)</strong></h4><p><strong>Best mental model:</strong> <em>Postgres for vectors</em></p><p><strong>What it is</strong></p><ul><li>Vector DB with <strong>server mode</strong></li><li>Can also run embedded (local)</li></ul><p><strong>Key points</strong></p><ul><li>REST &amp; gRPC APIs</li><li>Strong metadata filtering</li><li>Disk-backed</li><li>Scales well</li></ul><p><strong>When to use</strong></p><ul><li>Medium to large RAG systems</li><li>Production APIs</li></ul><h4><strong>3. Weaviate</strong></h4><p><strong>Best mental model:</strong> <em>Search engine + vectors</em></p><p><strong>What it is</strong></p><ul><li>Full vector search platform</li><li>Cloud + self-hosted</li></ul><p><strong>Key points</strong></p><ul><li>GraphQL API</li><li>Built-in embedding support</li><li>Multi-tenant</li><li>Heavy but powerful</li></ul><p><strong>When to use</strong></p><ul><li>Enterprise apps</li><li>Complex schemas</li></ul><h4><strong>4. Milvus</strong></h4><p><strong>Best mental model:</strong> <em>Big data vector warehouse</em></p><p><strong>What it is</strong></p><ul><li>High-scale vector DB</li><li>Kubernetes-friendly</li></ul><p><strong>Key points</strong></p><ul><li>Handles billions of vectors</li><li>Needs more infra</li><li>Used by big companies</li></ul><p><strong>When to use</strong></p><ul><li>Massive datasets</li><li>High throughput systems</li></ul><h4><strong>5. Pinecone (Managed / Cloud)</strong></h4><p><strong>Best mental model:</strong> <em>AWS for vectors</em></p><p><strong>What it is</strong></p><ul><li>Fully managed vector DB</li></ul><p><strong>Key points</strong></p><ul><li>No infra management</li><li>Paid</li><li>Very reliable</li></ul><p><strong>When to use</strong></p><ul><li>Production without infra headache</li><li>SaaS RAG apps</li></ul><h4>6. <strong>Redis Vector Search</strong></h4><p><strong>Best mental model:</strong> <em>Redis + vectors</em></p><p><strong>What it is</strong></p><ul><li>Redis with vector indexing</li></ul><p><strong>Key points</strong></p><ul><li>Super fast</li><li>Good for real-time apps</li><li>Limited vector-specific features</li></ul><p><strong>When to use</strong></p><ul><li>Low latency use cases</li><li>Already using Redis</li></ul><p><strong>ChromaDB itself is a vector database library</strong>, there are tools that give you an <strong>application-like interface</strong> to <strong>browse collections, view documents, inspect embeddings, metadata, and run queries</strong> without writing Python code yourself.</p><h4><strong>1. Chroma Explorer (Desktop GUI) — macOS app</strong></h4><p>A native desktop client for <strong>visualizing ChromaDB</strong>:</p><blockquote>Browse collections<br>View documents in each collection<br>See embeddings &amp; metadata<br>Run semantic search with natural language<br>Inspect similarity scores</blockquote><p>Built for macOS with a visual UI — great if you want a <strong>regular app experience</strong> rather than coding.</p><p>From your project folder:</p><pre>chroma run --path ./chroma_db --port 8001</pre><p>You should see something like:</p><pre>Running Chroma server on http://localhost:8001</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*4ODmTTch9QnaHF1LBFcdLA.png" /></figure><p>Install the application with a .dmg file in mac.</p><p>Now connect using the below</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*nkoBhK3SEZo0NNNSU8X1UQ.png" /></figure><p>It will look like this</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*7T6l0ZJjcMBvJG_brJg1kA.png" /></figure><p>I have given a practical demo only on Database Administration &amp; Visualization Tool (GUI Client) for ChromaDB. Please try the other tools given below if you are interested.</p><h4><strong>2. ChromaDB Viewer (Gradio UI)</strong></h4><p>A Python-based <strong>lightweight web interface</strong> that runs locally with Gradio:</p><blockquote>Connect to any local ChromaDB<br>Browse all collections<br>See vector distances and embeddings<br>Query the database interactively via browser</blockquote><p>To use:</p><ul><li>Install Python dependencies</li><li>Run the viewer script</li><li>Open a browser at a local URL</li></ul><p>Useful if you want a simple browser-based tool without desktop installation.</p><p>A <strong>simple Python server</strong> that shows your local ChromaDB in a browser.</p><h4><strong>3. Chromadb-UI (Web UI)</strong></h4><p>A community-built <strong>web application</strong> for managing ChromaDB:</p><blockquote>Browse and filter results<br>Visual interface instead of coding<br>Run locally via Docker or dev server</blockquote><p>You can clone and run it locally to interact with ChromaDB through a UI.</p><p>Unlike SQL databases (e.g., MySQL Workbench, pgAdmin), vector databases store high-dimensional embeddings rather than structured rows. But the tools above let you view:</p><blockquote>Stored text content<br>Embedding vectors<br>Metadata fields<br>Query results<br>Distance / similarity scores</blockquote><p>This gives a <strong>feel similar to inspecting a regular database</strong>, but tailored for vector data.</p><p>A <strong>simple Python server</strong> that shows your local ChromaDB in a browser.</p><p>A <strong>web application</strong> for browsing ChromaDB. You run it locally and open it in a browser.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=de9f1eac768d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Positional Encoding Explained Simply]]></title>
            <link>https://medium.com/@saha.soumyadeep90/positional-encoding-explained-simply-9c6b88b5d8ff?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/9c6b88b5d8ff</guid>
            <category><![CDATA[positional-encoding]]></category>
            <category><![CDATA[transformers]]></category>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Wed, 11 Feb 2026 14:45:33 GMT</pubDate>
            <atom:updated>2026-02-11T17:27:51.533Z</atom:updated>
            <content:encoded><![CDATA[<p>I’ve already covered the fundamentals of vector stores, vector databases, and the internal workings of RAG in detail in my previous blog:</p><p><a href="https://medium.com/@saha.soumyadeep90/vector-stores-positional-encoding-and-rag-explained-simply-and-with-a-practical-guide-dea70512f6fc">https://medium.com/@saha.soumyadeep90/vector-stores-positional-encoding-and-rag-explained-simply-and-with-a-practical-guide-dea70512f6fc</a></p><p>This article, however, focuses specifically on a deeper and more detailed exploration of <strong>positional encoding</strong> — its intuition, mathematical foundation, and how it works internally within Transformer architectures.</p><p>Let’s Start</p><p>When we read a sentence, word order matters.<br> “The cat chased the mouse” means something very different from “The mouse chased the cat.”</p><p>For humans, understanding word order is natural. But for machine learning models — especially Transformers — this is not automatic.</p><p>Traditional sequence models like RNNs and LSTMs process text one word at a time, so they naturally capture order. However, Transformer models process all words in parallel using a mechanism called <strong>self-attention</strong>. While this makes them extremely powerful and efficient, it also creates a challenge:</p><blockquote>How does a Transformer know which word comes first, second, or last?</blockquote><p>This is where <strong>Positional Encoding</strong> comes in.</p><p>Positional Encoding is a technique used to inject information about the position of each word directly into its embedding. By adding positional information, Transformers can understand the structure and order of sequences — allowing them to correctly interpret meaning.</p><p>We are going to dive deep into <strong>Positional Encoding in Transformers</strong>.</p><p><strong>Why transformers need positional encoding (what self-attention can’t do)</strong></p><p><strong>What self-attention is great at : </strong>Self-attention creates <strong>contextual embeddings</strong>.</p><p>Meaning:</p><ul><li>The word “bank” in “river bank” becomes different from “bank account”</li><li>Because attention uses surrounding words to update the representation</li></ul><p>Also it’s <strong>parallel</strong>:</p><ul><li>RNNs read tokens one-by-one (slow)</li><li>Transformers process all tokens together (fast)</li></ul><p><strong>The big problem (order blindness)</strong></p><p>Self-attention, by itself, doesn’t naturally know <em>word order</em>.</p><p>If you shuffle the tokens, attention can still compute relationships… but it doesn’t have a built-in “this came before that” signal.</p><p>So:</p><ul><li>“dog bites man”</li><li>“man bites dog”</li></ul><p>contain the same words, but mean different things.<br> Without position info, the model can get confused.</p><p>Positional Encoding is the mathematical “hack” that fixes this.</p><h4>1. The Evolution of the Solution (First Principles)</h4><p>Let’s see the “First Principles” approach. Let’s trace the logic of how researchers arrived at the final solution.</p><p><strong>Attempt 1: Just Count (Integers)</strong></p><p>Why not just number the words?</p><ul><li>“The” = 1</li><li>“bear” = 2</li><li>“ate” = 3</li><li>…</li></ul><p><strong>The Problem:</strong> These numbers get <strong>unbounded</strong>. If you have a document with 5,000 words, the last word has a value of 5,000. This huge number destroys the “Numerical Stability” of the Neural Network (gradients explode).</p><h4>Attempt 2: Normalize (0 to 1)</h4><p>Okay, let’s divide by the sentence length so everything is between 0 and 1.</p><ul><li>“The” = 0.1</li><li>“bear” = 0.2</li><li>…</li></ul><p><strong>The Problem:</strong> The “step size” changes depending on sentence length.</p><ul><li>In a 10-word sentence, the distance between words is <strong>0.1</strong>.</li><li>In a 100-word sentence, the distance is <strong>0.01</strong>. The model gets confused because “next door neighbor” means different things in different sentences.</li></ul><h4>Attempt 3: One-hot position vectors</h4><p>Position 3 = [0,0,0,1,0,0,…]</p><p><strong>Problem C: no smoothness</strong><br> Neural nets like <em>smooth, continuous</em> signals.<br> One-hot doesn’t tell the model that position 3 is closer to 4 than to 97.<br> It’s all equally “different”.</p><p>So we want:</p><ul><li>bounded values (not exploding)</li><li>smooth / continuous change</li><li>something that helps model learn <em>relative positions</em></li></ul><h4>Attempt 4: The Sine/Cosine Solution (The Winner)</h4><p>We need a system that is:</p><ol><li><strong>Bounded:</strong> Values stay between -1 and 1.</li><li><strong>Consistent:</strong> The distance between Position 1 and 2 is always the same.</li><li><strong>Deterministic:</strong> No random numbers.</li></ol><p>This is where <strong>Waves</strong> come in.</p><p>PE(pos) = sin(pos)</p><p>But there is a problem: <strong>The periodicity problem → </strong>Sine repeats.</p><p>So:</p><ul><li>sin(0) = 0</li><li>sin(2π) ≈ 0</li><li>sin(4π) ≈ 0</li></ul><p>Different positions can produce the same value → the model might think they are the same position.</p><p><strong>Fix periodicity (part 1): use sine AND cosine together</strong></p><p>Instead of encoding a position as <strong>one number</strong>, encode it as a <strong>2D vector</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/390/1*PKZZdtf7CPcqDbt2E5SJvA.png" /></figure><p>That pair behaves like a point on a circle.</p><p>That’s what you’re seeing in this image:</p><p>This helps because:</p><ul><li>even if sine repeats, cosine won’t match at the same time (except full cycle)</li><li>together they give a stronger signature</li></ul><p>Still periodic, but improved uniqueness.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*5vDej5uqXmFNcqTRRHJCkg.png" /></figure><p><strong>Fix periodicity (part 2): don’t use just one sine/cos pair — use MANY (different frequencies)</strong></p><p>Now comes the “real” transformer positional encoding idea:</p><ul><li>Make the positional encoding a <strong>vector</strong> the same size as the token embedding (e.g., 128, 512, 768 dims)</li><li>Use many sine/cos pairs</li><li>Each pair uses a <strong>different frequency</strong> (some change fast, some change slow)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*ovxlKKiGAuDsicaY11xzWA.png" /></figure><p><strong>The classic sinusoidal positional encoding formula</strong></p><p>The Math: Frequencies and Wavelengths</p><p>Imagine the Positional Encoding as a set of many dials or clocks, each spinning at a different speed.</p><ul><li><strong>Low Dimensions:</strong> Spin very fast (like a second hand).</li><li><strong>High Dimensions:</strong> Spin very slow (like an hour hand).</li></ul><p>By looking at the combination of all these hands, you can tell exactly what time (position) it is.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kyU1cDzqrl1Dk8YU2h8ymQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*w32klKD6HBcV8Tz-Ia0XMQ.png" /></figure><p><strong>What you would see:</strong></p><p>· <strong>Left side (Low dimensions):</strong> Rapid flickering (High Frequency).</p><p>· <strong>Right side (High dimensions):</strong> Slow, smooth changes (Low Frequency).</p><p>· This pattern is unique for every single row (word position).</p><h4>2. Visualising with Python Code</h4><p>Let’s write the code to visualize this “wobbly” matrix</p><pre>import numpy as np<br>import matplotlib.pyplot as plt</pre><pre>def get_positional_encoding(seq_len, d_model):<br>        &quot;&quot;&quot;<br>    Generates the Positional Encoding Matrix.<br>    seq_len: Number of words in sentence (e.g., 100)<br>    d_model: Dimensionality of the embedding (e.g., 512)<br>    &quot;&quot;&quot;<br>        # 1. Initialize the matrix<br>        pe = np.zeros((seq_len, d_model))</pre><pre>        # 2. Create the position indices (0, 1, 2, ..., seq_len-1)<br>position = np.arange(seq_len)[:, np.newaxis]</pre><pre>        # 3. Create the division term (the &quot;10000^...&quot; part)<br>    # We use a trick with log space for numerical stability<br>div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))</pre><pre>        # 4. Apply Sine to even indices<br>pe[:, 0::2] = np.sin(position * div_term)<br>    <br>    # 5. Apply Cosine to odd indices<br>pe[:, 1::2] = np.cos(position * div_term)<br>    <br>    return pe</pre><pre># Generate and Visualize<br>        seq_length = 100<br>d_model = 128<br>pe_matrix = get_positional_encoding(seq_length, d_model)</pre><pre>plt.figure(figsize=(10, 6))<br>        plt.imshow(pe_matrix, cmap=&#39;RdBu&#39;, aspect=&#39;auto&#39;)<br>plt.title(&quot;Positional Encoding Matrix&quot;)<br>plt.xlabel(&quot;Embedding Dimension (Depth)&quot;)<br>plt.ylabel(&quot;Word Position (Sequence Length)&quot;)<br>plt.colorbar(label=&quot;Value (-1 to 1)&quot;)<br>plt.show()</pre><p><strong>What you would see:</strong></p><p>· <strong>Left side (Low dimensions):</strong> Rapid flickering (High Frequency).</p><p>· <strong>Right side (High dimensions):</strong> Slow, smooth changes (Low Frequency).</p><p>· This pattern is unique for every single row (word position).</p><h4>3. The “Relative Position” Magic</h4><p>This is the coolest part of the math.</p><p>Why did we choose Sine and Cosine? Because of this trigonometric identity:</p><p>sin(x + k) = sin(x)cos(k) + cos(x)sin(k)</p><p><strong>In simple words:</strong></p><p>If the model knows the position of word A(at pos) and wants to look at word B(at pos+k), it doesn’t need to “re-learn” the position. It can just apply a <strong>Rotation</strong> (a linear matrix multiplication) to get from A to B.</p><p>This allows the Transformer to easily learn concepts like <em>“pay attention to the word 3 steps behind me”</em> regardless of whether “me” is at the start or end of the sentence.</p><h4>5. Final Architecture: Addition</h4><p>How do we combine this with the word meaning? We simply <strong>Add</strong> them.</p><pre># Pseudo-code for the final step in a Transformer<br>word_embeddings = embedding_layer(input_words) # Shape: [Batch, Seq_Len, 512]<br>pos_encodings = get_positional_encoding(Seq_Len, 512)</pre><pre># Crucial Step: Direct Addition<br>final_input = word_embeddings + pos_encodings</pre><p>We add them because the “Word Meaning” is like the content, and “Positional Encoding” is like the timestamp. Adding them allows the model to separate “What” (content) from “Where” (position) using its internal math.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9c6b88b5d8ff" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[RAG, Vector Stores And Positional Encoding Explained Simply And With A Practical Guide]]></title>
            <link>https://medium.com/@saha.soumyadeep90/vector-stores-positional-encoding-and-rag-explained-simply-and-with-a-practical-guide-dea70512f6fc?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/dea70512f6fc</guid>
            <category><![CDATA[retrieval-augmented-gen]]></category>
            <category><![CDATA[rags]]></category>
            <category><![CDATA[vector-database]]></category>
            <category><![CDATA[vector-store]]></category>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Sun, 25 Jan 2026 17:57:43 GMT</pubDate>
            <atom:updated>2026-03-15T19:33:59.252Z</atom:updated>
            <content:encoded><![CDATA[<h3>RAG And Vector Stores Explained Simply And With A Practical Guide</h3><p>Large Language Models feel magical — they recommend movies you’ll love, understand context in long conversations, and answer questions using your own documents.</p><p>But under the hood, none of this is magic.</p><p>Three core ideas make these systems work:<br> <strong>Vector Stores</strong>, <strong>Positional Encoding</strong>, and <strong>Retrieval-Augmented Generation (RAG)</strong>.</p><p>In this article, we’ll build an intuitive understanding of all three — starting from first principles and moving toward practical implementations. We’ll see:</p><ul><li>why keyword matching fails and how <strong>vector embeddings</strong> let machines understand <em>meaning</em></li><li>why transformers are naturally <strong>order-blind</strong>, and how <strong>positional encoding</strong> mathematically injects sequence information</li><li>how <strong>RAG</strong> combines vector search with LLMs to reduce hallucinations and unlock private, up-to-date knowledge</li></ul><p>Along the way, we’ll use simple analogies, diagrams, and Python examples with tools like <strong>LangChain</strong> and <strong>Chroma</strong> — no hand-waving, no unnecessary math.</p><p>If you’ve ever wondered <em>how</em> modern AI systems actually retrieve information, understand word order, and ground their answers in facts, this article will connect the dots.</p><h3><strong>Vector Stores</strong></h3><p>We will focus on moving away from old-school “keyword matching” toward “semantic understanding” (understanding meaning).</p><h4><strong>1. The Problem: Keywords vs. Meaning</strong></h4><p><strong>The “keyword” way (old-school)</strong></p><p>A basic recommender might do:</p><ul><li>You liked a movie → take its <strong>plot text</strong></li><li>Find other plots that share <strong>similar words</strong></li><li>Recommend those</li></ul><p><strong>Problem:</strong> words don’t always mean what you want.</p><p>Example:</p><ul><li>Movies like <em>Kabhi Alvida Naa Kehna</em> and <em>My Name is Khan</em> can feel similar in theme/emotion…</li><li>…but their plots may not share many exact keywords.<br> So keyword matching can miss good recommendations.</li></ul><p><strong>The “meaning” way (semantic)</strong></p><p>Instead of comparing words, we compare <strong>meaning</strong>.</p><p>To do that, we convert each plot into a <strong>vector</strong> (a list of numbers) called an <strong>embedding</strong>.</p><p>Then:</p><ul><li>Similar meaning → vectors end up <strong>near each other</strong></li><li>Different meaning → vectors are <strong>far apart</strong></li></ul><p><strong>Mini diagram: keyword vs semantic</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vcQaipaog9YqMap1dECTQw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*U8pVvKCbFS2IppdSNc0WGQ.png" /></figure><h4><strong>2.</strong> <strong>The Solution: Embeddings (Vectors)</strong></h4><p><strong>Please Note: </strong>There are numerous approach for Embeddings and the agreed approach is Transformer based. Please go through my blog<strong> </strong><a href="https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df"><strong>https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df</strong></a> if you want to learn in detail.</p><p>An <strong>embedding model</strong> converts text into numbers:</p><p><strong>“movie plot text” — -&gt; [0.12, -0.44, 0.98, …] (hundreds/thousands of numbers)</strong></p><p>To compare “plots” or “meanings,” computers need numbers. We convert text (like a movie plot) into a list of numbers called a <strong>Vector</strong> or <strong>Embedding</strong>.</p><ul><li><strong>What is it?</strong> A long list of floating-point numbers (e.g., [0.12, -0.98, 0.55…]).</li><li><strong>How it works:</strong> Similar concepts end up close to each other in mathematical space. The vector for “King” will be mathematically close to “Queen” and “Royalty.”</li></ul><p>This allows us to perform <strong>Semantic Search</strong>. We are no longer matching words; we are matching <em>meanings</em>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*ptDx-K77ulwmz_-sMoh1Sg.png" /></figure><h4><strong>3. What is a Vector Store?</strong></h4><p>Once we convert our movie plots into vectors, we need a place to save them. You cannot efficiently store and search these complex vectors in a normal Excel sheet or SQL database. You need a <strong>Vector Store</strong>.</p><p><strong>Key Features discussed:</strong></p><ol><li><strong>Storage:</strong></li></ol><ul><li><strong>In-Memory:</strong> Fast but temporary (data is lost when the computer turns off). Good for testing.</li><li><strong>On-Disk:</strong> Slower but permanent. Good for production.</li></ul><ol><li><strong>Indexing:</strong> This is the “secret sauce” for speed. Instead of comparing your query to <em>every single movie</em> (which takes too long), the store uses an index to quickly find the closest match.</li><li><strong>Clustering:</strong> It explains breaking data into groups (clusters).</li></ol><ul><li>Imagine a library. You don’t look at every book. You go to the “Sci-Fi” section.</li><li>In a Vector Store, we calculate a <strong>Centroid</strong> (the center point of a cluster). If your search query is far from that Centroid, we ignore that whole cluster. This makes searching massive datasets very fast.</li></ul><p>A <strong>vector store</strong> is basically a system that can:</p><p>1. <strong>Store vectors</strong> (embeddings)</p><p>2. <strong>Store metadata</strong> (like title, year, genre, etc.)</p><p>3. <strong>Quickly retrieve the most similar vectors</strong> when you query</p><p>So for movie recommendations:</p><p>· Each movie plot → embedding vector</p><p>· Store vectors in a vector store</p><p>· User gives a movie / preference → embed that too</p><p>· Do a similarity search → return closest movies</p><p><strong>Core idea:</strong> “Recommendation = nearest neighbors in embedding space.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*e_j-XEIhht2cV73oMdV1nA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sPXQrMGvtLk_dLwHgFYYbg.png" /></figure><h4><strong>4. Vector Store vs. Vector Database</strong></h4><p>This makes a distinction between a simple “Store” and a full “Database.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6DNzUeQacW6xTDgY1i6-fg.png" /></figure><p><strong>Chroma DB</strong>, as an example, bridges this gap. It is lightweight and open-source but offers features like “Collections” (similar to tables in SQL) and persistent storage.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nIpPtPNQSsYjlZq-TkJhcQ.png" /></figure><h4><strong>Similarity: how do we decide “close”?</strong></h4><p>A very common metric is <strong>cosine similarity</strong>:</p><ul><li>Think of vectors like arrows</li><li>Cosine similarity asks: <strong>“Are these arrows pointing in the same direction?”</strong></li><li>It focuses on <strong>direction (meaning)</strong> more than length</li></ul><p>Cosine similarity definition is standard: dot product divided by magnitudes.</p><p><strong>Tiny diagram (vector similarity intuition)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*73WY4wg-VPyyeY3jUSTv6g.png" /></figure><h4><strong>Why we need indexing (otherwise it gets painfully slow)</strong></h4><p>If you have <strong>N movies</strong>, a naive search compares your query to <strong>every</strong> movie vector.</p><ul><li>1,000 movies → okay-ish</li><li>1,000,000 movies → not okay</li></ul><p>So vector stores use <strong>indexes</strong> (special data structures) to search faster.</p><p><strong>The clustering/centroid idea</strong></p><p>You can cluster vectors into groups.<br> Each cluster has a <strong>centroid</strong> (the “center vector”).</p><p>Query time:</p><ol><li>Find nearest centroid(s)</li><li>Search inside those clusters only<br> Instead of searching everything.</li></ol><p>This is <strong>basically the idea behind IVF-style indexing (inverted file indexes)</strong> where vectors are assigned to clusters, and search probes a subset of clusters using something like <em>nprobe</em>.</p><h3>Interpretation:</h3><p>A vector database does <strong>not understand words like humans do</strong>.</p><p>It differentiates between <strong>“happy”</strong> and <strong>“enjoy”</strong> by:</p><p>Converting them into numerical vectors and measuring their distance in high-dimensional space.</p><p>If the vectors are close → meanings are similar.<br> If far apart → meanings are different.</p><p><strong>Step 1: Words Become Vectors</strong></p><p>Before storing in a vector DB, text goes through an <strong>embedding model</strong>.</p><p>Example:</p><p>“happy” → [0.12, -0.45, 0.88, …] (384 dimensions)</p><p>“enjoy” → [0.10, -0.40, 0.85, …]</p><p>“sad” → [-0.60, 0.22, -0.90, …]</p><p>Each word becomes a <strong>point in high-dimensional space</strong>.</p><p><strong>Step 2: Semantic Similarity = Distance</strong></p><p>Vector DBs use math like:</p><ul><li>Cosine similarity</li><li>Euclidean distance</li><li>Dot product</li></ul><p>If two vectors are:</p><ul><li>Very close → similar meaning</li><li>Far apart → different meaning</li></ul><p><strong>Why Are “happy” and “enjoy” Close?</strong></p><p>Because embedding models are trained on <strong>massive text corpora</strong>.</p><p>They learn patterns like:</p><p>· “I am happy”</p><p>· “I enjoy this”</p><p>· “She felt happy”</p><p>· “She enjoyed the event”</p><p>The model statistically learns that:</p><p>happy ≈ enjoy ≈ joyful ≈ delighted</p><p>So their vectors end up near each other.</p><p><strong>Important: The Vector DB Does NOT Understand Meaning</strong></p><p>The embedding model does the semantic learning.</p><p>The vector DB only does:</p><pre>Store vectors<br>+<br>Compute distance</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/904/1*NLZT-Jjc2vW239ekpbAO7A.png" /></figure><p><strong>Why This Works Better Than Keyword Search</strong></p><p>Keyword search:</p><pre>Search: enjoy<br><br>Document: happy<br><br>→ No match ❌</pre><p>Vector search:</p><pre>Search: enjoy<br><br>Document: happy<br><br>→ Similar vector → Match ✅</pre><p><strong>5. Practical Implementation: Coding with LangChain &amp; Chroma</strong></p><p>Let’s look at how to build this in Python. We will use LangChain to manage the logic and Chroma as our database.</p><p><strong>Step A: Setup and Ingestion</strong></p><pre>pip install -U langchain-chroma langchain-openai langchain-core chromadb</pre><p>First, we need to import our tools and set up the “Embedding Function” (the brain that turns text into numbers).</p><pre># Import necessary libraries<br>        from langchain.vectorstores import Chroma<br>from langchain.embeddings.openai import OpenAIEmbeddings<br>from langchain.schema import Document<br><br># 1. Initialize the Embedding Model<br># This converts text like &quot;A story about love&quot; into [0.01, 0.45, ...]<br>embeddings = OpenAIEmbeddings()<br><br># 2. Prepare our Movie Data (The &quot;Documents&quot;)<br>movie_plots = [<br>Document(page_content=&quot;A man embarks on a journey to find his lost love across borders.&quot;, metadata={&quot;title&quot;: &quot;Movie A&quot;, &quot;id&quot;: 1}),<br>Document(page_content=&quot;Space rangers fight an alien invasion on Mars.&quot;, metadata={&quot;title&quot;: &quot;Movie B&quot;, &quot;id&quot;: 2}),<br>Document(page_content=&quot;A romantic drama about a couple separating.&quot;, metadata={&quot;title&quot;: &quot;Movie C&quot;, &quot;id&quot;: 3})<br>        ]<br><br>        # 3. Create the Vector Store (Chroma)<br># We tell it where to save data (persist_directory) so we don&#39;t lose it.<br>vector_db = Chroma.from_documents(<br>        documents=movie_plots,<br>        embedding=embeddings,<br>        persist_directory=&quot;./chroma_db_storage&quot;<br>)<br><br>print(&quot;Movies stored successfully!&quot;)</pre><p><strong>Step B: Similarity Search</strong></p><p>Now, let’s find a recommendation. If a user likes “heartbreak stories,” we query the database.</p><pre># The user&#39;s query<br>query = &quot;sad love story about separation&quot;<br><br>        # Perform Similarity Search<br># k=1 means &quot;give me the top 1 most similar movie&quot;<br>docs = vector_db.similarity_search(query, k=1)<br><br>print(f&quot;Recommended Movie: {docs[0].metadata[&#39;title&#39;]}&quot;)<br>print(f&quot;Plot Summary: {docs[0].page_content}&quot;)<br><br># Output should be &quot;Movie C&quot; because the meaning matches &quot;separation&quot; and &quot;sad love&quot;.</pre><p><strong>Step C: CRUD Operations (Update &amp; Delete)</strong></p><p>Let’s emphasizes that managing data (CRUD) is vital.</p><p><strong>Updating a Document:</strong> In Chroma, updating often requires the Document ID.</p><pre># Updating the plot of Movie A<br>updated_movie = Document(<br>        page_content=&quot;A man travels to find his lost brother, not his love.&quot;,<br>        metadata={&quot;title&quot;: &quot;Movie A&quot;, &quot;id&quot;: 1}<br>)<br><br>        # Use the update function provided by the DB wrapper<br>vector_db.update_document(document_id=&quot;1&quot;, document=updated_movie)<br>print(&quot;Movie A updated.&quot;)</pre><p><strong>Deleting a Document:</strong> If a movie is removed from the catalog, we delete its vector.</p><pre># Delete the movie with ID 2 (The space movie)<br>        vector_db.delete(ids=[&quot;2&quot;])<br>print(&quot;Movie B deleted.&quot;)</pre><p><strong>Where RAG fits into this (quick connection)</strong></p><p>Even though this particular section is recommendation-focused, <strong>vector stores are also the main “Retrieval” piece of RAG</strong>.</p><p><strong>RAG = Retrieval-Augmented Generation</strong></p><ul><li>Retrieve relevant documents (using vector store)</li><li>Give them to the LLM as context</li><li>LLM answers using those docs</li></ul><h3><strong>Retrieval Augmented Generation (RAG) | What is RAG | How does RAG Work</strong></h3><p>RAG is the technique that stops AI from “hallucinating” (making things up) and gives it access to your private data. It is an architecture that <strong>adds external knowledge</strong> to a Large Language Model (LLM) at <em>query time</em>.</p><p>Instead of relying only on what the model was trained on, RAG:</p><ul><li>retrieves relevant documents</li><li>injects them into the prompt</li><li>then generates the answer</li></ul><h4><strong>1. What is RAG? (The “Open Book Exam” Analogy)</strong></h4><p>Imagine you are taking a very hard history exam.</p><ul><li><strong>Standard LLM (ChatGPT):</strong> You have to answer purely from memory. If you studied 2 years ago, you won’t know about events that happened yesterday. You might also “guess” if you aren’t sure.</li><li><strong>RAG:</strong> You are allowed to take a textbook into the exam. When a question comes up, you <strong>Retrieve</strong> the relevant page, read it, and then <strong>Generate</strong> your answer.</li></ul><p><strong>RAG</strong> stands for:</p><p>· <strong>R</strong>etrieval: Find the right data.</p><p>· <strong>A</strong>ugmentation: Add that data to the user’s prompt.</p><p>· <strong>G</strong>eneration: Let the AI write the answer using that data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Sm5KJOOBsFrX5YqTaPrR2w.png" /></figure><h4><strong>2. Why do we need it? (The Problems)</strong></h4><p>This highlights three major problems with standard LLMs:</p><ol><li><strong>Knowledge Cut-off:</strong> They don’t know recent news (e.g., “Who won the game last night?”).</li><li><strong>Private Data:</strong> They don’t know your company’s internal emails or documents.</li><li><strong>Hallucination:</strong> They confidently lie when they don’t know the answer.</li></ol><p><strong>Solution:</strong> RAG fixes all three by forcing the model to look at facts before answering.</p><h4><strong>3. RAG vs. Fine-Tuning</strong></h4><p>A common question is: “Why not just train (fine-tune) the model on my data?”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Xw6Z5uZ1sua1Sdx74s1M3w.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Wt0ayar_qcMaImhmgKWgBg.png" /></figure><h4><strong>4. How RAG Works: The 4 Steps</strong></h4><p>This is the core technical part. RAG is a pipeline.</p><p><strong>Step 1: Ingestion &amp; Indexing (Preparing the Data)</strong></p><p>Before we can search our documents, we need to prepare them.</p><ol><li><strong>Load:</strong> Read PDF, Text, or Webpage.</li><li><strong>Split (Chunking):</strong> LLMs can’t read a 500-page book at once. We cut the text into small “chunks” (e.g., 500 words each).</li><li><strong>Embed:</strong> Convert those text chunks into numbers (Vectors), just like we learned in the previous lesson!</li></ol><blockquote><strong>Please Note: </strong>There are numerous approach for Embeddings and the agreed approach is Transformer based. Please go through my blog<strong> </strong><a href="https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df"><strong>https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df</strong></a> if you want to learn in detail.</blockquote><p>4. <strong>Store:</strong> Save them in a Vector Database.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SemgysxoyuNubISUSUPD8g.png" /></figure><p><strong>Python Code for Step 1:</strong></p><pre>from langchain.text_splitter import RecursiveCharacterTextSplitter<br>from langchain.embeddings import OpenAIEmbeddings<br>from langchain.vectorstores import Chroma<br><br># 1. Load the data<br>        loader = TextLoader(&quot;my_private_document.txt&quot;)<br>documents = loader.load()<br><br># 2. Split the text (Chunking)<br># We split into chunks of 1000 characters<br>        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)<br>docs = text_splitter.split_documents(documents)<br><br># 3. &amp; 4. Embed and Store<br># This creates the Vector Database automatically<br>db = Chroma.from_documents(docs, OpenAIEmbeddings())<br><br>print(&quot;Data stored in Vector Database!&quot;)</pre><p><strong>Step 2: Retrieval</strong></p><p>When a user asks a question (e.g., <em>“What is our refund policy?”</em>), the system does a <strong>Semantic Search</strong>. It compares the numbers (vector) of the user’s question with the numbers of all the saved chunks and picks the top 3 most similar chunks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OCCmPeFcFnu3o9AvgAVe-w.png" /></figure><p><strong>Types Of Retrieval:</strong></p><p><strong>1.Retrieve Multiple Chunks (Top-K Retrieval) — Most Common</strong></p><p>Instead of sending one chunk to the LLM, we retrieve <strong>multiple relevant chunks</strong>.</p><pre>Example: Retriever → top_k = 5<br><br>So the prompt may contain:<br><br>Chunk 2<br>Chunk 5<br>Chunk 9<br>Chunk 11<br>Chunk 14</pre><p>Even if these chunks are far apart in the document, the LLM can combine them and reconstruct the <strong>global meaning</strong>.</p><p>Example in LangChain:</p><pre>retriever = vectorstore.as_retriever(search_kwargs={&quot;k&quot;:5})</pre><p><strong>2. Larger Chunk Size</strong></p><p>Use bigger chunks so each chunk contains more context.</p><p>Example:</p><pre>chunk_size = 800<br>chunk_overlap = 100</pre><p><strong>Pros: </strong>More context inside each chunk</p><p><strong>Cons: </strong>Fewer precise matches in retrieval</p><p><strong>3. Parent–Child Chunking (Hierarchical Retrieval)</strong></p><p>This is a <strong>very powerful RAG technique</strong>.</p><p><strong>Process:</strong></p><blockquote>Split document into <strong>large parent chunks</strong></blockquote><blockquote>Split parents into <strong>smaller child chunks</strong></blockquote><blockquote>Retrieval happens on <strong>child chunks</strong></blockquote><blockquote>When retrieved → return the <strong>full parent chunk</strong></blockquote><p>Example:</p><pre>Parent chunk: 1500 tokens<br>Child chunks: 200 tokens</pre><p>So retrieval finds precise pieces, but the LLM receives the <strong>larger parent context</strong>.</p><p><strong>LangChain example concept:</strong> ParentDocumentRetriever</p><p><strong>4. Document-Level Metadata</strong></p><p>Store metadata with chunks.</p><p>Example:</p><pre>chunk<br>├─ text<br>├─ document_id<br>├─ section<br>└─ page_number</pre><p>When a chunk is retrieved, the system can also fetch: All chunks from the same section</p><p>This helps reconstruct <strong>global context</strong>.</p><p><strong>5. Sliding Window Retrieval</strong></p><p>When one chunk is retrieved, also return <strong>neighbor chunks</strong>.</p><p>Example:</p><pre>Retrieved chunk: 7<br>Also include: 6 and 8</pre><p>So the final context becomes:</p><pre>Chunk 6<br>Chunk 7<br>Chunk 8</pre><p>This expands context automatically.</p><p><strong>Python Code for Step 2:</strong></p><pre>query = &quot;What is the refund policy?&quot;<br><br>        # Search the DB for the 2 most relevant chunks<br>        relevant_docs = db.similarity_search(query, k=2)<br><br>print(f&quot;Found snippet: {relevant_docs[0].page_content}&quot;)</pre><p><strong>Step 3: Augmentation</strong></p><p>We take the <strong>User Query</strong> and stick the <strong>Retrieved Data</strong> right next to it. We create a “Mega Prompt” behind the scenes.</p><p><strong>The Prompt looks like this:</strong></p><p>“You are a helpful assistant. Answer the user’s question using ONLY the context provided below.</p><p><strong>Context:</strong> [The refund policy is 30 days…] (This is the retrieved chunk)</p><p><strong>User Question:</strong> What is the refund policy?”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wViKMrPQvzH31kktRFGE7w.png" /></figure><p><strong>Step 4: Generation</strong></p><p>The LLM reads the Mega Prompt. Because the answer is right there in the context, it generates a perfect, factual answer.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S6ffBosDUvcbZWjoxD9sYw.png" /></figure><p><strong>Python Code for Step 3 &amp; 4 (The Full Chain):</strong></p><pre>from langchain.chains import RetrievalQA<br>from langchain.llms import OpenAI<br><br># Initialize the LLM<br>        llm = OpenAI()<br><br># Create the RAG Chain<br>qa_chain = RetrievalQA.from_chain_type(<br>        llm=llm,<br>        chain_type=&quot;stuff&quot;, # &quot;Stuff&quot; simply means stuffing the context into the prompt<br>                retriever=db.as_retriever()<br>)<br><br>        # Run the chain<br>        response = qa_chain.run(&quot;What is the refund policy?&quot;)<br><br>print(response)</pre><blockquote>In RAG systems, <strong>global context</strong> means the model can understand relationships across the <strong>entire document</strong>, not just neighboring chunks. Since simple chunk overlap only preserves <strong>local context</strong>, several techniques are used to recover <strong>global context</strong>.</blockquote><blockquote><strong><em>Embeddings themselves do NOT solve the neighboring vs global context problem.</em></strong></blockquote><p>Embeddings only convert <strong>text → vectors</strong> so that <strong>similar pieces of text are close in vector space</strong>.</p><p>The <strong>global context problem is mainly solved by retrieval strategies</strong>, not by the embedding type alone.</p><p>However, <strong>different embedding models capture semantic relationships better</strong>, which helps retrieve <strong>relevant chunks from anywhere in the document</strong>, indirectly helping global context.</p><p>Below is a <strong>clear list of the main embedding types you will encounter in RAG systems</strong>, with <strong>code examples and when to use them</strong>.</p><h4>1. OpenAI Embeddings</h4><p>Most commonly used in production.</p><pre>from langchain_openai import OpenAIEmbeddings<br>embeddings = OpenAIEmbeddings(<br>    model=&quot;text-embedding-3-small&quot;<br>)</pre><p><strong>Usage</strong></p><blockquote>RAG systems</blockquote><blockquote>semantic search</blockquote><blockquote>chatbots with private data</blockquote><p><strong>Pros</strong></p><blockquote>High accuracy<br>Optimized for retrieval<br>No local GPU needed</blockquote><p><strong>Cons</strong></p><blockquote>Paid API<br>Requires internet</blockquote><p><strong>Working</strong></p><blockquote>Uses large transformer models trained on massive datasets.</blockquote><blockquote>Converts text into <strong>high-dimensional dense vectors</strong> (~1536 dimensions).</blockquote><blockquote>Similar meaning → vectors close in vector space.</blockquote><p><strong>How it helps global context</strong></p><blockquote>High semantic understanding.</blockquote><blockquote>Even if information is <strong>far apart in the document</strong>, similar meaning vectors allow retrieval of relevant chunks.</blockquote><p><strong>Example:</strong></p><pre>Document:<br>Chunk1 → Introduction to AI<br>Chunk10 → Applications of AI<br><br>Query:<br>&quot;What are AI applications?&quot;</pre><p>Embedding similarity retrieves <strong>Chunk10 even if far away</strong>.</p><h4>2. HuggingFace Embeddings (Local Models)</h4><p>Use for local testing and studying purpose</p><pre>from langchain.embeddings import HuggingFaceEmbeddings<br>embeddings = HuggingFaceEmbeddings(<br>    model_name=&quot;all-MiniLM-L6-v2&quot;<br>)</pre><p><strong>Usage</strong></p><blockquote>local RAG systems</blockquote><blockquote>private data</blockquote><blockquote>offline applications</blockquote><p><strong>Pros</strong></p><blockquote>Free<br>Runs locally<br>Many models available</blockquote><p><strong>Cons</strong></p><blockquote>Slightly lower performance than large APIs</blockquote><p>Examples of HF embedding models:</p><ul><li>all-MiniLM-L6-v2</li><li>all-mpnet-base-v2</li><li>bge-large</li><li>e5-large</li></ul><p><strong>Working</strong></p><blockquote>Based on <strong>BERT-style sentence transformers</strong>.</blockquote><blockquote>Uses <strong>contrastive learning</strong>:</blockquote><blockquote>Similar sentences → closer vectors</blockquote><blockquote>Different sentences → farther vectors.</blockquote><p><strong>Example training idea:</strong></p><pre>Sentence A: &quot;Cat is an animal&quot;<br>Sentence B: &quot;Dog is an animal&quot;<br>→ embeddings placed close</pre><p><strong>Global context benefit</strong></p><blockquote>Retrieves <strong>semantically similar chunks</strong>, even if wording differs.</blockquote><p><strong>Example</strong></p><pre>Query:<br>&quot;Neural network training&quot;<br><br>Chunk:<br>&quot;Backpropagation is used to train deep learning models&quot;<br>Embedding similarity connects them.</pre><h4>3. Cohere Embeddings</h4><p>Another cloud embedding provider.</p><pre>from langchain.embeddings import CohereEmbeddings<br>embeddings = CohereEmbeddings(<br>    model=&quot;embed-english-v3.0&quot;<br>)</pre><p><strong>Usage</strong></p><blockquote>enterprise search</blockquote><blockquote>semantic similarity</blockquote><blockquote>document clustering</blockquote><p><strong>Pros</strong></p><blockquote>High-quality embeddings<br>good multilingual support</blockquote><p><strong>Cons</strong></p><blockquote>Paid API</blockquote><p><strong>Working</strong></p><p>Trained specifically for:</p><blockquote>semantic search</blockquote><blockquote>clustering</blockquote><blockquote>retrieval tasks</blockquote><p><strong>Global context benefit</strong></p><blockquote>Better <strong>semantic similarity scoring</strong>, enabling retrieval of relevant chunks from anywhere.</blockquote><p><strong>4. Instructor Embeddings</strong></p><p>Instruction-based embeddings.</p><pre>from langchain.embeddings import HuggingFaceInstructEmbeddings<br>embeddings = HuggingFaceInstructEmbeddings(<br>    model_name=&quot;hkunlp/instructor-large&quot;<br>)</pre><p><strong>Usage</strong></p><blockquote>Used when embeddings must <strong>adapt to different tasks</strong>.</blockquote><p>Example:</p><pre>Instruction: Represent the document for retrieval<br>Text: &quot;Machine learning is...&quot;</pre><p>These models embed <strong>instruction + text together</strong>, improving retrieval performance.</p><p><strong>Pros</strong></p><blockquote>Task-aware embeddings<br>better semantic understanding</blockquote><p><strong>Working</strong></p><blockquote>Embeddings include an <strong>instruction + text</strong>.</blockquote><p><strong>Example</strong></p><pre>Instruction: Represent document for retrieval<br>Text: Machine learning models learn patterns<br><br>The model learns to create vectors specific to the task.</pre><p><strong>Global context benefit</strong></p><blockquote>Better task-specific embeddings improve retrieval accuracy across the document.</blockquote><p>Example tasks:</p><ul><li>search</li><li>clustering</li><li>question answering</li></ul><h4>5. Sentence Transformer Embeddings</h4><p>A popular family of models based on <strong>BERT architecture</strong>.</p><p>Example models:</p><blockquote><em>all-MiniLM</em></blockquote><blockquote><em>mpnet</em></blockquote><blockquote><em>sentence-t5</em></blockquote><p>Sentence transformers generate <strong>sentence-level embeddings</strong> for similarity tasks.</p><p>Example:</p><pre>from sentence_transformers import SentenceTransformer<br><br>model = SentenceTransformer(&quot;all-MiniLM-L6-v2&quot;)<br>embedding = model.encode(&quot;Hello world&quot;)</pre><p><strong>Usage</strong></p><blockquote>semantic search</blockquote><blockquote>document similarity</blockquote><blockquote>recommendation systems</blockquote><h4>6. Google / Gemini Embeddings</h4><p>Google provides embeddings through its AI APIs.</p><pre>from langchain_google_genai import GoogleGenerativeAIEmbeddings<br><br>embeddings = GoogleGenerativeAIEmbeddings(<br>    model=&quot;models/embedding-001&quot;<br>)</pre><p><strong>Usage</strong></p><blockquote>Google ecosystem</blockquote><blockquote>large-scale enterprise search</blockquote><h4>7. BGE Embeddings (BAAI)</h4><p>Very strong open-source embeddings.</p><p>Example models:</p><blockquote><em>bge-small</em></blockquote><blockquote><em>bge-large</em></blockquote><blockquote><em>bge-m3</em></blockquote><pre>from langchain_community.embeddings import HuggingFaceBgeEmbeddings</pre><p><strong>Usage</strong></p><blockquote>high-quality retrieval</blockquote><blockquote>multilingual search</blockquote><p>These models perform extremely well on the <strong>MTEB benchmark</strong> for embedding evaluation.</p><p>The <a href="https://huggingface.co/mteb">Massive Text Embedding Benchmark (MTEB)</a> is a standardized framework and <a href="https://huggingface.co/spaces/mteb/leaderboard">public leaderboard</a> used to evaluate the performance of text embedding models across a wide range of tasks and languages. It is currently the most popular and comprehensive tool for selecting embedding models for applications like <strong>Retrieval-Augmented Generation (RAG)</strong> and semantic search.</p><p><strong>Working</strong></p><blockquote>Uses <strong>contrastive learning optimized for retrieval</strong>.</blockquote><pre>Training objective:<br><br>Query → relevant document closer<br>Query → irrelevant document farther<br><br>Example training pair:<br>Query: &quot;capital of France&quot;<br>Positive: &quot;Paris is the capital of France&quot;<br>Negative: &quot;Python is a programming language&quot;</pre><p><strong>Global context benefit</strong></p><blockquote>Strong <strong>query-document matching</strong>, which improves retrieving correct chunks anywhere in the document.</blockquote><h4>8. Self-Hosted Embeddings</h4><p>You can host your own embedding models on servers or GPUs.</p><pre>from langchain.embeddings import SelfHostedEmbeddings</pre><p><strong>Usage</strong></p><blockquote>enterprise security</blockquote><blockquote>large-scale private deployments</blockquote><h4>9. Fake Embeddings (Testing Only)</h4><p>Used only for testing pipelines.</p><pre>from langchain.embeddings import FakeEmbeddings</pre><p><strong>Usage:</strong></p><blockquote>testing</blockquote><blockquote>debugging RAG pipeline</blockquote><h4>10. Multilingual Embeddings</h4><p>Special models for multiple languages.</p><p><strong>Examples:</strong></p><blockquote><em>multilingual-e5-large</em></blockquote><blockquote><em>LaBSE</em></blockquote><blockquote><em>bge-m3</em></blockquote><p><strong>Usage:</strong></p><blockquote>cross-language search</blockquote><blockquote>global products</blockquote><p>If you’re interested in a step-by-step working example of RAG, check out my detailed blog post.</p><p><a href="https://medium.com/@saha.soumyadeep90/designing-scalable-rag-systems-using-vectordb-a-hands-on-walkthrough-de9f1eac768d">https://medium.com/@saha.soumyadeep90/designing-scalable-rag-systems-using-vectordb-a-hands-on-walkthrough-de9f1eac768d</a></p><h3>Positional Encoding in Transformers</h3><h4>How Positional Encoding Depends on Embeddings</h4><p>Positional Encoding is not independent — it works <strong>together with word embeddings</strong>. In fact, its design is tightly connected to how embeddings are represented in Transformers.</p><p>Let’s break this down clearly.</p><p><strong><em>1. Same Dimension as Embeddings</em></strong></p><p>Every token in a Transformer is first converted into a <strong>word embedding vector</strong> of size dmodel.</p><p>For example:</p><ul><li>If dmodel=512, each word becomes a <strong>512-dimensional vector</strong>.</li></ul><p>Positional Encoding is also created with the <strong>same dimension (512)</strong>.</p><p><strong>Why?</strong></p><p>Because positional encoding is <strong>added directly to the embedding vector</strong>:</p><p>Input to Transformer = Word Embedding + Positional Encoding</p><p>If the dimensions didn’t match, this addition wouldn’t be possible.</p><p>So positional encoding is structurally dependent on embedding size.</p><p><strong><em>2. It Modifies the Embedding Space</em></strong></p><p>Embeddings capture <strong>semantic meaning</strong>:</p><ul><li>“king” and “queen” are close in embedding space.</li><li>“dog” and “table” are far apart.</li></ul><p>Positional encoding shifts these embeddings slightly to encode position.</p><p>Example:</p><ul><li>Embedding(“cat”) at position 1</li><li>Embedding(“cat”) at position 5</li></ul><p>They start with the same semantic embedding, but after adding positional encoding, they become different vectors.</p><p>This allows the model to distinguish:</p><ul><li>The same word appearing in different positions.</li></ul><p>So positional encoding does not replace embeddings — it <strong>augments them</strong>.</p><p>Please go through my article on positional encoding:</p><p><a href="https://medium.com/@saha.soumyadeep90/positional-encoding-explained-simply-9c6b88b5d8ff">https://medium.com/@saha.soumyadeep90/positional-encoding-explained-simply-9c6b88b5d8ff</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=dea70512f6fc" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Natural Language Processing: From Beginner to Advanced]]></title>
            <link>https://medium.com/@saha.soumyadeep90/natural-language-processing-from-beginner-to-advanced-5ea74b55e2f6?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/5ea74b55e2f6</guid>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Sat, 24 Jan 2026 11:41:55 GMT</pubDate>
            <atom:updated>2026-02-18T11:24:40.940Z</atom:updated>
            <content:encoded><![CDATA[<p>We are going to explore the fascinating world of <strong>Natural Language Processing (NLP)</strong></p><h3><strong>Introduction to NLP</strong></h3><p>I have broken this down into <strong>5 key modules</strong>. For each, I will provide a simple explanation, a visual aid, and Python code to show you how it works in the real world.</p><h4><strong>Module 1: What is Natural Language Processing (NLP)?</strong></h4><p><strong>Simple Explanation:</strong> Imagine you are trying to teach a dog to understand English. You can teach it simple commands (“Sit”, “Stay”), but it can’t understand a Shakespeare poem or a complex joke. Computers are similar; they understand 0s and 1s, not words.</p><p><strong>NLP</strong> is the bridge that helps computers understand, interpret, and generate human language. It is a mix of three fields:</p><ol><li><strong>Linguistics:</strong> The rules of language (grammar, syntax).</li><li><strong>Computer Science:</strong> The programming and algorithms.</li><li><strong>Artificial Intelligence (AI):</strong> The “brain” that learns from data.</li></ol><p><strong>The goal (in simple words)</strong></p><blockquote>Humans talk using messy, flexible language. NLP tries to make machines handle that mess.</blockquote><p><strong>Humans:</strong> “Can you please remind me tomorrow?”<br> <strong>Machine must infer:</strong> “Set a reminder at tomorrow’s date/time.”</p><p><strong>Code Example: The First Step (Tokenization)</strong> Before a machine can understand a sentence, it must break it down into small pieces called “tokens” (words).</p><pre># We use a popular library called NLTK (Natural Language Toolkit)<br>import nltk<br>nltk.download(&#39;punkt&#39;)<br>        from nltk.tokenize import word_tokenize<br><br>text = &quot;NLP helps machines understand humans.&quot;<br><br>        # Break the text into words (tokens)<br>tokens = word_tokenize(text)<br><br>print(tokens)<br># Output: [&#39;NLP&#39;, &#39;helps&#39;, &#39;machines&#39;, &#39;understand&#39;, &#39;humans&#39;, &#39;.&#39;]</pre><p><strong>Why NLP is important?</strong></p><p>The lecture’s main point: <strong>language is how humans transfer knowledge</strong>. If machines can work with language, machines become way more useful.</p><p>NLP matters because:</p><ul><li>We communicate constantly via text: emails, chats, reviews, posts</li><li>There is <strong>too much text</strong> for humans to read manually</li><li>Businesses want automation: support tickets, moderation, analytics, search</li></ul><h4><strong>Module 2: Major Applications of NLP</strong></h4><p><strong>Simple Explanation:</strong> Why do we care about NLP? Because it powers the apps you use every day. The three main uses:</p><ol><li><strong>Smart Reply &amp; Translation:</strong> Like Gmail suggesting “Sounds good!” or Google Translate converting English to Spanish.</li><li><strong>Content Moderation:</strong> Automatically hiding “hate speech” or bullying comments on social media.</li><li><strong>Sentiment Analysis:</strong> Figuring out the “mood” of a text. Companies use this during elections or product launches to see if people are happy (positive) or angry (negative).</li></ol><p><strong>A) Sentiment Analysis……..</strong></p><p><strong>What it is:</strong> Detect the emotion/opinion in text.</p><ul><li>Positive: “This phone is amazing!”</li><li>Negative: “Worst service ever.”</li><li>Neutral: “The package arrived today.”</li></ul><p><strong>Where used</strong></p><ul><li>Product reviews</li><li>Election/public opinion analysis (as the lecture mentions)</li><li>Brand monitoring (“Are people angry about us today?”)</li></ul><p><strong>Code Example: Sentiment Analysis</strong> Let’s write a simple program to detect if a review is positive or negative.</p><pre>from textblob import TextBlob<br><br># A user review<br>        review = &quot;I absolutely love this new phone! It&#39;s amazing.&quot;<br><br># Analyze sentiment<br>blob = TextBlob(review)<br>sentiment_score = blob.sentiment.polarity<br><br># Polarity ranges from -1 (Negative) to +1 (Positive)<br>        if sentiment_score &gt; 0:<br>print(&quot;This is a POSITIVE review.&quot;)<br>elif sentiment_score &lt; 0:<br>print(&quot;This is a NEGATIVE review.&quot;)<br>else:<br>print(&quot;This is NEUTRAL.&quot;)<br><br># Output: This is a POSITIVE review.</pre><p><strong>B) Text Classification……..</strong></p><p><strong>What it is:</strong> Assign a label/category to text.</p><p>Examples:</p><ul><li>Email → spam / not spam</li><li>News → sports / politics / tech</li></ul><p>Support tickets → “refund”, “delivery”, “payment”</p><p><strong>C) Smart Reply (like Gmail suggestions)……….</strong></p><p><strong>What it is:</strong> Given a message, the model suggests short replies.</p><p>Incoming email: “Can we meet at 5?”<br> Smart replies: “Sure”, “Can we do 6?”, “Sounds good”</p><p>This is basically a <strong>context-based text generation</strong> problem.</p><p><strong>D) Content Moderation (toxic/hate/inappropriate filtering)………</strong></p><p><strong>What it is:</strong> Detect harmful content and flag/remove it.</p><p>Why it’s hard:</p><ul><li>People hide meaning using slang, sarcasm, spelling tricks</li><li>Cultural context matters</li><li>False positives can censor harmless content</li></ul><p><strong>E) Language Detection + Translation………</strong></p><p><strong>Language detection:</strong> detect language automatically<br><strong>Translation:</strong> convert one language to another</p><p>Example:</p><ul><li>“Bonjour” → French</li><li>Translate speech/text instantly (like Google Translate)</li></ul><p><strong>F) Question Answering + Knowledge Graphs………</strong></p><p>A <strong>knowledge graph</strong> is like a giant network of facts.</p><p>Example query: “Who is the CEO of X?”<br> Google often uses structured data (entities + relationships).</p><p><strong>Diagram: tiny knowledge graph</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/930/1*GyL_GJGoyqo1ncxYmP9Ikw.png" /></figure><p>This structure helps machines answer faster than searching raw text every time.</p><p><strong>G) Text Summarization………</strong></p><p><strong>What it is:</strong> Reduce a long text into a shorter version while keeping key meaning.</p><p>Two styles:</p><ol><li><strong>Extractive</strong>: pick important sentences from the original</li><li><strong>Abstractive</strong>: generate a new shorter version (more human-like)</li></ol><h4><strong>Module 3: The Evolution of NLP Techniques</strong></h4><p><strong>Simple Explanation:</strong> We didn’t just wake up with smart AI like ChatGPT. It explains three stages of history:</p><ol><li><strong>Rule-Based (The Old Way):</strong> Programmers wrote strict manual rules.</li></ol><ul><li><em>Example:</em> “If the sentence has the word ‘bad’, label it ‘Negative’.”</li><li><em>Problem:</em> It fails on sentences like “Not bad” (which is actually positive).</li></ul><ol><li><strong>Machine Learning (1990s):</strong> Computers started using statistics. Instead of rules, we fed them thousands of documents and let them calculate the probability of which words appear together.</li><li><strong>Deep Learning (2010s):</strong> We built “Neural Networks” that mimic the human brain. These can handle very complex data.</li></ol><p><strong>Code Example: Rule-Based vs. Machine Learning</strong></p><ol><li><em>The Old Rule-Based Way (Brittle):</em></li></ol><pre>def simple_sentiment(text):<br>        if &quot;bad&quot; in text:<br>        return &quot;Negative&quot;<br>        return &quot;Positive&quot;<br><br>print(simple_sentiment(&quot;This movie is not bad&quot;))<br>        # Output: Negative (INCORRECT! &#39;Not bad&#39; is good, but the rule failed.)</pre><p><em>2. The Modern Way (Concept):</em> In modern ML, we don’t look for specific keywords; we train a model on millions of sentences so it learns that “not” flips the meaning of “bad”.</p><h4><strong>Module 4: Deep Learning &amp; Transformers</strong></h4><p>Before Transformers, computers read sentences one word at a time, from left to right. They often forgot the beginning of a long sentence by the time they reached the end</p><p><strong>Transformers</strong> changed this. They can look at the <strong>entire sentence at once</strong>. They use a mechanism called <strong>“Attention”</strong>. Imagine reading a sentence and highlighting the most important words that relate to each other, even if they are far apart.</p><p><strong>Code Example: Using a Transformer (BERT)</strong> We can use the transformers library by Hugging Face to use these powerful models easily.</p><pre>from transformers import pipeline<br><br># Load a pre-trained transformer model for sentiment analysis<br>classifier = pipeline(&quot;sentiment-analysis&quot;)<br><br># The model understands complex context<br>        result = classifier(&quot;The food was okay, but the service was terrible.&quot;)<br><br>print(result)<br># Output: [{&#39;label&#39;: &#39;NEGATIVE&#39;, &#39;score&#39;: 0.99}]<br>        # It correctly understood that &quot;terrible service&quot; outweighs &quot;okay food&quot;.</pre><h4><strong>Module 5: Challenges (Ambiguity &amp; Sarcasm)</strong></h4><p><strong>Simple Explanation:</strong> Human language is messy. This highlights why machines still struggle:</p><ol><li><strong>Ambiguity:</strong> One word can have multiple meanings.</li></ol><ul><li><em>Example:</em> “I went to the <strong>bank</strong>.” (River bank or Money bank?) Humans know from context; machines struggle.</li></ul><ol><li><strong>Sarcasm:</strong> Saying the opposite of what you mean.</li></ol><ul><li><em>Example:</em> “Oh, great! Another flat tire.” (The machine sees “Great” and thinks you are happy).</li></ul><ol><li><strong>Idioms:</strong> Phrases that don’t make sense literally.</li></ol><ul><li><em>Example:</em> “Break a leg.” (Machine thinks you want to hurt someone; you actually mean “Good Luck”).</li></ul><p><strong>Code Example: Disambiguation (The “Bank” Problem)</strong> This example shows how we use a method called “Word Sense Disambiguation” to tell the difference.</p><pre>from nltk.wsd import lesk<br>from nltk.tokenize import word_tokenize<br><br># Context 1: Money<br>        sentence1 = &quot;I went to the bank to deposit money.&quot;<br>sense1 = lesk(word_tokenize(sentence1), &#39;bank&#39;)<br>print(f&quot;Context 1 meaning: {sense1.definition()}&quot;)<br># Output: a financial institution...<br><br>        # Context 2: River<br>        sentence2 = &quot;I sat on the bank of the river.&quot;<br>sense2 = lesk(word_tokenize(sentence2), &#39;bank&#39;)<br>print(f&quot;Context 2 meaning: {sense2.definition()}&quot;)<br># Output: sloping land (especially the slope beside a body of water)</pre><h3><strong>End to End NLP Pipeline</strong></h3><p>In this lesson, we are going to walk through the <strong>5-Step End-to-End NLP Pipeline</strong>. Think of this pipeline as an assembly line in a factory. You start with raw materials (messy text), process them, and end up with a finished product (a working AI model).</p><p>Here is the roadmap we will follow:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ssLWBAN5JqluxykWtmOX-A.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iv0o0jOL5e2Z7P-JVJydAQ.png" /></figure><h4><strong>Step 1: Data Acquisition (Gathering the Raw Material)</strong></h4><p><strong>Simple Explanation:</strong> Before you can cook, you need ingredients. In NLP, your “ingredients” are text data. You can get data in three ways:</p><ol><li><strong>Available Data:</strong> You already have it (e.g., company emails).</li><li><strong>Public Data:</strong> You download it from the internet (e.g., Kaggle datasets).</li><li><strong>Scraping:</strong> You write a bot to “read” websites and save the text (e.g., copying product reviews from Amazon).</li></ol><p><strong>Code Example: Web Scraping</strong> Let’s say we want to grab some text from a webpage using a library called BeautifulSoup.</p><pre>import requests<br>from bs4 import BeautifulSoup<br><br># The URL we want to scrape<br>url = &quot;https://example.com/reviews&quot;<br><br>        # Get the page content<br>response = requests.get(url)<br>soup = BeautifulSoup(response.text, &#39;html.parser&#39;)<br><br># Find all paragraphs (simulating grabbing reviews)<br>reviews = [p.text for p in soup.find_all(&#39;p&#39;)]<br><br>print(reviews[:2]) <br># Output: [&#39;This product is great!&#39;, &#39;I did not like the service.&#39;]</pre><h4><strong>Step 2: Text Preparation (Cleaning the Ingredients)</strong></h4><p><strong>Simple Explanation:</strong> Raw text is messy. It has emojis, HTML tags (&lt;br&gt;), and weird symbols. If we feed this to the computer, it will get confused. This step involves:</p><ul><li><strong>Cleaning:</strong> Removing HTML tags, emojis, and punctuation.</li><li><strong>Tokenization:</strong> Chopping sentences into words.</li><li><strong>Stop Word Removal:</strong> Deleting boring words like “is”, “the”, “at” that don’t add much meaning.</li></ul><p><strong>Code Example: Cleaning &amp; Tokenizing</strong></p><pre>import re<br>from nltk.corpus import stopwords<br>from nltk.tokenize import word_tokenize<br><br># Raw dirty text<br>        text = &quot;The movie was &lt;b&gt;AMAZING&lt;/b&gt;!!! 😃 I loved it.&quot;<br><br># 1. Remove HTML tags<br>        clean_text = re.sub(&#39;&lt;.*?&gt;&#39;, &#39;&#39;, text)<br><br># 2. Remove special characters (keep only letters)<br>clean_text = re.sub(&#39;[^a-zA-Z]&#39;, &#39; &#39;, clean_text)<br><br># 3. Tokenize (split into words)<br>words = word_tokenize(clean_text.lower())<br><br>        # 4. Remove Stop Words (&quot;the&quot;, &quot;it&quot;, &quot;was&quot;)<br>stop_words = set(stopwords.words(&#39;english&#39;))<br>filtered_words = [w for w in words if w not in stop_words]<br><br>print(filtered_words)<br># Output: [&#39;movie&#39;, &#39;amazing&#39;, &#39;loved&#39;]</pre><h4><strong>Step 3: Feature Engineering (Translating for the Computer)</strong></h4><p><strong>Simple Explanation:</strong> Computers cannot read words; they only understand numbers. <strong>Feature Engineering</strong> is the process of converting your text into a list of numbers (vectors) that represent the meaning.</p><p>Two common ways to do this:</p><ol><li><strong>Bag of Words (BoW):</strong> We count how many times each word appears.</li><li><strong>TF-IDF:</strong> A smarter way that gives more importance to rare, unique words and less importance to common words.</li></ol><p><strong>Code Example: Bag of Words (CountVectorizer)</strong></p><pre>from sklearn.feature_extraction.text import CountVectorizer<br><br>documents = [<br>        &quot;I love coding&quot;,<br>        &quot;Coding is fun&quot;<br>        ]<br><br>        # Create the vectorizer<br>        vectorizer = CountVectorizer()<br><br># Convert text to numbers<br>X = vectorizer.fit_transform(documents)<br><br># Show the numbers (The &quot;features&quot;)<br>print(vectorizer.get_feature_names_out())<br>print(X.toarray())<br><br>        # Output:<br>        # [&#39;coding&#39; &#39;fun&#39; &#39;is&#39; &#39;love&#39;]<br>        # [[1, 0, 0, 1]]  &lt;- &quot;I love coding&quot; (1 &#39;coding&#39;, 0 &#39;fun&#39;, 0 &#39;is&#39;, 1 &#39;love&#39;)<br>        # [[1, 1, 1, 0]]  &lt;- &quot;Coding is fun&quot;</pre><h4><strong>Step 4: Modeling (The Brain)</strong></h4><p><strong>Simple Explanation:</strong> Now that we have numbers, we can train a “Model”.</p><ul><li><strong>Machine Learning (ML):</strong> We use algorithms like Naive Bayes or Support Vector Machines. These are great when you have less data. You have to tell the model <em>what</em> to look for (manual feature engineering).</li><li><strong>Deep Learning (DL):</strong> We use Neural Networks. These are better for huge amounts of data. They figure out the features <em>automatically</em>, but they are “Black Boxes” (hard to explain <em>why</em> they made a decision).</li></ul><p><strong>Code Example: Training a Simple Classifier</strong></p><pre>from sklearn.naive_bayes import MultinomialNB<br><br># X is our numbers from Step 3, y is our labels (1=Positive, 0=Negative)<br>y = [1, 1] # Both our previous sentences were positive<br><br># Train the model<br>        model = MultinomialNB()<br>model.fit(X, y)<br><br># Predict a new sentence<br>        test_sentence = vectorizer.transform([&quot;Coding is amazing&quot;])<br>prediction = model.predict(test_sentence)<br><br>print(f&quot;Prediction: {prediction[0]}&quot;) <br># Output: Prediction: 1 (Positive)</pre><h4><strong>Step 5: Deployment (Going Live)</strong></h4><p><strong>Simple Explanation:</strong> A model sitting on your laptop is useless. <strong>Deployment</strong> means putting your model on a server so other people can use it (like via a website or app).</p><ul><li><strong>Monitoring:</strong> Once live, you must watch it. If people start using new slang words that your model doesn’t know, it will stop working.</li><li><strong>Updating:</strong> You need to retrain the model periodically with new data to keep it smart.</li></ul><p><strong>Code Example: A Mock API Endpoint (Flask)</strong> This is how a web server might look when you deploy your model.</p><pre>from flask import Flask, request, jsonify<br><br>        app = Flask(__name__)<br><br>@app.route(&#39;/predict&#39;, methods=[&#39;POST&#39;])<br>def predict():<br>data = request.json<br>        text = data[&#39;text&#39;]<br>    <br>    # Preprocess and Predict (using our previous steps)<br>text_vector = vectorizer.transform([text])<br>result = model.predict(text_vector)[0]<br><br>        return jsonify({&#39;sentiment&#39;: &#39;Positive&#39; if result == 1 else &#39;Negative&#39;})<br>        # If you ran this, you could send a text to the server and get a prediction back!</pre><h3><strong>Text Preprocessing</strong></h3><p>We are now diving deep into <strong>Step 2 of the NLP Pipeline: Text Preprocessing</strong>.</p><p>If you feed dirty text (with HTML tags, emojis, and typos) into a model, you get “Garbage In, Garbage Out.” We need to scrub it clean.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SrPEtR-m3YjyNdN-h73H1Q.png" /></figure><h4><strong>Module 1: Basic Cleaning (The “Janitorial” Work)</strong></h4><p><strong>Simple Explanation:</strong> Before we look at the meaning of words, we need to standardize the format.</p><ol><li><strong>Lowercasing:</strong> “Apple” and “apple” should be treated as the same word.</li><li><strong>Removing HTML:</strong> If you scrape data from the web, it comes with invisible tags like &lt;div&gt; or &lt;br&gt; that confuse the machine.</li><li><strong>Removing URLs &amp; Punctuation:</strong> Links (http://...) and symbols (!?,.) often add noise without adding sentiment.</li></ol><p><strong>Code Example: Cleaning with Regex</strong> We use Python’s re (Regular Expressions) library for this.</p><pre>import re<br>import string<br><br>def clean_text(text):<br>        # 1. Lowercase<br>        text = text.lower()<br><br>    # 2. Remove HTML tags (anything between &lt; and &gt;)<br>text = re.sub(r&#39;&lt;.*?&gt;&#39;, &#39;&#39;, text)<br><br>    # 3. Remove URLs<br>text = re.sub(r&#39;https?://\S+|www\.\S+&#39;, &#39;&#39;, text)<br><br>    # 4. Remove Punctuation<br>    # We replace punctuation with an empty string<br>        text = text.translate(str.maketrans(&#39;&#39;, &#39;&#39;, string.punctuation))<br><br>    return text<br><br>        raw_tweet = &quot;Watch this MOVIE! &lt;br&gt; It&#39;s GR8. https://movie.com&quot;<br>print(clean_text(raw_tweet))<br>        # Output: watch this movie  its gr8</pre><h4><strong>Module 2: Advanced Cleaning (Emojis &amp; Spelling)</strong></h4><p><strong>Simple Explanation:</strong> This highlights two specific “noise” types:</p><ol><li><strong>Emojis:</strong> :) or 🔥. Sometimes we want to keep them (for sentiment), but often we want to remove them or translate them to text (e.g., convert :) to “happy”).</li><li><strong>Chat Speak &amp; Typos:</strong> Users write “u” instead of “you” or “luv” instead of “love.” We use a <strong>dictionary mapping</strong> to fix these.</li></ol><p><strong>Code Example: Handling Emojis &amp; Chat Speak</strong></p><pre># A simple dictionary for chat speak<br>chat_words = {<br>        &quot;u&quot;: &quot;you&quot;,<br>        &quot;gr8&quot;: &quot;great&quot;,<br>        &quot;luv&quot;: &quot;love&quot;,<br>        &quot;r&quot;: &quot;are&quot;<br>        }<br><br>def clean_chat_speak(text):<br>words = text.split()<br>new_words = []<br>        for w in words:<br>        if w in chat_words:<br>        new_words.append(chat_words[w])<br>        else:<br>                new_words.append(w)<br>    return &quot; &quot;.join(new_words)<br><br># Removing Emojis (using regex)<br>def remove_emojis(text):<br>        # This regex removes non-ASCII characters (which covers most emojis)<br>    return text.encode(&#39;ascii&#39;, &#39;ignore&#39;).decode(&#39;ascii&#39;)<br><br>input_text = &quot;u r gr8 😃&quot;<br>clean = clean_chat_speak(input_text)<br>clean = remove_emojis(clean)<br>print(clean)<br># Output: you are great</pre><h4><strong>Module 3: Tokenization</strong></h4><p><strong>Simple Explanation:</strong> Tokenization is the act of chopping text into pieces.</p><ul><li><strong>Sentence Tokenization:</strong> Splitting a paragraph into sentences.</li><li><strong>Word Tokenization:</strong> Splitting a sentence into individual words.</li></ul><p>Why? Because the computer analyzes text one “token” (unit) at a time.</p><p><strong>Code Example: NLTK Tokenization</strong></p><pre>import nltk<br>from nltk.tokenize import word_tokenize, sent_tokenize<br><br>        text = &quot;NLP is fun. I am learning fast!&quot;<br><br># Sentence Tokenization<br>print(sent_tokenize(text))<br>        # Output: [&#39;NLP is fun.&#39;, &#39;I am learning fast!&#39;]<br><br>        # Word Tokenization<br>print(word_tokenize(text))<br>        # Output: [&#39;NLP&#39;, &#39;is&#39;, &#39;fun&#39;, &#39;.&#39;, &#39;I&#39;, &#39;am&#39;, &#39;learning&#39;, &#39;fast&#39;, &#39;!&#39;]</pre><h4><strong>Module 4: Stemming vs. Lemmatization</strong></h4><p><strong>Simple Explanation:</strong></p><p>In English, words change shape (inflection): “run,” “running,” “ran,” “runs.” To a computer, these look like 4 different words. We want to reduce them to their root concept: “RUN.”</p><p>Let’s compares two methods:</p><ol><li><strong>Stemming (Fast but dumb):</strong> It just chops off the end of the word.</li></ol><ul><li><em>Example:</em> “Changing”  “Chang” (Not a real word, but usually good enough).</li></ul><ol><li><strong>Lemmatization (Slow but smart):</strong> It uses a dictionary (like WordNet) to find the actual root word.</li></ol><ul><li><em>Example:</em> “Better”  “Good” (It understands the meaning).</li></ul><p><strong>Code Example: PorterStemmer vs. WordNetLemmatizer</strong></p><pre>from nltk.stem import PorterStemmer, WordNetLemmatizer<br><br>        stemmer = PorterStemmer()<br>lemmatizer = WordNetLemmatizer()<br><br>word = &quot;running&quot;<br><br>print(&quot;Stemming:&quot;, stemmer.stem(word))<br>        # Output: run<br><br>        word2 = &quot;better&quot;<br>print(&quot;Stemming:&quot;, stemmer.stem(word2))<br>        # Output: better (Stemmer doesn&#39;t know grammar)<br><br>        print(&quot;Lemmatization:&quot;, lemmatizer.lemmatize(word2, pos=&#39;a&#39;))<br>        # Output: good (Lemmatizer knows &#39;better&#39; is an adjective form of &#39;good&#39;)</pre><h4><strong>Module 5: The Project (Movie Classification)</strong></h4><p><strong>Simple Explanation:</strong> Let’s concludes with an assignment: Building a dataset of movie reviews to classify them. This brings all the steps together. You scrape the data, create a DataFrame, and apply the cleaning functions we just wrote.</p><p><strong>Code Example: The Complete Pipeline Function</strong> Here is how you would apply all these steps to a list of movie reviews in a real project.</p><pre>import pandas as pd<br><br># 1. Create the Dataset (Simulating the &#39;Data Acquisition&#39; step)<br>data = {<br>        &#39;review&#39;: [<br>        &quot;The movie was &lt;br&gt; AMAZING!!! 😃&quot;,<br>        &quot;worst. movie. ever. don&#39;t watch it.&quot;,<br>        &quot;I luv the acting, u should see it.&quot;<br>        ],<br>        &#39;sentiment&#39;: [&#39;positive&#39;, &#39;negative&#39;, &#39;positive&#39;]<br>        }<br>df = pd.DataFrame(data)<br><br># 2. Define the Master Preprocessing Function<br>def master_clean(text):<br>text = text.lower()                     # Lowercase<br>        text = re.sub(r&#39;&lt;.*?&gt;&#39;, &#39;&#39;, text)       # Remove HTML<br>text = text.encode(&#39;ascii&#39;, &#39;ignore&#39;).decode(&#39;ascii&#39;) # Remove Emojis<br>text = text.translate(str.maketrans(&#39;&#39;, &#39;&#39;, string.punctuation)) # Remove Punctuation<br>text = clean_chat_speak(text)           # Fix &quot;luv&quot; -&gt; &quot;love&quot;<br>        return text<br><br># 3. Apply to the DataFrame<br>df[&#39;cleaned_review&#39;] = df[&#39;review&#39;].apply(master_clean)<br><br>print(df[[&#39;review&#39;, &#39;cleaned_review&#39;]])</pre><p><strong>Output:</strong></p><pre>review                   cleaned_review<br>0     The movie was &lt;br&gt; AMAZING!!! 😃            the movie was amazing <br>1  worst. movie. ever. don&#39;t watch it      worst movie ever dont watch it<br>        2  I luv the acting, u should see it    i love the acting you should see it</pre><h3><strong>Text Representation | Bag of Words | Tf-Idf | N-grams, Bi-grams and Uni-grams</strong></h3><p>We have arrived at a critical juncture: <strong>Step 3 of the Pipeline — Feature Engineering (Text Representation).</strong></p><p>In simple terms: <em>How do we turn words into numbers so the machine can understand them?</em></p><p>It covers three major techniques, from simple to advanced. I will explain each with a diagram, code, and a clear “Pros &amp; Cons” list.</p><h4><strong>Module 1: One-Hot Encoding (The Simplest Approach)</strong></h4><p><strong>Simple Explanation:</strong></p><p>Imagine you have a vocabulary of 5 words: [“I”, “love”, “NLP”, “coding”, “hate”].</p><p>To represent the word “NLP”, you create a list of zeros and put a <strong>1</strong> in the slot where “NLP” sits.</p><ul><li>“I” → [1, 0, 0, 0, 0]</li><li>“NLP” → [0, 0, 1, 0, 0]</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*35MaaAz161RKSQIqeNZ5xA.png" /></figure><p><strong>The Problem:</strong></p><p>If your vocabulary has 50,000 words (like the English language), every single word becomes a list with 49,999 zeros and one 1. This is called <strong>Sparsity</strong>. It wastes memory and computes power.</p><pre>import pandas as pd<br>        from sklearn.preprocessing import OneHotEncoder<br><br># Our vocabulary is implicity created from this data<br>data = [[&#39;I&#39;], [&#39;love&#39;], [&#39;NLP&#39;]]<br>encoder = OneHotEncoder(sparse_output=False)<br><br># Convert to One-Hot<br>        one_hot = encoder.fit_transform(data)<br><br>print(encoder.get_feature_names_out())<br>        # Output: [&#39;x0_I&#39; &#39;x0_NLP&#39; &#39;x0_love&#39;]<br><br>print(one_hot)<br># Output:<br>        # [[1. 0. 0.]   &lt;- &quot;I&quot;<br>        #  [0. 0. 1.]   &lt;- &quot;love&quot;<br>        #  [0. 1. 0.]]  &lt;- &quot;NLP&quot;</pre><h4><strong>Module 2: Bag of Words (BoW) &amp; N-Grams</strong></h4><p><strong>Simple Explanation:</strong> Instead of marking just <em>one</em> word, we count <em>all</em> the words in a sentence.</p><ul><li><strong>Sentence:</strong> “I love NLP and I love coding.”</li><li><strong>BoW Vector:</strong> {“I”: 2, “love”: 2, “NLP”: 1, “coding”: 1}</li></ul><p><strong>The Problem with BoW:</strong> It loses order. “dog bites man” and “man bites dog” look exactly the same because they share the same words.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2kTkRgVTJu2iUVbEy1mZaw.png" /></figure><p><strong>The Solution: N-Grams</strong> Instead of counting single words (<strong>Unigrams</strong>), we count pairs (<strong>Bigrams</strong>) or triplets (<strong>Trigrams</strong>).</p><ul><li><strong>Bigrams:</strong> “I love”, “love NLP”, “NLP and”… Now “not bad” is treated as a single unit, preserving meaning.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*603aOPLPoe81gigeSZl2fA.png" /></figure><p><strong>Code Example: CountVectorizer (BoW &amp; Bigrams)</strong></p><pre>from sklearn.feature_extraction.text import CountVectorizer<br><br>text = [&quot;I love NLP and I love coding&quot;]<br><br>        # 1. Standard Bag of Words (Unigrams)<br>cv = CountVectorizer()<br>vector = cv.fit_transform(text)<br>print(&quot;BoW:&quot;, cv.get_feature_names_out())<br>        # Output: [&#39;and&#39;, &#39;coding&#39;, &#39;love&#39;, &#39;nlp&#39;]<br><br>        # 2. Bigrams (Pairs of words)<br>cv_bigram = CountVectorizer(ngram_range=(2, 2))<br>vector_bi = cv_bigram.fit_transform(text)<br>print(&quot;Bigrams:&quot;, cv_bigram.get_feature_names_out())<br>        # Output: [&#39;and love&#39;, &#39;love coding&#39;, &#39;love nlp&#39;, &#39;nlp and&#39;]</pre><h4><strong>Module 3: TF-IDF (Term Frequency — Inverse Document Frequency)</strong></h4><p><strong>Simple Explanation:</strong> In Bag of Words, common words like “the” or “is” might appear 100 times, making them seem <em>most</em> important. But they are useless! <strong>TF-IDF</strong> fixes this by:</p><ol><li><strong>TF (Term Frequency):</strong> How often a word appears in <em>this</em> document. (Rewards frequent words).</li><li><strong>IDF (Inverse Document Frequency):</strong> How rare the word is across <em>all</em> documents. (Punishes common words like “the”).</li></ol><ul><li><em>Result:</em> Words like “Netflix” or “Quantum” get high scores. Words like “the” get near-zero scores.</li></ul><p><strong>What is TF-IDF?</strong></p><p>TF-IDF stands for “Term Frequency — Inverse Document Frequency”. It is a statistical technique that quantifies the importance of a word in a document based on how often it appears in that document and a given collection of documents (corpus). The intuition for this measure is : If a word occurs frequently in a document, then it should be more important and relevant than other words that appear fewer times and we should give that word a high score (TF). But if a word appears many times in a document but also in too many other documents, it’s probably not a relevant and meaningful word, therefore we should assign a lower score to that word (IDF). The relevancy of a word is proportional to the amount of information that it gives about its context (a sentence, a document or a full dataset). The more relevant words help us better understand the entire document without reading it completely. The most relevant words are not necessary the most frequent words since <strong>stopwords</strong> like “the”, “of” or “a” tend to occur very often in many documents, but do not give much information. TF-IDF method is widely used in Information Retrieval and Text Mining. The TF-IDF score of term in document with respect to corpus is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/598/1*3QhcejOESpg9_6v7RDGqMQ.png" /></figure><p><strong>TF (Term Frequency) Score</strong></p><p>How often a term appears inside a document.</p><p>Common version:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FuRoBY_RGIkEETG2lUAd8A.png" /></figure><p><strong>IDF (Inverse Document Frequency)</strong></p><p>How rare the term is across documents.</p><p>Common version:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2zaZ5GJQLZc1gc79O7omRA.png" /></figure><p><strong>TF‑IDF</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/628/1*Dgo3QSYWScU9nWyh1fBpxQ.png" /></figure><p><strong>Example 1:</strong></p><ol><li>&quot;i love nlp&quot;</li><li>&quot;i love love deep learning&quot;</li><li>&quot;nlp is fun&quot;</li></ol><p>Here:</p><ul><li>N = 3</li><li>df(&quot;love&quot;) = 2 (in doc1 and doc2)</li><li>df(&quot;deep&quot;) = 1 (only doc2)</li></ul><p>Compute IDF (natural log):</p><ul><li>IDF(love)=log⁡(3/2)≈0.405</li><li>IDF(deep)=log⁡(3/1)≈1.099</li></ul><p>In document 2 (&quot;i love love deep learning&quot;, total words = 5):</p><ul><li>TF(love) = 2/5 = 0.4 → TFIDF(love) = 0.4×0.405 ≈ 0.162</li><li>TF(deep) = 1/5 = 0.2 → TFIDF(deep) = 0.2 × 1.099 ≈ 0.220</li></ul><p><strong>Example 2:</strong></p><p>TF-IDF Example</p><p>In order to fully understand how TF-IDF works, I will give you a concrete example. Let’s assume that we have a collection of four documents as follows:</p><ul><li>d1 : “<em>The sky is blue.</em></li><li>d2 : “<em>The sun is bright today.</em>”</li><li>d3 : “<em>The sun in the sky is bright.</em>”</li><li>d4 : “<em>We can see the shining sun, the bright sun.</em>”</li></ul><p><strong>Task:</strong> Determine the tf-idf scores for each term in each document.</p><ul><li><strong>Step1:</strong> Filter out the stopwords. After removing the stopwords, we have</li></ul><blockquote>d1 : “<em>sky blue</em></blockquote><blockquote>d2 : “<em>sun bright today</em>”</blockquote><blockquote>d3 : “<em>sun sky bright</em>”</blockquote><blockquote>d4 : “<em>can see shining sun bright sun</em>”</blockquote><ul><li><strong>Step2:</strong> Compute TF, therefore, we find document-word matrix and then normalize the rows to sum to 1.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8Ufs71IvfNgKg9aeATYSUQ.png" /></figure><ul><li><strong>Step3:</strong> Compute IDF: Find the number of documents in which each word occurs, then compute the formula:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ncRgRUOC_P3-FokzF3_K_Q.png" /></figure><ul><li><strong>Step4:</strong> Compute TF-IDF: Multiply TF and IDF scores.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9XnUQWKcaLJXUreb7OTtAQ.png" /></figure><p><strong>Code Example: TfidfVectorizer</strong></p><pre>from sklearn.feature_extraction.text import TfidfVectorizer<br><br>corpus = [<br>        &quot;I love the movie&quot;,<br>        &quot;The movie was boring&quot;,<br>        &quot;I love popcorn&quot;<br>        ]<br><br>        # Create TF-IDF<br>        tfidf = TfidfVectorizer()<br>output = tfidf.fit_transform(corpus)<br><br># Let&#39;s see the score for &quot;movie&quot; vs &quot;popcorn&quot;<br>feature_names = tfidf.get_feature_names_out()<br>print(feature_names)<br>print(output.toarray())<br><br>        # You will notice &#39;popcorn&#39; has a higher score in sentence 3<br>        # than &#39;movie&#39; does in sentence 1, because &#39;movie&#39; appears in multiple sentences (less unique).</pre><h3><strong>Word2vec | CBOW and Skip-gram</strong></h3><p>Welcome to <strong>Step 4: Word Embeddings (Word2Vec)</strong></p><p><strong>Please Note: </strong>Word2Vec is one form of embeddings. There are numerous approach and the agreed approach is Transformer based. Please go through my blog<strong> </strong><a href="https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df"><strong>https://medium.com/@saha.soumyadeep90/embeddings-explained-from-sparse-representations-to-transformer-based-semantic-spaces-4defcf1d78df</strong></a> if you want to learn in detail.</p><p>In the previous section (TF-IDF/Bag of Words), we treated words as just <strong>counts</strong>. The computer knew that “Apple” appeared 5 times, but it didn’t know that “Apple” is a fruit, or that it’s similar to “Orange”.</p><p><strong>Word2Vec</strong> changes everything. It turns words into <strong>Dense Vectors</strong> (lists of numbers) where similar words are placed close together in space.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YL1uJ6ldgu7l_wJhizBFeg.png" /></figure><h4><strong>Module 1: The Core Concept (Semantic Meaning)</strong></h4><p><strong>Simple Explanation:</strong> Imagine a giant 3D map.</p><ul><li>We want to put the word “King” at coordinate [5, 5].</li><li>We want to put “Queen” nearby at [5, 6].</li><li>We want to put “Apple” far away at [90, 10].</li></ul><p>Word2Vec figures out these coordinates automatically by reading millions of sentences. The logic is simple: <strong>“You shall know a word by the company it keeps.”</strong> If “Apple” and “Orange” both appear next to “juice” often, they must be related.</p><p><strong>The Magic Calculation:</strong> Because words are now numbers, we can do math with them! The most famous example is:</p><p>King — Man + Woman = Queen</p><h4><strong>Module 2: How it Learns (The Sliding Window)</strong></h4><p><strong>Simple Explanation:</strong></p><p>To teach the computer, we play a “Fill in the Blank” game. We slide a “Window” over a sentence to create training examples.</p><p><strong>Sentence:</strong> “The quick brown fox jumps.”</p><p><strong>Window Size:</strong> 2 (Look 2 words back and 2 words forward).</p><p>We create pairs of input/output words:</p><ul><li><strong>Input:</strong> “brown” → <strong>Target:</strong> “quick”</li><li><strong>Input:</strong> “brown” → <strong>Target:</strong> “fox”</li></ul><p>This creates a “Dummy Problem” for a Neural Network to solve. We don’t actually care about the prediction; we care about the <strong>Weights</strong> of the neural network — these weights become our Vectors!</p><h4><strong>Module 3: Two Architectures (CBOW vs. Skip-gram)</strong></h4><p>This explains two ways to train this model.</p><p><strong>1. CBOW (Continuous Bag of Words):</strong></p><ul><li><strong>Task:</strong> I give you the <em>context</em> (surrounding words), you guess the <em>middle</em> word.</li><li><em>Example:</em> “The quick ____ fox.” → You guess “brown”.</li><li><em>Best for:</em> Smaller datasets, faster.</li></ul><p><strong>2. Skip-gram:</strong></p><ul><li><strong>Task:</strong> I give you the <em>middle</em> word, you guess the <em>context</em>.</li><li><em>Example:</em> “____ brown ____” → You guess “quick” and “fox”.</li></ul><p><em>Best for:</em> Large datasets, captures rare words better.</p><h4><strong>Module 4: Coding Word2Vec (Game of Thrones Edition)</strong></h4><p><strong>Simple Explanation:</strong> It uses the <strong>Game of Thrones</strong> books to train a model. Since we can’t process the whole book here, I will simulate it with a small dataset so you can see the code structure. We use the library gensim.</p><pre>pip install gensim nltk</pre><pre>import gensim<br>from gensim.models import Word2Vec<br>from nltk.tokenize import sent_tokenize, word_tokenize<br>import nltk<br><br># 1. Prepare Data (Simulating the GoT text)<br>got_text = &quot;&quot;&quot;<br>Jon Snow is a member of the Night&#39;s Watch.<br>Daenerys Targaryen consists of fire and blood.<br>Tyrion Lannister is a dwarf and a clever man.<br>Arya Stark has a sword named Needle.<br>The King in the North is Jon Snow.<br>&quot;&quot;&quot;<br><br>        # 2. Preprocessing (Tokenization)<br># We need a list of lists: [[&#39;jon&#39;, &#39;snow&#39;, ...], [&#39;ary&#39;, &#39;stark&#39;, ...]]<br>sentences = []<br>        for sent in sent_tokenize(got_text):<br>words = [w.lower() for w in word_tokenize(sent)]<br>        sentences.append(words)<br><br># 3. Train the Model<br># min_count=1 means &quot;keep words that appear at least once&quot; (usually set to 5 for big data)<br>        # vector_size=100 means &quot;create a list of 100 numbers for each word&quot;<br>        # window=5 means &quot;look 5 words left and right&quot;<br>model = Word2Vec(sentences, min_count=1, vector_size=100, window=5)<br><br># 4. Use the Model<br># Find the vector for &quot;jon&quot;<br>vector_jon = model.wv[&#39;jon&#39;]<br>print(f&quot;Vector for Jon (first 5 numbers): {vector_jon[:5]}&quot;)<br><br># Find similarity<br>similarity = model.wv.similarity(&#39;jon&#39;, &#39;stark&#39;)<br>print(f&quot;Similarity between Jon and Stark: {similarity}&quot;)<br><br># Find most similar words (Won&#39;t be great on this tiny text, but works on big data)<br>        print(&quot;Most similar to Daenerys:&quot;, model.wv.most_similar(&#39;daenerys&#39;))</pre><h4><strong>Module 5: Visualization (PCA)</strong></h4><p><strong>Simple Explanation:</strong> Our vectors have 100 dimensions (a list of 100 numbers). Humans can only see 2D or 3D. To visualize them, we use a technique called <strong>PCA (Principal Component Analysis)</strong>. It squashes the 100 dimensions down to 2, keeping the most important information, so we can plot them on a scatter chart.</p><pre>from sklearn.decomposition import PCA<br>import matplotlib.pyplot as plt<br><br># Get all word vectors from the model<br>        X = model.wv[model.wv.index_to_key]<br><br># Compress to 2D<br>pca = PCA(n_components=2)<br>result = pca.fit_transform(X)<br><br># Plot<br>plt.scatter(result[:, 0], result[:, 1])<br>words = list(model.wv.index_to_key)<br>for i, word in enumerate(words):<br>        plt.annotate(word, xy=(result[i, 0], result[i, 1]))<br>        plt.show()</pre><h3><strong>Text Classification | Average Word2Vec</strong></h3><p>Welcome to <strong>Text Classification</strong>, one of the most useful skills you will learn in Machine Learning.</p><p>If you have ever wondered how Gmail knows an email is “Spam” or how a support ticket system automatically sends billing questions to the “Finance” team, the answer is <strong>Text Classification</strong>.</p><p>I have broken this lecture down into clear modules with diagrams and code.</p><h4><strong>Module 1: What is Text Classification?</strong></h4><p><strong>Simple Explanation:</strong> Imagine you are a librarian with a huge pile of unorganized books. Your job is to read the title of each book and throw it into the correct bin.</p><ul><li><strong>Binary Classification:</strong> You have only two bins (e.g., “Spam” vs. “Not Spam”).</li><li><strong>Multi-Class Classification:</strong> You have many bins (e.g., “Sports”, “Politics”, “Tech”).</li><li><strong>Multi-Label Classification:</strong> A book can go into multiple bins at once (e.g., A movie can be both “Action” and “Comedy”).</li></ul><p><strong>The Goal:</strong> To build a machine (Model) that can look at the text and predict the label automatically.</p><h4><strong>Module 2: The Classification Pipeline</strong></h4><p><strong>Simple Explanation:</strong> It emphasizes that you don’t just “throw data at an algorithm.” You must follow a pipeline.</p><ol><li><strong>Preprocessing:</strong> Clean the text (Lowercasing, remove HTML).</li><li><strong>Feature Engineering:</strong> Convert text to numbers (Bag of Words, TF-IDF, or Word Vectors).</li><li><strong>Modeling:</strong> Train an algorithm (Naive Bayes, Random Forest, etc.) to recognize patterns.</li><li><strong>Prediction:</strong> Give it new text and get a label.</li></ol><h4><strong>Module 3: Code Example (Building a Spam Filter)</strong></h4><p>Let’s build a real working <strong>Spam Classifier</strong> using the classic <strong>Naive Bayes</strong> algorithm (which is great for text). We will use a Pipeline to keep our code clean.</p><pre>import pandas as pd<br>        from sklearn.feature_extraction.text import CountVectorizer<br>from sklearn.naive_bayes import MultinomialNB<br>from sklearn.pipeline import Pipeline<br>from sklearn.model_selection import train_test_split<br>from sklearn.metrics import accuracy_score<br><br># 1. The Dataset (Simulating emails)<br>data = {<br>        &#39;text&#39;: [<br>        &quot;Win a free iPhone now! Click here.&quot;,<br>        &quot;Hey, are we still meeting for lunch?&quot;,<br>        &quot;URGENT: Your bank account is locked.&quot;,<br>        &quot;Project deadline is tomorrow. Please review.&quot;,<br>        &quot;Free cash prize winner!!! claim now&quot;<br>        ],<br>        &#39;label&#39;: [&#39;Spam&#39;, &#39;Ham&#39;, &#39;Spam&#39;, &#39;Ham&#39;, &#39;Spam&#39;] # &#39;Ham&#39; means Not Spam<br>}<br>df = pd.DataFrame(data)<br><br># 2. Split Data (Training vs Testing)<br>X_train, X_test, y_train, y_test = train_test_split(df[&#39;text&#39;], df[&#39;label&#39;], test_size=0.2, random_state=42)<br><br># 3. Build the Pipeline<br># Step A: Convert text to numbers (Bag of Words)<br># Step B: Apply the Classifier (Naive Bayes)<br>pipeline = Pipeline([<br>                            (&#39;vectorizer&#39;, CountVectorizer()),<br>        (&#39;classifier&#39;, MultinomialNB())<br>        ])<br><br>        # 4. Train the Model<br>pipeline.fit(X_train, y_train)<br><br># 5. Predict on New Data<br>new_emails = [&quot;Meeting at 5pm?&quot;, &quot;Free money click link!&quot;]<br>predictions = pipeline.predict(new_emails)<br><br>print(f&quot;Predictions: {predictions}&quot;)<br># Output: Predictions: [&#39;Ham&#39; &#39;Spam&#39;]</pre><h4><strong>Module 4: Advanced Technique (Averaging Word Vectors)</strong></h4><p><strong>Simple Explanation:</strong></p><p>This mentions a technique for converting a <em>whole sentence</em> into a single vector using 3D vectors (Word2Vec).</p><p>Since Word2Vec gives us a vector for <em>each word</em>, how do we get one vector for the <em>sentence</em>?</p><p><strong>We take the Average.</strong></p><ul><li><em>Analogy:</em> If you mix a drop of Red paint (“Apple”) and a drop of Yellow paint (“Banana”), you get Orange (The average color).</li><li><em>Math:</em></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/844/1*qYfzlVATjkhPIqOT0ToHPQ.png" /></figure><pre>import numpy as np<br><br># Imagine these are Word2Vec vectors (simplified to 2D for this example)<br>word_vectors = {<br>        &quot;king&quot;: np.array([5, 5]),<br>    &quot;rule&quot;: np.array([3, 3])<br>}<br><br>sentence = &quot;king rule&quot;<br>tokens = sentence.split()<br><br># Calculate Average<br>vectors = [word_vectors[word] for word in tokens]<br>sentence_vector = np.mean(vectors, axis=0)<br><br>print(f&quot;Sentence Vector: {sentence_vector}&quot;)<br># Calculation: ([5,5] + [3,3]) / 2  = [4, 4]<br>        # Output: Sentence Vector: [4. 4.]</pre><h3><strong>Part of Speech (POS) Tagging | Hidden Markov Models | Viterbi Algorithm in NLP</strong></h3><p>We are now entering the world of grammar and structure with <strong>Part of Speech (POS) Tagging</strong>.</p><p>If you have ever wondered how a computer knows that “Book a flight” uses “Book” as a <em>Verb</em>, but “Read a book” uses “Book” as a <em>Noun</em>, the answer is POS Tagging.</p><h4>1) What is POS tagging?</h4><p><strong>POS (Part of Speech) tagging</strong> = assigning a grammatical label to each word/token in a sentence.</p><p>Examples of POS tags:</p><ul><li><strong>NOUN</strong>: dog, city, movie</li><li><strong>VERB</strong>: run, eat, is</li><li><strong>ADJ</strong> (adjective): beautiful, big</li><li><strong>ADV</strong> (adverb): quickly, very</li><li><strong>PRON</strong> (pronoun): I, you, he</li><li><strong>DET</strong> (determiner): the, a, an</li><li><strong>ADP</strong> (preposition): in, on, to</li></ul><h4>Simple example</h4><p>Sentence:</p><blockquote><em>“The cat sleeps.”</em></blockquote><p>POS tags:</p><ul><li>The → DET</li><li>cat → NOUN</li><li>sleeps → VERB</li></ul><p>Why we do this: once you know the role of each word, it’s easier for a machine to understand the <strong>structure</strong> of the sentence.</p><h4>Why POS tagging is important (applications)</h4><p>POS tagging is a “support skill” that boosts many NLP systems:</p><p><strong>A) Information Retrieval (Search)</strong></p><p>If you search: <strong>“best camera for travel”</strong>, POS tags can help identify:</p><ul><li>“camera” = main noun</li><li>“best” = adjective modifying it<br> So search can weight the important words better.</li></ul><h4>B) Question Answering systems</h4><p>Question: “<strong>Who</strong> invented the telephone?”<br> POS helps find:</p><ul><li>“Who” (question pronoun)</li><li>“invented” (verb)</li><li>“telephone” (noun)</li></ul><h4>C) Disambiguation (same word, different meaning)</h4><p>Example:</p><ul><li>“I will <strong>book</strong> a cab.” → <strong>book = VERB</strong></li><li>“I read a <strong>book</strong>.” → <strong>book = NOUN</strong></li><li><em>Sentence A:</em> “I will <strong>park</strong> the car.” (“Park” is an Action/Verb).</li><li><em>Sentence B:</em> “I walked in the <strong>park</strong>.” (“Park” is a Place/Noun).</li></ul><p>Without POS tags, the computer thinks “park” means the same thing in both.</p><h4>D) Chatbots / intent understanding</h4><p>A chatbot often needs to identify:</p><ul><li>actions (verbs)</li><li>entities (nouns)</li><li>modifiers (adjectives/adverbs)</li></ul><p><strong>Why do we need it? Disambiguation.</strong> Words change meaning based on how they are used.</p><ul><li><em>Sentence A:</em> “I will <strong>park</strong> the car.” (“Park” is an Action/Verb).</li><li><em>Sentence B:</em> “I walked in the <strong>park</strong>.” (“Park” is a Place/Noun).</li></ul><p>Without POS tags, the computer thinks “park” means the same thing in both.</p><h4><strong>Module 2: Doing it the Easy Way (SpaCy)</strong></h4><p>It mentions using the library <strong>spaCy</strong>. This is the modern, fast way to do tagging without writing complex algorithms from scratch.</p><pre>import spacy<br><br># Load the English model<br>nlp = spacy.load(&quot;en_core_web_sm&quot;)<br><br>text = &quot;I will google the answer.&quot;<br><br>        # Process the text<br>        doc = nlp(text)<br><br># Print the token and its POS tag<br>for token in doc:<br>print(f&quot;{token.text} --&gt; {token.pos_} ({token.tag_})&quot;)<br><br># Output:<br>        # I --&gt; PRON (PRP)<br># will --&gt; AUX (MD)<br># google --&gt; VERB (VB)  &lt;-- Look! It knew &#39;google&#39; was a verb here!<br>        # the --&gt; DET (DT)<br># answer --&gt; NOUN (NN)</pre><h4><strong>Module 3: The Algorithm Behind It (Hidden Markov Models)</strong></h4><p><strong>Simple Explanation:</strong></p><p>How does the computer figure this out? It uses probability. Specifically, a <strong>Hidden Markov Model (HMM)</strong>.</p><p>The HMM looks at two types of probabilities:</p><ol><li><strong>Transition Probability (Tag → Tag):</strong></li></ol><ul><li>How likely is a <em>Noun</em> to follow a <em>Determiner</em>? (e.g., “The cat” → Very likely).</li><li>How likely is a <em>Verb</em> to follow a <em>Determiner</em>? (e.g., “The run” → Very unlikely).</li></ul><p><strong>2. Emission Probability (Tag → Word):</strong></p><ul><li>If the tag is <em>Verb</em>, how likely is the word “run”? (High).</li><li>If the tag is <em>Noun</em>, how likely is the word “run”? (Low, but possible, like “A long run”).</li></ul><p><strong>The Math Logic:</strong></p><p>The model calculates the probability for a sequence by multiplying these together:</p><p>P(Sequence) = P(Start → Noun) X P(Noun → Verb) X P(Verb → “run”)</p><h4><strong>Module 4: The “Viterbi” Algorithm (Optimization)</strong></h4><p><strong>The Problem:</strong> If you have a long sentence, checking <em>every single possible combination</em> of Nouns and Verbs would take forever (Exponential complexity).</p><ul><li>“I saw the man with the telescope.”</li><li>Is “saw” a noun (tool) or verb (action)? Is “man” a verb (to man a station) or noun?</li></ul><p><strong>The Solution (Viterbi):</strong> Instead of checking all paths at the end, the Viterbi algorithm checks them step-by-step and <strong>throws away the bad paths immediately</strong>. It keeps only the “winning” path at each word.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jXG4-6qllDmuSBD3FebVfQ.png" /></figure><p><strong>Simple Analogy:</strong> Imagine you are driving from New York to LA. You don’t map out every single road in the USA. You just look at the best road to the next city, take it, and then look for the best road to the city after that.</p><p><strong>Code Concept: Calculating Transition Probability</strong> Here is a simplified Python snippet to show how we calculate “How often does a Noun follow a Verb?”.</p><pre># A dummy dataset of tagged sentences<br># (Word, Tag)<br>corpus = [<br>        [(&quot;I&quot;, &quot;PRON&quot;), (&quot;love&quot;, &quot;VERB&quot;), (&quot;code&quot;, &quot;NOUN&quot;)],<br>        [(&quot;He&quot;, &quot;PRON&quot;), (&quot;runs&quot;, &quot;VERB&quot;), (&quot;fast&quot;, &quot;ADV&quot;)]<br>        ]<br><br>        # Calculate Transition: P(Tag B | Tag A)<br>def calculate_transition(tag_a, tag_b, data):<br>count_a = 0<br>count_a_followed_by_b = 0<br><br>        for sentence in data:<br>        for i in range(len(sentence) - 1):<br>current_tag = sentence[i][1]<br>next_tag = sentence[i+1][1]<br><br>        if current_tag == tag_a:<br>count_a += 1<br>        if next_tag == tag_b:<br>count_a_followed_by_b += 1<br><br>        return count_a_followed_by_b / count_a<br><br># Probability that a VERB follows a PRONoun<br>        prob = calculate_transition(&quot;PRON&quot;, &quot;VERB&quot;, corpus)<br>print(f&quot;Probability(VERB | PRON): {prob}&quot;)<br># Output: 1.0 (100% in this tiny dataset)</pre><p>I’ve tried to keep the explanation detailed while staying concise.<br> If you’d like to explore any of the topics in more depth, don’t hesitate to reach out — I’ll be glad to assist.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5ea74b55e2f6" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Foundation Models and Generative AI]]></title>
            <link>https://medium.com/@saha.soumyadeep90/foundation-models-and-generative-ai-b7bf0c73eaa6?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/b7bf0c73eaa6</guid>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Tue, 23 Dec 2025 04:24:13 GMT</pubDate>
            <atom:updated>2025-12-24T07:05:24.772Z</atom:updated>
            <content:encoded><![CDATA[<p>In the last few years, AI stopped being a collection of narrow tools and started feeling like a <strong>general-purpose helper</strong> — something that can write, summarize, explain, plan, generate images, and even assist in research. This shift isn’t magic. It comes from <strong>foundation models</strong>: huge models trained on “oceans” of data (text, images, code, and more) so they learn broad, reusable skills. When these models are used to <em>create</em> new content — sentences, pictures, programs, designs — we call it <strong>generative AI</strong>.</p><p>This article breaks down what’s really happening in simple terms: <strong>why self-supervised learning was the breakthrough</strong>, how modern models learn meaning from context and relationships, and why one strong base model can be adapted to dozens of tasks instead of building one model per task. We’ll also connect the technology to the real world — how businesses use these systems for “<strong>unified intelligence</strong>” how biology and medicine benefit from learning patterns in massive datasets, and why ethics, safety, bias, and regulation matter when a single model can influence decisions at scale.</p><h3><strong>Introduction</strong></h3><h4><strong>Big picture:</strong></h4><p>· <strong>Foundation Models</strong> (like GPT, Claude, etc.) are huge AI systems trained on oceans of text, images, code, audio, and more. They learn <strong>general skills</strong> (language, vision, reasoning) that you can adapt to many jobs, instead of training a new small model for each job.</p><p>· <strong>Generative AI</strong> makes new things (text, images, code, molecules) by learning the <strong>patterns and relationships</strong> in data.</p><p>· The course explores <strong>how modern AI learns</strong> (especially <em>self‑supervised learning</em>), where it’s used (science, business), and <strong>what it means</strong> for how we design systems in a messy, chaotic world.</p><h4><strong>1) Why the recent AI shift matters</strong></h4><ul><li><strong>Breakthroughs</strong>: Systems like ChatGPT showed that one general model can write, reason, summarize, plan, and even help with robotics or genomics tasks once adapted a bit.</li><li><strong>Economics</strong>: Investment has surged because these models can be reused everywhere (customer support, coding, research, creative work).</li><li><strong>Autonomous agents</strong>: On top of foundation models, “agents” try to plan multi-step tasks — like a smart assistant that can break a goal into steps and act.</li><li><strong>AGI discussion</strong>: The course touches on <strong>artificial general intelligence</strong> — systems that can do most cognitive tasks a human can. The “when” is debated, but understanding the <em>pathways</em> (especially learning from raw data) is essential.</li></ul><h4><strong>2) How do machines “learn” like people do?</strong></h4><p>Humans aren’t born with libraries in our heads. We learn patterns from experience and context.</p><ul><li><strong>Supervised learning</strong>: Learn from examples <em>with</em> correct answers (images labelled “dog”, “cat”).</li><li><strong>Reinforcement learning (RL)</strong>: Learn by acting, getting rewards/punishments (like learning to ride a bike).</li><li><strong>Generative / self‑supervised learning</strong>: Learn by predicting the missing parts of raw data itself (next word in a sentence, hidden patch in a picture).</li></ul><p>The philosophical angle: <strong>context</strong> matters. Meaning comes from <strong>relationships</strong> (how things connect) more than from isolated labels.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VQEl_cUYZS-4H9WAWCvahw.png" /><figcaption><strong>Three Ways Machines Learn</strong></figcaption></figure><h4><strong>3) Meaning comes from relationships</strong></h4><p>When a child learns “dog,” they don’t just memorize the word + a picture. They notice <strong>relations</strong>:</p><ul><li>Dogs <strong>fetch</strong> balls, <strong>bark</strong>, <strong>live</strong> with people.</li><li>Cats <strong>chase</strong> mice, <strong>sleep</strong> on sofas.</li></ul><p>These <strong>networks of relations</strong> make words meaningful. Generative models work similarly: they see millions of “cat–mouse–cheese” type patterns and internalize the structure, so they can write or draw sensible new combinations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*38Gjc1PdhW4D_E6lfa-6KA.png" /><figcaption><strong>“Context builds meaning” (diagram: a small relational map)</strong></figcaption></figure><h4><strong>4) Two ways to run organizations</strong></h4><blockquote><strong>Top‑down</strong>: Leaders decide → analysts set metrics → processes roll out → frontline executes.</blockquote><p><strong>Strength</strong>: clarity and consistency.</p><p><strong>Risk</strong>: misses on‑the‑ground nuance.</p><blockquote><strong>Bottom‑up</strong>: Frontline learns from customers → teams experiment → patterns bubble up → leaders align and scale.</blockquote><p><strong>Strength</strong>: grounded in real customer reality.</p><p><strong>Risk</strong>: can get noisy without coordination.</p><p>The lecture connects this to philosophy (Socrates’ cave) and science: our world is partly <strong>orderly</strong> (good for top‑down math) and partly <strong>chaotic</strong> (needs intuition and adaptation). Great orgs blend both perspectives.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CR54d2IYRPM4XLiLkoodUw.png" /><figcaption><strong>Top‑down vs Bottom‑up (diagram)</strong></figcaption></figure><h4><strong>5) The coding reality gap &amp; classic ML limits</strong></h4><p>Computers like <strong>precise math</strong> and <strong>explicit goals</strong>. But the world is <strong>messy</strong>. That creates friction:</p><blockquote><strong>Supervised learning</strong></blockquote><ul><li><strong>Pro</strong>: precise, works well with clear labels.</li><li><strong>Con</strong>: labels are expensive and sometimes unclear (“What counts as ‘kind customer service’?”).</li></ul><blockquote><strong>Reinforcement learning</strong></blockquote><ul><li><strong>Pro</strong>: learns sequences of actions to reach goals.</li><li><strong>Con</strong>: feedback is <strong>delayed</strong> (you drive for 2 hours, then find out you took the wrong turn); unsafe to “trial‑and‑error” in real life.</li></ul><blockquote><strong>Blank slate problem</strong></blockquote><ul><li>Starting from nothing makes exploration slow and risky. We need <strong>priors</strong> or <strong>representations</strong> that already make sense of the world.</li></ul><h4><strong>6) Why self‑supervised learning (SSL) was a breakthrough</strong></h4><p>Self‑supervision learns from <strong>raw, unlabeled data</strong> by setting <strong>make‑believe tasks</strong>:</p><ul><li><strong>Text</strong>: predict the next word or the masked word.</li><li><strong>Images</strong>: predict missing patches or align multiple views of the same scene.</li><li><strong>Audio/Video</strong>: predict the next frame/sound.</li></ul><p>This makes models learn <strong>general representations</strong> (a sense of “how the world is organized”), which you can later specialize for many tasks. It’s <strong>safer</strong> than RL (no risky real‑world exploration) and <strong>cheaper</strong> than supervised (no labels needed).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3SB6GkCcasWItOZJ3jxCNw.png" /><figcaption><strong>Self‑supervised learning from image crops</strong></figcaption></figure><h4><strong>7) How SSL detects relationships</strong></h4><ul><li><strong>Images</strong>: Take two crops of the <strong>same</strong> photo → train the model to put their internal vectors <strong>close</strong> together, and different images <strong>far</strong> apart. Over time, it clusters related concepts (e.g., “wheels”, “fur”, “sky”).</li><li><strong>Genomics</strong>: Predict the next DNA base from the previous ones; the model internalizes <strong>motifs</strong> and can help find genes or regulatory elements.</li><li><strong>Retail</strong>: Look at what people view/click/buy; SSL can learn “this and that often go together,” improving recommendations <strong>without</strong> hand‑built user profiles.</li></ul><h4><strong>8) Using it in business</strong></h4><ul><li><strong>Know your customers &amp; products deeply</strong>: Use self‑supervised patterns from <strong>behavioural data</strong> (clickstreams, sequences) to improve search, ranking, recommendations, demand forecasts, and even product design.</li><li><strong>Learn from observation first</strong>: It’s cheaper and safer to learn from historical data before trying interactive learning in the wild.</li><li><strong>Blend order &amp; chaos</strong>: Keep the top‑down strategy (safety, compliance, KPIs) but let bottom‑up signals (frontline/customer data) shape decisions.</li></ul><h4><strong>Concrete examples to make it stick</strong></h4><blockquote><strong>Self‑supervised text</strong></blockquote><ul><li>Task: hide the word “mouse” in “The cat chased the ___.”</li><li>The model guesses “mouse” because it has seen that <strong>cat–chase–mouse</strong> pattern repeatedly.</li></ul><blockquote><strong>Self‑supervised images</strong></blockquote><ul><li>Task: crop two parts of a single dog photo.</li><li>The model learns both crops are “the same thing.” Later, it recognizes <strong>dogs</strong> even in new poses or lighting.</li></ul><blockquote><strong>Retail playlists</strong></blockquote><ul><li>Customers who buy <strong>running shoes</strong> also often look at <strong>socks</strong> and <strong>phone armbands</strong>.</li><li>SSL learns this bundle — no one had to label “these three go together.”</li></ul><blockquote><strong>Genomics</strong></blockquote><ul><li>DNA has recurring “motifs.” Predicting the next base forces the model to internalize these motifs, which helps spot genes.</li></ul><h4><strong>How these pieces fit together</strong></h4><ul><li><strong>Foundation models</strong> get powerful because <strong>SSL</strong> lets them soak up the <em>structure</em> of the world from raw data.</li><li>Once they have that structure, a little <strong>supervised learning</strong> (fine‑tuning) or <strong>RL</strong> (to align behavior with goals) goes a long way.</li><li>In <strong>organizations</strong>, mirror the same idea: collect rich bottom‑up signals (customer interactions), then guide with top‑down objectives (safety, strategy).</li></ul><h4><strong>Practical tips if you’re building with this</strong></h4><ul><li>Start with <strong>self‑supervised pretraining</strong> on all the unlabelled data you can legally and ethically use.</li><li>Add <strong>task‑specific fine‑tuning</strong> with small labelled sets.</li><li>For sequential tasks (e.g., routing, pricing policies), use <strong>RL</strong> carefully in simulations or sandboxes first.</li><li>Measure both <strong>accuracy</strong> <em>and</em> <strong>robustness</strong> (does it still work when conditions shift?).</li><li>Keep <strong>human oversight</strong> for safety and fairness; raw data can encode biases.</li></ul><h3><strong>How Does It Work?</strong></h3><h4><strong>What this covers (at a glance)</strong></h4><ul><li><strong>Foundation models &amp; Generative AI</strong> learn general skills from <strong>unlabelled data</strong> using <strong>self‑supervised learning (SSL)</strong> — a big leap that makes AI scalable and versatile.</li><li>How <strong>language models</strong> learn meaning from context (masked vs causal next‑word prediction), and why <strong>text‑to‑text</strong> framing (like T5) simplifies everything.</li><li><strong>Contrastive learning</strong> for text and images, <strong>diffusion</strong> for image generation, and classic <strong>autoencoders</strong> &amp; <strong>GANs</strong> for compression and synthesis.</li><li>Why <strong>language is a universal interface</strong> for robots and <strong>autonomous agents</strong> (plan → act → check → improve), and how tool use (calculator, web) expands capability.</li></ul><h4><strong>1) Self‑supervised learning: the breakthrough</strong></h4><p><strong>Idea in one line:</strong> Make up a small puzzle from raw data (mask a word, crop an image, add noise), train the model to solve it, and the model is forced to learn <strong>useful general patterns</strong>.</p><ul><li><strong>No labels required:</strong> We don’t need humans to tag every example.</li><li><strong>Scales beautifully:</strong> Tons of unlabelled text/images/audio exist.</li><li><strong>Reusable knowledge:</strong> After pretraining, you can <strong>fine‑tune</strong> or <strong>prompt</strong> for many tasks (classification, search, QA, coding, robotics).</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AwCg1aSUmfal93UPJ6Skpg.png" /><figcaption><strong>SSL overview → pretrain, then reuse</strong></figcaption></figure><h4><strong>2) How models understand words (vectors + context)</strong></h4><ul><li><strong>Meanings are relational:</strong> <em>Cat</em> is close to <em>kitten</em> (species/age relation).</li><li><strong>Context disambiguates:</strong> “bank” (money) vs “bank” (river) depends on nearby words.</li><li><strong>Masked language modeling (MLM):</strong> Hide a word and force the model to predict it using <strong>both left and right context</strong> → strong <strong>understanding</strong>.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_cyYfuEyTmcr1PyhkPvuRA.png" /><figcaption><strong>Context and vector space (cat–kitten, bank)</strong></figcaption></figure><h4><strong>3) Pretraining → fine‑tuning, plus the text‑to‑text shift</strong></h4><ul><li><strong>Pretraining</strong> gives broad language sense.</li><li><strong>Fine‑tuning</strong> nudges the model for a specific job (e.g., Amazon review sentiment).</li><li><strong>Text‑to‑text framing (T5):</strong> Represent <em>every</em> task as <strong>input text → output text</strong> (e.g., “translate: …”, “summarize: …”).</li></ul><blockquote><strong>Why it’s great:</strong> One uniform interface, less ad‑hoc engineering, easy to read and debug.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*H5vOVNU9DZKvnv7RvWWYng.png" /><figcaption><strong>Pretrain → World sense → Fine‑tune; Text‑to‑text examples</strong></figcaption></figure><h4><strong>4) Two training styles for language models</strong></h4><blockquote><strong>Masked LM (BERT‑like)</strong></blockquote><ul><li><strong>Training:</strong> “The cat sat on the [MASK].” Predict the mask using <strong>both sides</strong>.</li><li><strong>Strength:</strong> Strong internal representations (great for understanding).</li><li><strong>Limit:</strong> Not naturally a left‑to‑right generator.</li></ul><blockquote><strong>Causal LM (GPT‑like)</strong></blockquote><ul><li><strong>Training:</strong> “The cat sat on the …” Predict the <strong>next</strong> word using only the <strong>left</strong> side.</li><li><strong>Strength:</strong> Fluent, open‑ended <strong>generation</strong>.</li><li><strong>Limit:</strong> No right context; may “guess” and sometimes hallucinate.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*O5P5GjMOZXN6yH3hA-VIDw.png" /><figcaption><strong>Masked vs Causal (pros &amp; cons)</strong></figcaption></figure><h4><strong>5) Contrastive learning (images &amp; sentences)</strong></h4><p><strong>Core idea:</strong> Pull <strong>similar</strong> things <strong>together</strong> in the model’s space; push <strong>dissimilar</strong> things <strong>apart</strong>.</p><ul><li><strong>Images:</strong> Two crops of the same photo → <strong>positive pair</strong> (close embeddings). Unrelated images → <strong>negative</strong> (far apart).<br> → Improves image representations and classification.</li><li><strong>Text:</strong> Paraphrases or augmented sentences → positive pairs. Helps models <strong>encode meaning</strong> consistently.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gx6lQUT0W6XHt4g_ox2c1Q.png" /><figcaption><strong>Contrastive learning for images &amp; sentences</strong></figcaption></figure><h4><strong>6) Diffusion models: generate by denoising</strong></h4><ul><li><strong>Forward process:</strong> Gradually add noise to an image until it’s nearly pure noise.</li><li><strong>Reverse process:</strong> Train a model to <strong>remove a little noise at a time</strong>, walking back to a clean image.</li><li><strong>Why it works:</strong> The model learns to reconstruct structure from noise → powerful, controllable image generation.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hL43kaimHJR3LHL8wdOUnw.png" /><figcaption><strong>Diffusion intuition (noise → denoise)</strong></figcaption></figure><h4><strong>7) Autoencoders &amp; GANs</strong></h4><blockquote><strong>Autoencoder:</strong></blockquote><ul><li><strong>Encoder → bottleneck → decoder.</strong></li><li>Learns compact representations; good for <strong>compression</strong>, <strong>denoising</strong>, <strong>feature learning</strong>.</li></ul><blockquote><strong>GAN (Generative Adversarial Network):</strong></blockquote><ul><li><strong>Generator (artist)</strong> makes fakes; <strong>Discriminator (critic)</strong> tries to spot fakes.</li><li>Training is a <strong>competition</strong> → increasingly realistic images.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jFiUxEhmIB2q9dHDdFsnjA.png" /><figcaption><strong>Autoencoders &amp; GANs in one view</strong></figcaption></figure><h4><strong>8) Language as a universal interface (robots &amp; agents)</strong></h4><p><strong>Language standardizes knowledge:</strong> Easy to write, read, and share instructions (“Pick up the red mug…”).</p><ul><li><strong>Robotics:</strong> The LM turns instructions into <strong>plans</strong> (steps), checks constraints, and sequences actions more clearly than low‑level numbers alone.</li><li><strong>Agents with tools:</strong> The model plans, uses tools (<strong>calculator</strong>, <strong>browser</strong>, <strong>database</strong>), <strong>self‑checks</strong>, retries, and learns from memory/logs.</li><li><strong>Why tools matter:</strong> Offload heavy math or retrieval to reliable tools → better accuracy and less user burden.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lQPn2gfYHQyRM_XrPMiSMg.png" /><figcaption><strong>Language → plan → execute, with tools &amp; self‑check loop</strong></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Gtg67YqtwaOB0Kq55f5IJg.png" /></figure><h3><strong>CHAT-GPT &amp; LLMs</strong></h3><h4><strong>1) What’s special about ChatGPT &amp; foundation models?</strong></h4><ul><li><strong>Self‑supervised pretraining</strong>: The model learns general language patterns by predicting the <strong>next word</strong> on huge amounts of text. No manual labels are needed.</li><li><strong>Transformers</strong>: The architecture that makes training fast and effective by processing tokens <strong>in parallel</strong> with <strong>self‑attention</strong>, unlike older sequential RNNs.</li><li><strong>Scaling</strong>: More data, parameters, and compute typically lower error — up to practical limits — making models more capable.</li><li><strong>Engineering details matter</strong>: Beyond big ideas, stability tricks, data curation, and training pipelines drive real‑world quality and robustness.</li><li><strong>Beyond text</strong>: The lecture mentions <strong>stable diffusion</strong> (images) and other emerging models — showing these foundations generalize across modalities.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2f1SxOOLDIhfTDoASmcNKw.png" /><figcaption><strong>Next‑token prediction (the core pretraining loop)</strong></figcaption></figure><h4><strong>2) How next‑word prediction actually trains a model</strong></h4><ol><li>Feed the prompt tokens (e.g., “The cat sat on the”).</li><li>The model outputs a <strong>probability</strong> for every word in its vocabulary.</li><li>Compare with the <strong>true</strong> next word (“mat”) → compute loss.</li><li>Update the model so it assigns <strong>higher</strong> probability to the correct word next time.</li><li>Repeat billions of times with diverse text → the model internalizes grammar, facts, and patterns.</li></ol><p>This single training objective is surprisingly powerful — because language encodes knowledge about the world.</p><h4><strong>3) Why Transformers changed the game</strong></h4><ul><li><strong>Self‑attention</strong> lets each word look at <strong>all</strong> other words at once to pull relevant context; the model runs <strong>many tokens at once</strong> (matrix math), not one‑by‑one.</li><li><strong>Direct long‑distance connections</strong>: The word at the start can directly attend to something at the end; RNNs struggle with long memories.</li><li><strong>Positional encodings</strong> provide <strong>order</strong> information so the model knows which word came first, even though it processes them in parallel.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*b5mPCbGSnn_U282oa0JQ1g.png" /><figcaption><strong>Transformer block (attention → MLP)</strong></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*keAuuuNE02T93ysMmj92uA.png" /><figcaption><strong>Multi‑head attention (different heads learn different relations)</strong></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OLhXCHAsOt5oH_333SWeTg.png" /><figcaption><strong>Positional encodings (how order is represented)</strong></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*88WJXFij8B4Hv0MjgVA7vg.png" /><figcaption><strong>Illustration — Toy sinusoidal signals for positions</strong></figcaption></figure><h4><strong>4) Scaling laws: why “more” often helps</strong></h4><p>As you increase <strong>data</strong>, <strong>model size</strong>, and <strong>compute</strong>, loss tends to drop smoothly — until you hit practical limits (data quality, overfitting, etc.). That’s why foundation models keep getting stronger when scaled correctly.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*11B-vysB1UQxoTmyE9DG3w.png" /><figcaption><strong>Scaling (schematic)</strong></figcaption></figure><h4><strong>5) From raw models to helpful dialogue: SFT and RLHF</strong></h4><ul><li><strong>SFT (Supervised Fine‑Tuning)</strong>: Teach the model the <em>style</em> of helpful answers using curated instruction‑response pairs.</li><li><strong>Preference data</strong>: Humans compare two model replies and pick the better one — capturing <strong>quality</strong> beyond token‑by‑token accuracy.</li><li><strong>Reward model</strong>: A model trained to <strong>predict</strong> human preferences gives a score to a candidate reply.</li><li><strong>RLHF (Reinforcement Learning from Human Feedback)</strong>: Optimize the policy (the chat model) to <strong>maximize</strong> the reward model’s score (often via PPO). This improves helpfulness, harmlessness, and robustness over long responses, even though feedback is <strong>delayed</strong> until the end.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nd1g9UUqJKMdzUA4AxtUlQ.png" /><figcaption><strong>RLHF loop (prompt → candidates → preference → reward → PPO update)</strong></figcaption></figure><h4><strong>6) Reinforcement learning challenges in dialogue</strong></h4><ul><li><strong>Delayed feedback</strong>: You don’t know if the final answer is good until the end — hard for credit assignment.</li><li><strong>Exploration vs exploitation</strong>:</li><li><em>Exploit</em>: stick to what works now.</li><li><em>Explore</em>: try new phrasing/structures that may be better.</li><li>Best: <strong>targeted exploration</strong> — sample promising but uncertain options.</li><li><strong>Robustness</strong>: RL must not overfit to the reward model or game the metric. We add safeguards (consistency checks, tool‑verified steps, rule constraints).</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MM9ZcFvCaqqozmcgwjREsg.png" /></figure><h4><strong>7) Feedback that grows with the model</strong></h4><p>As the model improves, we also <strong>raise the bar</strong> for feedback:</p><ol><li>Token‑level losses (pretraining).</li><li><strong>SFT</strong> demonstration quality.</li><li><strong>Pairwise preferences</strong> (ranking).</li><li><strong>Rule‑based checks</strong> (format, safety, citations).</li><li><strong>Tool‑verified answers</strong> (calculators, retrieval) and <strong>self‑check</strong> steps.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*d0BQZ4WBVMQgWpVgl7RXhA.png" /><figcaption><strong>Feedback curriculum (simple → sophisticated)</strong></figcaption></figure><h4><strong>8) Practical takeaways for builders</strong></h4><ul><li><strong>Engineering intuition matters</strong>: Stability tricks, data filtering, careful validation — these “details” create the leap from demo to dependable.</li><li><strong>Attention &amp; causality</strong>: ChatGPT uses <strong>causal</strong> (left‑to‑right) attention for generation.</li><li><strong>Guardrails</strong>: Balance factuality, bias reduction, and sensitivity. Use tool‑use (calculators/browsers), structured prompts, citations, and post‑processing checks.</li></ul><h4><strong>Quick glossary</strong></h4><ul><li><strong>Self‑supervised learning</strong>: Learn from raw data by solving puzzles like “predict the next word.”</li><li><strong>Transformer</strong>: Architecture that uses <strong>self‑attention</strong> to combine context efficiently in parallel.</li><li><strong>SFT</strong>: Supervised fine‑tuning on instruction data to teach helpful outputs.</li><li><strong>RLHF</strong>: Use human preference judgments to train a reward model and optimize the chat policy.</li></ul><h3><strong>Data and Stable Diffusion</strong></h3><h4><strong>1) Why data is the power source for modern AI</strong></h4><ul><li><strong>Data &gt; tech (over time):</strong> Better, larger, cleaner, and properly licensed data usually beats fancy tricks. Models are only as good as what they learn from.</li><li><strong>Access matters:</strong> Whoever can legally access and refresh high‑quality datasets can <strong>retrain</strong> and keep improving (e.g., new styles, trends, vocabulary).</li><li><strong>Ethics &amp; copyright:</strong> Datasets must respect creators’ rights. The legal landscape affects what data can be used — and therefore what models can learn.</li><li><strong>We are data creators:</strong> Our texts, images, and interactions become the “lessons” models learn from.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sUII5oogowwTThLIC2NN3g.png" /><figcaption><strong>Data → Pretraining → Foundation Model → Apps</strong></figcaption></figure><h4><strong>2) Stable Diffusion in plain English</strong></h4><p><strong>What it does:</strong> Turns a <strong>text prompt</strong> (e.g., “a watercolor fox in a forest”) into an image by starting from <strong>random noise</strong> and gradually <strong>removing</strong> that noise in a series of small steps.</p><p><strong>Key pieces:</strong></p><ol><li><strong>Text encoder</strong>: Converts your prompt into a vector (a numerical summary of meaning).</li><li><strong>Latent space</strong>: Images are <strong>compressed</strong> into a smaller grid (a “latent”) so generation is much faster and cheaper.</li><li><strong>Denoiser (U‑Net)</strong>: Learns to <strong>remove a bit of noise</strong> at each step. After many steps, the latent looks like a clean picture representation.</li><li><strong>Decoder (VAE)</strong>: Transforms the final latent back into a full‑resolution <strong>image</strong>.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JlcxQXYMLS0VVXfu9JvGKw.png" /><figcaption><strong>Stable Diffusion pipeline (overview)</strong></figcaption></figure><h4><strong>3) Why randomness is needed</strong></h4><ul><li>Without randomness, models would keep producing <strong>the same</strong> output.</li><li>A <strong>seed</strong> controls the initial noise. Change the seed → different starting point → <strong>different image</strong>. Keep the seed → <strong>reproducible</strong> image.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XZKZEamXpR0vKhqa3j8kMg.png" /><figcaption><strong>Seeds change the result</strong></figcaption></figure><h4><strong>4) The “iterative improvement” idea</strong></h4><ul><li>Think of an artist sketching: <strong>rough → refine → detail</strong>.</li><li>The model does the same: <strong>many small denoising steps</strong> (from heavy noise at the start to almost none at the end) until the picture emerges.</li><li><strong>Text guidance</strong> nudges each step toward what the prompt asked for.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*H5M83nf0XIE-SPK3SbIxsA.png" /><figcaption><strong>From pure noise to image over steps</strong></figcaption></figure><h4><strong>5) Why it’s cost‑efficient: work in latent space</strong></h4><ul><li>Training and generating on <strong>compressed latents</strong>(smaller grids) is <strong>much faster</strong> than working on full‑resolution pixels.</li><li>The <strong>encoder</strong> compresses; the <strong>decoder</strong> reconstructs. Good encoders/decoders preserve important details while dropping redundancy.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JdO_PutFm2kWtEaBOfrdqg.png" /><figcaption><strong>Latent compression with VAE</strong></figcaption></figure><h4><strong>6) How models know an image is “good”: losses</strong></h4><p>To train an image model, we need a way to <strong>measure</strong> how good the output is relative to a target. Common ingredients:</p><ul><li><strong>Pixel / reconstruction loss (MSE/MAE):</strong> Simple and stable, but can look slightly <strong>blurry</strong>.</li><li><strong>Perceptual loss:</strong> Compare <strong>features</strong> from a vision net; pushes toward images humans find <strong>sharp</strong> and <strong>natural</strong>.</li><li><strong>Adversarial (GAN/patch critic):</strong> A small <strong>critic</strong> network checks local patches for realism; great for <strong>texture</strong>, but training can be <strong>tricky</strong>.</li></ul><p>Often we <strong>combine</strong> these to get both sharp details <strong>and</strong> global correctness.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Yuc_5ZS1znduWw6v8tPQtg.png" /><figcaption><strong>Losses: pixel vs perceptual vs adversarial</strong></figcaption></figure><h4><strong>7) Local detail + global structure</strong></h4><ul><li><strong>Patch critic</strong> rewards <strong>locally</strong> realistic textures (fur, bark, fabric).</li><li><strong>Global similarity</strong> keeps the <strong>overall composition</strong> (shapes, layout) coherent.</li><li>Balancing both makes images look realistic <strong>up close</strong> and <strong>as a whole</strong>.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8KR6mRPcJyZmFr3d9fotQQ.png" /><figcaption><strong>Patch critic &amp; global similarity together</strong></figcaption></figure><h4><strong>8) How text aligns with images (contrastive learning)</strong></h4><ul><li>Models learn that the image of a <strong>fox in a forest</strong> should be <strong>close</strong> (in embedding space) to the caption “a red fox in a forest,” and <strong>far</strong> from unrelated captions.</li><li>This <strong>alignment</strong> helps prompts steer image generation in the intended direction.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jH91VvrcQvDBw71EBGdwng.png" /></figure><h4><strong>9) Training with different noise levels</strong></h4><ul><li>During training, the model sees the <strong>same image</strong> at <strong>many noise levels</strong> and learns to <strong>denoise</strong> appropriately.</li><li>This “curriculum of noise” provides <strong>directional feedback</strong>: at each step it learns how to move a little closer to the real image.</li><li>Over many iterations, it can <strong>navigate</strong> from noisy inputs to realistic outputs.</li></ul><h4><strong>Stable Diffusion: Components &amp; Trade‑offs (interactive table)</strong></h4><p><strong>Data is king</strong>: high‑quality, licensed, diverse data drives better models.</p><p><strong>Stable Diffusion</strong> generates images by <strong>iteratively denoising</strong> a compressed latent, guided by your text prompt.</p><p><strong>Randomness (seed)</strong> gives diversity; <strong>latent space</strong> makes it fast and cheap; <strong>combined losses</strong> ensure realistic detail and coherent structure.</p><p><strong>Contrastive alignment</strong> ties text and images together so prompts steer results effectively.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JS5yAJNjRp3tidaBma1-Wg.png" /></figure><h3><strong>AI. ECOSYSTEM</strong></h3><p><strong>Foundational models</strong> learn <strong>relational meaning</strong> — they understand concepts from how things connect (across text, images, audio, behaviors). Companies that <strong>unify</strong> their data and ML into a <strong>single intelligence layer</strong> compound advantages across search, recommendations, marketing, pricing, risk, support, and more.</p><h4><strong>1) What are “foundation models” and why are they a big shift?</strong></h4><p>A <strong>foundation model</strong> is a large AI model trained on a huge amount of data (text, images, etc.) so it learns <strong>general patterns</strong>. After that, you can adapt it to many tasks like:</p><ul><li>answering questions</li><li>summarizing documents</li><li>recommending products</li><li>classifying items</li><li>extracting info from text</li></ul><p>Earlier AI was usually <strong>one model per task</strong> (one for translation, one for search ranking, one for sentiment…).</p><p>Now, a foundation model can become a <strong>single base engine</strong> that supports many tasks.</p><h4><strong>2) The key idea: “Relational meaning” (how AI really understands concepts)</strong></h4><p><strong>What “relational meaning” means</strong></p><p>A word or concept doesn’t have meaning in isolation. It gets meaning from its <strong>relationships</strong> with other concepts.</p><p>Example:</p><ul><li>You understand <strong>“dog”</strong> not only by a dictionary definition<br> but also by its links to <strong>bark, pet, leash, park, bite, fur, vet, cute</strong>.</li></ul><p>Foundation models learn like this by observing massive data:</p><ul><li>which words appear near which words</li><li>which images match which captions</li><li>which actions follow which actions in user behavior logs</li></ul><p>This is why they can often “understand” things they were never explicitly taught.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*p2cE_pL5wOndd-Ynj5Ijvg.png" /></figure><h4><strong>3) Why the section talks about “how humans learn” (and why it matters)</strong></h4><p>The speakers highlight that humans learn a lot <strong>without formal teaching</strong>:</p><ul><li>kids learn language mostly by exposure, trial, correction, context</li><li>people learn meaning from real-world experience and repeated patterns</li></ul><p>This connects to modern AI training called <strong>self-supervised learning</strong>:</p><ul><li>the model teaches itself from data patterns (no human labeling for every example)</li></ul><p>So the message is:</p><p>If you want better AI, learn from how humans build understanding: mostly from exposure + relationships + experience.</p><h4><strong>4) From “isolated task models” to “unified intelligence”</strong></h4><p><strong>Old style</strong></p><ul><li>Separate AI for search</li><li>Separate AI for recommendations</li><li>Separate AI for customer support</li><li>Separate AI for marketing analytics</li></ul><p>Problem: these systems don’t “talk” to each other well, so the business acts like it has <strong>multiple small brains</strong>.</p><p><strong>New style (unified intelligence)</strong></p><p>Build a <strong>central intelligence layer</strong> that understands:</p><ul><li>customers</li><li>products</li><li>context (season, location, trends)</li><li>business constraints (inventory, delivery, margins)</li></ul><p>Then different applications connect to it.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bmKkgSK4Vq2ZUUyFocRDrA.png" /></figure><h4><strong>5) Why businesses need their own central model (competitive advantage)</strong></h4><p>If everyone uses the same public model (same API, same general training),<br> then <strong>everyone has the same intelligence</strong>.</p><p>So where does advantage come from?</p><p><strong>Your proprietary data</strong>:</p><ul><li>customer clicks, searches, carts, purchases</li><li>returns and complaints</li><li>store inventory and supply chain signals</li><li>product catalogs and attributes</li><li>domain rules (retail logic, policies, constraints)</li></ul><p>When you combine foundation models with <em>your</em> unique data, you get:</p><ul><li>better personalization</li><li>better predictions</li><li>better product understanding</li><li>better decision-making</li></ul><p>That’s hard for competitors to copy.</p><h4><strong>6) “Many foundation models will exist” (not just one model to rule all)</strong></h4><p>The future described is <strong>not</strong>: one single AI does everything best.</p><p>Instead:</p><ul><li>some models are great at language</li><li>some at images/video</li><li>some at search/ranking</li><li>some at code</li><li>some at reasoning</li><li>some at a specific industry (health, retail, finance)</li></ul><p>So companies will likely use a <strong>portfolio of models</strong>:</p><ul><li>an internal “core” model for their business brain</li><li>external models for specialized skills</li><li>tools like databases, search engines, workflow systems</li></ul><p>Use this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0cjn1Y2hsCVADCzz5Jfc6w.png" /></figure><h4><strong>7) Multi-modal learning (like human senses)</strong></h4><p>Humans don’t learn only from text:</p><ul><li>vision, sound, touch, movement, memory, emotion all contribute</li></ul><p>Similarly, foundation models are evolving to handle:</p><ul><li>text + images + audio + video + user behavior + structured data</li></ul><p>This creates “synergy”:</p><ul><li>the model can connect how something <strong>looks</strong> with how it’s <strong>described</strong></li><li>and how people <strong>behave</strong> around it (click, buy, return)</li></ul><h4><strong>8) System 1 vs System 2 thinking (intuition vs conscious reasoning)</strong></h4><p>The section mentions a psychological idea:</p><p><strong>System 1 (fast, automatic)</strong></p><ul><li>quick judgments</li><li>intuition</li><li>habits</li><li>pattern recognition<br> Most day-to-day decisions happen here.</li></ul><p><strong>System 2 (slow, deliberate)</strong></p><ul><li>careful reasoning</li><li>step-by-step logic</li><li>conscious effort<br> Used less often.</li></ul><p>Why this matters for AI:</p><ul><li>many useful business predictions are more like <strong>System 1</strong><br> (pattern-based, probabilistic, fast)</li><li>not everything needs heavy “reasoning” to be valuable</li></ul><p>Example:</p><ul><li>predicting a customer is likely to abandon a purchase doesn’t require a proof</li><li>it requires recognising patterns from behavior</li></ul><h4><strong>9) Retail “deep intelligence” (focus on understanding customers, not just solving one task)</strong></h4><p>The section argues that the biggest win in retail is not only:</p><ul><li>“answer questions”</li><li>“fix tickets”</li><li>“automate emails”</li></ul><p>…but building a model that understands:</p><ul><li>customer intent</li><li>product meaning</li><li>shopping journey</li><li>preferences and constraints</li></ul><p>That enables a better experience:</p><ul><li>better navigation</li><li>better recommendations</li><li>fewer frustrating searches</li><li>more trust</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Vf53ZZ_0RXP1AA14Rv7s8g.png" /></figure><h4><strong>10) Why “expert labelling” can miss what customers actually care about</strong></h4><p>Traditional product tagging might say:</p><ul><li>Category: “wall art”</li><li>Style: “landscape”</li><li>Color: “orange”</li></ul><p>But customers might actually be reacting to something else:</p><ul><li>the <strong>feeling</strong></li><li>the <strong>scene</strong> (example mentioned: “sunsets”)</li><li>mood, aesthetic, cultural meaning</li></ul><p>AI can learn this from behavior:</p><ul><li>what people click after viewing it</li><li>what they compare it with</li><li>what they save</li><li>what they return</li><li>what they search before they buy</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ksm1oucatCHGGyqBPEkA3g.png" /></figure><h4><strong>11) Multilingual + cultural nuance (harder than it looks)</strong></h4><p>The section points out that in real markets:</p><ul><li>people mix languages (code-switching)</li><li>meanings differ by culture</li><li>translations aren’t literal</li></ul><p>So a retail intelligence system must understand:</p><ul><li>blended language queries</li><li>local synonyms</li><li>culturally specific product interpretations</li></ul><p>Example style of problem:</p><ul><li>one region’s “slippers” might be another region’s “flip-flops”</li><li>product descriptions may need adaptation, not direct translation</li></ul><h4><strong>12) Predictive workforce modelling (reducing attrition cost)</strong></h4><p>Instead of only relying on surveys (“Are you happy at work?”),<br> AI can learn patterns from behaviour signals like:</p><ul><li>schedule changes</li><li>overtime spikes</li><li>repeated shift conflicts</li><li>performance changes</li><li>transfer requests</li><li>absentee patterns</li></ul><p>Then it can estimate <strong>attrition risk</strong>, so managers can intervene early:</p><ul><li>better scheduling</li><li>coaching</li><li>career development</li><li>workload balancing</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*v3kx_gmuDwtcHCEc0tFqCw.png" /></figure><p>(Important note in real life: this should be handled carefully with privacy, fairness, and transparency — otherwise it can create mistrust.)</p><h3><strong>AI. BIOLOGY</strong></h3><h4><strong>1) AI is changing biology and medicine: what’s the big change?</strong></h4><p>Earlier, medicine progressed mainly by:</p><ul><li>observing something in patients,</li><li>making a <strong>hypothesis</strong> (“maybe X causes Y”),</li><li>testing it on small experiments.</li></ul><p>Now, we can collect huge amounts of data (genetics + hospital records + medical images + lab results), and AI helps us:</p><ul><li>find patterns we didn’t notice,</li><li>discover hidden subtypes of disease,</li><li>propose new drug targets,</li><li>personalize treatment per person.</li></ul><p>So the professor’s main message is:</p><p>Medicine is shifting from “guess first” to “measure a lot first.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CBjw2tl9a-0eZJjIocGdWA.png" /></figure><h4><strong>2) Hypothesis-driven vs data-driven research (simple comparison)</strong></h4><p><strong>Hypothesis-driven (older style)</strong></p><ol><li>Scientist guesses an explanation.</li><li>Runs a small targeted experiment.</li><li>Confirms or rejects it.</li><li>Repeats.</li></ol><p>Good when we already have strong ideas.<br>Limited because we might miss unexpected causes.</p><h4><strong>Data-driven (new style)</strong></h4><ol><li>Collect big datasets (genetics, EHR, images).</li><li>Use AI to find patterns and relationships.</li><li>Generate many candidate explanations.</li><li>Test the best candidates in lab/clinical experiments.</li></ol><p>Great at discovering surprises and hidden mechanisms.<br>Needs careful design to avoid false patterns and bias.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wf924KFY196VAdNmQ78ZoA.png" /></figure><h4><strong>3) Moving from “correlation” to “causation” (why genetics is powerful)</strong></h4><p><strong>Correlation</strong></p><p>Correlation means:</p><ul><li>“X and Y occur together”</li></ul><p>Example:</p><ul><li>People with a certain marker often have Alzheimer’s.</li></ul><p>But correlation does <strong>not</strong> prove cause:</p><ul><li>Maybe X is just a side effect, not the real driver.</li></ul><p><strong>Causation</strong></p><p>Causation means:</p><ul><li>“X actually produces Y”</li></ul><p>Genetics helps because it gives <strong>mechanistic clues</strong>:</p><ul><li>If a gene variant increases disease risk, it’s often closer to a real cause (not always, but it’s a stronger clue).</li></ul><p>Then researchers test causation by <strong>interventions</strong>, like:</p><ul><li>editing genes in cells,</li><li>switching gene circuits on/off,</li><li>checking if the disease-related outcome changes.</li></ul><p>If changing the gene changes the outcome → stronger evidence of cause → better drug target.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4vRonEj6ty-Oh-e1ja7xvw.png" /></figure><h4><strong>4) Deep learning in biology: why it’s useful</strong></h4><p>Biology data is messy and complex:</p><ul><li>thousands of genes interact,</li><li>proteins fold in complicated ways,</li><li>disease isn’t one thing (it can have subtypes),</li><li>data comes from many sources.</li></ul><p>Deep learning helps because it can learn patterns from high-dimensional data like:</p><ul><li>gene expression profiles,</li><li>microscopy images,</li><li>pathology slides,</li><li>multi-step patient histories.</li></ul><p>A key idea in the summary:</p><p>AI can predict outcomes, then scientists validate those predictions by doing real experiments on cells.</p><p>So AI doesn’t replace experiments; it helps choose <strong>which experiments to do first</strong>.</p><h4><strong>5) Genetic mechanisms → new therapies (examples in the summary)</strong></h4><p>The section summary gives examples of using genetic understanding like “rewiring circuits”:</p><p><strong>A) Obesity / metabolic disorders</strong></p><p>Human fat cells can behave in different modes:</p><ul><li><strong>fat-storing mode</strong></li><li><strong>fat-burning mode</strong></li></ul><p>If we can “switch” the gene circuit controlling that behavior, it may become possible to shift metabolism in a healthier direction.</p><p>(Important: conceptually powerful, but real therapies must be safe and proven in humans.)</p><p><strong>B) Alzheimer’s (APOE4 example)</strong></p><p>APOE4 is a genetic variant linked to higher Alzheimer’s risk.<br> The summary says:</p><ul><li>fixing a specific biological function (cholesterol transport) improved myelination and cognition in that context.</li></ul><p>The big idea:</p><p>Find the mechanism a risky gene disrupts, then target that mechanism.</p><p><strong>C) Cancer immunotherapy + recurrence</strong></p><p>If we understand the genetic circuits that let cancer return, we can:</p><ul><li>predict recurrence risk,</li><li>design therapies to prevent relapse,</li><li>personalize follow-up and treatment intensity.</li></ul><h4><strong>6) Integrating genetics + EHR (health records) for deeper understanding</strong></h4><p><strong>What is the goal?</strong></p><p>To connect:</p><ul><li><strong>genetic variation</strong> (differences in DNA)<br> with</li><li><strong>phenotypes</strong> (what we observe: symptoms, lab values, diagnoses, disease progression)</li></ul><p>If you map many patients, you can find:</p><ul><li>which gene patterns connect to which disease patterns,</li><li>subtypes of diseases that look “same” clinically but differ biologically.</li></ul><p>This is especially useful for complex diseases like Alzheimer’s.</p><h4><strong>7) How LLMs can help with medical notes (EHR text)</strong></h4><p>EHRs contain lots of unstructured text:</p><ul><li>doctor notes</li><li>discharge summaries</li><li>radiology reports</li></ul><p>Large Language Models (LLMs) can:</p><ul><li>extract meaning from that text,</li><li>standardize messy descriptions,</li><li>detect patterns across huge populations (carefully, with privacy and bias control).</li></ul><p>This is not magic — LLMs help turn text into structured signals that can be combined with labs, images, and genetics.</p><h4><strong>8) AI in pathology imaging (tumor detection)</strong></h4><p>Pathology slides are images of tissue.<br> A pathologist checks them for:</p><ul><li>tumor presence,</li><li>tumor grade,</li><li>margins,</li><li>cell patterns.</li></ul><p>AI image models can:</p><ul><li>highlight likely tumor regions,</li><li>detect subtle patterns,</li><li>speed up screening,</li><li>assist diagnosis (as a support tool, not a replacement).</li></ul><p>This improves:</p><ul><li>accuracy</li><li>speed</li><li>consistency (especially when workload is high)</li></ul><h4><strong>9) Graph Neural Networks (GNNs) for molecules and drug design</strong></h4><p>Molecules are naturally graphs:</p><ul><li>atoms = nodes</li><li>bonds = edges</li></ul><p>A GNN learns chemical behavior from structure, helping:</p><ul><li>predict molecule properties,</li><li>suggest new molecules,</li><li>support synthetic chemistry planning.</li></ul><p>This matters for pharmaceutical development because it can shorten the search for promising candidates.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DaVm61nYgrmz2e_JeW3UJw.png" /></figure><h4><strong>10) Multi-modal embeddings: the “one common space” idea</strong></h4><p><strong>What is an embedding (simple)?</strong></p><p>An embedding is like turning complex data into a point on a map, so that:</p><ul><li>similar things are close,</li><li>different things are far.</li></ul><p><strong>Multi-modal embedding</strong></p><p>Means combining many types of data into one representation:</p><ul><li>genetics + labs + images + notes</li></ul><p>So each patient becomes a “point” in a big patient map.</p><p>Then you can:</p><ul><li>find similar patients (“neighbors”),</li><li>predict risk/progression,</li><li>select the best treatment based on similar outcomes.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*t_zave3_rARqf0ZCVWQHfA.png" /></figure><h4><strong>11) “Google Maps for knowledge” (papers &amp; concepts navigation)</strong></h4><p>The summary mentions a navigation system like Google Maps:</p><ul><li>instead of streets, you have concepts and papers,</li><li>instead of physical distance, you have “meaning distance” (embedding similarity).</li></ul><p>This helps researchers:</p><ul><li>see clusters of related work,</li><li>find gaps (“no one connected these two ideas yet”),</li><li>explore literature faster than manual reading.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_rgcEI4K2vUKIxw5yU1DIQ.png" /></figure><h4><strong>12) Bias in medical data: NMAR (Non-Missing At Random)</strong></h4><p><strong>What NMAR means (simple)</strong></p><p>Medical tests are not collected randomly.</p><p>Doctors order tests because they suspect something is wrong.</p><p>So the dataset becomes skewed:</p><ul><li>lots of abnormal cases have tests,</li><li>healthy people often don’t.</li></ul><p>If an AI model learns from that directly, it may become biased:</p><ul><li>it may treat “missing test” as a strong signal,</li><li>or overestimate risk because it mostly saw sick/testing cases.</li></ul><p><strong>How AI can help</strong></p><p>AI can model the <strong>process of testing</strong>:</p><ul><li>who got tested, when, and why (age, sex, symptoms, access, doctor practice)</li></ul><p>Then it can do <strong>counterfactual analysis</strong>:</p><ul><li>“What would we predict if this person had been tested?”</li><li>“What if they had not been treated?”</li></ul><p>This reduces bias and improves predictions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*a_UCpXMR8fJ2LqgyOwwFJA.png" /></figure><p>AI + massive biological/clinical data + genetics + experiments → <strong>mechanism discovery + personalized medicine + faster therapy design</strong>, but we must handle <strong>bias, causality, and validation</strong> carefully.</p><p>If you want, I can also turn this into a <strong>clean exam-style notes PDF</strong> (with these diagrams embedded and headings + bullet-point answers).</p><h3><strong>AI AUTONOMY</strong></h3><h4><strong>1) What are “autonomous agents” in simple words?</strong></h4><p>A normal chatbot answers questions.</p><p>An <strong>autonomous agent</strong> goes further:</p><ul><li>It can <strong>take actions</strong>, not just talk.</li><li>It can <strong>use tools</strong> (search, code, databases, apps).</li><li>It works in a <strong>loop</strong> until the task is done.</li></ul><p>Example tasks:</p><ul><li>“Find the best sources and summarize them.”</li><li>“Check my logs and identify security alerts.”</li><li>“Plan a trip and create an itinerary.”</li><li>“Write code, run tests, fix errors, and repeat.”</li></ul><p><strong>Core idea: “Think + Use tools + Learn from results”</strong></p><p>Instead of giving one-shot answers, the agent:</p><ol><li>decides the next step</li><li>uses a tool</li><li>reads the result</li><li>updates the plan</li><li>repeats</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*L80xm2ov2NGEafxzEwaycw.png" /></figure><h4><strong>2) Why is there “confusion in AI terminology”?</strong></h4><p>AI is evolving fast, so people mix words that sound similar but mean different things. Here are the common ones:</p><p><strong>LLM (Large Language Model)</strong></p><p>A model trained on lots of text that predicts the next token (word piece).<br> It’s great at language tasks: writing, explaining, summarizing, Q&amp;A.</p><p><strong>GPT</strong></p><p>A <strong>type/brand/family</strong> of LLM architecture. People also say “GPT” casually to mean “an LLM chatbot,” which adds confusion.</p><p><strong>“Model” vs “Application”</strong></p><ul><li><strong>Model</strong> = the engine (like an AI brain)</li><li><strong>Application</strong> = the product built using that engine (chatbot, copilot, agent)</li></ul><p><strong>Agent</strong></p><p>An application that uses an LLM <strong>plus tools, memory, and an action loop</strong> to get tasks done.</p><h4><strong>3) What is AGI and why current AI is not AGI?</strong></h4><p><strong>AGI (Artificial General Intelligence)</strong></p><p>In simple terms, AGI would be:</p><p>a system that can do <strong>any</strong> intellectual task a human can (learn new things, adapt, plan, interact with the world).</p><p><strong>Why current AI isn’t AGI (as the lecture hints)</strong></p><p>Today’s LLMs:</p><ul><li>don’t truly “live” in an environment like humans do</li><li>don’t automatically form long-term goals on their own</li><li>may struggle with reliable planning and real-world adaptation</li><li>can be brittle outside their training patterns</li></ul><p>So the <strong>lecture </strong>compares LLMs to <strong>a powerful component</strong> (like a part of the brain), not the whole “complete intelligence system.”</p><h4><strong>4) How agents evolved: from “deep learning only” to “reasoning + tools”</strong></h4><p>Older AI systems were often:</p><ul><li>a single neural network that outputs a prediction (classify spam / detect fraud / translate text)</li></ul><p>Modern agents include:</p><ul><li><strong>LLM reasoning</strong></li><li><strong>tool use</strong></li><li><strong>memory</strong></li><li><strong>execution loops</strong></li><li>sometimes <strong>multiple agents</strong> cooperating</li></ul><p>That’s why agents feel more “useful” in real work: they can <em>do</em> things, not just answer.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DOhsI8pynSBpJdodb1M9Dw.png" /></figure><h4><strong>5) Chain-of-thought (thinking step-by-step) — what it really means</strong></h4><p>“Chain-of-thought prompting” means encouraging the model to:</p><ul><li>break a problem into smaller steps</li><li>reason through them iteratively</li></ul><p>Why it helps:</p><ul><li>complex tasks often fail if the model jumps directly to the final answer</li><li>step-by-step reasoning reduces mistakes (especially in multi-step logic)</li></ul><p>Important note (simple):<br> Even without seeing the full internal steps, the key benefit is that <strong>the model is guided to be more systematic</strong>.</p><h4><strong>6) Reinforcement Learning with Feedback (RLHF / RLAIF)</strong></h4><p><strong>Big idea</strong></p><p>Instead of only training a model to imitate text, we also train it to prefer better answers.</p><p>How it works (simple):</p><ol><li>model generates multiple answers (A, B, C…)</li><li>a judge picks the best:</li></ol><ul><li><strong>RLHF</strong>: humans judge</li><li><strong>RLAIF</strong>: AI judges (with rules), sometimes mixed with humans</li></ul><p>3. model is updated to produce more “preferred” answers next time</p><p>This improves:</p><ul><li>helpfulness</li><li>safety</li><li>style consistency</li><li>“what users actually want”</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-nGfd694BB6W4Ai3B784EA.png" /></figure><h4><strong>7) RAG (Retrieval Augmented Generation) — why it’s a big trend</strong></h4><p>LLMs can “hallucinate” because they generate text from learned patterns.<br> RAG reduces this by letting the model <strong>look things up</strong> first.</p><p><strong>RAG flow (simple)</strong></p><ol><li>user asks a question</li><li>system retrieves relevant documents/snippets (from internal files or web)</li><li>model answers using those retrieved snippets</li></ol><p>Benefits:</p><ul><li>more accurate</li><li>up-to-date (if the source is current)</li><li>can cite sources</li><li>very useful in enterprise knowledge bases</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*B30SpYzq411nOJlImyuy4Q.png" /></figure><h4><strong>8) The planning problem: why agents sometimes fail at “simple tasks”</strong></h4><p>Even strong LLMs can struggle with:</p><ul><li>long multi-step planning</li><li>keeping track of constraints</li><li>not getting distracted mid-way</li><li>executing a full 20-step plan reliably</li></ul><p><strong>A practical solution discussed: “act first”</strong></p><p>Instead of making a huge plan, agents do:</p><ul><li><strong>first best action</strong></li><li>observe results</li><li>adjust next step</li><li>repeat</li></ul><p>This is closer to how humans work in real life: start, see what happens, then correct course.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1PfOou5z5-n4hgJumUn0gg.png" /></figure><h4><strong>9) “Muscle memory” for agents (automation of common behaviours)</strong></h4><p>The lecture uses an idea like “muscle memory”:</p><ul><li>humans don’t consciously plan every tiny movement</li><li>we learn reliable automatic routines</li></ul><p>Similarly, agents can become better if they learn reusable skills like:</p><ul><li>“how to search properly”</li><li>“how to debug”</li><li>“how to write a report”</li><li>“how to follow security playbooks”</li></ul><p>This can come from:</p><ul><li>learning from demonstrations (<strong>imitation learning</strong>)</li><li>reinforcement learning</li><li>storing successful workflows as reusable patterns</li></ul><h4><strong>10) Imitation learning + section understanding (why it matters)</strong></h4><p>For physical or environment-based tasks (robots, self-driving, navigation):</p><ul><li>the agent must understand sequences of observations (often video)</li><li>it must map perception → action</li></ul><p>So efficient video processing and learning from demonstrations can make agents:</p><ul><li>more robust in real environments</li><li>better at navigation and interaction</li></ul><h4><strong>11) Collective intelligence: many specialized agents</strong></h4><p>Instead of one general agent doing everything, you can split work:</p><ul><li>Research agent finds information</li><li>Builder agent writes code</li><li>QA agent tests and finds bugs</li><li>Manager agent coordinates</li></ul><p>This can be faster and more reliable — <strong>if</strong> they share a workspace and coordinate properly.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*g7FddqRRQA6SEBJYZCWaww.png" /></figure><h4><strong>12) Human oversight is essential (especially for high-stakes actions)</strong></h4><p>The future described is not “AI replaces humans.”<br> It’s more like:</p><p>AI does 80% of the work fast, humans approve critical decisions.</p><p>Where humans should stay in control:</p><ul><li>cybersecurity actions (blocking accounts, deleting resources)</li><li>financial actions (payments, purchases)</li><li>production deployments</li><li>sending sensitive emails</li><li>healthcare or legal decisions</li></ul><p>Common safe design:</p><ul><li>agent proposes</li><li>safety/risk checks run</li><li>human approves/edits</li><li>system logs everything</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KpTC0lMIQSpq4hSt9PfiLA.png" /></figure><h3><strong>AI ETHICS</strong></h3><h4><strong>1) Why ethics + regulation matter for foundation models</strong></h4><p><strong>Foundation models</strong> (big models that can do many tasks) and <strong>generative AI</strong> (systems that create text/images/video) are powerful because they can influence:</p><ul><li>what people believe (information, persuasion)</li><li>who gets opportunities (jobs, loans, admissions)</li><li>safety and security (fraud, phishing, cyberattacks)</li><li>privacy (learning patterns from personal data)</li><li>society at scale (jobs, culture, politics)</li></ul><p>So the key question becomes:</p><p>“Who is responsible when an AI system causes harm?”</p><p>That’s where ethics and regulation come in.</p><h4><strong>2) Accountability: don’t treat AI like a “person”</strong></h4><p>The lecture warns against <strong>anthropomorphizing AI</strong> — meaning we talk like:</p><ul><li>“the AI decided”</li><li>“the AI wanted”</li><li>“the AI is lying”</li></ul><p>This can be dangerous because it shifts blame away from real people.</p><p><strong>The simple truth</strong></p><p>AI systems are built and deployed by:</p><ul><li>companies</li><li>engineers</li><li>product teams</li><li>leaders who choose goals and incentives</li></ul><p>So accountability should point to real stakeholders:</p><ul><li>who built it?</li><li>who deployed it?</li><li>who profits?</li><li>who failed to add safeguards?</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*BNptlsrHAJ2uG1pahEG2Jg.png" /></figure><h4><strong>3) Transparency: what it means (more than “open the code”)</strong></h4><p>People say “AI should be transparent,” but transparency has layers.</p><p><strong>A simple way to understand transparency</strong></p><ol><li><strong>Data transparency:</strong><br> Where did training/deployment data come from? What’s missing?</li><li><strong>Model transparency:</strong><br> What can it do well? Where does it fail? What are known risks?</li><li><strong>Decision transparency:</strong><br> Why did it output this? What evidence did it use?</li><li><strong>Governance transparency:</strong><br> Who is accountable? Is there auditing, logging, escalation?</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2A_VjcbxLIj5lUFjn0SjgQ.png" /></figure><h4><strong>4) Real-world vs “ideal world” ethics</strong></h4><p>The lecture highlights a classic gap:</p><ul><li><strong>How the world should work:</strong> fair, truthful, calm decision-making</li><li><strong>How it actually works:</strong> incentives, competition, manipulation, conflict</li></ul><p>So ethical discussions become more useful when they ask:</p><ul><li>“What happens when bad actors use the tech?”</li><li>“What happens when companies optimize for profit/engagement?”</li><li>“What happens when governments use it for defence or influence?”</li></ul><p>This is why the talk discusses warfare, media manipulation, and urgency in national defence.</p><h4><strong>5) Misinformation and manipulation: why generative AI raises the risk</strong></h4><p><strong>Why the risk grows</strong></p><p>Generative AI makes it cheaper and easier to create:</p><ul><li>realistic fake images/video (“deepfakes”)</li><li>persuasive fake text at huge scale</li><li>impersonation (voice, writing style, video)</li></ul><p>This can be used for:</p><ul><li>political manipulation</li><li>scams and fraud</li><li>identity theft and blackmail</li><li>social unrest (spreading distrust)</li></ul><p><strong>Why democracy is vulnerable</strong></p><p>Democracy depends on people agreeing on shared facts.<br> If people stop trusting anything (“everything might be fake”), then:</p><ul><li>it becomes easier to manipulate crowds</li><li>it becomes harder to hold anyone accountable</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AI3TLrOew4w2-4PbupyJVg.png" /></figure><h4><strong>6) “Information overload” → skepticism → need for critical thinking</strong></h4><p>The summary says we’re overwhelmed with information, and now we doubt:</p><ul><li>news</li><li>images</li><li>video</li><li>even “direct evidence”</li></ul><p>So individuals and societies need stronger habits like:</p><ul><li>checking sources</li><li>cross-verifying</li><li>understanding incentives (who benefits if I believe this?)</li><li>resisting emotionally-triggering content designed to provoke fast reactions</li></ul><p>This is not only a tech issue — it’s a human thinking issue.</p><h4><strong>7) Privacy risk: AI can learn your “psychology” from data</strong></h4><p>The lecture warns that AI can learn from massive personal data:</p><ul><li>what you click</li><li>what you watch</li><li>what makes you angry or happy</li><li>what convinces you</li></ul><p>This can lead to hyper-personalized persuasion:</p><ul><li>ads that push your exact emotional buttons</li><li>political messaging tailored to your fears</li><li>manipulation that feels like “your own idea”</li></ul><p>So privacy is not just about “my name and phone number.”<br> It’s also about:</p><p>“Can someone model my mind and influence my decisions?”</p><h4><strong>8) Bias and fairness: it’s not a simple on/off switch</strong></h4><p><strong>Why fairness is difficult</strong></p><p>Bias can enter at multiple stages:</p><ol><li><strong>Data bias:</strong><br> Online content is not a perfect mirror of society. It’s selective.</li><li><strong>Measurement bias:</strong><br> What gets recorded? Who gets labeled? Who is missing?</li><li><strong>Decision bias (use bias):</strong><br> Even a “good” model can cause harm if used wrongly (over-trusting it, no appeals, no human review).</li></ol><p>Also, fairness often has trade-offs:</p><ul><li>improving one fairness metric can worsen another</li><li>different groups can be affected differently</li></ul><p>So fairness is more like a <strong>continuum</strong> than a binary “fair/unfair.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*H1Ww6MjbhWBmu8lUCW7XQA.png" /></figure><h4><strong>9) Risk of “one algorithm dominating society”</strong></h4><p>The lecture warns: if one algorithm (or a few models) become the default decision-maker for many systems (courts, hiring, credit, education), then:</p><ul><li>any bias becomes <strong>system-wide</strong></li><li>mistakes scale to millions of people</li><li>society becomes dependent on a small number of model owners</li></ul><p>This is why governance and diversity of systems matter.</p><h4><strong>10) Social impact: jobs, leisure, and isolation</strong></h4><p>AI can automate parts of work, which might lead to:</p><ul><li>productivity gains</li><li>more leisure time for some people</li></ul><p>But the lecture also points to risks:</p><ul><li>job displacement</li><li>inequality (some benefit more than others)</li><li>changes in human relationships (less interaction, more isolation)</li><li>loss of meaning (if work is a major source of identity)</li></ul><p>So the impact is not only economic — it’s psychological and cultural too.</p><h4><strong>11) Rapid change causes unrest: lessons from history</strong></h4><p>The lecture connects fast technological change to social instability:</p><ul><li>when people feel uncertain, they look for someone to blame</li><li>fear can beat creativity</li><li>conflict becomes more likely when systems change too fast</li></ul><p>This is the idea behind warning against accepting change blindly.</p><h4><strong>12) Antifragility + “time tests” (how to deploy responsibly)</strong></h4><p>Because predicting the future is uncertain, the lecture suggests building systems that:</p><ul><li>can fail safely</li><li>learn from failures</li><li>improve over time</li></ul><p><strong>“Time tests” (simple meaning)</strong></p><p>Instead of rolling out a powerful system everywhere:</p><ul><li>test it on a small scale</li><li>run it for a longer period</li><li>observe failures early (cheaply)</li><li>only then scale up</li></ul><p><strong>Antifragile thinking</strong></p><p>Antifragile systems don’t just survive shocks — they improve because of them:</p><ul><li>monitoring + alerts</li><li>fallback modes</li><li>human override</li><li>red-teaming (trying to break it)</li><li>post-incident learning</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7GkMpfmf7kP_Hc_78NiZ1w.png" /></figure><h4><strong>13) Why regulation is hard (and why regions differ)</strong></h4><p>Regulation is tough because it must balance:</p><ul><li><strong>safety</strong> (reduce harm)</li><li><strong>innovation</strong> (don’t freeze progress)</li></ul><p>Different regions emphasize different levers:</p><ul><li><strong>EU</strong>: transparency, risk categories, human oversight, societal impact</li><li><strong>US</strong>: guidance + sector-by-sector rules, benchmarks, risk mitigation while keeping innovation</li><li><strong>China</strong>: algorithm registration, operational standards, content and deployment controls</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tznugIzsVL4SXlcI90bViA.png" /></figure><h4><strong>14) The “regulation vs big corporations” concern</strong></h4><p>A realistic issue raised:</p><ul><li>compliance is expensive</li><li>big companies can pay for audits/lawyers/processes</li><li>small innovators may struggle</li></ul><p>So regulation can unintentionally:</p><ul><li>entrench big players</li><li>reduce competition</li></ul><p>Good policy tries to protect people <strong>without</strong> making it impossible for smaller companies to build responsibly.</p><h3><strong>AI PANEL</strong></h3><h4><strong>1) Centralization vs decentralization: what does it mean?</strong></h4><p><strong>Centralization in AI</strong></p><p>This means <strong>a few big companies or platforms</strong> control most of:</p><ul><li>the strongest models</li><li>the data pipelines</li><li>the distribution (apps, cloud, APIs)</li></ul><p><strong>Why it’s attractive</strong></p><ul><li>cheaper (economies of scale)</li><li>faster rollout</li><li>standardization (same tools everywhere)</li></ul><p><strong>Why it’s risky</strong></p><ul><li>“single point of failure” (one big system breaks → many people affected)</li><li>too much power in few hands</li><li>systemic bias (one model’s blind spots spread everywhere)</li><li>less competition → slower innovation over time</li></ul><p><strong>Decentralization in AI</strong></p><p>This means <strong>many different models and builders</strong> exist:</p><ul><li>open-source models</li><li>regional / domain-specific models</li><li>multiple platforms competing</li></ul><p><strong>Why it’s good</strong></p><ul><li>diversity (more approaches, more creativity)</li><li>resilience (if one fails, others still work)</li><li>innovation (new ideas appear faster)</li><li>better local fit (language, culture, domain needs)</li></ul><p><strong>Why it’s hard</strong></p><ul><li>tougher to monitor everyone</li><li>inconsistent quality</li><li>more coordination needed (standards, interoperability)</li></ul><p><strong>Panel’s main idea</strong></p><p>We need a balance: strong innovation + reduced concentration of risk.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qn6BVSDpmXkXOoOaBPHdxA.png" /></figure><h4><strong>2) Why diversity is a big theme in the discussion</strong></h4><p>The panel says diversity is not only a “social value” — it’s an <strong>engineering advantage</strong>.</p><p><strong>How diversity improves outcomes (simple)</strong></p><p>Different people bring different:</p><ul><li>assumptions</li><li>problem-solving styles</li><li>priorities (safety vs speed, fairness vs accuracy, etc.)</li></ul><p>This reduces blind spots and increases creativity.</p><p>Example (simple):</p><ul><li>If everyone designing AI has the same background, they may miss how the system affects other communities.</li></ul><p>In AI development, diversity matters in:</p><ul><li>the team building it</li><li>the data used to train it</li><li>the evaluation (who tests it and what tests they run)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*37zLNk88p85_Tad8FsVSfw.png" /></figure><h4><strong>3) Can AI reduce human bias?</strong></h4><p>The panel highlights a hopeful point:</p><ul><li>AI can be designed to <strong>challenge stereotypes</strong> and <strong>detect unfair patterns</strong>.</li></ul><p>But there is a catch:</p><ul><li>AI learns from data, and data often includes society’s past unfairness.</li><li>So AI can either:</li><li><strong>reduce bias</strong>, if carefully designed and tested, or</li><li><strong>amplify bias</strong>, if trained/deployed carelessly.</li></ul><p>So the “anti-bias” outcome is not automatic — it requires:</p><ul><li>clear fairness goals</li><li>careful dataset choices</li><li>testing across groups</li><li>transparency and monitoring</li></ul><h4><strong>4) “Community of models” (why AI systems aren’t one single brain)</strong></h4><p>The panel suggests systems like ChatGPT can be thought of as <strong>multiple components</strong> working together, such as:</p><ul><li>a main language model (generates text)</li><li>safety filters (reduce harmful output)</li><li>retrieval/tools (look up information or run actions)</li><li>coordination logic (decides which component to use)</li></ul><p>Why this matters for “diversity”:</p><ul><li>multiple models can give richer, more robust outcomes</li><li>you can swap/upgrade parts without rebuilding everything</li><li>failures can be contained (one part fails, not the entire system)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VDhus5JI1txgs9igHz8x4Q.png" /></figure><h4><strong>5) Change, disruption, and the “evolution” analogy</strong></h4><p>The panel mentions evolution and disruption:</p><ul><li>dinosaurs went extinct → mammals expanded</li><li>big changes create space for new forms of life/innovation</li></ul><p>The message applied to AI:</p><ul><li>disruption can be painful,</li><li>but it can also enable new industries and new kinds of work.</li></ul><p>They also emphasize:</p><p>AI has no “will” or “desire.”<br>Risks usually come from <strong>human misuse</strong>, incentives, or poor governance.</p><p><strong>6) AI as “fire”: a tool that can help or harm</strong></p><p>The panel compares AI to fire:</p><ul><li>Fire enables cooking and progress, but can also destroy.</li><li>AI can increase productivity, but also create risks.</li></ul><p><strong>Benefits</strong></p><ul><li>faster learning</li><li>higher productivity</li><li>new discoveries and services</li><li>better tutoring and personalization</li></ul><p><strong>Risks</strong></p><ul><li>scams and manipulation</li><li>bias in important decisions</li><li>concentration of power</li><li>job disruption</li></ul><p>So the outcome depends on:</p><ul><li>who controls it</li><li>what incentives exist</li><li>what safeguards and accountability exist</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*POWvpkYFxHsAaR-_sPi70g.png" /></figure><h4><strong>7) AI in education: “liberating force” for students</strong></h4><p>The panel suggests AI can act like a <strong>personal tutor</strong>:</p><ul><li>helps struggling students (step-by-step explanations)</li><li>challenges advanced students (keeps them engaged)</li><li>offers practice, feedback, and personalized pacing</li></ul><p>This can reduce inequality in education <strong>if access is broad</strong>.</p><p>But risks include:</p><ul><li>over-reliance (students stop practicing thinking)</li><li>misinformation (AI may be wrong)</li><li>fairness issues (if only some students can afford it)</li></ul><h4><strong>8) Future of work: jobs will change fast (5–10 years)</strong></h4><p>The panel predicts:</p><ul><li>many current jobs will be transformed (tasks automated)</li><li>some roles will shrink</li><li>new roles will appear (AI operators, evaluators, safety, data stewards, tool builders)</li></ul><p>Important nuance:</p><p>It may not be “jobs disappear everywhere.”<br> Often it’s “tasks move and skills shift,” and opportunities appear in different places.</p><p>So society needs:</p><ul><li>reskilling and upskilling</li><li>support for workers during transitions</li><li>redesigning jobs so humans + AI work together</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PY3-2xs4MJ-39aWN-L5Gug.png" /></figure><h4><strong>9) Regulation: focus on self-regulation + flexibility (but with safeguards)</strong></h4><p>The panel suggests heavy regulation can sometimes:</p><ul><li>slow innovation</li><li>push power toward big companies that can afford compliance</li></ul><p>This is called <strong>regulatory capture</strong>:</p><ul><li>large players handle paperwork easily</li><li>small innovators struggle</li><li>competition drops, centralization increases</li></ul><p>So the panel leans toward:</p><ul><li><strong>self-regulating systems</strong> (local governance, quick iteration)</li><li>plus <strong>standards and external checks</strong> for serious risks</li></ul><p>A practical compromise:</p><ul><li>light rules for low-risk uses</li><li>strong oversight for high-risk uses (health, finance, elections, justice)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wwV1CMAsk85eCe8aJuGSFw.png" /></figure><h4><strong>10) Creativity, originality, and copyright concerns</strong></h4><p>The panel mentions experiments suggesting:</p><ul><li>some creative patterns (like melodies) may be finite,<br> which raises questions:</li><li>What counts as “original”?</li><li>Who owns AI-generated content?</li><li>Is it remixing existing work too closely?</li></ul><p>This is why copyright law and creative ethics will become more important as AI content becomes widespread.</p><p>· <strong>Centralization</strong> gives efficiency but increases systemic risk.</p><p>· <strong>Decentralization</strong> boosts diversity and resilience but needs standards.</p><p>· Diversity (people + models + evaluation) improves creativity and safety.</p><p>· AI is a tool (like fire): outcomes depend on governance and incentives.</p><p>· Education may improve, jobs will shift fast, and regulation must avoid stifling innovation while preventing harms.</p><p>I know many of you will not have patience and time to go through all the detailed explanation, so I have explained in short the gist of the above lecture in a crisp format.</p><h3><strong>Crisp Explanation Of The Subject</strong></h3><h4><strong>1) The big idea (simple words)</strong></h4><p><strong>What is a foundation model?</strong></p><p>A <strong>foundation model</strong> is a <strong>very large AI model</strong> trained on a <strong>huge amount of data</strong> (text, images, code, audio, etc.). Because it saw so much data, it learns <strong>general abilities</strong> (language, writing, summarizing, reasoning patterns). Then you can <strong>reuse</strong> that same model for many tasks instead of building a separate AI for each task.</p><p><strong>What is generative AI?</strong></p><p><strong>Generative AI</strong> means AI that can <strong>create new content</strong> — like writing text, generating images, writing code, or proposing new designs — by learning patterns from the data it trained on.</p><p><strong>Why this became a “recent shift”</strong></p><p>Earlier AI was often “one model for one job.” The shift is that now <strong>one strong general model</strong> can do many jobs (write, summarize, plan, help with science/business tasks) once you adapt it slightly (prompting, fine-tuning, adding tools).</p><h4><strong>2) Why the recent AI shift matters (simple + real-world meaning)</strong></h4><p><strong>Simple view</strong></p><p>Foundation models are like a person who got a <strong>very broad education</strong> (read tons of books). After that, you can train them quickly for a specific job (like law, medicine, customer support) with much less extra training than starting from scratch.</p><p><strong>Why companies care (economics)</strong></p><p>Because one model can be reused across many products (support, search, coding help, analytics), it becomes a “platform” investment — so funding and adoption surged.</p><p><strong>Agents (a step beyond chat)</strong></p><p>A normal chatbot answers. An <strong>agent</strong> tries to <strong>do multi-step work</strong>: break a goal into steps, use tools, check results, retry, and finish a task loop (plan → act → check → improve).</p><h4><strong>3) The most important concept: “How do machines learn?”</strong></h4><p>The notes describe three main learning styles. Here’s the simplest way to remember them.</p><p><strong>A) Supervised learning (learning with answer keys)</strong></p><ul><li>You show examples + correct labels<br> Example: many images labeled “dog” or “cat.”</li><li>The model learns to map input → correct output.</li></ul><p><strong>Downside:</strong> Labels are expensive and sometimes unclear (“what counts as ‘good customer service’?”).</p><p><strong>B) Reinforcement learning (learning by trial-and-error)</strong></p><ul><li>The model takes actions, gets reward/punishment.<br> Example: learning to play a game by scoring points.</li></ul><p><strong>Downside:</strong> Feedback can be delayed and trial-and-error can be unsafe in the real world.</p><p><strong>C) Self-supervised / generative learning (learning from “puzzles” made from raw data)</strong></p><p>This is the breakthrough: the model learns without human labels by solving “make-believe tasks,” like:</p><ul><li>In text: predict the next word, or fill in a missing word</li><li>In images: predict missing patches or match two crops of the same image</li></ul><p><strong>Why it mattered:</strong> There’s an ocean of unlabelled data. Self-supervised learning lets models learn from it cheaply and at massive scale.</p><h4><strong>4) “Meaning comes from relationships” (simple explanation)</strong></h4><p>A key point in the notes is that <strong>concepts get meaning from how they relate to other things</strong>, not from isolated labels.</p><p><strong>Simple example</strong></p><p>A child learns “dog” not only from a picture + the word “dog,” but from a <strong>network of relations</strong>:</p><ul><li>dogs bark, fetch, have leashes, go to parks, live with people</li><li>cats meow, chase mice, sleep on sofas</li></ul><p><strong>What the model is learning</strong></p><p>Generative models see millions/billions of examples of words appearing together (and images with captions, etc.). Over time they build an internal “map” of:</p><ul><li>what tends to go with what</li><li>what is similar / different</li><li>what comes next in sequences</li></ul><p>That internal map is what people loosely call “understanding.”</p><h4><strong>5) The core engine: self-supervised pretraining → then reuse</strong></h4><p>This “two-stage” idea is the backbone of foundation models.</p><p><strong>Stage 1: Pretraining (learn general world/language patterns)</strong></p><ul><li>Train on massive raw data</li><li>Task looks simple (predict missing/next parts), but it forces learning deep patterns.</li></ul><p><strong>Stage 2: Adaptation (make it useful for specific tasks)</strong></p><p>After pretraining, you specialize using:</p><ul><li><strong>Prompting</strong> (tell it what you want in words)</li><li><strong>Fine-tuning</strong> (train a bit more on task data)</li><li><strong>RLHF</strong> (align behavior with human preferences)</li><li><strong>RAG</strong> (let it look things up in documents/web)</li><li><strong>Tools</strong> (calculator, database, code runner, etc.)</li></ul><h4><strong>6) Two main ways language models are trained (important)</strong></h4><p>The notes contrast two training styles: <strong>Masked LM</strong> vs <strong>Causal LM</strong>.</p><p><strong>A) Masked Language Model (BERT-style)</strong></p><ul><li>You hide a word: “The cat sat on the [MASK].”</li><li>Model predicts the missing word using <strong>both left and right context</strong>.</li></ul><p><strong>Strength:</strong> strong “understanding” representations (good for classification/search).<br> <strong>Limit:</strong> not naturally built to generate long text left-to-right.</p><p><strong>B) Causal Language Model (GPT-style)</strong></p><ul><li>You give: “The cat sat on the”</li><li>Model predicts the <strong>next word</strong> using only the left context.</li></ul><p><strong>Strength:</strong> great at generating fluent text.<br> <strong>Limit:</strong> doesn’t “see the future words,” so it sometimes guesses and may hallucinate.</p><h4><strong>7) How GPT-like models learn (step-by-step, simple)</strong></h4><p>The “next-token prediction loop” works like this:</p><ol><li>Input prompt tokens: “The cat sat on the”</li><li>Model outputs probabilities for the next token</li><li>Compare with the true next token (“mat”)</li><li>Compute loss (how wrong it was)</li><li>Update weights so it’s more likely to predict correctly next time</li><li>Repeat billions of times</li></ol><p>Even though the task is “just next word,” language contains huge amounts of world structure, so the model indirectly learns grammar, style, and many facts/patterns.</p><h4><strong>8) Why Transformers mattered (simple but accurate)</strong></h4><p>Transformers are the architecture that made modern LLMs work well at scale.</p><p><strong>The key trick: attention</strong></p><p><strong>Attention</strong> means: while processing a word, the model can “look at” other words in the sentence and decide which ones matter most right now.</p><p><strong>Why that’s a big deal</strong></p><ul><li>It handles long-range relationships better than older RNNs</li><li>It runs efficiently on GPUs because it processes many tokens in parallel</li><li>It uses positional encodings so word order still matters</li></ul><h4><strong>9) “Scaling laws” (why bigger often gets better)</strong></h4><p>The notes describe that, in general, as you increase:</p><ul><li>model size (parameters),</li><li>data,</li><li>compute,</li></ul><p>the prediction error tends to drop smoothly (up to practical limits like data quality). That’s why scaling has been such a powerful strategy.</p><h4><strong>10) From raw model → helpful ChatGPT: SFT + RLHF (deep but clear)</strong></h4><p>Pretraining makes a model <em>capable</em>, but not necessarily <em>helpful or safe</em>. The notes explain two main steps used to make chat models behave better.</p><p><strong>A) SFT (Supervised Fine-Tuning)</strong></p><p>Humans provide example “good answers” to prompts.<br> The model learns the style: helpful, structured, polite, etc.</p><p><strong>B) RLHF (Reinforcement Learning from Human Feedback)</strong></p><ol><li>Model generates multiple candidate replies</li><li>Humans rank which reply is better</li><li>Train a <strong>reward model</strong> to predict those preferences</li><li>Optimize the chat model to get higher reward scores</li></ol><p><strong>Why RLHF is tricky:</strong> feedback is delayed (you judge the whole answer at the end), and the model can “game” the reward model if you’re not careful — so guardrails and validation matter.</p><h4><strong>11) Other key learning/generative methods in the notes</strong></h4><p><strong>A) Contrastive learning (learn “what matches what”)</strong></p><p>Core idea: bring “related things” close in embedding space, push unrelated far away.</p><ul><li>Images: two crops of the same photo should be close</li><li>Text: paraphrases should be close</li></ul><p>This builds strong representations for retrieval/search and multimodal alignment.</p><p><strong>B) Diffusion models (generate by denoising)</strong></p><p>Diffusion is like sculpting from noise:</p><ol><li>Add noise to images during training (forward process)</li><li>Train a model to remove noise step-by-step (reverse process)</li><li>To generate: start from random noise → denoise gradually into an image</li></ol><p>This is the core idea behind Stable Diffusion-style generation described in the notes.</p><p><strong>C) Autoencoders (compress then reconstruct)</strong></p><ul><li>Encoder compresses input into a bottleneck (a small code)</li><li>Decoder reconstructs the original<br> Useful for compression/denoising/feature learning.</li></ul><p><strong>D) GANs (generator vs critic)</strong></p><ul><li>Generator makes fake samples</li><li>Discriminator tries to detect fakes<br> They compete, pushing realism up — though training can be unstable and can collapse to low diversity.</li></ul><h4><strong>12) Stable Diffusion pipeline (explained simply)</strong></h4><p>The notes explain Stable Diffusion as: <strong>text → image</strong> by denoising in a compressed “latent space.”</p><p><strong>Components in plain words</strong></p><ul><li><strong>Text encoder:</strong> turns your prompt into numbers representing meaning</li><li><strong>Latent space:</strong> a compressed version of the image (faster than full pixels)</li><li><strong>U-Net denoiser:</strong> removes noise step-by-step guided by the text</li><li><strong>VAE decoder:</strong> turns the final latent into a real image</li></ul><p><strong>Why “seed” matters</strong></p><p>Seed controls the starting noise. Same seed → reproducible. Different seed → different image variations.</p><h4><strong>13) Why “the world is messy” matters (the “coding reality gap”)</strong></h4><p>The notes emphasize a practical truth:</p><p>Computers like precise rules, but the real world is full of ambiguity and chaos — so systems that rely only on explicit labels and rigid rules struggle.</p><p>Self-supervised learning helps because it learns from <strong>real data patterns</strong> (how people speak, what users click, what happens over time) rather than depending only on perfect labels.</p><h4><strong>14) Business view: “unified intelligence” vs many small models</strong></h4><p><strong>Old way</strong></p><p>Separate AI systems for search, recommendations, support, etc.<br> Problem: it’s like having multiple small brains that don’t share knowledge well.</p><p><strong>New way</strong></p><p>A central intelligence layer (foundation model + company data) can improve many systems at once. Competitive advantage comes from your <strong>proprietary data</strong> (clicks, purchases, inventory, product catalog, domain rules), not only from using the same public model as everyone else.</p><h4><strong>15) Biology/medicine: why foundation models matter there</strong></h4><p>The notes describe a major shift: medicine is becoming more <strong>data-driven</strong>, using huge datasets (genetics, images, health records) to find patterns and disease subtypes.</p><p>Key ideas explained:</p><ul><li><strong>Correlation vs causation:</strong> genetics can provide stronger causal clues than pure correlations, but still needs validation.</li><li><strong>LLMs for medical notes:</strong> convert messy clinical text into structured signals (with privacy and bias controls).</li><li><strong>GNNs for molecules:</strong> molecules are graphs (atoms=nodes, bonds=edges), so graph neural nets fit naturally.</li><li><strong>NMAR bias (Non-Missing At Random):</strong> medical tests are ordered for reasons, so missing data isn’t random — models must handle that carefully.</li></ul><h4><strong>16) Agents + RAG (retrieval) in simple terms</strong></h4><p><strong>Agents</strong></p><p>An agent is an LLM plus:</p><ul><li>tool use,</li><li>memory/logs,</li><li>an action loop (do → observe → adjust).</li></ul><p><strong>RAG (Retrieval Augmented Generation)</strong></p><p>LLMs can hallucinate because they “generate from patterns.” RAG reduces this by:</p><ol><li>retrieving relevant documents/snippets</li><li>answering grounded in those snippets</li></ol><p>That’s why RAG is big in enterprise settings (internal knowledge bases).</p><h4><strong>17) Ethics &amp; regulation (simple but serious)</strong></h4><p>The notes highlight why ethics matters: these models can influence beliefs, opportunities, safety, privacy, and society at scale.</p><p>Main points explained simply:</p><ul><li><strong>Accountability:</strong> don’t blame “the AI”; humans/organizations choose goals, data, deployment.</li><li><strong>Transparency has layers:</strong> data transparency, model limits, decision explanations, governance/auditing.</li><li><strong>Misinformation risk:</strong> generative AI makes mass creation of persuasive fake content cheaper.</li><li><strong>Privacy risk:</strong> not just identity — AI can learn what persuades you (“model your psychology”).</li><li><strong>Antifragility:</strong> deploy carefully, test at small scale, monitor, add human overrides, learn from failures (“time tests”).</li></ul><h4><strong>18) A clean mental model to remember everything</strong></h4><p>Think of modern AI as a <strong>stack</strong>:</p><ol><li><strong>Data</strong> (raw text/images/behavior logs)</li><li><strong>Self-supervised pretraining</strong> (learn general patterns)</li><li><strong>Foundation model</strong> (general-purpose engine)</li><li><strong>Adaptation</strong> (prompting / fine-tuning / RLHF)</li><li><strong>Grounding &amp; tools</strong> (RAG, calculators, databases)</li><li><strong>Applications</strong> (chatbot, copilot, recommender, scientist assistant, agent)</li><li><strong>Governance</strong> (safety, privacy, fairness, monitoring, accountability)</li></ol><h4><strong>19) Quick glossary (simple definitions)</strong></h4><ul><li><strong>Foundation model:</strong> big reusable model trained broadly, adapted to many tasks.</li><li><strong>Generative AI:</strong> AI that creates new content from learned patterns.</li><li><strong>Self-supervised learning:</strong> learning from raw data by solving prediction “puzzles.”</li><li><strong>Embedding:</strong> turning things (words/images/users) into points in a space where “close = similar.”</li><li><strong>Transformer/attention:</strong> architecture that lets tokens “look at each other” to use context efficiently.</li><li><strong>SFT:</strong> fine-tuning on curated instructions → answer examples.</li><li><strong>RLHF:</strong> aligning a model using preference feedback and reinforcement learning.</li><li><strong>RAG:</strong> retrieval + generation so answers can be grounded in documents.</li></ul><p><strong>Agent:</strong> LLM + tools + memory + action loop to complete tasks.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b7bf0c73eaa6" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A2A vs MCP: Comparing Google’s Agent-to-Agent Protocol with Anthropic’s Model Context Protocol]]></title>
            <link>https://medium.com/@saha.soumyadeep90/a2a-vs-mcp-comparing-googles-agent-to-agent-protocol-with-openai-s-model-context-protocol-6798491cc87e?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/6798491cc87e</guid>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Thu, 23 Oct 2025 17:39:29 GMT</pubDate>
            <atom:updated>2025-10-23T18:18:22.728Z</atom:updated>
            <content:encoded><![CDATA[<p>In AI agent development, there are two main types of protocols that help different systems work together.<br> One type lets agents connect with tools and resources.<br> The other type allows agents to work and communicate with each other.<br> The <strong>Model Context Protocol (MCP)</strong> and the <strong>Agent2Agent (A2A) Protocol</strong> are designed to handle these two different but complementary functions.</p><h3>Model Context Protocol (MCP) — Simplified Explanation</h3><p>The <strong>Model Context Protocol (MCP)</strong> sets the rules for how an AI agent connects to and uses different tools or resources — like databases or APIs.</p><p>Here’s what it does:</p><ul><li><strong>Creates a standard method</strong> for AI models and agents to connect with tools, APIs, and other outside systems.</li><li><strong>Provides a clear structure</strong> for describing what each tool can do, much like how function calling works in large language models (LLMs).</li><li><strong>Handles data flow</strong> — it sends inputs to tools and receives organized, structured outputs in return.</li><li><strong>Supports common tasks</strong>, such as:</li></ul><blockquote>An LLM using an external API,</blockquote><blockquote>An agent querying a database, or</blockquote><blockquote>An agent working with built-in functions that are already defined.</blockquote><h3>Agent2Agent Protocol — Simplified Explanation</h3><p>The <strong>Agent2Agent (A2A) Protocol</strong> is designed to help different AI agents work together to reach a shared goal.</p><p>Here’s what it does:</p><ul><li><strong>Creates a standard way</strong> for independent AI agents to talk to and cooperate with each other as equals.</li><li><strong>Defines rules for communication</strong>, allowing agents to find one another, set up how they’ll work together, share tasks, and exchange both conversations and complex data.</li><li><strong>Supports common situations</strong>, such as:</li></ul><blockquote>A customer service agent passing a question to a billing agent, or</blockquote><blockquote>A travel agent working with flight, hotel, and activity agents to plan a trip.</blockquote><h3>Why Different Protocols?</h3><p>Both the <strong>Model Context Protocol (MCP)</strong> and the <strong>Agent2Agent (A2A)</strong> Protocol are important for creating advanced AI systems. Each serves a different purpose based on what the AI agent is interacting with.</p><h4>1. Tools and Resources (MCP Domain)</h4><ul><li><strong>What they are:</strong> Simple, clearly defined systems that take inputs and return specific outputs.</li><li><strong>Examples:</strong> A calculator, a database query API, or a weather service.</li><li><strong>Purpose:</strong> Agents use these tools to get information or perform small, focused tasks.</li><li><strong>Nature:</strong> These tools are usually <em>stateless</em> — they don’t remember past interactions.</li></ul><h4>2. Agents (A2A Domain)</h4><ul><li><strong>What they are:</strong> Independent, intelligent systems that can reason, plan, and hold longer conversations.</li><li><strong>Examples:</strong> A customer support agent, a travel booking agent, or a scheduling agent.</li><li><strong>Purpose:</strong> Agents work together to solve bigger, more complex problems that may require multiple steps or tools.</li><li><strong>Nature:</strong> These agents often <em>maintain state</em> — they remember past interactions and use them to guide future actions.</li></ul><h3>A2A ❤️ MCP: How They Work Together</h3><h4>In an <strong>agentic system</strong>, both protocols play different but connected roles:</h4><ul><li>The <strong>A2A Protocol</strong> is used for <strong>communication between agents</strong> — it helps them share information, coordinate tasks, and work together toward a goal.</li><li>Inside each agent, the <strong>MCP Protocol</strong> is used to <strong>connect with tools and resources</strong> — allowing the agent to access data, run functions, or use APIs to get things done.</li></ul><p>In simple terms:</p><blockquote><em>Agents talk to </em><strong><em>each other</em></strong><em> using </em><strong><em>A2A</em></strong><em>,<br> and each agent talks to </em><strong><em>its tools</em></strong><em> using </em><strong><em>MCP</em></strong><em>.</em></blockquote><p>Together, they form a complete system where agents collaborate effectively while still being able to perform specific tasks through their tools.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*dQopHCV4-zGgxr0y.png" /><figcaption><em>An agentic application might use A2A to communicate with other agents, while each agent internally uses MCP to interact with its specific tools and resources.</em></figcaption></figure><h3>Example Scenario: The Smart Hospital 🏥</h3><p>Let’s imagine a <strong>smart hospital</strong> run by AI “medical staff” agents.<br> Each agent specializes in a certain role— and they all use <strong>A2A</strong> and <strong>MCP</strong> to work together efficiently.</p><h3>1. Patient Interaction (User-to-Agent using A2A)</h3><p>A patient uses <strong>A2A</strong> to talk to the hospital’s <strong>Receptionist Agent</strong>.<br> For example, the patient might say:</p><blockquote><em>“I’ve been feeling dizzy and have a fever.”</em></blockquote><p>The Receptionist Agent collects basic information and assigns the case to a <strong>Doctor Agent</strong>.</p><h3>2. Doctor’s Consultation (Agent-to-Agent using A2A)</h3><p>The <strong>Doctor Agent</strong> uses <strong>A2A</strong> to coordinate with other agents in the hospital.</p><p>For example, the doctor might ask the <strong>Nurse Agent</strong>:</p><blockquote><em>“Please take the patient’s temperature and blood pressure.”</em></blockquote><p>Then the Doctor Agent might tell the <strong>Lab Agent</strong>:</p><blockquote><em>“Run a blood test for infection markers.”</em></blockquote><p>Here, <strong>A2A</strong> allows smooth, multi-turn communication between multiple agents (Doctor, Nurse, Lab) — just like human teamwork.</p><h3>3. Using Internal Tools (Agent-to-Tool using MCP)</h3><p>Now, each agent uses <strong>MCP</strong> to interact with specialized hospital tools and databases.</p><ul><li><strong>Nurse Agent (MCP call):</strong><br> use_device(device=&quot;thermometer&quot;, patient_id=&quot;P001&quot;)<br> use_device(device=&quot;blood_pressure_monitor&quot;, patient_id=&quot;P001&quot;)</li><li><strong>Lab Agent (MCP call):</strong><br> run_test(test_type=&quot;CBC&quot;, sample_id=&quot;S123&quot;)</li><li><strong>Doctor Agent (MCP call):</strong><br> query_medical_database(symptoms=[&quot;fever&quot;, &quot;dizziness&quot;])</li></ul><p>These MCP calls connect the agents to hospital systems — tools with <em>structured inputs and outputs</em> — to gather and analyze data.</p><h3>4. Pharmacy Interaction (Agent-to-Agent using A2A)</h3><p>After diagnosis, the <strong>Doctor Agent</strong> prescribes medication and communicates via <strong>A2A</strong> with the <strong>Pharmacy Agent</strong>:</p><blockquote><em>“Please prepare an antibiotic for patient P001, 500mg twice daily.”</em></blockquote><p>The Pharmacy Agent checks inventory and confirms:</p><blockquote><em>“Medicine available. Ready for pickup in 10 minutes.”</em></blockquote><p>This shows <strong>A2A’s role</strong> in managing <em>cooperative, goal-oriented dialogues</em> among agents.</p><h3>5. Summary — How A2A and MCP Work Together</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KvuHplKFe_FL7OCwzBMSiQ.png" /></figure><h3>Why Both Matter</h3><ul><li><strong>MCP</strong> gives agents the ability to <em>use tools</em> efficiently — for clear, structured tasks like data lookups, device operations, or computations.</li><li><strong>A2A</strong> enables <em>conversation, coordination, and teamwork</em> among agents — like doctors, nurses, and pharmacies working together to treat a patient.</li></ul><p>Together, <strong>A2A and MCP</strong> form the backbone of an intelligent, cooperative AI ecosystem — one that mirrors how humans use both <strong>tools</strong> and <strong>teamwork</strong> to solve complex problems.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6798491cc87e" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Agent2Agent Protocol: Building the Language of AI Collaboration]]></title>
            <link>https://medium.com/@saha.soumyadeep90/understanding-the-agent2agent-protocol-the-future-of-autonomous-system-communication-cbfa51bff502?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/cbfa51bff502</guid>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Thu, 23 Oct 2025 10:34:28 GMT</pubDate>
            <atom:updated>2025-10-23T18:21:26.978Z</atom:updated>
            <content:encoded><![CDATA[<h3>Building Collaborative Systems with ADK and the Agent-to-Agent (A2A) Protocol</h3><p>The <strong>Agent Development Kit (ADK)</strong> empowers developers to create sophisticated <strong>multi-agent systems</strong>, where multiple agents work together seamlessly. Through the <strong>Agent-to-Agent (A2A) Protocol</strong>, these agents can communicate, collaborate, and coordinate their actions efficiently and securely.</p><p>This guide walks you through the fundamentals of ADK’s A2A features — helping you design intelligent, interconnected agents that operate as a cohesive system. Explore the sections below to unlock the full potential of <strong>ADK’s A2A capabilities</strong>.</p><h4>Introduction to A2A</h4><p>Start with the basics. This guide walks you through building your first <strong>multi-agent system</strong>, complete with a <strong>root agent</strong>, a <strong>local sub-agent</strong>, and a <strong>remote A2A agent</strong>. You’ll learn how they interact, exchange data, and collaborate to perform complex tasks.</p><h4>A2A Quickstart (Exposing)</h4><p>Already have an agent running? Learn how to <strong>expose it</strong> so that other agents can discover and use it through the A2A protocol. This is your first step toward turning your agent into a service that others can rely on.</p><h4>A2A Quickstart (Consuming)</h4><p>On the other side of the equation, this guide teaches you how to <strong>connect your ADK agent to a remote agent</strong> via A2A. You’ll see how to securely consume data and services from other agents to extend your system’s capabilities.</p><h4>Official Website</h4><p>For more details, documentation, and the latest updates, check out the <strong>[https://a2a-protocol.org/]</strong>. It’s your go-to resource for deep-diving into A2A concepts and best practices.</p><h3>Introduction to A2A</h3><p>As your systems grow in complexity, you’ll quickly realize that a single agent can only do so much. Real-world problems often demand <strong>multiple specialized agents</strong>, each handling a different part of the solution. That’s where the <strong>Agent-to-Agent (A2A) Protocol</strong> comes in.</p><p>The A2A Protocol acts as a <strong>common language</strong> for agents — enabling them to communicate, share insights, and collaborate effectively. With A2A, agents don’t just coexist; they <strong>work together intelligently</strong>, forming a coordinated network capable of tackling challenges far beyond the reach of any single agent.</p><h4>When to Use A2A vs. Local Sub-Agents</h4><p>· <strong>Local Sub-Agents:</strong> These are agents that run <em>within the same application process</em> as your main agent. They are like internal modules or libraries, used to organize your code into logical, reusable components. Communication between a main agent and its local sub-agents is very fast because it happens directly in memory, without network overhead.</p><p>· <strong>Remote Agents (A2A):</strong> These are independent agents that run as separate services, communicating over a network. A2A defines the standard protocol for this communication.</p><p>Consider using <strong>A2A</strong> when:</p><blockquote>· The agent you need to talk to is a <strong>separate, standalone service</strong> (e.g., a specialized financial modeling agent).</blockquote><blockquote>· The agent is maintained by a <strong>different team or organization</strong>.</blockquote><blockquote>· You need to connect agents written in <strong>different programming languages or agent frameworks</strong>.</blockquote><blockquote>· You want to enforce a <strong>strong, formal contract</strong> (the A2A protocol) between your system’s components.</blockquote><h4>When Not to Use A2A (Prefer Local Sub-Agents)</h4><p>Sometimes, using A2A is unnecessary and can even slow things down. In these cases, <strong>local sub-agents</strong> or simple modules are the better choice:</p><p><strong>Internal Code Organization:</strong></p><p>If you’re just breaking a big task into smaller parts inside one agent — like a <strong>DataValidator</strong> that cleans input before processing — use local sub-agents. It’s faster and simpler.</p><p><strong>Performance-Critical Tasks:</strong></p><p>For operations that need <strong>speed and low latency</strong>, such as a <strong>RealTimeAnalytics</strong> sub-agent handling live data, keep everything inside the same app. A2A’s network calls would only add delay.</p><p><strong>Shared Memory or Context:</strong></p><p>When agents need to <strong>share the same memory or state</strong>, local sub-agents work best. A2A adds extra overhead from network communication and data conversion.</p><p><strong>Simple Helper Logic:</strong></p><p>If it’s just a <strong>small, reusable function</strong> that doesn’t need to run separately — like a utility or helper class — don’t create an A2A agent. A simple local module is enough.</p><h3>The A2A Workflow in ADK: A Simplified View</h3><p>Agent Development Kit (ADK) simplifies the process of building and connecting agents using the A2A protocol. Here’s a straightforward breakdown of how it works:</p><p>1. <strong>Making an Agent Accessible (Exposing):</strong> You start with an existing ADK agent that you want other agents to be able to interact with. The ADK provides a simple way to “expose” this agent, turning it into an <strong>A2AServer</strong>. This server acts as a public interface, allowing other agents to send requests to your agent over a network. Think of it like setting up a web server for your agent.</p><p>2. <strong>Connecting to an Accessible Agent (Consuming):</strong> In a separate agent (which could be running on the same machine or a different one), you’ll use a special ADK component called RemoteA2aAgent. This RemoteA2aAgent acts as a client that knows how to communicate with the <strong>A2AServer</strong> you exposed earlier. It handles all the complexities of network communication, authentication, and data formatting behind the scenes.</p><p>From your perspective as a developer, once you’ve set up this connection, interacting with the remote agent feels just like interacting with a local tool or function. The ADK abstracts away the network layer, making distributed agent systems as easy to work with as local ones.</p><h3>Visualizing the A2A Workflow</h3><p>To understand how the <strong>A2A workflow</strong> actually works, let’s look at what happens <strong>before and after</strong> you expose your agent — and how everything fits together in a connected system.</p><h4>Exposing an Agent</h4><h4>Before Exposing</h4><p>At first, your agent is just a <strong>standalone program</strong>. It runs by itself and can’t be accessed by other agents over a network.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/494/1*QD8OUrH0mlgN1gOKE5uy5g.png" /></figure><h4>After Exposing</h4><p>When you integrate your agent with <strong>ADK’s A2A Server</strong>, it becomes accessible to other agents remotely. The <strong>A2A Server</strong> acts like a <strong>gateway</strong>, allowing network communication between your agent and others.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/988/1*NClvuuAHK1cVn1iv7E0a6g.png" /></figure><h4>Consuming an Agent</h4><p>Just like exposing an agent makes it available for others, <strong>consuming</strong> an agent means your own agent is set up to <strong>connect to and use</strong> a remote one. Let’s see how this works.</p><h4>Before Consuming</h4><p>Your <strong>Root Agent</strong> (the main agent you’re building) can’t yet talk to any remote agents. It’s isolated and has no built-in way to communicate over the network.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GrOPYjJlPq4QTGopw9Trvw.png" /></figure><h4>After Consuming</h4><p>Once you add <strong>RemoteA2aAgent</strong> (an ADK component) to your setup, it acts as a <strong>client-side proxy</strong> that connects your Root Agent to the remote agent. The communication now flows smoothly over the network.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6-f_2cvuMbhqfo4UobNNJQ.png" /></figure><p>In short:</p><p>· <strong>Before consuming</strong>, your Root Agent can’t reach remote services.</p><p>· <strong>After consuming</strong>, the <strong>RemoteA2aAgent</strong> handles all the network details, making communication with external agents as simple as calling a local function.</p><h4>Final System (Combined View)</h4><p>Here’s how everything fits together — the <strong>consuming</strong> and <strong>exposing</strong> sides form a complete <strong>A2A system</strong>.<br> This setup shows how agents communicate seamlessly through ADK’s A2A components.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yUeWk3G_pngZYvp5bd0QOw.png" /><figcaption><strong>Full A2A Architecture</strong></figcaption></figure><h3>Concrete Use Case: Customer Service and Product Catalog Agents</h3><p>Let’s consider a practical example: a <strong>Customer Service Agent</strong> that needs to retrieve product information from a separate <strong>Product Catalog Agent</strong>.</p><h3>Before A2A</h3><p>Initially, your Customer Service Agent might not have a direct, standardised way to query the Product Catalog Agent, especially if it’s a separate service or managed by a different team.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ADPeyiySHcwX21WgE9YZwg.png" /></figure><h3>After A2A</h3><p>By using the A2A Protocol, the Product Catalog Agent can expose its functionality as an A2A service. Your Customer Service Agent can then easily consume this service using ADK’s RemoteA2aAgent.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dyg1FMH93of2qc3Dwp2emA.png" /></figure><p>In this setup, first, the Product Catalog Agent needs to be exposed via an A2A Server. Then, the Customer Service Agent can simply call methods on the RemoteA2aAgent as if it were a tool, and the ADK handles all the underlying communication to the Product Catalog Agent. This allows for clear separation of concerns and easy integration of specialized agents.</p><h3>A2A Protocol Internal Working</h3><p>From the <strong>official documentation</strong> for the <strong>Agent2Agent (A2A) Protocol</strong>, an open standard designed to enable seamless communication and collaboration between AI agents.</p><p>Originally developed by Google and now donated to the Linux Foundation, A2A provides the definitive common language for agent interoperability in a world where agents are built using diverse frameworks and by different vendors.</p><p><strong>Why use the A2A Protocol?</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lhRTEFMtMO60DGIJ5vpsww.png" /></figure><p><strong>How does A2A work with MCP?</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*k1fciDcMRwBozVkD2VPCxA.png" /></figure><h3>How A2A and MCP Work Together</h3><p><strong>Agent-to-Agent (A2A)</strong> and <strong>Model Context Protocol (MCP)</strong> are <strong>complementary standards</strong> that form the backbone of modern, multi-agent ecosystems. Together, they make it possible for intelligent agents to <strong>communicate, collaborate, and access tools seamlessly</strong>.</p><h4>Model Context Protocol (MCP) — Agent-to-Tool Communication</h4><p>The <strong>MCP</strong> standard defines how an agent connects to its <strong>tools, APIs, and data sources</strong> to retrieve or process information.<br> Think of it as the <strong>bridge between an agent and its environment</strong> — standardizing how agents interact with resources like databases, APIs, or third-party services.</p><h4>🌐Agent-to-Agent Protocol (A2A) — Agent-to-Agent Communication</h4><p>The <strong>A2A Protocol</strong> focuses on how <strong>different agents talk to each other</strong>.<br> It serves as a <strong>universal, decentralized network</strong> — almost like the “public internet” for AI agents — allowing them to <strong>interoperate, share knowledge, and collaborate</strong>, regardless of which framework or platform they’re built on.</p><p>In short:</p><blockquote><strong>MCP</strong> connects <strong>agents to tools</strong>.</blockquote><blockquote><strong>A2A</strong> connects <strong>agents to each other</strong>.<br> Together, they make scalable, intelligent, and interconnected <strong>agentic systems</strong> possible.</blockquote><h4><strong>Why Use the A2A Protocol?</strong></h4><p>A2A addresses key challenges in AI agent collaboration. It provides a standardized approach for agents to interact. This section explains the problems A2A solves and the benefits it offers.</p><p><strong>Problems that A2A Solves</strong></p><p>Consider a user request for an AI assistant to plan an international trip. This task involves orchestrating multiple specialized agents, such as:</p><p>· A flight booking agent</p><p>· A hotel reservation agent</p><p>· An agent for local tour recommendations</p><p>· A currency conversion agent</p><p>Without A2A, integrating these diverse agents presents several challenges:</p><p>· <strong>Agent Exposure</strong>: Developers often wrap agents as tools to expose them to other agents, similar to how tools are exposed in a Multi-agent Control Platform (Model Context Protocol). However, this approach is inefficient because agents are designed to negotiate directly. Wrapping agents as tools limits their capabilities. A2A allows agents to be exposed as they are, without requiring this wrapping.</p><p>· <strong>Custom Integrations</strong>: Each interaction requires custom, point-to-point solutions, creating significant engineering overhead.</p><p>· <strong>Slow Innovation</strong>: Bespoke development for each new integration slows innovation.</p><p>· <strong>Scalability Issues</strong>: Systems become difficult to scale and maintain as the number of agents and interactions grows.</p><p>· <strong>Interoperability</strong>: This approach limits interoperability, preventing the organic formation of complex AI ecosystems.</p><p>· <strong>Security Gaps</strong>: Ad hoc communication often lacks consistent security measures.</p><p>The A2A protocol addresses these challenges by establishing interoperability for AI agents to interact reliably and securely.</p><h4><strong>A2A Example Scenario</strong></h4><p>This section provides an example scenario to illustrate the benefits of using an A2A (Agent2Agent) protocol for complex interactions between AI agents.</p><h4>A User’s Complex Request</h4><p>A user interacts with an AI assistant, giving it a complex prompt like “Plan an international trip.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/972/1*ZGeWdQAE2ZoAhXd0J9Hkow.png" /></figure><h4>Need for Collaboration</h4><p>The AI assistant receives the prompt and realizes it needs to call upon multiple specialized agents to fulfill the request. These agents include a Flight Booking Agent, a Hotel Reservation Agent, a Currency Conversion Agent, and a Local Tours Agent.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/794/1*mqZL7L__6yVirguA-G0xWQ.png" /></figure><h4>The Interoperability Challenge</h4><p>The core problem: The agents are unable to work together because each has its own bespoke development and deployment.</p><p>The consequence of a lack of a standardized protocol is that these agents cannot collaborate with each other let alone discover what they can do. The individual agents (Flight, Hotel, Currency, and Tours) are isolated.</p><h4>The “With A2A” Solution</h4><p>The A2A Protocol provides standard methods and data structures for agents to communicate with one another, regardless of their underlying implementation, so the same agents can be used as an interconnected system, communicating seamlessly through the standardized protocol.</p><p>The AI assistant, now acting as an orchestrator, receives the cohesive information from all the A2A-enabled agents. It then presents a single, complete travel plan as a seamless response to the user’s initial prompt.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yoCsu77qcT6WciWccNpdNQ.png" /></figure><h4>Core Benefits of A2A</h4><p>Implementing the A2A protocol offers significant advantages across the AI ecosystem:</p><p>· <strong>Secure collaboration</strong>: Without a standard, it’s difficult to ensure secure communication between agents. A2A uses HTTPS for secure communication and maintains opaque operations, so agents can’t see the inner workings of other agents during collaboration.</p><p>· <strong>Interoperability</strong>: A2A breaks down silos between different AI agent ecosystems, enabling agents from various vendors and frameworks to work together seamlessly.</p><p>· <strong>Agent autonomy</strong>: A2A allows agents to retain their individual capabilities and act as autonomous entities while collaborating with other agents.</p><p>· <strong>Reduced integration complexity</strong>: The protocol standardizes agent communication, enabling teams to focus on the unique value their agents provide.</p><p>· <strong>Support for LRO</strong>: The protocol supports long-running operations (LRO) and streaming with Server-Sent Events (SSE) and asynchronous execution.</p><h4>Understanding the Agent Stack: A2A, MCP, Agent Frameworks and Models</h4><p>A2A is situated within a broader agent stack, which includes:</p><p>· <strong>A2A:</strong> Standardizes communication among agents deployed in different organizations and developed using diverse frameworks.</p><p>· <strong>MCP:</strong> Connects models to data and external resources.</p><p>· <strong>Frameworks (like ADK):</strong> Provide toolkits for constructing agents.</p><p>· <strong>Models:</strong> Fundamental to an agent’s reasoning, these can be any Large Language Model (LLM).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TxNYa8-K4ERGJoSQ19s3zQ.png" /><figcaption>Agent Stack: A2A, MCP, Agent Frameworks and Models</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*k1fciDcMRwBozVkD2VPCxA.png" /><figcaption><strong>A2A and MCP</strong></figcaption></figure><h3>A2A and ADK</h3><p>The <a href="https://google.github.io/adk-docs">Agent Development Kit (ADK)</a> is an open-source agent development toolkit developed by Google. A2A is a communication protocol for agents that enables inter-agent communication, regardless of the framework used for their construction (e.g., ADK, LangGraph, or Crew AI). ADK is a flexible and modular framework for developing and deploying AI agents. While optimized for Gemini AI and the Google ecosystem, ADK is model-agnostic, deployment-agnostic, and built for compatibility with other frameworks.</p><h4><strong>Core Actors in A2A Interactions</strong></h4><p>· <strong>User</strong>: The end user, which can be a human operator or an automated service. The user initiates a request or defines a goal that requires assistance from one or more AI agents.</p><p>· <strong>A2A Client (Client Agent)</strong>: An application, service, or another AI agent that acts on behalf of the user. The client initiates communication using the A2A protocol.</p><p>· <strong>A2A Server (Remote Agent)</strong>: An AI agent or an agentic system that exposes an HTTP endpoint implementing the A2A protocol. It receives requests from clients, processes tasks, and returns results or status updates. From the client’s perspective, the remote agent operates as an <em>opaque</em> (black-box) system, meaning its internal workings, memory, or tools are not exposed.</p><h3>Fundamental Communication Elements</h3><p>The following table describes the fundamental communication elements in A2A:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1jSxmBI_0pmsDKG0aJZf5Q.png" /></figure><h3>Interaction Mechanisms</h3><p>The A2A Protocol supports various interaction patterns to accommodate different needs for responsiveness and persistence. These mechanisms ensure that agents can exchange information efficiently and reliably, regardless of the task’s complexity or duration:</p><p>· <strong>Request/Response (Polling)</strong>: Clients send a request and the server responds. For long-running tasks, the client periodically polls the server for updates.</p><p>· <strong>Streaming with Server-Sent Events (SSE)</strong>: Clients initiate a stream to receive real-time, incremental results or status updates from the server over an open HTTP connection.</p><p>· <strong>Push Notifications</strong>: For very long-running tasks or disconnected scenarios, the server can actively send asynchronous notifications to a client-provided webhook when significant task updates occur.</p><h4><strong>The Role of the Agent Card</strong></h4><p>The Agent Card is a JSON document that serves as a digital “business card” for an A2A Server (the remote agent). It is crucial for agent discovery and interaction. The key information included in an Agent Card is as follows:</p><p>· <strong>Identity:</strong> Includes name, description, and provider information.</p><p>· <strong>Service Endpoint:</strong> Specifies the url for the A2A service.</p><p>· <strong>A2A Capabilities:</strong> Lists supported features such as streaming or pushNotifications.</p><p>· <strong>Authentication:</strong> Details the required schemes (e.g., &quot;Bearer&quot;, &quot;OAuth2&quot;).</p><p>· <strong>Skills:</strong> Describes the agent’s tasks using AgentSkill objects, including id, name, description, inputModes, outputModes, and examples.</p><p>Client agents use the Agent Card to determine an agent’s suitability, structure requests, and ensure secure communication.</p><h3>Discovery Strategies</h3><p>The following sections detail common strategies used by client agents to discover remote Agent Cards:</p><h4>1. Well-Known URI</h4><p>This approach is recommended for public agents or agents intended for broad discovery within a specific domain.</p><h4>2. Curated Registries (Catalog-Based Discovery)</h4><p>This approach is employed in enterprise environments or public marketplaces, where Agent Cards are often managed by a central registry. The curated registry acts as a central repository, allowing clients to query and discover agents based on criteria like “skills” or “tags”.</p><h4>3. Direct Configuration / Private Discovery</h4><p>This approach is used for tightly coupled systems, private agents, or development purposes, where clients are directly configured with Agent Card information or URLs.</p><h3>Life of a Task</h3><p>In the Agent2Agent (A2A) Protocol, interactions can range from simple, stateless exchanges to complex, long-running processes. When an agent receives a message from a client, it can respond in one of two fundamental ways:</p><p>· <strong>Respond with a Stateless </strong><strong>Message</strong>: This type of response is typically used for immediate, self-contained interactions that conclude without requiring further state management.</p><p>· <strong>Initiate a Stateful </strong><strong>Task</strong>: If the response is a Task, the agent will process it through a defined lifecycle, communicating progress and requiring input as needed, until it reaches an interrupted state (e.g., input-required, auth-required) or a terminal state (e.g., completed, canceled, rejected, failed).</p><h3>Agent Response: Message or Task</h3><p>The choice between responding with a Message or a Task depends on the nature of the interaction and the agent&#39;s capabilities:</p><p>· <strong>Messages for Trivial Interactions</strong>: Message objects are suitable for transactional interactions that don&#39;t require long-running processing or complex state management. An agent might use messages to negotiate the acceptance or scope of a task before committing to a Task object.</p><p>· <strong>Tasks for Stateful Interactions</strong>: Once an agent maps the intent of an incoming message to a supported capability that requires substantial, trackable work over an extended period, the agent responds with a Task object.</p><h3>Example Follow-up Scenario</h3><p>The following example illustrates a typical task flow with a follow-up:\</p><p>1. Client sends a message to the agent:</p><pre>{<br>  &quot;jsonrpc&quot;: &quot;2.0&quot;,<br>  &quot;id&quot;: &quot;req-001&quot;,<br>  &quot;method&quot;: &quot;message.send&quot;,<br>  &quot;params&quot;: {<br>    &quot;message&quot;: {<br>      &quot;role&quot;: &quot;user&quot;,<br>      &quot;parts&quot;: [<br>        {<br>          &quot;kind&quot;: &quot;text&quot;,<br>          &quot;text&quot;: &quot;Generate an image of a sailboat on the ocean.&quot;<br>        }<br>      ]<br>      &quot;messageId&quot;: &quot;msg-user-001&quot;<br>    }<br>  }<br>}</pre><p>2. Agent responds with a boat image (completed task):</p><pre>{<br>  &quot;jsonrpc&quot;: &quot;2.0&quot;,<br>  &quot;id&quot;: &quot;req-001&quot;,<br>  &quot;result&quot;: {<br>    &quot;id&quot;: &quot;task-boat-gen-123&quot;,<br>    &quot;contextId&quot;: &quot;ctx-conversation-abc&quot;,<br>    &quot;status&quot;: {<br>      &quot;state&quot;: &quot;completed&quot;<br>    },<br>    &quot;artifacts&quot;: [<br>      {<br>        &quot;artifactId&quot;: &quot;artifact-boat-v1-xyz&quot;,<br>        &quot;name&quot;: &quot;sailboat_image.png&quot;,<br>        &quot;description&quot;: &quot;A generated image of a sailboat on the ocean.&quot;,<br>        &quot;parts&quot;: [<br>          {<br>            &quot;kind&quot;: &quot;file&quot;,<br>            &quot;file&quot;: {<br>              &quot;name&quot;: &quot;sailboat_image.png&quot;,<br>              &quot;mimeType&quot;: &quot;image/png&quot;,<br>              &quot;bytes&quot;: &quot;base64_encoded_png_data_of_a_sailboat&quot;<br>            }<br>          }<br>        ]<br>      }<br>    ],<br>    &quot;kind&quot;: &quot;task&quot;<br>  }<br>}</pre><p>3. Client asks to color the boat red. This refinement request refers to the previous taskId and uses the same contextId.</p><pre>{<br>  &quot;jsonrpc&quot;: &quot;2.0&quot;,<br>  &quot;id&quot;: &quot;req-002&quot;,<br>  &quot;method&quot;: &quot;message.send&quot;,<br>  &quot;params&quot;: {<br>    &quot;message&quot;: {<br>      &quot;role&quot;: &quot;user&quot;,<br>      &quot;messageId&quot;: &quot;msg-user-002&quot;,<br>      &quot;contextId&quot;: &quot;ctx-conversation-abc&quot;,<br>      &quot;referenceTaskIds&quot;: [<br>        &quot;task-boat-gen-123&quot;<br>      ],<br>      &quot;parts&quot;: [<br>        {<br>          &quot;kind&quot;: &quot;text&quot;,<br>          &quot;text&quot;: &quot;Please modify the sailboat to be red.&quot;<br>        }<br>      ]<br>    }<br>  }<br>}</pre><p>4. Agent responds with a new image artifact (new task, same context, updated artifact name): The agent creates a new task within the same contextId. The new boat image artifact retains the same name but has a new artifactId</p><pre><br>{<br>  &quot;jsonrpc&quot;: &quot;2.0&quot;,<br>  &quot;id&quot;: &quot;req-002&quot;,<br>  &quot;result&quot;: {<br>    &quot;id&quot;: &quot;task-boat-color-456&quot;,<br>    &quot;contextId&quot;: &quot;ctx-conversation-abc&quot;,<br>    &quot;status&quot;: {<br>      &quot;state&quot;: &quot;completed&quot;<br>    },<br>    &quot;artifacts&quot;: [<br>      {<br>        &quot;artifactId&quot;: &quot;artifact-boat-v2-red-pqr&quot;,<br>        &quot;name&quot;: &quot;sailboat_image.png&quot;,<br>        &quot;description&quot;: &quot;A generated image of a red sailboat on the ocean.&quot;,<br>        &quot;parts&quot;: [<br>          {<br>            &quot;kind&quot;: &quot;file&quot;,<br>            &quot;file&quot;: {<br>              &quot;name&quot;: &quot;sailboat_image.png&quot;,<br>              &quot;mimeType&quot;: &quot;image/png&quot;,<br>              &quot;bytes&quot;: &quot;base64_encoded_png_data_of_a_RED_sailboat&quot;<br>            }<br>          }<br>        ]<br>      }<br>    ],<br>    &quot;kind&quot;: &quot;task&quot;<br>  }<br>}</pre><h3>Enterprise Implementation of A2A</h3><p>The Agent2Agent (A2A) protocol is designed with enterprise requirements at its core. Rather than inventing new, proprietary standards for security and operations, A2A aims to integrate seamlessly with existing enterprise infrastructure and widely adopted best practices. This approach allows organizations to use their existing investments and expertise in security, monitoring, governance, and identity management.</p><p>A key principle of A2A is that agents are typically <strong>opaque</strong> because they don’t share internal memory, tools, or direct resource access with each other. This opacity naturally aligns with standard client-server security paradigms, treating remote agents as standard HTTP-based enterprise applications.</p><p>Want to know more about A2A and MCP? Please go through my blog: <a href="https://medium.com/@saha.soumyadeep90/a2a-vs-mcp-comparing-googles-agent-to-agent-protocol-with-openai-s-model-context-protocol-6798491cc87e"><strong>https://medium.com/@saha.soumyadeep90/a2a-vs-mcp-comparing-googles-agent-to-agent-protocol-with-openai-s-model-context-protocol-6798491cc87e</strong></a></p><h3>Quickstart: Exposing a remote agent via A2A</h3><p>This quickstart is the perfect starting point for any developer asking:</p><p>“I already have an agent — how do I make it accessible so other agents can use it via A2A?”</p><p>Exposing your agent is a key step in building <strong>multi-agent systems</strong>, where different agents can <strong>collaborate, share data, and interact intelligently</strong>.</p><p>In this example, you’ll learn how to <strong>expose an ADK agent</strong> so it can be accessed and used by other agents through the <strong>Agent-to-Agent (A2A) Protocol</strong>.</p><p>There are <strong>two main ways</strong> to expose an ADK agent via A2A:</p><h4>1. Using to_a2a(root_agent)</h4><p>This is the <strong>simplest and fastest</strong> method.</p><p>· Use this function to <strong>convert an existing agent</strong> into an A2A-compatible one.</p><p>· You can then expose it via a <strong>server using </strong><strong>uvicorn</strong>, instead of deploying with adk deploy api_server.</p><p>· This approach gives you <strong>more control</strong> over what gets exposed, making it great for <strong>production environments</strong>.</p><p>· The best part: the to_a2a() function <strong>automatically generates an agent card</strong> (a metadata file that describes your agent).</p><h4>2. Using adk api_server --a2a with a Custom Agent Card</h4><p>This method is ideal when you want <strong>more flexibility</strong> or when you’re managing <strong>multiple agents</strong>.</p><p>· You create your own <strong>agent card (</strong><strong>agent.json)</strong> and host it using the ADK API server.</p><p>· It integrates smoothly with <strong>ADK Web</strong>, making it easier to <strong>test, debug, and visualize</strong> your agents.</p><p>· You can also specify a folder containing <strong>multiple agents</strong>, and those with agent cards will automatically be exposed via the same server.</p><p>· To create agent cards manually, follow the <strong>A2A Python tutorial (</strong><a href="https://google.github.io/adk-docs/a2a/quickstart-consuming/">https://google.github.io/adk-docs/a2a/quickstart-consuming/</a><strong>)</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jmyYKTT_7sSNxnxge21OrQ.png" /></figure><p>The sample consists of :</p><p>· <strong>Remote Hello World Agent</strong> (remote_a2a/hello_world/agent.py): This is the agent that you want to expose so that other agents can use it via A2A. It is an agent that handles dice rolling and prime number checking. It becomes exposed using the to_a2a() function and is served using uvicorn.</p><p>· <strong>Root Agent</strong> (agent.py): A simple agent that is just calling the remote Hello World agent.</p><p><strong>Exposing the Remote Agent with the </strong><strong>to_a2a(root_agent) function</strong></p><p>You can take an existing agent built using ADK and make it A2A-compatible by simply wrapping it using the to_a2a() function. For example, if you have an agent like the following defined in root_agent:</p><pre># Your agent code here<br>root_agent = Agent(<br>    model=&#39;gemini-2.0-flash&#39;,<br>    name=&#39;hello_world_agent&#39;,<br><br>    &lt;...your agent code...&gt;<br>)</pre><p>Then you can make it A2A-compatible simply by using to_a2a(root_agent):</p><pre>from google.adk.a2a.utils.agent_to_a2a import to_a2a<br><br># Make your agent A2A-compatible<br>a2a_app = to_a2a(root_agent, port=8001)</pre><p>The to_a2a() function will even auto-generate an agent card in-memory behind-the-scenes by <a href="https://github.com/google/adk-python/blob/main/src/google/adk/a2a/utils/agent_card_builder.py">extracting skills, capabilities, and metadata from the ADK agent</a>, so that the well-known agent card is made available when the agent endpoint is served using uvicorn.</p><p>You can also provide your own agent card by using the agent_card parameter. The value can be an AgentCard object or a path to an agent card JSON file.</p><p><strong>Example with an </strong><strong>AgentCard object:</strong></p><pre>from google.adk.a2a.utils.agent_to_a2a import to_a2a<br>from a2a.types import AgentCard<br><br># Define A2A agent card<br>my_agent_card = AgentCard(<br>    &quot;name&quot;: &quot;file_agent&quot;,<br>    &quot;url&quot;: &quot;http://example.com&quot;,<br>    &quot;description&quot;: &quot;Test agent from file&quot;,<br>    &quot;version&quot;: &quot;1.0.0&quot;,<br>    &quot;capabilities&quot;: {},<br>    &quot;skills&quot;: [],<br>    &quot;defaultInputModes&quot;: [&quot;text/plain&quot;],<br>    &quot;defaultOutputModes&quot;: [&quot;text/plain&quot;],<br>    &quot;supportsAuthenticatedExtendedCard&quot;: False,<br>)<br>a2a_app = to_a2a(root_agent, port=8001, agent_card=my_agent_card)</pre><p><strong>Example with a path to a JSON file:</strong></p><pre>from google.adk.a2a.utils.agent_to_a2a import to_a2a<br><br># Load A2A agent card from a file<br>a2a_app = to_a2a(root_agent, port=8001, agent_card=&quot;/path/to/your/agent-card.json&quot;)</pre><p><em>Now let’s dive into the sample code:</em></p><h4>1. Getting the Sample Code</h4><p>First, make sure you have the necessary dependencies installed:</p><p><strong>pip install google-adk\[a2a\]</strong></p><p>You can clone and navigate to the <a href="https://github.com/google/adk-python/tree/main/contributing/samples/a2a_root"><strong>a2a_root</strong> sample</a> (<a href="https://github.com/google/adk-python/tree/main/contributing/samples/a2a_root">https://github.com/google/adk-python/tree/main/contributing/samples/a2a_root</a>) here:</p><p><strong>git clone </strong><a href="https://github.com/google/adk-python.git"><strong>https://github.com/google/adk-python.git</strong></a></p><p>As you’ll see, the folder structure is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/974/1*EA4F1F0vLSQNz9TNyW8WCQ.png" /></figure><h4><strong>Root Agent (</strong><strong>a2a_root/agent.py)</strong></h4><p>· <strong>root_agent</strong>: A RemoteA2aAgent that connects to the remote A2A service</p><p>· <strong>Agent Card URL</strong>: Points to the well-known agent card endpoint on the remote server</p><h4>Remote Hello World Agent (a2a_root/remote_a2a/hello_world/agent.py)</h4><p>· <strong>roll_die(sides: int)</strong>: Function tool for rolling dice with state management</p><p>· <strong>check_prime(nums: list[int])</strong>: Async function for prime number checking</p><p>· <strong>root_agent</strong>: The main agent with comprehensive instructions</p><p>. <strong>a2a_app</strong>: The A2A application created using to_a2a() utility</p><h4>2. Start the Remote A2A Agent server</h4><p>You can now start the remote agent server, which will host the a2a_app within the hello_world agent:</p><pre># Ensure current working directory is adk-python/<br># Start the remote agent using uvicorn<br>uvicorn contributing.samples.a2a_root.remote_a2a.hello_world.agent:a2a_app --host localhost --port 8001</pre><p>Once executed, you should see something like:</p><pre>INFO:     Started server process [10615]<br>INFO:     Waiting for application startup.<br>INFO:     Application startup complete.<br>INFO:     Uvicorn running on http://localhost:8001 (Press CTRL+C to quit)</pre><h4><strong>3. Check that your remote agent is running</strong></h4><p>You can check that your agent is up and running by visiting the agent card that was auto-generated earlier as part of your to_a2a() function in a2a_root/remote_a2a/hello_world/agent.py: <a href="http://localhost:8001/.well-known/agent-card.json">http://localhost:8001/.well-known/agent-card.json</a></p><p>You should see the contents of the agent card, which should look like:</p><pre>{<br>    &quot;capabilities&quot;: {},<br>    &quot;defaultInputModes&quot;: [<br>        &quot;text/plain&quot;<br>    ],<br>    &quot;defaultOutputModes&quot;: [<br>        &quot;text/plain&quot;<br>    ],<br>    &quot;description&quot;: &quot;hello world agent that can roll a dice of 8 sides and check prime numbers.&quot;,<br>    &quot;name&quot;: &quot;hello_world_agent&quot;,<br>    &quot;protocolVersion&quot;: &quot;0.2.6&quot;,<br>    &quot;skills&quot;: [<br>        {<br>            &quot;description&quot;: &quot;hello world agent that can roll a dice of 8 sides and check prime numbers. \n      I roll dice and answer questions about the outcome of the dice rolls.\n      I can roll dice of different sizes.\n      I can use multiple tools in parallel by calling functions in parallel(in one request and in one round).\n      It is ok to discuss previous dice roles, and comment on the dice rolls.\n      When I are asked to roll a die, I must call the roll_die tool with the number of sides. Be sure to pass in an integer. Do not pass in a string.\n      I should never roll a die on my own.\n      When checking prime numbers, call the check_prime tool with a list of integers. Be sure to pass in a list of integers. I should never pass in a string.\n      I should not check prime numbers before calling the tool.\n      When I are asked to roll a die and check prime numbers, I should always make the following two function calls:\n      1. I should first call the roll_die tool to get a roll. Wait for the function response before calling the check_prime tool.\n      2. After I get the function response from roll_die tool, I should call the check_prime tool with the roll_die result.\n        2.1 If user asks I to check primes based on previous rolls, make sure I include the previous rolls in the list.\n      3. When I respond, I must include the roll_die result from step 1.\n      I should always perform the previous 3 steps when asking for a roll and checking prime numbers.\n      I should not rely on the previous history on prime results.\n    &quot;,<br>            &quot;id&quot;: &quot;hello_world_agent&quot;,<br>            &quot;name&quot;: &quot;model&quot;,<br>            &quot;tags&quot;: [<br>                &quot;llm&quot;<br>            ]<br>        },<br>        {<br>            &quot;description&quot;: &quot;Roll a die and return the rolled result.\n\nArgs:\n  sides: The integer number of sides the die has.\n  tool_context: the tool context\nReturns:\n  An integer of the result of rolling the die.&quot;,<br>            &quot;id&quot;: &quot;hello_world_agent-roll_die&quot;,<br>            &quot;name&quot;: &quot;roll_die&quot;,<br>            &quot;tags&quot;: [<br>                &quot;llm&quot;,<br>                &quot;tools&quot;<br>            ]<br>        },<br>        {<br>            &quot;description&quot;: &quot;Check if a given list of numbers are prime.\n\nArgs:\n  nums: The list of numbers to check.\n\nReturns:\n  A str indicating which number is prime.&quot;,<br>            &quot;id&quot;: &quot;hello_world_agent-check_prime&quot;,<br>            &quot;name&quot;: &quot;check_prime&quot;,<br>            &quot;tags&quot;: [<br>                &quot;llm&quot;,<br>                &quot;tools&quot;<br>            ]<br>        }<br>    ],<br>    &quot;supportsAuthenticatedExtendedCard&quot;: false,<br>    &quot;url&quot;: &quot;http://localhost:8001&quot;,<br>    &quot;version&quot;: &quot;0.0.1&quot;<br>}</pre><h4>4. Run the Main (Consuming) Agent</h4><p>Now that your remote agent is running, you can launch the dev UI and select “a2a_root” as your agent.</p><pre># In a separate terminal, run the adk web server<br>adk web contributing/samples/</pre><p>To open the adk web server, go to: <a href="http://localhost:8000/">http://localhost:8000</a>.</p><h3>Example Interactions<a href="https://google.github.io/adk-docs/a2a/quickstart-exposing/#example-interactions">¶</a></h3><p>Once both services are running, you can interact with the root agent to see how it calls the remote agent via A2A:</p><p><strong>Simple Dice Rolling:</strong> This interaction uses a local agent, the Roll Agent:</p><pre>User: Roll a 6-sided die</pre><pre>Bot: I rolled a 4 for you.</pre><p><strong>Prime Number Checking:</strong></p><p>This interaction uses a remote agent via A2A, the Prime Agent:</p><pre>User: Is 7 a prime number?</pre><pre>Bot: Yes, 7 is a prime number.</pre><p><strong>Combined Operations:</strong></p><p>This interaction uses both the local Roll Agent and the remote Prime Agent:</p><pre>User: Roll a 10-sided die and check if it&#39;s prime</pre><pre>Bot: I rolled an 8 for you.</pre><pre>Bot: 8 is not a prime number.</pre><h3>Quickstart: Consuming a remote agent via A2A</h3><p>This quickstart focuses on another common developer question:</p><p>“There’s a remote agent — how can my ADK agent connect to and use it via A2A?”</p><p>This step is essential when building <strong>multi-agent systems</strong>, where different agents need to <strong>communicate, share data, and collaborate</strong> to complete complex tasks.</p><p>In this example, you’ll explore how the <strong>Agent-to-Agent (A2A) Protocol</strong> works inside the <strong>Agent Development Kit (ADK)</strong>. It shows how multiple agents can connect and <strong>work together as one system</strong>.</p><p>The sample demonstrates a simple but powerful concept:<br> an agent that can <strong>roll dice</strong> and <strong>check if numbers are prime</strong> — with these tasks handled by <strong>different agents</strong> working together.</p><p>By the end, you’ll understand how your ADK agent can <strong>consume a remote agent</strong>, call its functions over the network, and act as if those features were built in locally.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Eg2TWcE0OwSxzRQkp_QEMw.png" /></figure><p>The A2A Basic sample consists of:</p><p>· <strong>Root Agent</strong> (root_agent): The main orchestrator that delegates tasks to specialized sub-agents</p><p>· <strong>Roll Agent</strong> (roll_agent): A local sub-agent that handles dice rolling operations</p><p>· <strong>Prime Agent</strong> (prime_agent): A remote A2A agent that checks if numbers are prime, this agent is running on a separate A2A server</p><h4>Exposing Your Agent with the ADK Server</h4><p>The ADK comes with a built-in CLI command, adk api_server --a2a to expose your agent using the A2A protocol.</p><p>In the a2a_basic example, you will first need to expose the check_prime_agent via an A2A server, so that the local root agent can use it.</p><h4>1. Getting the Sample Code</h4><p>First, make sure you have the necessary dependencies installed:</p><pre>pip install google-adk/[a2a/]</pre><p>You can clone and navigate to the <a href="https://github.com/google/adk-python/tree/main/contributing/samples/a2a_basic"><strong>a2a_basic</strong> sample</a> here:</p><pre>git clone https://github.com/google/adk-python.git</pre><p>As you’ll see, the folder structure is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/714/1*g3cowhZJrjiRfdpsYtbGhQ.png" /></figure><h4>Main Agent (a2a_basic/agent.py)</h4><p>· <strong>roll_die(sides: int)</strong>: Function tool for rolling dice</p><p>· <strong>roll_agent</strong>: Local agent specialized in dice rolling</p><p>· <strong>prime_agent</strong>: Remote A2A agent configuration</p><p>· <strong>root_agent</strong>: Main orchestrator with delegation logic</p><h4>Remote Prime Agent (a2a_basic/remote_a2a/check_prime_agent/)</h4><p>· <strong>agent.py</strong>: Implementation of the prime checking service</p><p>· <strong>agent.json</strong>: Agent card of the A2A agent</p><p>· <strong>check_prime(nums: list[int])</strong>: Prime number checking algorithm</p><h4>2. Start the Remote Prime Agent server</h4><p>To show how your ADK agent can consume a remote agent via A2A, you’ll first need to start a remote agent server, which will host the prime agent (under check_prime_agent).</p><pre># Start the remote a2a server that serves the check_prime_agent on port 8001<br>adk api_server --a2a --port 8001 contributing/samples/a2a_basic/remote_a2a</pre><p>Once executed, you should see something like:</p><pre>INFO:     Started server process [56558]<br>INFO:     Waiting for application startup.<br>INFO:     Application startup complete.<br>INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)</pre><h4>3. Look out for the required agent card (agent-card.json) of the remote agent</h4><p>A2A Protocol requires that each agent must have an agent card that describes what it does.</p><p>If someone else has already built the remote A2A agent that you are looking to consume in your agent, then you should confirm that they have an agent card (agent-card.json).</p><p>In the sample, the check_prime_agent already has an agent card provided:</p><p><strong>a2a_basic/remote_a2a/check_prime_agent/agent-card.json</strong></p><pre>{<br>  &quot;capabilities&quot;: {},<br>  &quot;defaultInputModes&quot;: [&quot;text/plain&quot;],<br>  &quot;defaultOutputModes&quot;: [&quot;application/json&quot;],<br>  &quot;description&quot;: &quot;An agent specialized in checking whether numbers are prime. It can efficiently determine the primality of individual numbers or lists of numbers.&quot;,<br>  &quot;name&quot;: &quot;check_prime_agent&quot;,<br>  &quot;skills&quot;: [<br>    {<br>      &quot;id&quot;: &quot;prime_checking&quot;,<br>      &quot;name&quot;: &quot;Prime Number Checking&quot;,<br>      &quot;description&quot;: &quot;Check if numbers in a list are prime using efficient mathematical algorithms&quot;,<br>      &quot;tags&quot;: [&quot;mathematical&quot;, &quot;computation&quot;, &quot;prime&quot;, &quot;numbers&quot;]<br>    }<br>  ],<br>  &quot;url&quot;: &quot;http://localhost:8001/a2a/check_prime_agent&quot;,<br>  &quot;version&quot;: &quot;1.0.0&quot;<br>}</pre><h3>4. Run the Main (Consuming) Agent</h3><pre># In a separate terminal, run the adk web server<br>adk web contributing/samples/</pre><h4>How it works</h4><p>The main agent uses the RemoteA2aAgent() function to consume the remote agent (prime_agent in our example). As you can see below, RemoteA2aAgent() requires the name, description, and the URL of the agent_card.</p><p><strong>a2a_basic/agent.py</strong></p><pre>from google.adk.agents.remote_a2a_agent import AGENT_CARD_WELL_KNOWN_PATH<br>from google.adk.agents.remote_a2a_agent import RemoteA2aAgent<br><br>prime_agent = RemoteA2aAgent(<br>    name=&quot;prime_agent&quot;,<br>    description=&quot;Agent that handles checking if numbers are prime.&quot;,<br>    agent_card=(<br>        f&quot;http://localhost:8001/a2a/check_prime_agent{AGENT_CARD_WELL_KNOWN_PATH}&quot;<br>    ),<br>)<br><br>&lt;...code truncated&gt;</pre><p>Then, you can simply use the RemoteA2aAgent in your agent. In this case, prime_agent is used as one of the sub-agents in the root_agent below:</p><p><strong>a2a_basic/agent.py</strong></p><pre>from google.adk.agents.llm_agent import Agent<br>from google.genai import types<br><br>root_agent = Agent(<br>    model=&quot;gemini-2.0-flash&quot;,<br>    name=&quot;root_agent&quot;,<br>    instruction=&quot;&quot;&quot;<br>      &lt;You are a helpful assistant that can roll dice and check if numbers are prime.<br>      You delegate rolling dice tasks to the roll_agent and prime checking tasks to the prime_agent.<br>      Follow these steps:<br>      1. If the user asks to roll a die, delegate to the roll_agent.<br>      2. If the user asks to check primes, delegate to the prime_agent.<br>      3. If the user asks to roll a die and then check if the result is prime, call roll_agent first, then pass the result to prime_agent.<br>      Always clarify the results before proceeding.&gt;<br>    &quot;&quot;&quot;,<br>    global_instruction=(<br>        &quot;You are DicePrimeBot, ready to roll dice and check prime numbers.&quot;<br>    ),<br>    sub_agents=[roll_agent, prime_agent],<br>    tools=[example_tool],<br>    generate_content_config=types.GenerateContentConfig(<br>        safety_settings=[<br>            types.SafetySetting(  # avoid false alarm about rolling dice.<br>                category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,<br>                threshold=types.HarmBlockThreshold.OFF,<br>            ),<br>        ]<br>    ),<br>)</pre><h4>Example Interactions</h4><p>Once both your main and remote agents are running, you can interact with the root agent to see how it calls the remote agent via A2A:</p><p><strong>Simple Dice Rolling:</strong> This interaction uses a local agent, the Roll Agent:</p><pre>User: Roll a 6-sided die<br>Bot: I rolled a 4 for you.</pre><p><strong>Prime Number Checking:</strong></p><p>This interaction uses a remote agent via A2A, the Prime Agent:</p><pre>User: Is 7 a prime number?<br>Bot: Yes, 7 is a prime number.</pre><p><strong>Combined Operations:</strong></p><p>This interaction uses both the local Roll Agent and the remote Prime Agent:</p><pre>User: Roll a 10-sided die and check if it&#39;s prime<br>Bot: I rolled an 8 for you.<br>Bot: 8 is not a prime number.</pre><h4>3. Look out for the required agent card (agent-card.json) of the remote agent</h4><p>A2A Protocol requires that each agent must have an agent card that describes what it does.</p><p>If someone else has already built the remote A2A agent that you are looking to consume in your agent, then you should confirm that they have an agent card (agent-card.json).</p><p>In the sample, the check_prime_agent already has an agent card provided:</p><p><strong>a2a_basic/remote_a2a/check_prime_agent/agent-card.json</strong></p><pre><br>  &quot;capabilities&quot;: {},<br>  &quot;defaultInputModes&quot;: [&quot;text/plain&quot;],<br>  &quot;defaultOutputModes&quot;: [&quot;application/json&quot;],<br>  &quot;description&quot;: &quot;An agent specialized in checking whether numbers are prime. It can efficiently determine the primality of individual numbers or lists of numbers.&quot;,<br>  &quot;name&quot;: &quot;check_prime_agent&quot;,<br>  &quot;skills&quot;: [<br>    {<br>      &quot;id&quot;: &quot;prime_checking&quot;,<br>      &quot;name&quot;: &quot;Prime Number Checking&quot;,<br>      &quot;description&quot;: &quot;Check if numbers in a list are prime using efficient mathematical algorithms&quot;,<br>      &quot;tags&quot;: [&quot;mathematical&quot;, &quot;computation&quot;, &quot;prime&quot;, &quot;numbers&quot;]<br>    }<br>  ],<br>  &quot;url&quot;: &quot;http://localhost:8001/a2a/check_prime_agent&quot;,<br>  &quot;version&quot;: &quot;1.0.0&quot;<br>}</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=cbfa51bff502" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unsupervised Learning with PCA: Theory, Math, and Practical Implementation]]></title>
            <link>https://medium.com/@saha.soumyadeep90/unsupervised-learning-with-pca-theory-math-and-practical-implementation-b8053fda7d5b?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/b8053fda7d5b</guid>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Wed, 15 Oct 2025 10:15:22 GMT</pubDate>
            <atom:updated>2025-10-15T11:23:01.024Z</atom:updated>
            <content:encoded><![CDATA[<h3>Unsupervised Learning with Principal Component Analysis (PCA): Theory, Math, and Practical Implementation</h3><p><strong>Principal Component Analysis (PCA)</strong> is a <strong>dimensionality reduction technique</strong>.<br> It helps you take <strong>high-dimensional data</strong> (data with many features or variables) and represent it using <strong>fewer dimensions</strong>, while keeping as much <strong>useful information (variance)</strong> as possible.</p><h3><strong>Why PCA</strong></h3><p><strong>The Problem: Too Many Features</strong></p><h4>1. Predictive Modeling Issue — Multicollinearity</h4><p>When you have a lot of features that are <strong>strongly related to each other</strong>, they cause a problem called <strong>multicollinearity</strong>.<br> This makes it hard for models (like regression or machine learning models) to understand which feature is actually important.</p><p>· If two features say almost the same thing, the model gets <strong>confused</strong> about which one to trust.</p><p>· To fix it, we sometimes remove features one by one — but that’s <strong>slow</strong> and may <strong>throw away useful information</strong>.</p><p><em>Simple example:</em><br> If you have both “height in cm” and “height in inches” as features, they’re perfectly correlated — keeping both is redundant.<br> But when there are hundreds of such correlated features, manually deciding which to drop becomes messy.</p><h4><strong>2. Visualization Issue — Too Many Dimensions</strong></h4><p>We humans can <strong>only visualize up to 3 dimensions</strong> easily (2D plots or 3D graphs).<br> If your data has 10 or 100 features, you <strong>can’t directly plot</strong> it to see patterns or clusters.</p><p>That means it’s <strong>hard to notice relationships</strong> or <strong>groupings</strong> among data points just by looking at graphs.</p><p><em>Example:</em><br> If each customer has 20 attributes (age, income, spending, habits, etc.), you can’t make a 20-D plot to see which customers are similar.</p><h3>How PCA Helps</h3><p>PCA helps in both these cases:</p><p>· It <strong>reduces the number of features</strong> by combining correlated ones into new “principal components.”<br> → This fixes <strong>multicollinearity</strong> and keeps most of the important information.</p><p>· It <strong>compresses</strong> many dimensions into 2 or 3 that capture most of the data’s variation.<br> → This makes <strong>visualization</strong> possible again.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mZRVK-sZsGay3pjd19gFwA.png" /></figure><p>In the image above, you can see that a data set having N dimensions has been approximated to a smaller data set containing ‘k’ dimensions. In this module, you will learn how this manipulation is done. And this simple manipulation helps in several ways such as follows:</p><ul><li>For data visualisation and EDA</li><li>For creating uncorrelated features that can be input to a prediction model: With a smaller number of uncorrelated features, the modelling process is faster and more stable as well.</li><li>Finding latent themes in the data: If you have a data set containing the ratings given to different movies by Netflix users, PCA would be able to find latent themes like genre and, consequently, the ratings that users give to a particular genre.</li><li>Noise reduction</li></ul><h3>Core Idea of PCA — In Simple Words</h3><p><strong>Principal Component Analysis (PCA)</strong> is a <strong>dimensionality reduction technique</strong>.<br> That means it takes a dataset with <strong>many variables (columns, or dimensions)</strong> and <strong>reduces</strong> it to a smaller number of variables — while keeping most of the <strong>important information (patterns, variation)</strong>.</p><h4>What “Dimension” Means</h4><p>· Each <strong>dimension</strong> represents a <strong>feature or variable</strong> in your data.<br> For example, in a dataset of students:</p><blockquote>Marks in math → one dimension</blockquote><blockquote>Marks in science → another dimension</blockquote><blockquote>Marks in English → a third dimension</blockquote><p>So, if you have 3 features, your data lies in <strong>3D space</strong>.<br> If you have 100 features, your data lies in <strong>100D space</strong> — something we can’t visualize directly.</p><h4>What PCA Does</h4><p>PCA finds <strong>new axes (directions)</strong> in this high-dimensional space that:</p><p>1. <strong>Capture most of the variance</strong> (the spread or meaningful information in the data), and</p><p>2. <strong>Are fewer in number</strong> — typically just 2 or 3 principal components.</p><p>This way, we can work with a smaller dataset that’s easier to visualize and analyze, <strong>without losing much information</strong>.</p><h3><strong>What of PCA</strong></h3><p><strong>Dimensionality reduction</strong> means <strong>reducing the number of variables (features)</strong> in a dataset.<br> The goal is to keep <strong>only the useful information</strong> and remove <strong>redundant or unimportant data</strong>.</p><h4>How You’ve Already Done It Before</h4><p>You’ve already performed dimensionality reduction manually in earlier topics:</p><p>· In <strong>Exploratory Data Analysis (EDA):</strong><br> You removed columns that were mostly <strong>empty (nulls)</strong> or <strong>duplicated</strong>.</p><p>· In <strong>Linear/Logistic Regression:</strong><br> You removed features with <strong>high p-values</strong> (not statistically significant) or <strong>high VIF scores</strong> (causing multicollinearity).</p><h4>What PCA Does (and How It’s Different)</h4><p>Instead of <strong>dropping</strong> features directly, <strong>PCA creates new ones</strong> — called <strong>principal components</strong> — by <strong>combining the old features</strong> in a smart mathematical way.</p><p>These new features:</p><p>· Capture most of the <strong>useful information (variance)</strong> from the original data</p><p>· Are <strong>uncorrelated</strong> with each other</p><p>· Allow you to easily decide <strong>how many to keep</strong> (based on how much total information each one holds)</p><p>PCA is a statistical procedure to convert observations of possibly correlated variables to ‘principal components’ such that:</p><ul><li>They are <strong>uncorrelated</strong> with each other.</li><li>They are <strong>linear combinations</strong> of the original variables.</li><li>They help in capturing maximum <strong>information</strong> in the data set.</li></ul><p>So PCA doesn’t just throw away columns — it <strong>rebuilds</strong> them into a smaller, more powerful set of new variables that represent the same data better.</p><p>Here we’ll learn about two of the most important building blocks of PCA — <strong>basis </strong>and <strong>change of basis</strong>. But before that, we’ll go through a brief refresher on basic linear algebra concepts.</p><h3><strong>Vectorial Representation of Data</strong></h3><p>Before we understand how PCA works, we need to be comfortable with some <strong>basic linear algebra concepts</strong> — because PCA relies heavily on <strong>matrix and vector operations</strong>.</p><p>To summarise what you’re going to learn in this segment here’s a handy checklist:</p><ul><li>Vectors and their properties</li><li>Vector operations (addition, scaling, linear combination and dot product)</li><li>Matrices</li><li>Matrix operations (matrix multiplication and matrix inverses)</li></ul><p>Consider the following data set containing the height and weight of five patients.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*f0k4EnUWnNBkRH2FEWHpHg.jpeg" /></figure><p>The height and weight information can be represented in the form of a matrix as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/636/1*8olNFU6yFkxFjgU7ajJOoA.png" /></figure><p>with each row representing a particular patient’s data and each column representing the original variable. Geometrically, these patients can be represented as shown in the following image.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Deo5wIMtoxqv5T605A9GMw.png" /></figure><p>A <strong>vector</strong> is just a mathematical way to represent data points — basically a list of numbers that describe one observation.</p><p>The vector associated with the first patient is given by the values (165, 55). This value can also be written in the following way:</p><ol><li>A column containing the values along the rows. This is also known as the <strong>column-vector </strong>representation.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/100/1*2J_cXTu0Qfm1PdrUkm5YqA.png" /></figure><p>2. Sometimes, we write the same vector horizontally. As a transpose of the above form. Essentially, it is the same column vector but now written as a transpose of a row vector.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/214/1*JETNL1wfoSrdFfeBKdo9ww.png" /></figure><p>3. In terms of the basis vectors <br> This is something that you’ll learn in detail in later segments. To give a brief idea, the vector (165,55) can also be written as 165<strong>i</strong> +55<strong>j</strong>, where <strong>i</strong> and <strong>j</strong> are the unit vectors along X and Y respectively and are the basis vectors used to represent all vectors in the 2-D space.</p><h4>Vector Representation for n-Dimensional Data</h4><p>If you have more variables (or features), the vector just gets longer.<br> For example:</p><p>· If you add <strong>age = 22</strong> to the data,<br> → The vector becomes <strong>(165, 55, 22)</strong> → a 3D vector.</p><p>If your dataset has <strong>10 variables</strong>, then each data point is a <strong>10-dimensional vector</strong> — written as (x1 , x2 , x3 ,…, x10)</p><p>Even though we can’t visualize more than 3D, <strong>math can handle n dimensions easily</strong>, and PCA uses that math to simplify things.</p><h4><strong>Vector Operations</strong></h4><p>Now that you’ve understood what vectors are, let’s go ahead and learn about some vector properties and a few associated operations.</p><h4>1. Vectors Have Direction and Magnitude</h4><p>A <strong>vector</strong> represents both:</p><p>· A <strong>direction</strong> (where it points), and</p><p>· A <strong>magnitude</strong> (how long it is).</p><p>Think of a vector as an <strong>arrow</strong> drawn from the origin (0, 0) to a point in space (x, y).</p><p><strong>Example (2D):</strong><br> For the vector (2, 3):</p><p>· The <strong>direction</strong> is the arrow from (0, 0) → (2, 3).</p><p>· The <strong>magnitude (length)</strong> is calculated using the Pythagoras theorem:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/328/1*KzZgDA6vg8CVsdWB8YJJ2g.png" /></figure><p><strong>Example (3D):</strong><br> For vector (2, –3, 4):</p><p>· The <strong>direction</strong> goes from (0, 0, 0) → (2, –3, 4).</p><p>· The <strong>magnitude</strong> is</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/512/1*DahqQFFarD2oY0BW1408hg.png" /></figure><p>So, magnitude tells you <strong>how strong</strong> the vector is, and direction tells you <strong>where</strong> it points.</p><h4>2. Vector Addition</h4><p>When you add two vectors, you <strong>add each component individually</strong> (element by element).</p><p><strong>Example:</strong></p><p><strong>Let</strong> V1=(2,3), V2=(1,2), <strong>Then</strong> V1 + V2 = (2+1,3+2) = (3,5)</p><p>So you’re just adding the X parts together and the Y parts together.</p><p><strong>In</strong> <strong>i, j</strong> form (where i = x-axis, j = y-axis): V1 = 2i+3j, V2 = i+2j,</p><p>Then V1 + V2 = (2+1)i+(3+2)j = 3i+5j</p><p>It’s the same concept — just written in a different notation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3SjXZIHmWQJMh4ZJRkYoeg.png" /></figure><p><strong>Geometrically:</strong><br> When you add vectors, it’s like placing one arrow after another (head to tail). The result is the diagonal (the new arrow connecting start to end).</p><h4>3. Scalar Multiplication</h4><p>If you <strong>multiply a vector by a number (scalar)</strong>, the <strong>direction stays the same</strong>, but the <strong>length (magnitude)</strong> changes.</p><p>Example:</p><p>Let V = (2,3) and Scalar = 2</p><p>Then 2 × V = (4,6)</p><p>The vector points in the <strong>same direction</strong> but becomes <strong>twice as long</strong>.</p><p>If the scalar is <strong>negative</strong>, say –2,</p><p>Then -2 × V = (-4,-6)</p><p>The direction <strong>reverses</strong> but the length doubles.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kFPsEokWnGKXt1qoVkRAHg.png" /></figure><h4>Why This Matters for PCA</h4><p>All PCA does is <strong>transform</strong> data using these vector operations:</p><p>· It treats each data point as a vector.</p><p>· It uses <strong>vector addition</strong> and <strong>scaling</strong> to create <strong>new axes (principal components)</strong>.</p><p>· It uses <strong>magnitude and direction</strong> to find which directions explain the most variation in the data.</p><p>So these simple operations are the <strong>building blocks</strong> of PCA math.</p><h3><strong>Matrix Multiplication</strong></h3><p>Apart from the vector operations that we learnt previously, we need some knowledge of matrix operations as well.</p><p>The process of matrix multiplication is quite simple, and it involves element-wise multiplication followed by the addition of all the elements present in it. The one key rule that it must satisfy is when you multiply 2 matrices, say A and B, the number of columns of A must equal the number of rows in B. Visually, you can take a look at the following image to get the idea of how that should be.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4PGpHU42MgE29835dQjlVw.png" /></figure><p>As shown in the example, since the number of columns in the first matrix and the number of rows in the second matrix are equal to 4, matrix multiplication is possible and the resultant matrix has a shape of 5 x 6.</p><p>The element-wise multiplication followed by addition is also pretty straightforward as can be seen in the following example.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XxtPVEwLqTfWl3h_RqMQnA.png" /></figure><h4>What Is a Matrix Inverse?</h4><p>Just like in normal arithmetic, where the <strong>reciprocal</strong> (or inverse) of a number “undoes” its multiplication,<br> the <strong>inverse of a matrix</strong> does the same thing in matrix algebra.</p><p><strong>Analogy with Regular Numbers</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/960/1*q9NSsj0bo4Ps6-xHO2jZyg.png" /><figcaption>regular math</figcaption></figure><p>Similarly, in matrix math:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/860/1*DDx_Dde_8Ntlru7Q5gw_Pw.png" /></figure><p>Here, <strong>I</strong> is the <strong>identity matrix</strong> — it acts like the number “1” in normal arithmetic.</p><h3>Example</h3><p>Given two matrices:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/710/1*5qTF4zmSEjU-kN-aQxzR2Q.png" /></figure><p>If you multiply them as B × A, you get:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/310/1*eR6Lm3gBWytM9Ja1OPCO-A.png" /></figure><p>This resulting matrix is the <strong>Identity Matrix (I)</strong> — notice it has 1s on the diagonal and 0s elsewhere.</p><h4>What the Identity Matrix Does</h4><p>The <strong>identity matrix</strong> works just like multiplying a number by 1: <strong>A × I = A</strong></p><p>It doesn’t change the matrix — it’s the “do-nothing” element in matrix multiplication.</p><p>So, if : <strong>A × B = I or B × A = I , </strong>then and are <strong>inverses of each other</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WgCBx8jjxldcYeKmcfb0Hw.png" /></figure><p><strong>In Simple Words</strong></p><p>The <strong>inverse of a matrix</strong> is another matrix that “undoes” its effect when multiplied — just like dividing by a number or multiplying by its reciprocal.</p><h4>Why This Matters for PCA</h4><p>In PCA (and other algorithms like regression), we often need to:</p><p>· <strong>Undo transformations</strong> or</p><p>· <strong>Normalize data</strong> mathematically.</p><p>Matrix inverses help us reverse matrix operations — for example, when solving systems of equations or projecting data back to its original space after transformation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pDhKbrfldsvOJPFbJCgWHQ.png" /></figure><h3><strong>Basis</strong></h3><h4>What is a Basis (Intuitively)?</h4><p>Think of a <strong>basis</strong> as a set of <strong>building blocks</strong> (reference directions) that help you describe every point (or vector) in space.</p><p>You can think of it like <strong>units</strong>:</p><p>· For measuring <strong>length</strong>, the unit (basis) is <em>meter</em> or <em>centimeter</em>.</p><p>· For measuring <strong>weight</strong>, the unit (basis) is <em>kilogram</em> or <em>gram</em>.</p><p>Similarly, when you describe a <strong>vector</strong>, you need a <strong>unit direction</strong> — or <strong>basis vectors</strong> — in which to express it.</p><h4>2. Representing a Vector Using Basis Vectors</h4><p>In 2D space, we usually use two standard <strong>basis vectors</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/316/1*EpoUq7nbQRYc8QUC7qb_yg.png" /></figure><p>· <strong>î</strong> (i-hat) represents 1 unit in the <strong>x-direction</strong>.</p><p>· <strong>ĵ</strong> (j-hat) represents 1 unit in the <strong>y-direction</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GuWw5hUQwkrtpczTkGz6KA.png" /></figure><p>That means:</p><p>· Move <strong>aₓ</strong> units in the <strong>x direction (î)</strong></p><p>· Then move <strong>aᵧ</strong> units in the <strong>y direction (ĵ)</strong></p><p>and you’ll reach the point <strong>a(x, y)</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yFXsL2V48CkJMLZ-FNJk1g.png" /></figure><p><strong>Example</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DotKNXlMnHwAwpSMIdq57g.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CtbdNHErxkHqCgJ2y__Zgg.png" /></figure><p>Basis in Higher Dimensions</p><p>· In <strong>2D</strong>, the basis is {î, ĵ}.</p><p>· In <strong>3D</strong>, the basis is {î, ĵ, k̂} — with k̂ = [0, 0, 1].</p><p>· In general, for <strong>n-dimensional data</strong>, the basis consists of <strong>n standard unit vectors</strong>, each pointing along one axis.</p><p>So any data point (or vector) in that space can be expressed as a <strong>combination of these basis vectors</strong>.</p><h4>Why Basis Matters in PCA</h4><p>The <em>basis</em> defines the <strong>coordinate system</strong> you use to describe data.</p><p>PCA works by <strong>finding a new basis</strong> — new directions (called <strong>principal components</strong>) that:</p><p>· Capture the <strong>maximum variance</strong> (spread) of data.</p><p>· Are <strong>uncorrelated</strong> (independent) from each other.</p><p>So PCA is like <strong>rotating the coordinate system</strong> to a new, smarter set of basis vectors that explain the data better.</p><h3><strong>Change of Basis: Introduction</strong></h3><h4>1. Same Data, Different Basis</h4><p>Just like you can describe a person’s <strong>weight</strong> in kilograms <em>or</em> pounds,<br> you can describe a vector (or data point) using <strong>different sets of basis vectors</strong>.</p><p>Example:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*K19EHX2u5lIsz1lVobjEEg.png" /></figure><p>This means:</p><p>· The first basis vector represents a <strong>1-unit change in height (ft)</strong> but no change in weight.</p><p>· The second basis vector represents a <strong>1-unit change in weight (lbs)</strong> but no change in height.</p><h4>2. Changing the Basis</h4><p>You could instead measure:</p><p>· Height in <strong>centimeters</strong> instead of feet, and</p><p>· Weight in <strong>kilograms</strong> instead of pounds.</p><p>Now, your <strong>basis vectors</strong> become different — but the <strong>underlying information</strong> (the person’s size) remains the same.</p><p>So, we’re expressing the <strong>same vector (same data point)</strong> in a <strong>new coordinate system</strong> — just like converting from one unit to another.</p><p>The following table summarises the results you get when you make the change.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wqU1xLcZorgf7E4OUY6ULw.png" /></figure><h4>3. Why This Matters in PCA</h4><p>This is exactly what PCA does — it changes the <strong>basis</strong> of the data.</p><p>· The <strong>old basis</strong>: your original features (like height and weight).</p><p>· The <strong>new basis</strong>: the <strong>principal components</strong> — new axes that best explain how your data varies.</p><p>So PCA doesn’t change your data’s meaning — it just <strong>re-expresses it in a more efficient coordinate system</strong> (the one aligned with maximum variance).</p><p>In the previous segment, you saw a demonstration on how the change of basis led to dimensionality reduction. Let’s go ahead and understand the elegant way of doing the same calculations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/676/1*saZpuflVglj0vuLDrcpgpQ.jpeg" /></figure><h4>1. From Scalar to Matrix Transformation</h4><p>Earlier, when dealing with one-dimensional data (like converting meters to centimeters), you could simply multiply by a <strong>number (scalar)</strong> — for example:</p><p><strong>length in cm = 100 × length in meters</strong></p><p>But when you have <strong>multi-dimensional data</strong> (e.g., height and weight together), conversion involves <strong>more than one variable</strong>, so the transformation must be done using a <strong>matrix</strong> rather than a single number.</p><p>So instead of: <strong>y = Mx</strong></p><p>(where <em>M</em> was just a number before),<br> now <em>M</em> becomes a <strong>matrix</strong> that handles how each variable affects the others.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vDu_Fel2Y1EqqK52St4RTw.png" /></figure><h4>2. The Example Explained</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*v2SBeSpxjwjko_5VfLYrxw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Prtl351EF6KIC0FzczJLWg.png" /></figure><p>This means:</p><p>· The <strong>x-values</strong> (like height) are scaled by 30.48 — for example, converting feet to centimeters.</p><p>· The <strong>y-values</strong> (like weight) are scaled by 0.45 — for example, converting pounds to kilograms.</p><h4>3. But What If You Want to Go Back?</h4><p>If you want to convert <strong>from the new basis back to the old basis</strong>,<br> you can’t just “divide” or take a simple reciprocal (because <em>M</em> is a <strong>matrix</strong>, not a number).</p><p>To reverse a matrix transformation, you use the <strong>matrix inverse</strong>.</p><p>That is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/316/1*ZZuUSyjQOPWqtgcxEmGSbQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i4U24bbU4YIIyfvsZ8mSpA.png" /></figure><h4>4. Why This Matters for PCA</h4><p>PCA works by <strong>transforming</strong> data from the original coordinate system (your features) to a <strong>new one</strong> (the principal components).<br> To do that:</p><p>· It uses a <strong>transformation matrix</strong> made of <strong>eigenvectors</strong>.</p><p>· If you want to go back to your original space, you use the <strong>inverse (or transpose)</strong> of that matrix.</p><p>So this concept — of using a <strong>matrix and its inverse</strong> to switch between coordinate systems — is at the <strong>heart of PCA math</strong>.</p><h3><strong>Change of Basis: Solved Examples</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*J8LSkFz4C38q8spoR-xjxA.png" /></figure><h3>Understanding How to Move Between Different Bases</h3><p>The main equation that helps us move from one set of basis vectors to another is:</p><p>New Basis Representation = M × Old Basis Representation</p><p>Our goal is to express <strong>M</strong> (the transformation matrix) in terms of the <strong>old basis vectors</strong> and the <strong>new basis vectors</strong>.</p><h4>Notation Setup</h4><p>To make this easier:</p><p>· <strong>B₁</strong> → represents the <strong>old basis</strong>, and <strong>v₁</strong> is the <strong>old basis representation</strong>.</p><p>· <strong>B₂</strong> → represents the <strong>new basis</strong>, and <strong>v₂</strong> is the <strong>new basis representation</strong>.</p><p>So, the above equation can now be written as: <strong>v2 = M × v1</strong></p><p>Let’s call this <strong>Equation 1</strong>.</p><h4>Relating Old and New Bases</h4><p>When switching between multiple bases, the following relationship always holds:</p><p>B1 × v1 = B2 × v2</p><p>This means that the same vector (point) can be represented using either the old or new basis — the vector itself doesn’t change, only the coordinate system does.</p><h4>Deriving the Transformation Matrix</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jPIdh8jz8zEuGuW41rDsRg.png" /></figure><p>Let’s call this <strong>Equation 2</strong>.</p><p><strong>Comparing Equation 1 and Equation 2</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2Sf-NJ9-DxXRa9tF0yciSQ.png" /></figure><p><strong>In Simple Words</strong></p><p>To convert coordinates from one basis to another, we multiply by a transformation matrix <strong>M</strong>.<br> That matrix is found by multiplying the <strong>inverse of the new basis</strong> by the <strong>old basis</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/282/1*2LdtH_mU6LlPHWI9LRxg8Q.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eQWAOCiaQ9So306lTBF3Fw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cNNQsR-Kx_LBGXZfgWjcjg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dVsnVVgV0itvna4DwUjNYg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VwaYdC2g-GULIkd-tdLOFA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*c_Lf7LL1-M9eBrR-sS-Vug.png" /></figure><p>Mainly when we’re moving between multiple basis vectors, it’s important to know that the <strong>point’s position in space doesn’t change</strong>. The point’s representation might be different in different basis vectors but it would be representing the same point.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*G5PyoUHhaCJIQimJLiC4zg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wANA44re3UuxEEzcBncOrw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*idrm9zd9dmxhxmOp2HvZ4w.png" /></figure><p><strong>Change of Basis: Solved Python</strong></p><p><strong>Code : </strong><a href="https://drive.google.com/file/d/1KJZ7yei5x4rrRoeGLLeT82CYH-lfAQkO/view?usp=sharing">https://drive.google.com/file/d/1KJZ7yei5x4rrRoeGLLeT82CYH-lfAQkO/view?usp=sharing</a></p><p>Practice running the above code and explore how it works. If you get stuck or have questions, let me know in the comments — we’ll figure it out together!</p><h3>Understanding the Next Step — Variance as Information</h3><p>In the previous session, you learned the first key idea behind PCA — <strong>the concept of a basis</strong> and how changing the basis can help simplify or reduce dimensions in data.<br> You also saw that the same dataset can be represented using <strong>different basis vectors</strong> (or coordinate systems).</p><p>However, we didn’t yet answer the most important question:</p><p>“How do we find the <em>best</em> or <em>ideal</em> basis vectors that summarize the data most effectively?”</p><h4>The Missing Ingredient: Variance</h4><p>This session introduces that missing piece — <strong>variance</strong>.</p><p>· In earlier approaches, we decided which features (columns) to remove using:</p><blockquote>Missing values (nulls)</blockquote><blockquote>Irrelevant or duplicate information</blockquote><blockquote>Statistical measures like <strong>p-values</strong> or <strong>VIF scores</strong></blockquote><p>But PCA uses a <strong>different and more powerful metric</strong> — <strong>variance</strong> — to decide what’s important.</p><p>· <strong>Variance</strong> measures <strong>how much the data spreads out or varies</strong>.</p><p>· Features with <strong>higher variance</strong> carry <strong>more information</strong> about differences between observations.</p><p>· Features with <strong>low variance</strong> are often <strong>less informative</strong> and can be reduced or removed.</p><h4>In Simple Words:</h4><p>PCA identifies which directions (basis vectors) capture the <strong>maximum variance</strong> — meaning, where the data changes the most.<br> These directions become the new <strong>principal components</strong>, helping reduce dimensions while keeping the most meaningful information.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/860/1*GVvWmbU8eusPnU4LyOIs1Q.png" /></figure><h3><strong>Directions of Maximum Variance</strong></h3><h4>1. Unequal Variance → Easy Reduction</h4><p>When one feature (column) has <strong>much less variance</strong> than another, it’s clear that it contributes <strong>less information</strong>.</p><p>· For example, if <em>Height</em> varies a lot but <em>Weight</em> hardly changes, you can safely remove <em>Weight</em> without losing much information.</p><p>· PCA (or even basic feature selection) can easily reduce dimensions in such cases.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XUqV5b6aQDUriUHq8VjGXA.png" /></figure><h4>2. Similar Variances → Not So Easy</h4><p>Now look at the graph above — each red dot shows a data point (Height vs Weight).<br> You can see that:</p><p>· The data is spread <strong>similarly along both axes</strong> — height and weight have almost the <strong>same variance</strong>.</p><p>· So, you <strong>can’t easily decide</strong> which variable (axis) is more informative.</p><p>In this case, both features carry similar levels of variation, and neither axis seems clearly better for reduction.</p><h4>3. What PCA Does in This Case</h4><p>When variances are similar, PCA does something <strong>smarter</strong> — <br> it <strong>changes the coordinate system</strong> (the basis vectors).</p><p>Instead of keeping the original X (Weight) and Y (Height) axes, PCA:</p><p>· <strong>Rotates</strong> the axes to find <strong>new directions</strong> where data spreads the most.</p><p>· These new directions are called <strong>principal components</strong>.</p><p>· The <strong>first principal component (PC1)</strong> is the direction of <strong>maximum variance</strong> — the line along which the data points spread out the most.</p><p>This new basis (set of principal components) captures <strong>maximum information</strong> in fewer dimensions.</p><p><strong>Directions of Maximum Variance</strong></p><p>Basically, the steps of PCA for finding the principal components can be summarised as follows.</p><ul><li>First, it finds the basis vector which is along the best- fit line that maximises the variance. This is our first <strong>principal component or PC1.</strong></li><li>The second principal component is perpendicular to the first principal component and contains the next highest amount of variance in the dataset.</li><li>This process continues iteratively, i.e. each new principal component is perpendicular to all the previous principal components and should explain the next highest amount of variance.</li><li>If the dataset contains <strong><em>n</em></strong> independent features, then PCA will create <strong><em>n</em></strong> Principal components.</li></ul><p>For a 2-D dataset that has the representation as shown in the image below.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uWGyLOSlj5hiq2PNIGIelA.png" /></figure><p>The principal components can be visually represented as shown in the image below.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CGyjpMe74iOIwfXvpOZxmw.png" /></figure><p>Also, once the Principal Components are found out, PCA assigns a %age variance to each PC. Essentially it’s the fraction of the total variance of the dataset explained by a particular PC. This helps in understanding which Principal Component is more important than the other and by how much. This is shown in the images below.</p><p><strong>Original Dataset</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/490/1*IaS2QLbc4uRGqYKdpU2e_w.png" /></figure><p><strong>PCA Modified Dataset</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/468/1*w_tbIlONDCa62UwvXhO8xw.png" /></figure><p>Since 100% of the total variance or information of the entire dataset is present in only one of the columns (PC1) we can safely drop PC2 and still be assured of losing no information.</p><h3><strong>The Workings of PCA</strong></h3><p>Let’s once again summarise the steps of PCA</p><p>· <strong>Find n new features</strong><br> Choose a different set of n basis vectors ( non-standard). These basis vectors are essentially the directions of maximum variance and are called Principal Components</p><p>· <strong>Express the original dataset using these new features</strong><br> Transform the dataset from the original basis to this PCA basis.</p><p>· <strong>Perform dimensionality reduction</strong><br> Choose only a certain k (where k &lt; n) number of the PCs to represent the data. Remove those PCs which have fewer variance (explain less information) than others.</p><p>PCA acts as a pre-processing tool in the ML pipeline, predominantly used for dimensionality reduction to improve model performance.</p><p>The approach is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/842/1*SLAktpo0J7A49r7iz48Yzw.png" /></figure><p>Note — The number of principal components is the same as the number of columns in the dataset. PCs are sorted in descending order of information content.</p><p><strong>Before we end this session…</strong></p><p>· The methodology or the algorithm by which PCA maximises the variance and obtains the new basis vectors is the process of eigendecomposition of the covariance matrix.</p><p>· Using the eigendecomposition method, you’ll be able to obtain the new basis vectors that will function as the Principal Components numerically. These new basis vectors are also called eigenvectors.</p><p>· For <strong>example</strong>, in the roadmap case, the following PCs are obtained using the eigendecomposition of the covariance matrix of the original dataset.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/412/1*1cZyBqSE8TgLFo4AlqCaBQ.png" /></figure><h4><strong>Implement PCA in Python</strong></h4><p><strong>Code and Data zipped : </strong><a href="https://drive.google.com/file/d/1zsd8lO5HrzzvphaeHMGtnZfH2qQj4Dzp/view?usp=sharing">https://drive.google.com/file/d/1zsd8lO5HrzzvphaeHMGtnZfH2qQj4Dzp/view?usp=sharing</a></p><h4>You learnt some important shortcomings of PCA:</h4><ul><li>PCA is limited to linearity, though we can use <strong>non-linear techniques such as t-SNE </strong>as well</li><li>PCA needs the components to be perpendicular, though in some cases, that may not be the best solution. The alternative technique is to use <strong>Independent Components Analysis.</strong></li><li>PCA assumes that columns with low variance are not useful, which might not be true in prediction setups (especially classification problem with a high class imbalance).</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b8053fda7d5b" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unsupervised Learning and Clustering Explained with Python Examples]]></title>
            <link>https://medium.com/@saha.soumyadeep90/unsupervised-learning-and-clustering-explained-with-python-examples-f188c25c397b?source=rss-53767639011e------2</link>
            <guid isPermaLink="false">https://medium.com/p/f188c25c397b</guid>
            <dc:creator><![CDATA[Soumyadeep Saha]]></dc:creator>
            <pubDate>Wed, 15 Oct 2025 08:21:51 GMT</pubDate>
            <atom:updated>2025-10-15T10:11:20.813Z</atom:updated>
            <content:encoded><![CDATA[<p>In the previous blogs, you have learnt <strong>supervised learning</strong> techniques such as regression and classification. These methods rely on a <strong>training set with labels</strong> to teach the algorithm, which can then be applied to make predictions on new, unseen data.</p><p>In this module, we shift focus to <strong>unsupervised learning</strong>, where the data has <strong>no predefined labels</strong>. Instead, the algorithm tries to discover hidden patterns and structures directly from the data.</p><h3>In This Session</h3><p>· You will begin by learning about <strong>clustering</strong>, an unsupervised learning technique that groups data points based on similarity.</p><p>· A <strong>case study</strong> will demonstrate how clustering is applied in real-world industry problems.</p><p>· You will then explore the two most widely used clustering algorithms:</p><blockquote>K-Means Clustering</blockquote><blockquote>Hierarchical Clustering</blockquote><p>· You’ll also learn how to implement these algorithms in <strong>Python</strong>.</p><p>· Finally, we’ll discuss <strong>segmentation</strong> — how it differs from clustering, and where it is applied.</p><h4><strong>Practical Applications Of Clustering</strong></h4><ol><li><strong>Customer Insight:</strong> Say, a retail chain with so many stores across locations wants to manage stores at best and increase the sales and performance. Cluster analysis can help the retail chain to get desired insights on customer demographics, purchase behaviour and demand patterns across locations. This will help the retail chain for assortment planning, planning promotional activities and store benchmarking for better performance and higher returns.</li><li><strong>Marketing:</strong> Cluster Analysis can help with In the field of marketing, Cluster Analysis can help in market segmentation and positioning, and to identify test markets for new product development.</li><li><strong>Social Media:</strong> In the areas of social networking and social media, Cluster Analysis is used to identify similar communities within larger groups.</li><li><strong>Medical</strong>: Cluster Analysis has also been widely used in the field of biology and medical science like human genetic clustering, sequencing into gene families, building groups of genes, and clustering of organisms at species.</li></ol><h4>Segmentation: Key Requirements</h4><p>For segmentation to be meaningful and useful, the <strong>segments formed must be stable</strong>.</p><p>· This means that the <strong>same person should not fall into different segments</strong> if the data is segmented using the same criteria.</p><p>· Additionally, good segmentation requires:</p><blockquote><strong>Intra-segment homogeneity</strong> → members within the same segment should be similar to each other.</blockquote><blockquote><strong>Inter-segment heterogeneity</strong> → different segments should be clearly distinct from one another.</blockquote><p>Later in the module, you’ll see how these ideas can be expressed <strong>mathematically</strong>.</p><h4>Types of Market Segmentation</h4><p>Now, let’s look at the most commonly used types of <strong>market segmentation</strong> that are applied in real-world business contexts.</p><p><strong>Types of Customer Segmentation</strong></p><p>In practice, three main types of <strong>customer segmentation</strong> are commonly used:</p><p>1. <strong>Behavioural Segmentation</strong></p><blockquote>Based on the <strong>actual patterns of behavior</strong> displayed by consumers.</blockquote><blockquote>Examples: purchase frequency, product usage, brand loyalty.</blockquote><p>2. <strong>Attitudinal Segmentation</strong></p><blockquote>Based on the <strong>beliefs, values, or intentions</strong> of customers, even if these do not always translate into actual actions.</blockquote><blockquote>Example: customers who express a preference for eco-friendly products, even if they do not consistently purchase them.</blockquote><p>3. <strong>Demographic Segmentation</strong></p><p>Based on a customer’s <strong>profile information</strong>, such as:</p><blockquote>Age</blockquote><blockquote>Gender</blockquote><blockquote>Income</blockquote><blockquote>Education</blockquote><blockquote>Location</blockquote><h4>Session Summary</h4><p>In this session, you were introduced to the basics of <strong>unsupervised learning</strong> and got an initial understanding of how <strong>clustering</strong> works. You learned that clustering groups data points based on similarities, without relying on predefined labels.</p><h4>What’s Next</h4><p>In the upcoming sessions, we will dive deeper into clustering and explore two of the most widely used clustering algorithms:</p><p>· <strong>K-Means Clustering</strong> — a partition-based method that groups data into <em>k</em> clusters.</p><p>· <strong>Hierarchical Clustering</strong> — a tree-based method that builds clusters step by step in a hierarchy.</p><h3><strong>Welcome to the Session on <em>K-Means Clustering</em></strong></h3><p>In this session, we’ll take a deeper look at one of the most widely used clustering algorithms: <strong>K-Means Clustering</strong>. This algorithm is simple, intuitive, and highly effective, making it one of the first choices for many clustering tasks in practice.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mtq7umQeokUoLHMltjk8vQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*K309gzX0wCg7ZafAWjXM4A.png" /></figure><h4><strong>Clustering with Euclidean Distance</strong></h4><p>The concept of a <strong>distance measure</strong> is quite intuitive:</p><ul><li>If two observations are <strong>close</strong> to each other, they will have a <strong>low Euclidean distance</strong>.</li><li>If two observations are <strong>far apart</strong>, they will have a <strong>high Euclidean distance</strong>.</li></ul><p>So, how does clustering use this?</p><p>In the clustering process (like in <strong>K-Means</strong>):</p><ol><li>The algorithm begins with a set of <strong>cluster centers</strong> (also called centroids).</li><li>Each observation in the dataset is assigned to the cluster whose centroid is <strong>closest</strong> to it, based on Euclidean distance.</li><li>Once all points are assigned, the centroids are <strong>recalculated</strong> as the mean of all the points in that cluster.</li><li>Steps 2 and 3 are repeated until the centroids stop moving significantly (or a maximum number of iterations is reached).</li></ol><p>In short, clustering with Euclidean distance groups observations such that points within a cluster are <strong>close to each other</strong>, while points in different clusters are <strong>farther apart</strong>.</p><h4><strong>Centroid</strong></h4><p>A crucial concept in clustering is the <strong>centroid</strong>.</p><p>From high school geometry, you may remember that the centroid is the <strong>center point of a triangle</strong>. Similarly, in clustering, the centroid is the <strong>center point of a cluster</strong> — it represents the “average location” of all the points belonging to that cluster.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*lAS6rr_CfsQR8gL1z0HnxA.jpeg" /></figure><h4>Why Do We Need a Centroid?</h4><p>Take the example shown above, where students’ marks in <strong>Mathematics</strong> and <strong>Biology</strong> form four distinct clusters:</p><p>· <strong>Cluster 1</strong>: High Biology, Low Maths</p><p>· <strong>Cluster 2</strong>: Average in both Biology and Maths</p><p>· <strong>Cluster 3</strong>: High in both Biology and Maths</p><p>· <strong>Cluster 4</strong>: High Maths, Low Biology</p><p>From the visual, we can clearly see how groups are formed. But suppose you want to compare two clusters — say Cluster 1 and Cluster 2:</p><p>· By how many marks do students in Cluster 1 outperform those in Cluster 2 in Biology?</p><p>· By how much do they underperform in Maths?</p><p>You <strong>cannot answer this precisely just by looking at the plot</strong>. That’s where the <strong>centroid</strong> becomes useful.</p><h3>Calculating a Centroid</h3><p>As explained, a <strong>centroid</strong> is the <strong>cluster center</strong>, representing the average of all observations in that cluster. To compute it, you simply take the <strong>mean of each column (dimension)</strong> across the observations in the cluster.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EjTKxg2ewYOgsHZ__xyoWQ.png" /></figure><p><strong>Step 1: Compute Mean for Each Feature</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/784/1*9xPALBPjEY13FIB4XaRRnQ.png" /></figure><p><strong>Step 2: Form the Centroid</strong></p><p>Thus, the centroid of this group of observations is: <strong>(173.75,” “ 83.75,” “ 23.75)</strong>. This single point summarizes the cluster’s <strong>average height, weight, and age</strong>.</p><p><strong>Key takeaway</strong>: Centroids provide a <strong>numerical summary</strong> of clusters, making it possible to compare groups quantitatively rather than just visually.</p><h4><strong>Steps of the Algorithm: </strong>K-Means Algorithm with a Simple Example</h4><p>Let’s understand how the <strong>K-Means algorithm</strong> works step by step using a very simple scenario.</p><p>Suppose you have the data of <strong>10 students</strong> with their marks in <strong>Biology (y-axis)</strong> and <strong>Math (x-axis)</strong>. You want to divide these students into <strong>2 clusters</strong> so that you can see the types of students in the class.</p><p>Imagine two groups forming — one colored <strong>red</strong> and the other <strong>yellow</strong>. The question is: how will the algorithm decide which student belongs to which group?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*lAS6rr_CfsQR8gL1z0HnxA.jpeg" /></figure><p><strong>Step 1: Recall the Centroid</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*55K868fohQIhOTb0NsTaDQ.png" /></figure><p><strong>Step 2: How K-Means Uses the Centroid</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HpfCKIN-RpRZWXmBgKndeA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7SkacFFU32_zyb1MY9Ay-Q.png" /></figure><p><strong>End Result</strong>:<br> The 10 students will be divided into 2 clusters, with each cluster containing students who have <strong>similar performance in Math and Biology</strong></p><h4>K-Means Cost Function</h4><p>The <strong>goal of K-Means</strong> is to form clusters where points within a cluster are as close as possible to their cluster centroid. To measure this closeness, K-Means uses a <strong>cost function</strong> (also called the objective function or distortion function).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*blASTpEmomSPz1cAStCfFA.png" /></figure><p><strong>What Does It Mean?</strong></p><p>· The formula calculates the <strong>squared Euclidean distance</strong> between each data point and its assigned cluster centroid.</p><p>· The total cost is the <strong>sum of these squared distances</strong> across all clusters.</p><p>· The K-Means algorithm works by <strong>minimizing this cost function</strong> through iterative updates:</p><blockquote>1. <strong>Assignment step</strong> — assign each point to the nearest centroid.</blockquote><blockquote>2. <strong>Update step</strong> — recalculate centroids as the mean of their assigned points.</blockquote><p>With each iteration, the cost function decreases, and the algorithm converges when the assignments no longer change significantly.</p><h4>The Two Steps of K-Means</h4><p>The <strong>K-Means algorithm</strong> runs in an iterative loop of two key steps: <strong>Assignment</strong> and <strong>Optimization</strong>.</p><h4>1. Assignment Step</h4><p>In this step, each data point is assigned to the cluster whose centroid is closest to it.</p><p>Formally, for each data point Xi:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/866/1*1bQTV6BDPtWQj09i2Ebmig.png" /><figcaption>This ensures every point is grouped with the centroid it is closest to</figcaption></figure><p><strong>2. Optimization Step</strong></p><p>Once all points have been assigned, the centroids of the clusters are recalculated.</p><p>For cluster , the new centroid is the average of all points assigned to it:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/866/1*g0Lpqn0MYqnVj-j-s_diUw.png" /><figcaption>This moves the centroid to the <strong>“center of mass”</strong> of its cluster</figcaption></figure><p><strong>Repeat Until Convergence</strong></p><p>The algorithm repeats the <strong>assignment step</strong> and the <strong>optimization step</strong> until:</p><p>· The cluster assignments no longer change, or</p><p>· The improvement in the cost function becomes negligible.</p><p>At this point, K-Means is said to have <strong>converged</strong>, producing stable clusters.</p><h3><strong>K Means++ Algorithm</strong></h3><p>In the previous segment, you learned how the K-Means algorithm alternates between two steps — <strong>assignment</strong> and <strong>optimization</strong> — in order to minimize the cost function:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/470/1*kBX2H5GwK3ceGPLxGq_dbg.png" /></figure><p>However, one limitation of standard K-Means is its <strong>sensitivity to the choice of initial centroids</strong>. If the starting centroids are chosen poorly (e.g., too close to each other), the algorithm may converge to suboptimal clusters.</p><p>To address this, the <strong>K-Means++ algorithm</strong> was introduced as a smarter initialization strategy.</p><h4>How K-Means++ Works</h4><p>1. <strong>Choose the first centroid randomly</strong> from the dataset.</p><p>2. <strong>Compute distances</strong>: For each remaining data point , calculate its distance from the <strong>nearest chosen centroid</strong>.</p><p>3. <strong>Select the next centroid</strong>: Pick the next centroid from the data points with a probability proportional to the <strong>square of its distance</strong> from the nearest chosen centroid.</p><blockquote>Intuitively, points that are farther away from existing centroids are more likely to be chosen as new centroids.</blockquote><p>4. <strong>Repeat Steps 2 and 3</strong> until centroids are chosen.</p><p>5. Once the initialization is complete, proceed with the standard <strong>K-Means algorithm</strong> (assignment + optimization).</p><h4>Why K-Means++ is Better</h4><p>· Ensures that initial centroids are <strong>spread out</strong>, reducing the risk of poor clustering.</p><p>· Leads to <strong>faster convergence</strong>.</p><p>· Produces <strong>better quality clusters</strong> compared to random initialization.</p><p><strong>Key takeaway</strong>: K-Means++ is not a new algorithm but an <strong>improved initialization procedure</strong> that makes the standard K-Means more robust and efficient.</p><p>Let’s see the K-Means algorithm in action using a visualisation tool. This tool can be found on <a href="http://www.naftaliharris.com/blog/visualizing-k-means-clustering/">naftaliharris.com</a>. You can go to this link and play around with the different options available to get an intuitive feel of the K-Means algorithm.</p><p>Upon trying the different options, you may have noticed that the final clusters that you obtain vary depending on many factors, such as choice of the initial cluster centres and the value of K, i.e. the number of clusters that you want. You will understand these factors and other practical considerations while using the K-means algorithm in more detail in the next segment.</p><h4>Let’s solve an exercise now</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4zyF0_9EZdzTzRTHt_RJKA.png" /></figure><p><strong>Data (X1, X2):</strong><br> 1:(1,4) 2:(1,3) 3:(0,4) 4:(5,1) 5:(6,2) 6:(4,0)</p><h4>Iteration 0 — initial assignment</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RBrW5Q-JouhAHSBAU5S-Gg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yZYvI0eWcWieMxlifSz2vA.png" /></figure><h4>Iteration 1 — recompute centroids</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zHzEV3TC2mB0g15JZfsCOA.png" /></figure><h4><strong>Practical Consideration in K Means Algorithm</strong></h4><p>Before applying K-Means clustering, you must be aware of some practical issues that can affect the quality of clusters:</p><p>1. <strong>Number of Clusters (K):</strong><br> The value of K must be chosen before running the algorithm. A wrong choice leads to poor clustering.</p><p>2. <strong>Initial Cluster Centers:</strong><br> The starting centroids influence the final clusters. Poor initialization can give different or unstable results.</p><p>3. <strong>Outliers:</strong><br> K-Means is sensitive to outliers, as they can shift centroids away from their true positions.</p><p>4. <strong>Feature Scaling:</strong><br> Since Euclidean distance is used, all features should be on the same scale. Standardization is usually required.</p><p>5. <strong>Categorical Data:</strong><br> K-Means does not work well with categorical variables — it is mainly for numerical data.</p><p>6. <strong>Convergence:</strong><br> The algorithm may not converge within a fixed number of iterations, so always check if clusters have stabilized.</p><h3>Silhouette Analysis in K-Means</h3><p>After understanding the ways to choose the value of K, another useful method is the <strong>Silhouette Analysis</strong> (or Silhouette Coefficient).</p><p>· It is a measure that shows how well a data point fits within its assigned cluster.</p><p>· It compares <strong>cohesion</strong> (similarity of a point to its own cluster) with <strong>separation</strong> (difference from other clusters).</p><p>· A higher silhouette score means the data point is well-clustered, while a lower or negative score indicates poor clustering.</p><h4>Computing Silhouette Metric</h4><p>To calculate the <strong>silhouette score</strong> for a data point, we need two measures:</p><p>1. <strong>Cohesion (a):</strong><br> The average distance of the point from all other points in its <strong>own cluster</strong>.</p><p>2. <strong>Separation (b):</strong><br> The average distance of the point from all points in the <strong>nearest neighboring cluster</strong>.</p><p>Once we have these two values, the <strong>silhouette coefficient (s)</strong> for each point is calculated as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/412/1*20xYc-1JA5kEa1HyEOolcQ.png" /></figure><p>· If <strong>s is close to +1</strong>, the point is well clustered.</p><p>· If <strong>s is close to 0</strong>, the point lies between two clusters.</p><p>· If <strong>s is negative</strong>, the point may be assigned to the wrong cluster.</p><h4><strong>K Means Clustering — Cluster Tendency</strong></h4><p>Before applying any clustering algorithm, it is important to check whether the data actually has <strong>meaningful clusters</strong> or not. This ensures that the data is not just random. The process of evaluating whether data is suitable for clustering is called <strong>clustering tendency</strong>.</p><p>As discussed earlier, clustering algorithms like K-Means will always return <em>K clusters</em>, even if no natural clusters exist in the data. Hence, we should not blindly apply clustering methods. Instead, we must first check the <strong>cluster tendency</strong>.</p><p>One common way to do this is the <strong>Hopkins Test</strong>. This test checks whether the data distribution is significantly different from a uniform (random) distribution in the multidimensional space. If the data is truly random, clustering will not give meaningful results.</p><h3>Session Summary: K-Means Clustering</h3><p>In this session, we started by intuitively understanding <strong>K-Means</strong> through the example of grouping 10 random points into 2 clusters.</p><p>· The algorithm begins by selecting <strong>K random cluster centers</strong>.</p><p>· Then, two steps — <strong>Assignment</strong> (assigning points to the nearest cluster) and <strong>Optimization</strong> (updating cluster centers) — are repeated until the clusters stop changing.</p><p>· The result is the <strong>most optimal clusters</strong>, which minimize intra-cluster distance (points within a cluster are close) and maximize inter-cluster distance (clusters are well separated).</p><p>We also discussed several <strong>practical issues</strong> to keep in mind when applying K-Means:</p><p>1. <strong>Choosing K:</strong> You must decide the number of clusters before running the algorithm.</p><p>2. <strong>Non-deterministic nature:</strong> K-Means can give different results on the same dataset because outcomes depend on the choice of initial cluster centers.</p><p>3. <strong>Outliers:</strong> Outliers can distort the clusters, leading to poor results.</p><p>4. <strong>Feature scaling:</strong> Since Euclidean distance is commonly used, all attributes need to be brought to the same scale using standardization.</p><p>5. <strong>Categorical data:</strong> K-Means cannot be directly applied to categorical variables; specialized algorithms like <strong>K-Modes</strong> or <strong>K-Prototypes</strong> are used instead.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SVxAfYzJab5vcFGBTROtEQ.png" /></figure><h3><strong>K-Means</strong> in Python</h3><p><strong>Data: </strong><a href="https://drive.google.com/file/d/1BPagY_u3059RA7wFi-0NCoY-44lawUjd/view?usp=sharing">https://drive.google.com/file/d/1BPagY_u3059RA7wFi-0NCoY-44lawUjd/view?usp=sharing</a></p><p><strong>Code: </strong><a href="https://drive.google.com/file/d/16QQio0oAI7F6nVcBiUd71tvxEDL15AHg/view?usp=sharing">https://drive.google.com/file/d/16QQio0oAI7F6nVcBiUd71tvxEDL15AHg/view?usp=sharing</a></p><p>Practice running the above code and explore how it works. If you get stuck or have questions, let me know in the comments — we’ll figure it out together!</p><h3><strong>Hierarchical Clustering</strong></h3><h4>Hierarchical Clustering vs K-Means</h4><p>· <strong>K-Means Limitation:</strong> You must decide the number of clusters <strong>K</strong> in advance.</p><p>· <strong>Hierarchical Clustering Advantage:</strong> No need to specify K beforehand.</p><h4>Output Difference</h4><p>· <strong>K-Means:</strong> Produces fixed clusters by assigning data to centroids and refining them.</p><p>· <strong>Hierarchical Clustering:</strong> Produces a <strong>dendrogram</strong> (an inverted tree structure) showing how data points merge step by step.</p><h4>Process of Hierarchical Clustering</h4><p>1. Compute an <strong>N×N distance (similarity) matrix</strong> between all items.</p><p>2. Initially, treat each item as a <strong>separate cluster</strong> (N clusters).</p><p>3. Merge the <strong>two closest clusters</strong> into one.</p><p>4. Repeat merging and updating distances until all items form <strong>one cluster</strong>.</p><p>5. The final output is a <strong>dendrogram</strong>, which shows the hierarchy of merges and the distance at which they happened.</p><h4><strong>Interpreting the Dendrogram</strong></h4><p>The result of the cluster analysis is shown by a dendrogram, which starts with all the data points as a separate cluster and indicates at what level of dissimilarity any two clusters were joined.</p><p>As you saw, the y-axis of the dendrogram is some measure of the dissimilarity or distance at which clusters join.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/648/1*GukO3fkls9ErGLGP_wCKIQ.png" /></figure><p>In the dendrogram shown above, samples 4 and 5 are the most similar and join to form the first cluster, followed by samples 1 and 10. The last two clusters to fuse together to form the final single cluster are 3–6 and 4–5–2–7–1–10–9–8.</p><p>Determining the number of groups in a cluster analysis is often the primary goal. Typically, one looks for natural groupings defined by long stems. Here, by observation, you can identify that there are 3 major groupings: 3–6, 4–5–2–7 and 1–10–9–8.</p><p>You also saw that hierarchical clustering can proceed in 2 ways — <strong>agglomerative</strong> and <strong>divisive</strong>. If you start with n distinct clusters and iteratively reach to a point where you have only 1 cluster in the end, it is called agglomerative clustering. On the other hand, if you start with 1 big cluster and subsequently keep on partitioning this cluster to reach n clusters, each containing 1 element, it is called divisive clustering.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*J_AwwVuyLH3R_UrtdnfNhA.png" /></figure><p>The dendrogram helps you decide at which level to “cut the tree” to obtain the desired number of clusters.</p><ul><li>You learnt about <strong>Hierarchical Clustering</strong> as another clustering method.</li><li>Unlike <strong>K-Means</strong>, it does <strong>not require pre-defining the number of clusters</strong>.</li><li>It produces a <strong>dendrogram</strong>, which shows how clusters are formed step by step.</li><li>The main drawback is that it requires computing the <strong>distance between every pair of points</strong>, making it <strong>time-consuming and computationally expensive</strong> for large datasets.</li></ul><h4>Expert Insights: K-Means vs Hierarchical Clustering</h4><p>· The choice between <strong>K-Means</strong> and <strong>Hierarchical Clustering</strong> depends mainly on:</p><p>1. <strong>Hardware/Computing power</strong> — Hierarchical clustering is more resource-intensive since it requires pairwise distance calculations.</p><p>2. <strong>Data Size and Nature</strong> — K-Means works well on large datasets, while Hierarchical clustering is better for smaller datasets or when you want to explore natural groupings.</p><h4>Summary Flow</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yL0lzSr32DB1MbC5lEiIjg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yA8nGR6ECTFhyHk9doROrQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9nwOSUoph0U2xmNjS83oLg.png" /></figure><h3>A Practical Hack for Segmentation</h3><p>Instead of relying on only one method, you can <strong>combine both approaches</strong>:</p><p><strong>Step 1:</strong> Use <strong>Hierarchical clustering</strong> to understand the data structure and estimate the likely number of clusters (by reading the dendrogram).</p><p><strong>Step 2:</strong> Use this number of clusters as input to <strong>K-Means</strong> to perform efficient clustering on larger datasets.</p><p>This way, you get the best of both:</p><p>· <strong>Interpretability</strong> from Hierarchical clustering,</p><p>· <strong>Scalability</strong> from K-Means clustering.</p><h3>Comparison of Linkages in Hierarchical Clustering</h3><p>1. <strong>Single Linkage (Minimum Distance)</strong></p><blockquote>Defines cluster distance by the <strong>closest pair</strong> of points.</blockquote><blockquote>Often causes a <strong>chaining effect</strong> → clusters become long and loose.</blockquote><blockquote>Dendrogram is not very well-structured.</blockquote><p>2. <strong>Complete Linkage (Maximum Distance)</strong></p><blockquote>Defines cluster distance by the <strong>farthest pair</strong> of points.</blockquote><blockquote>Produces <strong>compact, well-separated clusters</strong>.</blockquote><blockquote>Dendrogram is cleaner and easier to interpret.</blockquote><p>3. <strong>Average Linkage (Mean Distance)</strong></p><blockquote>Defines cluster distance as the <strong>average of all pairwise distances</strong> between clusters.</blockquote><blockquote>Balances the advantages of single and complete linkage.</blockquote><blockquote>Gives reasonably well-structured dendrograms.</blockquote><h4>Which is Best?</h4><p>· <strong>Complete Linkage</strong> generally gives the most <strong>well-separated dendrogram</strong> because it forces clusters to be compact and distinct.</p><p>· <strong>Advantages:</strong></p><blockquote>Easier to see clear groupings.</blockquote><blockquote>Better interpretability.</blockquote><blockquote>More reliable for business decisions where distinct segments are needed.</blockquote><p>Play around with various linkages and number of clusters. You will be able to see the number of natural clusters from the dendrogram itself. If you want, you can change the scale as well. Which group of parameters give you the best result.</p><h4>Choosing the Number of Clusters</h4><p>By looking at the dendrogram and applying <strong>general knowledge about Indian states</strong> (e.g., southern states being more educated, BIMARU states having lower literacy), the following clusters make logical sense:</p><p>1. <strong>High literacy and higher education states</strong> → Kerala, Tamil Nadu, Delhi, Chandigarh.</p><p>2. <strong>Moderately literate, growing education states</strong> → Maharashtra, Gujarat, Karnataka, Punjab.</p><p>3. <strong>Low literacy, education-challenged states</strong> → Bihar, Uttar Pradesh, Jharkhand, Rajasthan, Madhya Pradesh.</p><p>Cutting the dendrogram at around <strong>3–4 clusters</strong> with <strong>complete or average linkage</strong> gives the most meaningful results.</p><p>That wraps up our journey through Unsupervised Learning and Clustering! If you have any questions or doubts, feel free to drop them in the comments — I’m always happy to help.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f188c25c397b" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>