<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Mohitkumar Mahto on Medium]]></title>
        <description><![CDATA[Stories by Mohitkumar Mahto on Medium]]></description>
        <link>https://medium.com/@mohitkumar.mahto1205?source=rss-e25cf48777e4------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*jg6zzxR2nyBgWeSg</url>
            <title>Stories by Mohitkumar Mahto on Medium</title>
            <link>https://medium.com/@mohitkumar.mahto1205?source=rss-e25cf48777e4------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 19 May 2026 14:45:50 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@mohitkumar.mahto1205/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[NLP Looks Scary Until You Understand Bag of Words]]></title>
            <link>https://medium.com/@mohitkumar.mahto1205/nlp-looks-scary-until-you-understand-bag-of-words-6f4a8976ed04?source=rss-e25cf48777e4------2</link>
            <guid isPermaLink="false">https://medium.com/p/6f4a8976ed04</guid>
            <category><![CDATA[bag-of-words]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[nlp]]></category>
            <category><![CDATA[genai]]></category>
            <dc:creator><![CDATA[Mohitkumar Mahto]]></dc:creator>
            <pubDate>Fri, 15 May 2026 12:34:51 GMT</pubDate>
            <atom:updated>2026-05-15T12:38:19.840Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*v7WI-h8Z2o5WTEB2CcpNew.png" /></figure><p>When I first started learning NLP, one of the easiest text representation techniques I came across was <strong>Bag of Words (BoW)</strong>.</p><p>The idea is pretty simple:<br>Instead of understanding the meaning of a sentence, Bag of Words only focuses on <strong>which words are present and how many times they appear</strong>.</p><p>It completely ignores grammar and word order.</p><h3>Simple Definition</h3><p>Bag of Words is a text representation technique where a sentence is converted into numerical vectors based on the frequency or presence of words in the vocabulary.</p><h3>Let’s Understand with an Example</h3><p>Suppose we have these sentences:</p><pre>S1 = &quot;Cats are cute&quot;<br>S2 = &quot;Dogs are cute&quot;<br>S3 = &quot;Cats and dogs play&quot;</pre><h3>Text Preprocessing</h3><p>Usually, before applying BoW, we do some preprocessing:</p><ul><li>Convert all words to lowercase</li><li>Remove stopwords if needed</li><li>Apply stemming/lemmatization</li></ul><p>After preprocessing:</p><pre>S1 = &quot;cats cute&quot;<br>S2 = &quot;dogs cute&quot;<br>S3 = &quot;cats dogs play&quot;</pre><h3>Create Vocabulary</h3><p>Now collect all unique words.</p><pre>[&quot;cats&quot;, &quot;cute&quot;, &quot;dogs&quot;, &quot;play&quot;]</pre><h3>Create Vectors</h3><p>Now represent each sentence using the vocabulary.</p><p>SentencecatscutedogsplayS11100S20110S31011</p><p>This is called a <strong>Binary Bag of Words</strong> because values are either 1 or 0.</p><ul><li>1 → word is present</li><li>0 → word is absent</li></ul><h3>Frequency Based Bag of Words</h3><p>Instead of only checking presence, we can also store frequency.</p><p>Example:</p><pre>Sentence = &quot;cat cat dog&quot;</pre><p>Vocabulary:</p><pre>[&quot;cat&quot;, &quot;dog&quot;]</pre><p>Vector:</p><pre>[2, 1]</pre><p>Because:</p><ul><li>“cat” appears 2 times</li><li>“dog” appears 1 time</li></ul><h3>Advantages of Bag of Words</h3><ul><li>Simple and easy to understand</li><li>Works well for basic text classification tasks</li><li>Converts text into fixed-size numerical vectors</li></ul><h3>Limitations of Bag of Words</h3><h3>1. Ignores Word Order</h3><p>These two sentences get almost the same representation:</p><pre>&quot;dog bites man&quot;<br>&quot;man bites dog&quot;</pre><p>But their meanings are completely different.</p><h3>2. Sparse Matrix Problem</h3><p>If vocabulary becomes huge, vectors become very large and mostly filled with zeros.</p><h3>3. No Semantic Meaning</h3><p>BoW cannot understand relationships between words.</p><p>For example:</p><ul><li>“car” and “vehicle” are related</li><li>But BoW treats them as completely different words</li></ul><h3>4. Out Of Vocabulary (OOV)</h3><p>If a new word appears during testing that was not present during training, the model cannot handle it properly.</p><h3>Final Thoughts</h3><p>Bag of Words may look basic today, but it was one of the foundational ideas in NLP.</p><p>Understanding BoW makes it much easier to learn advanced techniques later like:</p><ul><li>TF-IDF</li><li>Word2Vec</li><li>GloVe</li><li>FastText</li><li>Embeddings</li></ul><p>Sometimes the simplest ideas are the best place to start 🚀</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6f4a8976ed04" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[One Hot Encoding]]></title>
            <link>https://medium.com/@mohitkumar.mahto1205/one-hot-encoding-f859cd075004?source=rss-e25cf48777e4------2</link>
            <guid isPermaLink="false">https://medium.com/p/f859cd075004</guid>
            <category><![CDATA[nlp]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[genai]]></category>
            <dc:creator><![CDATA[Mohitkumar Mahto]]></dc:creator>
            <pubDate>Fri, 15 May 2026 07:03:40 GMT</pubDate>
            <atom:updated>2026-05-15T07:03:40.930Z</atom:updated>
            <content:encoded><![CDATA[<h3>Simple Definition</h3><p>One Hot Encoding is a technique used to convert words or categories into binary vectors where only one position is 1 and all other positions are 0.</p><h3>Let’s Understand with a Simple Example</h3><p>Suppose we have these sentences:</p><ul><li>D1 = “I love pizza”</li><li>D2 = “I love burgers”</li><li>D3 = “Pizza is tasty”</li></ul><h3>Step 1: Create Vocabulary</h3><p>First, collect all unique words from the sentences.</p><pre>[&quot;I&quot;, &quot;love&quot;, &quot;pizza&quot;, &quot;burgers&quot;, &quot;is&quot;, &quot;tasty&quot;]</pre><p>Vocabulary size = 6</p><h3>Step 2: Assign One-Hot Vectors</h3><p>WordVectorI[1 0 0 0 0 0]love[0 1 0 0 0 0]pizza[0 0 1 0 0 0]burgers[0 0 0 1 0 0]is[0 0 0 0 1 0]tasty[0 0 0 0 0 1]</p><p>Each word gets its own unique binary vector.</p><h3>Step 3: Encode Sentences</h3><h3>D1 = “I love pizza”</h3><pre>I      -&gt; [1 0 0 0 0 0]<br>love   -&gt; [0 1 0 0 0 0]<br>pizza  -&gt; [0 0 1 0 0 0]</pre><p>Encoded form:</p><pre>[<br> [1 0 0 0 0 0],<br> [0 1 0 0 0 0],<br> [0 0 1 0 0 0]<br>]</pre><h3>D2 = “I love burgers”</h3><pre>[<br> [1 0 0 0 0 0],<br> [0 1 0 0 0 0],<br> [0 0 0 1 0 0]<br>]</pre><h3>D3 = “Pizza is tasty”</h3><pre>[<br> [0 0 1 0 0 0],<br> [0 0 0 0 1 0],<br> [0 0 0 0 0 1]<br>]</pre><h3>Limitations of One Hot Encoding</h3><ul><li>Vectors become very large when vocabulary size increases.</li><li>It does not understand word meaning.</li><li>“Pizza” and “Burger” are treated as equally different as “Pizza” and “Car”.</li><li>No semantic relationship exists between words.</li></ul><h3>What Came After One Hot Encoding?</h3><p>To solve these limitations, more advanced techniques were introduced:</p><ul><li>Word2Vec</li><li>GloVe</li><li>FastText</li><li>Word Embeddings</li></ul><p>These methods capture semantic meaning and relationships between words.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f859cd075004" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>