<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Dhaval Taunk on Medium]]></title>
        <description><![CDATA[Stories by Dhaval Taunk on Medium]]></description>
        <link>https://medium.com/@taunkdhaval08?source=rss-a8e9132e2784------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*8XaMg4MXuvkqy7HU4Rm51g.jpeg</url>
            <title>Stories by Dhaval Taunk on Medium</title>
            <link>https://medium.com/@taunkdhaval08?source=rss-a8e9132e2784------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 15 May 2026 15:51:39 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@taunkdhaval08/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Message Passing in Graphs]]></title>
            <link>https://medium.com/analytics-vidhya/message-passing-in-graphs-0e0a7b12b52a?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/0e0a7b12b52a</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[graph]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[graph-neural-networks]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Thu, 04 Jul 2024 18:20:02 GMT</pubDate>
            <atom:updated>2024-07-04T18:20:02.077Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dFYvNVPeUAQ8p2gICh6SGQ.png" /></figure><p>In this blog post, I will discuss the message passing algorithm, which serves as the backbone for graph neural network algorithms. This algorithm facilitates information processing through the nodes of the graph, enabling the network to learn various graph attributes. Consequently, we can perform tasks such as node classification and link prediction. Let’s dive in!</p><h3>Message Passing</h3><p>Let’s explore message passing through the lens of a node classification task. So, what exactly is node classification?</p><blockquote><strong>Node Classification</strong>: Given a network with labels on some nodes, how do we assign labels to all other nodes in the network?</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3vIt7bET8FUJJcxKYSP6GA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Example</strong>: In a network where some nodes are identified as fraudsters and others as fully trusted, the challenge is to identify additional fraudsters and trustworthy nodes based on their interactions and behaviors within the network?</p><blockquote>Node classification typically falls under the category of semi-supervised learning. It leverages both labeled and unlabeled nodes within a graph to predict the labels of unlabeled nodes.</blockquote><p>First let’s discuss the intuition behind message passing framework.</p><p>Message passing in graphs entails nodes exchanging information (messages) with their neighbors iteratively. This process updates each node’s state based on aggregated information, enabling nodes to integrate knowledge from their local neighborhoods. This correlation in graph networks supports tasks such as node classification or link prediction.</p><p>Simply put, nodes that are similar are connected in some manner. Node classification is addressed using a technique called collective classification, where labels are assigned to all nodes in a network simultaneously.</p><p>Now let’s discuss different techniques of message passing algorithms.</p><p>We will be discussing majorly 3 different message passing techniques.</p><ol><li>Relational classification</li><li>Iterative classification</li><li>Belief propagation</li></ol><p>We will be discussing them in detail one by one. But before that, let’s discuss some important things that we will be using throughout this blog.</p><ul><li>Individual behaviors are <strong>correlated</strong> in the network</li><li><strong>Correlation:</strong> nearby nodes have the same color (belonging to the same class)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/684/1*C489e2SL8u6Plym3UuMNog.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>There are mainly two types of dependencies that lead to correlation:</p><ol><li>Homophily</li><li>Influence</li></ol><p>Let’s discuss both of them.</p><ol><li><strong>HomoPhily</strong>: Homophily refers to the tendency of individuals to associate and bond with others who are similar to themselves in characteristics such as beliefs, interests, or demographics.<br> - <strong>Example 1</strong>: Researchers who focus on the same research area are more likely to establish a connection (meeting at conferences, interacting in academic talks, etc.)<br> - <strong>Example 2</strong>: Online social network<br> — Nodes = people<br> — Edges = friendship<br> — Node color = interests (sports, arts, etc.)</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/912/1*HQ-zx3VOPRDL6YC5E8_drA.png" /><figcaption>People with the same interest are more closely connected due to homophily — Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>2. Influence</strong>: Social connections can influence the individual characteristics of a person.<br>- <strong>Example</strong>: I recommend my musical preferences to my friends, until one of them grows to like my same favorite genres!</p><p>How do we leverage this correlation observed in networks to help predict node labels?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*M7bgusi-UujO1hdzF3O_YQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Intuition: </strong>Similar nodes are typically close together or directly connected in the network</p><p><strong>Approach:</strong> <br> — <strong>Guilt-by-association: </strong>If I am connected to a node with label 𝑋, then I am likely to have label 𝑋 as well.</p><p><strong>Example</strong>: <br> —<strong> Malicious/benign web page: </strong>Malicious web pages link to one another to increase visibility, look credible, and rank higher in search engines.</p><p>Classification label of a node <strong>𝑣</strong> in network may depend on:</p><ul><li>Features of <strong>𝑣</strong></li><li>Labels of the nodes in <strong>𝑣</strong>’s neighborhood</li><li>Features of the nodes in <strong>𝑣</strong>’s neighborhood</li></ul><p>So, how to find label? Let’s understand the task first.</p><p><strong>Given</strong>:<br> — Graph<br> — Few labeled nodes</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/674/1*XMQ3yUHoARwZlIvJ33eIKw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Find:</strong> class (red/green) of remaining nodes</p><p><strong>Main assumption</strong>: There is homophily in the network</p><p>Let’s try to solve it using semi-supervised learning approach.</p><p>Example task from above graph:</p><ol><li>Let 𝑨 be a 𝑛×𝑛 adjacency matrix over 𝑛 nodes</li><li>Let Y = {0, 1}ⁿ be a vector of labels:<br>- Y<em>ᵥ</em> = 1 belongs to Class 1<br>- Y<em>ᵥ</em> = 0 belongs to Class 0<br>- There are unlabeled node needs to be classified</li><li><strong>Goal</strong>: Predict which unlabeled nodes are likely <strong>Class 1</strong>, and which are likely <strong>Class 0</strong></li></ol><p>As mentioned earlier, this problem is solved by <strong>collective classification. </strong>Let’s learn how to perform collective classification.</p><h4>Collective Classification</h4><ol><li><strong>Intuition:</strong> Simultaneous classification of interlinked nodes using correlations.</li><li>Probabilistic framework</li><li><strong>Markov Assumption:</strong> The label 𝑌<em>ᵥ</em> of one node <em>v</em> depends on the labels of its neighbors 𝑁<em>ᵥ</em></li></ol><blockquote>P(Yᵥ) = P(Yᵥ | Nᵥ)</blockquote><p>Collective classification involves 3 steps:</p><ol><li><strong>Local Classifier: </strong>Used for initial label assignment<br>- Predicts label based on node attributes/features<br>- Standard classification task<br>- Does not use network information</li><li><strong>Relational Classifier: </strong>Capture correlations<br>- Learns a classifier to label one node based on the labels and/or attributes of its neighbors<br>- This is where network information is used</li><li><strong>Collective Inference: </strong>Propagate the correlation<br>- Apply relational classifier to each node iteratively<br>- Iterate until the inconsistency between neighboring labels is minimized<br>- Network structure affects the final prediction</li></ol><p>Till now I have covered all the basic concepts required to understand the <strong>message passing</strong> algorithm. Before going back to 3 message passing algorithms I mentioned above, let’s first define a problem statement that we will take as example to understand message passing.</p><p><strong>Problem Statement</strong></p><ol><li>How to predict the labels 𝑌<em>ᵥ</em> for the unlabeled nodes <em>v</em>(in grey color)?</li><li>Each node <em>v </em>has a feature vector 𝑓<em>ᵥ</em></li><li>Labels for some nodes are given (1 for green, 0 for red)</li><li><strong>Task</strong>: Find 𝑃(𝑌<em>ᵥ</em>) given all features and the network</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1Vd7NAAUTpaKnXGJpbg45A.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ol><li>We focus on semi-supervised node classification.</li><li>Intuition is based on <strong>homophily</strong>: Similar nodes are typically close together or directly connected.</li></ol><h4>Relational Classification</h4><p><strong>Basic idea</strong>: Class probability 𝑌<em>ᵥ</em> of node 𝑣 is a weighted average of class probabilities of its neighbors.</p><p><strong>Steps:</strong></p><ul><li>For labeled nodes 𝑣, initialize label 𝑌<em>ᵥ</em> with ground-truth label 𝑌<em>ᵥ*</em></li><li>For unlabeled nodes, initialize 𝑌<em>ᵥ</em> = 0.5</li><li>Update all nodes in a random order until convergence or until maximum number of iterations is reached</li><li>Update for each node 𝑣 and label 𝑐 (e.g. 0 or 1)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FouG0JCgahbt0WweW1WNCQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>If edges have strength/weight information, 𝐴<em>ᵥ,ᵤ</em> can be the edge weight between <em>v </em>and <em>u</em></li><li>𝑃(𝑌<em>ᵥ</em> = 𝑐) is the probability of node <em>v </em>having label <em>c</em></li></ul><p><strong>Challenges:</strong></p><ul><li>Convergence is not guaranteed</li><li>Model cannot use node feature information</li></ul><p><strong>Example:</strong></p><p><strong>Initialization:</strong></p><ul><li>All labeled nodes with their labels</li><li>All unlabeled nodes 0.5 (belonging to class 1 with probability 0.5)</li><li>Let 𝑃ᵧ₁ = 𝑃(𝑌₁ = 1)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tug6-vjRUGlRjAZkkTvtdg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Update for the 1st Iteration:</strong></p><ul><li>For node 3, 𝑁₃ = {1, 2, 4}</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-NmkVJMFFpjnpDFfyG69Mg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>For node 4, 𝑁₄ = {1, 3, 5, 6}</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lnQ7Nry6N2ik0Do705A_9g.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>For node 5, 𝑁₅ = {4, 6, 7, 8}</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eHXevYT_-z_ktRoAw_roxg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>After Iteration 1 (a round of updates for all unlabeled nodes)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Q0fR5QEhzeuZlofKabt_8w.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>After Iteration 2</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fXAAwUbReXhsBCKOlCETnw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>After Iteration 3</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AEe3_K-JmuLUiHYxxvSyKA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>After Iteration 4</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*o23RPCx4FYsIwlXFRyQeXg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>All scores stabilize after 4 iterations. We therefore predict:<br>- Nodes 4, 5, 8, 9 belong to class 1 (𝑃<em>ᵧᵥ</em> &gt; 0.5)<br>- Nodes 3 belong to class 0 (𝑃<em>ᵧᵥ</em> &lt; 0.5)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Kjqq8bAYw4ECLgHqA0OEeA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><h4>Iterative classification</h4><p>Relational classifiers do not use node attributes. How can one leverage them? Main idea of iterative classification: Classify node v based on its attributes 𝒇<strong><em>ᵥ</em></strong> as well as labels <em>𝒛</em><strong><em>ᵥ</em></strong> of neighbor set 𝑵<strong><em>ᵥ.</em></strong></p><ul><li>Input: Graph<br>- <em>f</em><strong><em>ᵥ</em></strong> : feature vector for node <em>v</em><br>- Some nodes 𝑣 are labeled with 𝑌<strong><em>ᵥ</em></strong></li><li>Task: Predict label of unlabeled nodes</li><li>Approach: Train two classifiers:</li><li>𝜙%(𝑓<strong><em>ᵥ</em></strong>) = Predict node label based on node feature vector 𝑓<strong><em>ᵥ</em></strong></li><li>𝜙’ (𝑓<strong><em>ᵥ</em></strong>, 𝑧<strong><em>ᵥ)</em></strong> = Predict label based on node feature vector 𝑓<strong><em>ᵥ</em></strong> and summary 𝑧<strong><em>ᵥ</em></strong> of labels of <em>v</em>’s neighbors.</li></ul><p>Now the question comes about computing summary. How do we compute the summary <strong>𝒛<em>ᵥ</em></strong> of labels of <strong><em>v</em>’s</strong> neighbors 𝑵<strong><em>ᵥ</em></strong>?</p><ul><li>Idea: <strong>𝒛<em>ᵥ</em> = vector<br>- </strong>Histogram of the number (or fraction) of each label in 𝑁<strong><em>ᵥ</em></strong><br>- Most common label in 𝑁<strong><em>ᵥ</em></strong><br>- Number of different labels in 𝑁<strong><em>ᵥ</em></strong></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PaY6BlNE2DcJvfvrOlDJKg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Now, let’s discuss the overall process of training iterative classifiers.</p><ol><li><strong>Phase 1</strong>: Classify based on node attributes alone<br>- On a training set, train classifier (e.g., linear classifier, neural networks, …):<br>- 𝜙₁(𝑓<strong><em>ᵥ</em></strong>) to predict 𝑌<strong><em>ᵥ</em></strong> based on 𝑓<strong><em>ᵥ</em></strong><br>- 𝜙₂<em>(f</em><strong><em>ᵥ</em></strong><em>, 𝑧</em><strong><em>ᵥ)</em></strong> to predict 𝑌<strong><em>ᵥ</em></strong> based on 𝑓<strong><em>ᵥ</em></strong> and summary 𝑧<strong><em>ᵥ</em></strong> of labels of <em>v</em>’s neighbors</li><li>Phase 2: Iterate till convergence<br>- On test set, set labels 𝑌<strong><em>ᵥ</em></strong> based on the classifier 𝜙₁, compute 𝑧<strong><em>ᵥ</em></strong> and predict the labels with 𝜙₂<br>- Repeat for each node 𝑣<br>- Update 𝑧<strong><em>ᵥ</em></strong> based on 𝑌<strong><em>ᵥ</em></strong> for all 𝑢 ∈ 𝑁<strong><em>ᵥ</em></strong><br>- Update 𝑌<strong><em>ᵥ</em></strong> based on the new 𝑧<strong><em>ᵥ</em></strong> (𝜙₂)<br>- Iterate until class labels stabilize or max number of iterations is reached<br>- Note: Convergence is not guaranteed</li></ol><p>Let’s understand the above process with an example. We will take example of web-page classification for this.</p><ul><li><strong>Input:</strong> Graph of web pages</li><li><strong>Node:</strong> Web page</li><li><strong>Edge:</strong> Hyper-link between web pages</li><li><strong>Directed edge:</strong> a page points to another page</li><li><strong>Node features:</strong> Webpage description<br>- For simplicity, we only consider 2 binary features</li><li><strong>Task:</strong> Predict the topic of the webpage</li></ul><p><strong>Steps:</strong></p><ul><li><strong>Baseline</strong>: train a classifier (e.g., linear classifier) to classify pages based on binary node attributes</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CNgC8q1xUnow8_VSWO6WqQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>Each node maintains vectors 𝒛𝒗 of neighborhood labels:<br> - 𝐼 = Incoming neighbor label information vector<br> - 𝑂 = Outgoing neighbor label information vector</li><li><em>I₀ </em>= 1 if at least one of the incoming pages is labelled 0. Similar definitions for <em>𝐼₁, 𝑂₀</em>, and <em>𝑂₁</em></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dY3m65BkmHbMOWKK6R4_-Q.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>On a different training set, train two classifiers:<br> - Node attribute vector only (green circles): 𝜙<em>₁</em><br> - Node attribute and link vectors (red circles): 𝜙₂</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_0m8a1FCj6Uj_WD8wpxCJA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>On the test set:<br> - Use trained node feature vector classifier 𝜙<em>₁ </em>to set 𝑌<strong><em>ᵥ</em></strong></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vPRKqgsaInwUQvw8Vah5Ww.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*N6rLpzcIruY0ljLGCNo1Cg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>Update 𝑧<strong><em>ᵥ</em></strong> for all nodes</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PF407pdjofZLVW4PJRQXAA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>Re-classify all nodes with 𝜙₂</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zwNbtCbOKFLpQLQuLy1Bag.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>Continue until convergence<br> - Update 𝑧<strong><em>ᵥ</em></strong><br> - Update 𝑌<strong><em>ᵥ </em></strong>= 𝜙₂(𝑓<strong><em>ᵥ</em></strong>, 𝑧<strong><em>ᵥ</em></strong>)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*m39yw_a2IO6dNBTFGvpEXA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>Stop iteration<br> - After convergence or when maximum iterations are reached</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*x3I219PvNxUe2tyxSVMS9Q.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>And that’s how you perform iterative classification approach. Now let’s discuss the last approach known as <strong>Belief Propagation.</strong></p><h4><strong>Belief Propagation</strong></h4><p>Belief Propagation is a dynamic programming approach to answering probability queries in a graph (e.g. probability of node <em>v </em>belonging to class 1). Iterative process in which neighbor nodes “talk” to each other, passing messages. When consensus is reached, calculate final belief.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VmHSx3LxH06vuQJg4PDJnw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Introduction:</strong></p><ul><li>Task: Count the number of nodes in a graph*</li><li>Condition: Each node can only interact (pass message) with its neighbors</li></ul><p><strong>Example: </strong>Path Graph</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-9XVavD8EkMYbd681WEUWw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Algorithm:</strong></p><ul><li>Define an ordering of nodes (that results in a path)</li><li>Edge directions are according to order of nodes<br> - Edge direction defines the order of message passing</li><li>For node 𝑖 from 1 to 6<br> - Compute the message from node 𝑖 to 𝑖 + 1 (number of nodes counted so far)<br> - Pass the message from node 𝑖 to 𝑖 + 1</li><li><strong>Condition</strong>: Each node can only interact (pass message) with its neighbors</li><li><strong>Solution</strong>: Each node listens to the message from its neighbor, updates it, and passes it forward <em>m</em>: the message</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ENzttCyIYsCuCrLKCsgKLQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Generalizing to a tree</strong></p><ul><li>We can perform message passing not only on a path graph, but also on a tree-structured graph.</li><li>Define order of message passing from leaves to root.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*I4dW3h2xQbflffNKfeeVvg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>Update beliefs in tree structure</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pKCMWac4CGwQz3hqneo65A.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hztUJLeC2L7vEp9ex0OzWQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Question:</strong> What message will 𝑖 send to 𝑗?</p><ul><li>It depends on what 𝑖 hears from its neighbors</li><li>Each neighbor passes a message to 𝑖 its beliefs of the state of 𝑖</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/408/1*3M42j0lchGmj61elfPzjTQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Let’s first define some notations before going into the algorithm</p><ul><li><strong>Label-label potential matrix 𝝍</strong> : Dependency between a node and its neighbor. 𝝍(𝑌ᵢ, 𝑌ⱼ) is proportional to the probability of a node 𝑗 being in class 𝑌ⱼ given that it has neighbor 𝑖 in class 𝑌ᵢ.</li><li><strong>Prior belief 𝝓</strong>: 𝜙(𝑌ᵢ) is proportional to the probability of node 𝑖 being in class 𝑌ᵢ.</li><li>𝑚ᵢ →ⱼ(𝑌ⱼ) is 𝑖’s message / estimate of 𝑗 being in class 𝑌ⱼ.</li><li>ℒ is the set of all classes/labels</li></ul><p><strong>Steps:</strong></p><ul><li>Initialize all messages to 1</li><li>Repeat for each node:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/562/1*5bvc7mtxBOQY9w2foukABA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*soU87mDN_7dx_EvBDjWxpg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>After convergence:</strong></p><ul><li>𝑏ᵢ(𝑌ᵢ) = node 𝑖’s belief of being in class 𝑌ᵢ</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vQ6k6-9yGGeSVdIyDBHFWQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Let’s understand this with an example. We will include a graph with cycles as well. This process of using Belief Propagation in cycled graph is also called as <strong>Loopy Belief Propagation.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/598/1*qbNOJfolZAKJOT8GZlmIaQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><blockquote>Messages from different subgraphs are no longer independent!</blockquote><blockquote>But we can still run BP, but it will pass messages in loops.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/744/1*PPLoxd6FgSIF7aJLpG1iFQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>Beliefs may not converge<br>- Message 𝑚ᵤ → ᵢ (𝑌ᵢ) is based on initial belief of 𝑖, not a separate evidence for 𝑖<br>- The initial belief of 𝑖 (which could be incorrect) is reinforced by the cycle <em>i </em>→ 𝑗 → 𝑘 → 𝑢 → 𝑖</li><li>However, in practice, Loopy BP is still a good heuristic for complex graphs which contain many branches.</li></ul><p><strong>Challenges:</strong></p><ul><li><strong>Messages loop around and around:</strong> 2, 4, 8, 16, 32, … More and more convinced that these variables are T!</li><li>BP incorrectly treats this message as <strong>separate evidence</strong> that the variable<br>is T (true).</li><li>Multiplies these two messages as if they were <strong>independent</strong>.<br> - But they don’t actually come from independent parts of the graph.<br> - One influenced the other (via a cycle).</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/884/1*WOiMbMxh3LUbgOFF2jzJ9A.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li><strong>Advantages:<br> - </strong>Easy to program &amp; parallelize<br> - General: can apply to any graph model with any form of potentials<br> - Potential can be higher order: e.g. 𝝍(𝑌ᵢ, 𝑌ⱼ, 𝑌ₖ, 𝑌ᵥ … )</li><li><strong>Issues</strong>:<br> - Convergence is not guaranteed (when to stop), especially if many closed loops</li><li><strong>Potential functions (parameters) :</strong><br> - Require training to estimate</li></ul><p>That’s all for message passing algorithm. Next we will be discussing about graph neural networks in upcoming blog. Stay tuned for that….</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=0e0a7b12b52a" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/message-passing-in-graphs-0e0a7b12b52a">Message Passing in Graphs</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Graph Representational Learning: Creating node and graph embeddings — Part 2]]></title>
            <link>https://medium.com/analytics-vidhya/graph-representational-learning-creating-node-and-graph-embeddings-part-2-33817c5ce7f3?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/33817c5ce7f3</guid>
            <category><![CDATA[graph]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[graph-neural-networks]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Mon, 01 Jul 2024 15:15:04 GMT</pubDate>
            <atom:updated>2024-07-01T15:15:04.595Z</atom:updated>
            <content:encoded><![CDATA[<h3>Graph Representational Learning: Creating node and graph embeddings — Part 2</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*UeJu9Rg0wvdFBzJ1.jpeg" /></figure><p>In the previous blog post, I covered various techniques for node-level and graph-level embeddings, explaining their intuition and training methods. Now, we’ll delve into coding some of these techniques in Python. Let’s begin!</p><h3>1. Node2Vec</h3><p>First, we’ll begin with Node2Vec. We’ll use NetworkX to create a random graph, then proceed to train the <strong>Node2Vec</strong> algorithm by generating random walks over the graph using <strong>Word2Vec</strong> from the <strong>gensim</strong> package.</p><ol><li>Install the necessary packages</li></ol><pre>pip install networkx node2vec</pre><p>2. Next, we create the input graph using the NetworkX package.</p><pre>import networkx as nx<br><br>G = nx.fast_gnp_random_graph(n=100, p=0.5)</pre><p>The above technique creates a graph with 100 nodes. The parameter <em>p </em>defines the probability of two nodes being connected to each other. Therefore, this graph won’t be fully connected. Instead, it should have approximately half the edges compared to a fully connected graph with 100 nodes. You can adjust the <em>p</em> value as desired.</p><p>3. Now, we will initialize a Node2Vec class that takes the generated graph as input and generates random walks over the graph.</p><pre>from node2vec import Node2Vec<br><br>node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)</pre><p>In the above code, you can see that I am creating random walks with a length of 30 and a total number of walks equal to 200. The embedding size is set to 64 in this case. Feel free to adjust these parameters according to your specific use-case.</p><p>4. Next, let’s fit the created graph and train the model on it.</p><pre>model = node2vec.fit(window=10, min_count=1, batch_words=4)</pre><p>5. The next step will involve saving the trained model in order to extract embeddings from it.</p><pre>model.wv.save_word2vec_format(&quot;embeddings_node2vec.txt&quot;)</pre><p>6. To extract embeddings from the model, you can use the following code:</p><pre>embeddings = {str(node): model.wv[str(node)] for node in G.nodes()}</pre><p>Now feel free to experiment with it as you like.</p><h3>DeepWalk</h3><p>The second algorithm we’ll discuss is DeepWalk. The coding approach remains mostly the same as Node2Vec. The difference lies in the walking strategy, where DeepWalk employs biased random walks instead of the random walks used in Node2Vec.</p><ol><li>Installing packages</li></ol><pre>pip install networkx karateclub</pre><p>2. Importing required packages</p><pre>from karateclub import DeepWalk<br>import networkx as nx</pre><p>3. Creating a graph</p><pre>G = nx.fast_gnp_random_graph(n=100, p=0.5)</pre><p>4. Initialize DeepWalk class</p><pre>model = DeepWalk(dimensions=64, walk_length=30, num_walks=200, workers=4)</pre><p>5. Fitting the model</p><pre>model.fit(G)</pre><p>6. Get the embeddings</p><pre>embeddings = model.get_embedding()</pre><p>So that’s how you create DeepWalk embeddings. Feel free to experiment with this approach.</p><h3>Graph2Vec</h3><p>The last algorithm I’m going to discuss is Graph2Vec. It differs slightly from the previous two algorithms because it creates graph-level embeddings instead of node-level embeddings.</p><ol><li>Installing required packages</li></ol><pre>pip install karateclub networkx</pre><p>2. Importing the packages</p><pre>import networkx as nx<br>from karateclub import Graph2Vec<br>import os<br><br>os.makedirs(&#39;graphs&#39;, exist_ok=True)</pre><p>3. Creating the graph</p><pre>for i in range(5):<br>    G = nx.fast_gnp_random_graph(n=10 + i, p=0.5)<br>    nx.write_gml(G, f&#39;graphs/graph_{i}.gml&#39;)</pre><p>4. Creating a list of graphs for training</p><pre>graphs = []<br>for i in range(5):<br>    G = nx.read_gml(f&#39;graphs/graph_{i}.gml&#39;)<br>    graphs.append(G)</pre><p>5. Fitting the graph using Graph2Vec algorithm</p><pre>model = Graph2Vec(dimensions=64, wl_iterations=2, attributed=False)<br>model.fit(graphs)</pre><p>6. Extracting the embeddings</p><pre>embeddings = model.get_embedding()<br><br>for idx, embedding in enumerate(embeddings):<br>    print(f&#39;Embedding for graph_{idx}: {embedding}&#39;)</pre><p>So that’s how you learn Graph2Vec embeddings.</p><p>For now, that’s it from my side. I hope you enjoyed this blog. Stay tuned for more important topics in the next one.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=33817c5ce7f3" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/graph-representational-learning-creating-node-and-graph-embeddings-part-2-33817c5ce7f3">Graph Representational Learning: Creating node and graph embeddings — Part 2</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Graph Representational Learning: Creating node and graph embeddings]]></title>
            <link>https://medium.com/analytics-vidhya/graph-representational-learning-creating-node-and-graph-embeddings-25915b5a5aca?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/25915b5a5aca</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[graph-representation]]></category>
            <category><![CDATA[aritificial-intelligence]]></category>
            <category><![CDATA[graph]]></category>
            <category><![CDATA[graph-neural-networks]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Thu, 20 Jun 2024 17:08:34 GMT</pubDate>
            <atom:updated>2024-06-20T17:09:09.263Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AaHGu6bc6rpmQ5fjfnLrZQ.jpeg" /></figure><p>In the previous blog post, we covered traditional methods for extracting features from input graphs. In this post, we’ll delve into various approaches for generating node and graph-level embeddings. This includes techniques such as DeepWalk and Node2Vec for node embeddings, as well as Anonymous Walk for graph-level embeddings. Let’s dive in!</p><h3>Node Embedding</h3><p>The primary challenge with traditional hand-crafted features lies in their time-consuming creation process and their task-specific nature. Consequently, there is a pressing need for a more efficient method to encode input features. What are our next steps?</p><p>In natural language processing, concepts like learned embeddings such as Word2Vec are well-established. We can apply a similar approach here. Let’s delve into this further.</p><h4>1. Encoder — Decoder Approach</h4><p>The primary objective of this approach is to generate node-level embeddings in such a way that nodes with similar characteristics are represented by embeddings that are closer to each other, while embeddings of dissimilar nodes are farther apart.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pGoOf14dFKSpUZUSyx8JCA.png" /><figcaption>Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014</figcaption></figure><p>Let’s explore the process to achieve this. Assume we have a graph G with:</p><ul><li>V as the vertex set,</li><li>A as the adjacency matrix.</li></ul><p>The objective is to encode nodes in such a way that their similarity in the embedding space (e.g., measured by dot product) reflects their similarity in the graph structure.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OqFJb1SzHnNDezRILq4nJw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Process:</p><ol><li>The encoder ENC maps nodes to embeddings.</li><li>Define a node similarity function, which measures similarity in the original network.</li><li>The decoder DEC​ maps embeddings to similarity scores.</li><li>Optimize the parameters of the encoder to ensure embeddings reflect the node similarities captured by the decoder.</li></ol><blockquote><em>similarity</em>(u, v) ≈ <em>zᵤᵀzᵥ</em></blockquote><p>Let’s clarify the roles of the encoder and the similarity function mentioned earlier:</p><p><strong>Encoder</strong>: The encoder’s role is to transform each node into a low-dimensional vector representation, known as an embedding.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/966/1*rKRwsgD_v2_G2u97QnQB6Q.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Similarity Function: Specifies how the relationships in the vector space correspond to those in the original network. It plays a key role in the decoder section of the approach.</p><p>As for initializing these embeddings, a common approach is:</p><p><strong>Shallow Encodings:</strong> A straightforward approach is to treat the encoder as an embedding lookup.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-9V-rCOYdzneKBXR-CjhiQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VFBT8FDeyH_Aa54rv5lBqQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Each node is assigned a unique embedding vector, meaning we directly optimize the embedding for each individual node.</p><h4>2. Random Walk Approach</h4><p>Given a graph and a starting point, we randomly select a neighbor of the starting point and move to this neighbor. We continue this process by selecting random neighbors of the current point and moving accordingly. This sequence of points visited in this manner constitutes a random walk on the graph.</p><blockquote><em>zᵤᵀzᵥ </em>≈ probability that u and v co-occur on a random walk over the graph</blockquote><p><strong>Steps:</strong></p><ul><li>Estimate the probability of visiting node <strong><em>v</em></strong><em> </em>on a random walk starting from node <strong><em>u </em></strong>using a specific random walk strategy <strong>R</strong>.</li><li>Optimize embeddings to encode these statistics derived from random walks.</li></ul><p>We will explore how to optimize the embeddings. Before that, let’s delve into why we employ the Random Walk approach.</p><ol><li><strong>Expressivity:</strong> This method offers a flexible stochastic definition of node similarity, integrating both local and higher-order neighborhood details. The idea is that if a random walk originating from node 𝒖 frequently visits 𝒗, then 𝒖 and 𝒗 are deemed similar, leveraging high-order multi-hop information.</li><li><strong>Efficiency:</strong> By focusing solely on pairs of nodes that co-occur during random walks, we eliminate the need to consider all possible node pairs during training.</li></ol><p>Next, let’s explore the process of learning the embeddings. Clearly, from its description, this process is an unsupervised feature learning procedure.</p><ol><li><strong>Intuition:</strong> The goal is to find embeddings of nodes in a d-dimensional space that preserve their similarities.</li><li><strong>Idea:</strong> The objective is to learn node embeddings such that nodes that are close in the network are also close in the embedding space.</li><li>For a given node 𝑢, the notion of nearby nodes is defined by 𝑁ᵣ(u), the neighborhood of 𝑢 obtained through some random walk strategy 𝑟.</li><li>Given a graph 𝐺 = (𝑉, 𝐸), our aim is to learn a mapping 𝑓: 𝑢 → ℝᴰ, where 𝑓₍ᵤ₎ = 𝐳ᵤ, representing the embedding vector of node 𝑢.</li></ol><p>The objective function is structured as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eZe9CuzL3c7uFWmVe98AQA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Given node 𝑢, our aim is to learn feature representations that predict nodes within its random walk neighborhood 𝑁ᵣ(u).</p><p><strong>Optimization Steps:</strong></p><ol><li>Conduct short fixed-length random walks starting from each node 𝑢 in the graph using a specified random walk strategy R.</li><li>For each node <strong>𝑢,</strong> gather <strong>𝑁ᵣ(𝑢)</strong>, which represents the multiset* of nodes visited during random walks initiated from <strong>𝑢</strong>.</li><li>Optimize embeddings based on the following principle: Given a node <strong>𝑢</strong>, predict its neighbors <strong>𝑁ᵣ(u)</strong>.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uJMf0pZYxF_lDQbhtdK4Fw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>which is equivalent to:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Dh5JvcPVQp9geInZxdyicQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Intuition: We optimize embeddings 𝑧ᵤ to maximize the likelihood of co-occurrences in random walks.</p><p>We parameterize 𝑃(𝑣|𝐳<strong><em>ᵤ</em></strong>) using softmax.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xXZps3_DinVNQLNkYBp34Q.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Why softmax? We aim for node 𝑣 to be most similar to node 𝑢 among all nodes 𝑛.</p><p>Intuition: ∑ᵢ exp(xᵢ) ≈ maxᵢ exp(𝑥ᵢ)</p><p>Now, let’s combine all these insights. The overall equation looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RolMieOVIFS9XGkvEBFQcw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Therefore, we can conclude that:</p><blockquote>Optimizing random walk embeddings involves finding embeddings 𝐳ᵤ that minimize L.</blockquote><p>However, when dealing with graphs of immense size, optimizing with softmax, which includes a costly normalizing term, becomes impractical. So, what is the solution for this?</p><p>This is where Negative Sampling comes into play. First, let’s explain what negative sampling is, and then we’ll explore how to apply it for our purposes.</p><p><strong>Negative Sampling</strong>: Negative sampling is a method in machine learning, often used in neural networks like word embeddings, where instead of learning from all possible examples, the model trains on a subset of “negative” examples alongside positive ones to improve efficiency and performance.</p><p>Now, let’s discuss, how to use it in our case.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*aF08BDwdCf6qrS0gWt_H-w.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Sample 𝑘 negative nodes, each selected with a probability proportional to its degree. There are two considerations regarding 𝑘 (# of negative samples):</p><ul><li>A higher 𝑘 provides more robust estimates.</li><li>However, a higher 𝑘 also introduces a higher bias towards negative events. In practice, 𝑘 typically ranges from 5 to 20.</li></ul><p>After formulating the objective function, the next step is to optimize (minimize) it. This is usually accomplished using iterative optimization methods such as stochastic gradient descent (SGD) or its variants, which adjust the embeddings to gradually reduce the objective function’s value until convergence.</p><p>Gradient Descent: a straightforward method to minimize ℒ:</p><ol><li>Initialize 𝑧ᵢ to random values for all 𝑖.</li><li>Iterate until convergence.</li></ol><p>a. Compute the derivative ∂ℒ/∂zᵢ for all 𝑖.</p><p>b. Update each 𝑧ᵢ by taking a step in the direction of its derivative:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/370/1*R9fnPxW9qKk1TsxusdRQ8w.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>That’s the essence of how we perform random walks over graphs to learn embeddings for them. The DeepWalk concept is built upon this intuition. Let’s delve into it a bit.</p><h4>3. DeepWalk</h4><p>DeepWalk is a method for learning representations of nodes in a graph by leveraging techniques from natural language processing. It uses the above explained random walk technique to generate sequences of nodes in the graph and then applies a Skip-gram model (similar to word2vec) to learn embeddings that capture the structural context of nodes.</p><p>By treating nodes as words and sequences of nodes as sentences, DeepWalk learns distributed representations that encode the graph’s topology and connectivity patterns. These embeddings can be used for various tasks such as node classification, link prediction, and community detection in complex networks.</p><h4>4. Node2Vec</h4><p>Node2vec draws inspiration from the Random Walk approach. The key distinction lies in its use of biased walks rather than random walks to generate and learn embeddings. This approach aims to employ flexible biased random walks that balance between local and global perspectives of the network.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-mtVxBRN3F9RaJhaNKXdsw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>In the image above, two strategies are employed to calculate walks of length 3. The first strategy, Breadth First Search (BFS), provides a local microscopic view by traversing neighboring nodes. The second strategy, Depth First Search (DFS), offers a global macroscopic view by exploring distant nodes within the graph.</p><p>iased fixed-length random walk 𝑹, given a node 𝒖, generates a neighborhood 𝑵𝑹 𝒖. Two key parameters are involved:</p><ul><li>Return parameter 𝒑: Determines the likelihood of returning to the previous node.</li><li>In-out parameter 𝒒: Balances the movement towards neighboring nodes (DFS) versus returning to nodes already visited (BFS). Intuitively, 𝒒 represents the “ratio” of BFS to DFS behaviors.</li></ul><p>Biased 2nd-order random walks explore network neighborhoods in the following manner:</p><ol><li>After traversing the edge (𝑠7, 𝑤), the random walk is now at node 𝑤.</li><li>Insight: The neighbors of 𝑤 can only be:</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6etr_nRzVpDdrt2CSCJy6g.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><blockquote>Make sure to remember the origin of the walk.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TXdYkpgM45G0OHZwYTUgKg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ol><li>𝑝, 𝑞 model transition probabilities<br>- 𝑝 … return parameter<br>- 𝑞 … ”walk away” parameter</li></ol><p>Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yS0n7pNiTClWryDJT4mKgA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><ul><li>BFS-like walk: Low value of 𝑝</li><li>DFS-like walk: Low value of 𝑞</li></ul><p>𝑁ᵣ(𝑢) are the nodes visited by the biased walk.</p><p>Algorithm:</p><ol><li>Compute random walk probabilities.</li><li>Simulate 𝑟 random walks of length 𝑙 starting from each node 𝑢.</li><li>Optimize the node2vec objective using Stochastic Gradient Descent.</li></ol><p><strong>Advantages:</strong></p><ul><li>Linear-time complexity</li><li>All 3 steps are individually parallelizable</li></ul><blockquote><strong>Note</strong>: It has been observed in different works that node2vec performs better on node classification while alternative methods perform better on link prediction.</blockquote><h3>Graph Embeddings</h3><p>The first two approaches are fundamental, so I’ll provide a brief overview without diving into details. Then, we’ll discuss popular graph embedding algorithms such as Anonymous Walk Embeddings and Graph2Vec.</p><h4>1. First Approach:</h4><ul><li>Apply a standard graph embedding technique to the (sub)graph 𝐺.</li><li>Then simply aggregate the node embeddings in the (sub)graph 𝐺 by summing or averaging them.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*XcSNvzTBIlbC-TR8Js6IbQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><h4>2. Second Approach</h4><ul><li>Introduce a “virtual node” to represent the (sub)graph and apply a standard graph embedding technique to it.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iFPf8Csqyf8ZPgAEV89wBg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><h4>3. Anonymous Walk Embeddings</h4><p>States in anonymous walks correspond to the index representing the first visit to each node during a random walk.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2wC9WZW8giDYQuUr83Tggw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5pEzaBR-rAUu7bnlS9ImlQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p><strong>Steps:</strong></p><ul><li>Simulate anonymous walks of length 𝑙 and record their frequencies.</li><li>Represent the graph using a probability distribution based on these walks.</li></ul><p>For example:</p><ul><li>Let 𝑙 = 3</li><li>We represent the graph with a 5-dimensional vector corresponding to the 5 anonymous walks 𝑤# of length 3: 111, 112, 121, 122, 123.</li><li>𝑍𝓰[𝑖] = probability of anonymous walk 𝑤5 in 𝐺</li></ul><p>Now, let’s discuss how to sample the anonymous walks:</p><ol><li>Generate a set of 𝑚 independent random walks.</li><li>Represent the graph using a probability distribution based on these walks.</li><li>How many random walks 𝑚 do we need?<br>- We require 𝑚 such that the distribution’s error is less than 𝜀 with a probability greater than 𝛿.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*n8jwnkxe4g2sGAc7JM6NUg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>The previous approach represents each walk by its frequency of occurrence. What if, instead, we could learn embeddings zᵢ for each anonymous walk wᵢ?</p><ol><li>We aim to learn a graph embedding 𝒁𝓰 along with embeddings 𝒛ᵢ for all anonymous walks:<br>- 𝒁 = {𝒛ᵢ : 𝑖 = 1 ... 𝜂}, where 𝜂 is the number of sampled anonymous walks.</li><li>How do we embed walks?<br>- Embed walks so that the subsequent walk can be predicted.</li><li>We use a vector parameter 𝒁𝓰 for the input graph.<br>- The entire graph&#39;s embedding is to be learned.</li><li>Starting from node 1: Sample anonymous random walks. For example:</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/780/1*KxmQj0WK-Nnu1tYjo4a4_g.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>5. Learn to predict walks that co-occur within a Δ-size window (e.g., predict 𝑤; given 𝑤&lt;, 𝑤= if Δ = 1).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iQGFb6BoWy1xtKS8KyPvCw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>6. Sum the objective across all nodes in the graph.</p><p>7. Conduct 𝑻 different random walks from 𝒖, each of length 𝒍.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/756/1*Vrhjkb_tO4N184EzzdtTUg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>8. Learn to predict walks that co-occur within a Δ-size window.</p><p>9. Estimate the embedding 𝑧5 of the anonymous walk 𝑤5. Let 𝜂 be the number of all possible walk embeddings.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*euT-5bbkAVW_2Son3IvZWQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>10. After optimization, we obtain the graph embedding 𝒛𝑮 (a learnable parameter).</p><p>11. Utilize 𝒁𝓰 to make predictions, such as graph classification: <br> — Option 1: Inner product kernel 𝒁ᵀ𝓰₁ 𝒁𝓰₂.<br> — Option 2: Employ a neural network that takes 𝒁𝓰 as input for classification.</p><h4>4. Graph2Vec</h4><p>Graph2Vec is a method for learning fixed-size embeddings (vector representations) of entire graphs, which encapsulate the structural and topological properties of the graphs. Unlike traditional node embeddings that represent individual nodes in isolation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/824/1*RU-aJI691kpQEAvWENuUdw.png" /><figcaption><a href="https://arxiv.org/pdf/1707.05005">https://arxiv.org/pdf/1707.05005</a></figcaption></figure><p><strong>Steps to calculate Graph2Vec embeddings:</strong></p><ol><li><strong>Graph Representation:<br>- Input:</strong> Start with a dataset consisting of multiple graphs G={G₁, G₂, …, Gₙ}<br>- Each graph Gᵢ is represented as a structured entity with nodes, edges, and optionally node attributes and edge weights.</li><li><strong>Feature Extraction:<br>- </strong>Extract meaningful features from each graph <em>Gᵢ</em>. This typically involves capturing global and local structural characteristics of the graph.<br>- Example features could include node degree distribution, graph centrality measures, subgraph statistics, or any other relevant graph properties.</li><li><strong>Subgraph Sampling (Neighborhood Sampling): </strong>For each graph <em>Gᵢ</em><br>- <strong>Random Walks or Neighborhood Sampling:</strong> Perform random walks or other neighborhood sampling techniques to extract subgraphs (neighborhoods) around each node within <em>Gᵢ</em>.<br>- <strong>Context Extraction:</strong> Collect these sampled subgraphs to use as local contexts for learning embeddings.</li><li><strong>Graph Embedding Learning:<br>- Embedding Model:</strong> Utilize a model inspired by Skip-gram (used in word2vec)<br>- <strong>Objective:</strong> Train the model to predict the local contexts (sampled subgraphs) given a central subgraph (target) within each graph <em>Gᵢ</em><br>- <strong>Loss Function:</strong> Define a loss function (typically cross-entropy or negative sampling based) to optimize the model parameters.</li><li><strong>Training:<br>- Optimization:</strong> Use stochastic gradient descent (SGD) or other optimization techniques to minimize the loss function.<br>- <strong>Iterations:</strong> Iterate over the dataset multiple times (epochs) to improve the quality of the embeddings.<br>- <strong>Hyperparameters:</strong> Adjust hyperparameters such as embedding dimensions, window size (for context sampling), learning rate, and number of iterations based on validation performance.</li><li><strong>Graph2Vec Embeddings:<br>- Output:</strong> After training, obtain fixed-size vector representations (embeddings) for each graph <em>Gᵢ</em>.<br>- Each embedding <em>vᵢ</em> captures the structural and topological properties of the corresponding graph <em>Gᵢ</em>.</li></ol><p>So, that it for graph representational learning. There are a couple of other algorithms which I haven’t discussed. Feel free to explore more or comment about them. Some of them are LINE, NetMF etc.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=25915b5a5aca" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/graph-representational-learning-creating-node-and-graph-embeddings-25915b5a5aca">Graph Representational Learning: Creating node and graph embeddings</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Traditional ML for Graphs]]></title>
            <link>https://medium.com/analytics-vidhya/traditional-ml-for-graphs-ca7bdbef7544?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/ca7bdbef7544</guid>
            <category><![CDATA[graph-neural-networks]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[graph]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Sat, 08 Jun 2024 07:18:12 GMT</pubDate>
            <atom:updated>2024-06-08T07:18:12.222Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*U27D2bd_baOHstrVYiccBw.png" /></figure><p>In the last <a href="https://medium.com/@taunkdhaval08/graph-neural-networks-an-introduction-9be46691d67c"><em>blog post</em></a>, I discussed in brief about different topics in the context of Graph Neural Networks. In this blog post, I will be discussing about the traditional feature extraction approaches that were popular in the past and used along with classical ML algorithms to solve different graph based problems. So let’s get started…</p><h3>Hand Crafted Features</h3><p>Traditional ML pipelines used hand-crafted features to perform the modelling. To achieve a good performance, effective feature creation is a crucial step. Overall, we can divide the feature extraction process in 3 categories:</p><ol><li>Node Level</li><li>Edge/Link Level</li><li>Graph Level</li></ol><p>We will be discussing different approaches for all three categories one by one. So tighten your seat belt and lets begin…</p><h3>Node Level Features</h3><h4>1. Node Degree</h4><p>Node Degree refers to simply counting the number of edges for each node and treat them as features.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0ymMPeTrGcbHt_Sw83ffXQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu</figcaption></figure><p>Although, this approach seems to be pretty straight forward, it has a drawback. It does not considers the importance of node into account. Therefore, we need a better strategy that takes node importance into consideration as well.</p><h4>2. Node Centrality</h4><p>It is a feature extraction method which takes node centrality into account. There are different ways by which we can calculate node centrality, let’s discuss them one by one.</p><ol><li><strong>Eigenvector Centrality</strong>: The idea behind this is that a node is important if it is surrounded by important neighboring nodes. It is calculated as the sum of centrality of neighboring nodes.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/796/1*dHMEPID3eAVub2hAIb4qLg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>If we rewrite the above equation in matrix form, we can observe that eigenvector centrality is nothing but the eigenvector. The leading eigenvector will be used to calculate the centrality.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*h0_ZqvwlLJ5B0n2cqKSyEA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>2. <strong>Betweenness Centrality</strong>: According to betweenness centrality, a node is important if it lies on many shortest paths between other nodes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vN9T0ExgOLThWdNWuYtMTw.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>3. <strong>Closeness Centrality:</strong> It says that a node is important if it has smallest path lengths to all other nodes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2uCtKoIsHs2Novnl6u53bQ.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><h4><strong>3. Clustering Coefficient</strong></h4><p>It is based on the idea that how well a node’s neighboring nodes are connected. It can be calculated by below equation:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i9GIh2iBL7NEM_sGkgrkcA.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><blockquote>The above network where there is a central node and other nodes surround it is called <strong>ego network.</strong></blockquote><h4><strong>4. Graphlet Degree Vector (GDV)</strong></h4><p>In GDV, we calculate number of graphlets rooted at a given node. So the question comes, what is graphlet? It is basically a <em>rooted connected non-isomorphic sub-graph</em> of a given graph. Below image shows possible graphlets till 5 node sub-graphs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nhVw019nJGLjKEPR1mbM6g.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>So how to calculate GDV? The below images shows how to calculate it followed by it’s description:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FkBV6Fh-1RH3stREeQXwZg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>In the above graph, you can see how to calculate GDV for sub-graph rooted at <strong><em>v. </em></strong>We will calculate 2 and 3 node graphlets for this sub-graph.</p><blockquote><strong>Important observation</strong></blockquote><blockquote>Node degree counts number of edges that a node touches</blockquote><blockquote>Clustering Coefficient counts number of triangles that a node touches</blockquote><blockquote>GDV counts number of graphlets that a node touches</blockquote><p>Overall, the above methods can be divided into two categories:</p><pre>1. Importance-based Features<br>      - Node Degree<br>      - Node Centrality<br><br>2. Structure-based Features<br>      - Node Degree<br>      - Clustering Coefficient<br>      - Graphlet Degree Vector</pre><h3>Edge/Link Level Features</h3><h4>1. Distance based features</h4><ol><li><strong>Shortest Path distance between two nodes: </strong>In this approach, we simply calculate the shortest path distance between two nodes. However, this approach does not capture the degree of neighborhood overlap. Therefore, it is not a very good metric to calculate edge features, although a good starting point.</li></ol><h4>2. Local Neighborhood Overlap</h4><ol><li><strong>Common Neighbors</strong>: Calculates the number of common neighbors, two nodes share.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/472/1*MEZo4hPixPbvDKxDwXWEQQ.png" /></figure><p>2.<strong> Jaccard’s Coefficient</strong>: Calculates the intersection over union of neighboring nodes of two nodes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/338/1*SHlwC7htEYxsIN_Gn5GyyA.png" /></figure><p>3. <strong>Adamic-Adar Index</strong>: It is calculated as one over log of neighboring nodes, a node has.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/506/1*aa5p1euK0KXL3tz6o-DLlQ.png" /></figure><blockquote>The main issue with local neighborhood overlap is that the value will be <strong>zero </strong>if there are no common neighbors between two nodes.</blockquote><h4>3. Global Neighborhood Overlap</h4><p>To overcome the limitation of <em>local neighborhood overlap</em>, global neighborhood overlap comes in the picture.</p><ol><li>Katz Index: It counts the number of paths of all lengths between all pair of nodes. To do so, it utilizes the adjacency matrix of the input graph. Let’s now discuss about how to calculate it:</li></ol><blockquote><em>Let Pᵤᵥᵏ = #</em>path of length k between u and v</blockquote><blockquote>===&gt; Pᵤᵥ¹ = #path of length 1 between u and v which is nothing but the adjacency matrix A of input graph</blockquote><p>Now, how to compute #paths of length 2?</p><p><strong>Step 1</strong>: Compute #paths of length 1 between each of node u’s neighbor and v.</p><p><strong>Step 2</strong>: Sum up these #paths across u’s neighbors using below equation:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wF4mHjR8WJDD96qkSKSb5Q.png" /></figure><p>You can see that the #path of length 2 is nothing but the multiplication of adjacency matrix to itself. Therefore, in this way, you can calculate #paths for any arbitrary length.</p><p>To finally calculate the Katz index, just do a summation of #paths for all length across any two nodes <em>v1</em> and <em>v2</em> as shown in below figure:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*57jYfb1AJaJtHZAHMfpCgw.png" /></figure><p>Finally, to calculate the entire matrix, you can use the below equation:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jmF9DOq4RmVnKw3AuTP5AQ.png" /></figure><pre>Summary<br><br>1. Distance based features<br>      - Uses shortest path length b/w two nodes but does not capture<br>neighborhood overlaps.<br><br>2. Local Neighborhood Overlap<br>      - Captures number of common neighbors but can become zero if no <br>common neighbors.<br><br>3. Global Neighborhood Overlap<br>      - Uses entire graph structure to calculate Katz Index.</pre><h3>Graph Level Features</h3><p>While creating graph level features, kernel level features are popular approach. They are inspired from classical machine learning where kernel methods are quite popular. We will be discussing about two popular methods known as <strong><em>Graphlet Kernels</em></strong> and <strong><em>Weisfeiler-Lehman Kernel.</em></strong></p><h4>1. Graphlet Kernels</h4><p>The idea of graphlet kernel is to count the number of graphlets in a graph like we did in <strong><em>Graphlet Degree Vector. </em></strong>Although the idea of graphlet here is slightly different. Here, graphlets need not to be disconnected and not rooted as well. Below example shows how to count graphlet for 3 node subgraph:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CJ4yJYl1mFcrWEe2cYF8OQ.png" /></figure><p>So, how to create graphlet kernel with this approach. Assume you have a graph <strong><em>G</em></strong>, then the graphlet vector <strong><em>Gⱼ</em></strong> can be defined as graphlet count vector:</p><blockquote><em>(f𝓰)ᵢ = #(gᵢ ⊆ G) for i = 1,2,….,nⱼ</em></blockquote><p>The below image shows how to calculate graphlet vector for the sample k = 3 for the input graph G.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tup2b4rmv937WsNn8JcRIA.png" /></figure><p>Then, if you have two graphs <strong><em>G</em></strong> and <strong><em>G’, </em></strong>you can calculate graphlet kernel with the below formula:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/746/1*_X8fv9oIrDeW7Bd12iiKrg.png" /></figure><p>Although, one thing you should keep in mind that the size of graph <strong><em>G</em></strong> and <strong><em>G’</em></strong> can be different. In that case, you can normalize the vectors to make them same scale.</p><blockquote>There is one issue with the above approach. The approach too much complex to implement. Counting size-k graphlets for a graph with size n take nᵏ time.</blockquote><h4>2. Weiseiler-Lehman Kernel</h4><p>To overcome the complexity of <em>Graphlet Kernels, </em>WL kernels were introduced. The WL Kernels are based on <em>color refinement</em> algorithm which is defined below:</p><p>Given: A graph 𝐺 with a set of nodes 𝑉.<br> 1. Assign an initial color 𝑐 , 𝑣 to each node 𝑣. <br> 2. Iteratively refine node colors by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*giavRoej8ws-HWLnfAmeYw.png" /></figure><p>where HASH maps different inputs to different colors.<br>3. After 𝐾 steps of color refinement, 𝑐⁽ᵏ⁾(𝑣) summarizes the structure of 𝐾-hop neighborhood.</p><p>The below set of images show a example of how WL Kernel works:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*03ruYykFLB7ZHZyLFBqpvA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0gzeYzfGV-Vw4X-UIJ1gcA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qqi0896zGlXgP_bPxkkmEg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*U2OZdSr0VB7zVNuRfEYcVg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-2x7xPyTPzjXr7EGvWthdg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>Then at the last, the WL Kernel value can be calculated by taking the dot product of vectors attained using WL Kernel.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rx32Q6i-Ku29pBXDAzRMHg.png" /><figcaption>Stanford CS224W: Machine Learning for Graphs <a href="https://cs224w.stanford.edu">https://cs224w.stanford.edu</a></figcaption></figure><p>That’s all for traditional ML for graph. In the next post, I will be discussing about the different embedding algorithms. So stay tuned….</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ca7bdbef7544" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/traditional-ml-for-graphs-ca7bdbef7544">Traditional ML for Graphs</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Graph Neural Networks: An Introduction]]></title>
            <link>https://medium.com/analytics-vidhya/graph-neural-networks-an-introduction-9be46691d67c?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/9be46691d67c</guid>
            <category><![CDATA[graph-neural-networks]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[gnn]]></category>
            <category><![CDATA[graph]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Wed, 29 May 2024 17:48:34 GMT</pubDate>
            <atom:updated>2024-05-29T17:48:34.489Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/697/1*VvP1WjNzJhFKEoET0IsjYQ.png" /><figcaption><em>Credits: </em><a href="https://bdtechtalks.com/2021/10/11/what-is-graph-neural-network/"><em>https://bdtechtalks.com/2021/10/11/what-is-graph-neural-network/</em></a></figcaption></figure><p>In this series of blog posts on graph neural networks, we will discuss the basics of graph neural networks, their use cases, why they are required, and their advantages over conventional fully connected neural networks.</p><p><em>So let’s get started…..</em></p><h3>Why Graph Neural Networks?</h3><p>Many data sources are in a graph-based format where they can be represented as nodes (vertices) connected by edges, capturing relationships and dependencies between entities. Due to their complex structure, traditional fully connected networks are not effective at interpreting these kinds of datasets.</p><p>Such datasets have a rich relational structure that can be better represented as a relational graph. Therefore, graph neural networks offer a promising avenue for modeling these types of datasets.</p><h3>Use Cases of Graph Neural Networks (GNN’s)</h3><p>GNNs offer solutions to a wide range of use cases involving graph-structured data. One of the most popular examples is social media analysis. With the vast user bases of platforms like Facebook and Instagram, each user can be treated as an individual node in a graph, with their profile features serving as attributes. By leveraging their friends lists, we can construct a large graph. This is a classical example where graph neural networks outperform fully connected neural networks.</p><p>Other use cases include communication networks, citation networks, economic networks, knowledge graphs, scene graphs, and more. Graph neural networks can effectively tackle various problems across these domains.</p><h3>Graph Neural Network Applications</h3><p>GNNs can be used to solve a variety of problems. Below, I am listing the applications. We will discuss them in detail in the upcoming blogs.</p><pre>1. Graph Level Prediction:<br>    - Graph classification (Drug Discovery, Molecule Generation etc.)<br><br>2. Sub-Graph Level:<br>    - Traffic Prediction (Google Maps etc.)<br><br>3. Node Level:<br>    - Node Classification (Categorize online users/items etc.)<br><br>4. Edge Level:<br>    - Link Prediction (Recommender Systems, Drug Side effects etc.)</pre><h3>Popular Tools</h3><p>There are a variety of tools available for graph-based data analysis and modeling. Below, I am listing some of them. In the next set of blogs, I will discuss them in detail, providing hands-on examples as well.</p><pre>1. NetworkX: <br>   - It is a Python library for the graph creation, manipulation, and study of <br>complex networks.<br><br>2. Pytorch Geometric: <br>   - It is a library for deep learning on irregularly structured data such as <br>graphs and it is  built on top of PyTorch.<br><br>3. DeepSnap:<br>   - It is a Python library that facilitates deep learning on graphs by <br>providing easy-to-use data structures and tools for graph manipulation and <br>model training.<br><br>4. GraphGym:<br>   - It is a research platform built on PyTorch Geometric that offers modular <br>and flexible tools for designing, training, and evaluating graph neural <br>networks.</pre><h3>Machine Learning for Graph</h3><p>Before the advent of graph neural networks, people used to follow traditional ML pipelines, where they extracted features from the input graph and fed them into classical ML models to perform various tasks. These features can also be divided based on the graph structure in the following manner:</p><pre>1. Node Level Features:<br>      a. Node Degree<br>      b. Node Centrality<br>      c. Clustering Coefficient<br>      d. Graphlets<br><br>2. Edge Level Features:<br>      a. Distance Based Features: <br>            i. Shortest-path distance between two nodes<br>      b. Local Neighborhood Overlap:<br>            i.  Common Neighbors<br>            ii. Jaccard&#39;s Coefficient<br>            iii. Adamic-Adar Index<br>      c. Global Neighborhood Overlap:<br>            i. Katz Index <br><br>3. Graph Level Features:<br>      a. Graph Kernels<br>            i. Graphlet Kernel<br>            ii. Weisfeiler-Lehman Kernel</pre><p>I will be discussing them in detail in the upcoming blogs. Stay tuned for that!</p><h3>Types of Graph Neural Network</h3><p>There are numerous graph neural networks available. Discussing them in detail will require another set of blogs, which I will cover in the next series. However, for the sake of discussion and to provide an overview, I am listing some of the popularly used networks.</p><pre>1. Vanilla GNN<br>2. Graph Convolutional Networks (GCN)<br>3. Graph Attention Networks (GAT)<br>4. GraphSage<br>5. Relational Graph Convolutional Networks (R-GCNs)<br>6. Graph Recurrent Neural Networks (GRNNs) </pre><p>The above list is not exhaustive, but it includes the most prominent networks in use. Details will follow in the next set of blogs.</p><p>For now, that’s all from my side. In summary, in this blog, I briefly discussed the utility of graph neural networks, their applications, types, etc. In the next blog, we will discuss classical ML for graphs and continue to delve into graph neural networks in subsequent blogs.</p><p>So stay tuned and happy reading……..</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9be46691d67c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/graph-neural-networks-an-introduction-9be46691d67c">Graph Neural Networks: An Introduction</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[BERT — Pre-training + Fine-tuning]]></title>
            <link>https://medium.com/analytics-vidhya/bert-pre-training-fine-tuning-eb574be614f6?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/eb574be614f6</guid>
            <category><![CDATA[naturallanguageprocessing]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[pytorch]]></category>
            <category><![CDATA[transformers]]></category>
            <category><![CDATA[bert]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Sun, 26 Dec 2021 06:50:23 GMT</pubDate>
            <atom:updated>2023-07-28T07:22:12.916Z</atom:updated>
            <content:encoded><![CDATA[<h3>BERT — Pre-training + Fine-tuning</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*hzPw6ngLghgnVdwy.png" /><figcaption>Source — <a href="https://ruder.io/content/images/2021/02/fine-tuning_methods.png">https://ruder.io/content/images/2021/02/fine-tuning_methods.png</a></figcaption></figure><p>Huggingface.co has made using the transformers-based model convenient with their Transformers API. But a lot of time, only fine-tuning does not work. Pre-training on the unlabelled data and then fine-tuning helps the model achieve the desired results. Huggingface API provides the pre-training functionality as well. In this blog post, I will be explaining how to perform pre-training and then fine-tuning a transformers based model. For this purpose, I will be using BERT as a reference model.</p><h3>Data Formatting</h3><p>To perform pre-training, the data must be in a specific format. It should be in a text file (.txt format) with one sentence per line. The purpose of this text file is first to tokenize the data using Word Piece tokenizer and then perform pre-training on the data.</p><h3>Pre-training model</h3><h4>Train tokenizer on the text</h4><p>After converting the data in the required format, the next step is to train the tokenizer on input data. This step is helpful to create the vocabulary of the data. The below code gist shows how to tokenize the text using Word Piece Tokenizer. To read more about Word Piece Tokenizer, you can refer to section 4.1 from the below link:-</p><blockquote><a href="https://arxiv.org/pdf/1609.08144v2.pdf">https://arxiv.org/pdf/1609.08144v2.pdf</a></blockquote><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/fb819a089f9fb06bfd318c549ffe3854/href">https://medium.com/media/fb819a089f9fb06bfd318c549ffe3854/href</a></iframe><h4>Train BERT for MLM task</h4><p>The next step will be to pre-train BERT for the masked language modelling task. For this purpose, we will be using the same dataset we used to train the tokenizer for this purpose. For the MLM task, 15% of tokens are randomly masked, and then the model is trained to predict those tokens. This functionality is present in the Huggingface API, which is given in the below code:-</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/309b82465488ab4e4be0d714106f4b0a/href">https://medium.com/media/309b82465488ab4e4be0d714106f4b0a/href</a></iframe><p>Till now, we are done with the pre-training part. Let&#39;s move to fine-tuning part.</p><h3>Finetuning Model</h3><h4>Data Preparation</h4><p>For the fine-tuning section, the data must be in a different format from what we used in the pre-training part. BERT takes three inputs viz. — input_ids, attention_mask, token_type_ids. I won&#39;t be going into the details of what are they. You can refer to them from the BERT paper. Here, I will be explaining how to calculate them from the Huggingface API. I will be using the BERT model for classification purposes here. One can make changes in their code according to their convenience.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/2ef375b6bd5a4d846d8ae2d68035dbe7/href">https://medium.com/media/2ef375b6bd5a4d846d8ae2d68035dbe7/href</a></iframe><p>In the above code, I have used the <strong>Dataset</strong> class from <strong>torch.utils</strong> and <strong>BERT&#39;s</strong> <strong>tokenizer</strong> to convert the data in the required format. The in the next step, I am creating a DataLoader class for training and testing purposes.</p><h4>Model Defining</h4><p>Let&#39;s start with the model-building part now for the fine-tuning purpose. I will be adding two linear layers on top of BERT for the classification purpose with <strong>dropout = 0.</strong>1 and <strong>ReLU</strong> as an activation function. One can try different configurations as well. I have defined PyTorch class to build the model which is there in the below code:-</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/092cc0e60b046808ad2c16cc9d872587/href">https://medium.com/media/092cc0e60b046808ad2c16cc9d872587/href</a></iframe><h4>Train and validation function</h4><p>The last step is to define the training and validation function to perform fine-tuning. This will be a usual function that is used in PyTorch by everyone. The below code depicts this:-</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/36e57569d15b46f91effdf52c6a68669/href">https://medium.com/media/36e57569d15b46f91effdf52c6a68669/href</a></iframe><p>Voila, now you are done with all the required steps to achieve the goal. But one can try different configurations as mentioned above. Also, you can try for a different task than classification as mentioned above. If you want to complete the code, you can visit the below link:-</p><p><a href="https://github.com/DhavalTaunk08/NLP_scripts">GitHub - DhavalTaunk08/NLP_scripts: Contains notebooks related to various transformers based models for different nlp based tasks</a></p><p>This is all from my side this time. If you want to read more related to ML/DL, visit the below link and if you like, do give it a clap.</p><p><a href="https://medium.com/@taunkdhaval08">Dhaval Taunk - Medium</a></p><p>If you liked my article:</p><figure><a href="https://buymeacoffee.com/taunkdhaval"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Nh1owl6a3HdaRB_Y0y04pw.png" /></a></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=eb574be614f6" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/bert-pre-training-fine-tuning-eb574be614f6">BERT — Pre-training + Fine-tuning</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Search Engine in Python from scratch]]></title>
            <link>https://medium.com/analytics-vidhya/search-engine-in-python-from-scratch-c3f7cc453250?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/c3f7cc453250</guid>
            <category><![CDATA[nlp]]></category>
            <category><![CDATA[information-retrieval]]></category>
            <category><![CDATA[search-engines]]></category>
            <category><![CDATA[information-extraction]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Sun, 10 Oct 2021 12:21:24 GMT</pubDate>
            <atom:updated>2023-07-28T07:24:47.186Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/320/1*mY70Qzx_dO-p5Hmi32vJKQ.gif" /><figcaption>Source:- <a href="https://www.pinterest.com/pin/847450854860596454/?amp_client_id=CLIENT_ID(_)&amp;mweb_unauth_id=&amp;simplified=true">https://www.pinterest.com/pin/847450854860596454/?amp_client_id=CLIENT_ID(_)&amp;mweb_unauth_id=&amp;simplified=true</a></figcaption></figure><p>In this post, I will be going through all the details of building a search engine from scratch using the Wikipedia dump (approximately 84GB in size). I will be going through the step-by-step process of creating a primary index of data and a secondary index—also, how to implement search functionality to output the results in minimum time and all. I will be dividing the post into two parts viz Requirements, Wiki dump details, Indexing and Searching. So tighten your seat belts, and let&#39;s get started.</p><h3>1. Requirements</h3><blockquote>Python3</blockquote><blockquote>NLTK stopwords</blockquote><blockquote>PyStemmer</blockquote><h3>2. Wiki dump details</h3><p>For creating the search engine, in this post, I will be referring to the <strong>Wikipedia</strong> dump for the English language, which is approximately <strong>84GB</strong> in size. One can download the data from the below-given link:-</p><blockquote><a href="https://dumps.wikimedia.org/enwiki/20210720/enwiki-20210720-pages-articles-multistream.xml.bz2">https://dumps.wikimedia.org/enwiki/20210720/enwiki-20210720-pages-articles-multistream.xml.bz2</a></blockquote><p>You will need to download and extract the downloaded dump. Alternatively, you can work on the zipped data as well. I will be explaining all the things on the extracted dump.</p><h3>3. Creating the Index for Wiki dump</h3><h4>(i) Parsing the XML dump</h4><p>First of all, you will be required to parse the XML and get the necessary data. For this, there are a couple of parsers available in python. Some of them are:-</p><blockquote>SAX</blockquote><blockquote>Etree</blockquote><blockquote>DOM</blockquote><p>Here I will be using SAX parser to parse the XML. You can try out different other parsers as well. In the below gist, I have shown how to use the SAX parser to parse the data.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/5ab880ea001ad802dc96887ffda11108/href">https://medium.com/media/5ab880ea001ad802dc96887ffda11108/href</a></iframe><p>I am using only two fields, i.e. title and text, from XML in the above code. I am giving my own id&#39;s from the variable num_pages. How I am using the title, and the text is explained below in different sections.</p><h4>(ii) How to preprocess text</h4><p>It is an essential task, as this step will ensure we are not adding unnecessary terms to the Index. Otherwise, it will blow the index size. Majorly, I will remove stopwords, tokenise the text, remove HTML tags, remove non-ASCII characters, etc. It is shown in the below code.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/eb7f8412c4742f45d75bcf0fbea11bc6/href">https://medium.com/media/eb7f8412c4742f45d75bcf0fbea11bc6/href</a></iframe><h4>(iii) Extract different fields</h4><p>There can be different possible fields on which we can do query searching. I will be using six fields: title, body, category, infobox, links, and references. One can search generic queries or field-specific queries using these fields.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f400109357ce59ec03f5917cb8da3bb6/href">https://medium.com/media/f400109357ce59ec03f5917cb8da3bb6/href</a></iframe><h4>(iv) Creating an intermediate index</h4><p>Creating the final Index directly will be a heavy task and can blow up memory as well. Therefore, we will first create an intermediate index on data sections and then perform the final merging to make a final index. We will be using the SPIMI approach. You can read more about it using the below link.</p><p><a href="https://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html">Single-pass in-memory indexing</a></p><p>The below code shows how to create an intermediate index.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/ee94ebd92c26c703b7774b47914fe09f/href">https://medium.com/media/ee94ebd92c26c703b7774b47914fe09f/href</a></iframe><p>Here I am processing the text, token by token, adding it to a dictionary, and then writing it in intermediate files. The format of index files now looks this:-</p><pre>apple-2314:t3b6i2r1;6432:t5c8b3i1;</pre><p>Here in the above example, &quot;<em>apple</em>&quot; is a token. Then after the hyphen, every pair is separated by &quot;<strong>;</strong>&quot;. The first value before &quot;<strong>:</strong>&quot; represents the docID for the document, and then the corresponding string and value represents frequency.</p><blockquote>Ex.:- t3b6i2r1 represents token appears 3 times in title, 6 times in the body, 2 times in infobox and 1time in reference.</blockquote><h4>(iv) Merging the intermediate index</h4><p>Now when we are done with writing the intermediate Index. We need to merge the indexes because there will be instances where the token will have its info split into multiple files. We need to merge it for creating the final Index. The below code do this:-</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3e79537cafcd7ae03b7b2903ebe4a1f5/href">https://medium.com/media/3e79537cafcd7ae03b7b2903ebe4a1f5/href</a></iframe><p>The format of the final Index looks like the below example:-</p><pre>apple-2141:5;1232:1;5432:78;</pre><p>Here in the above example, &quot;<em>apple</em>&quot; is a token. Then after the hyphen, every pair is separated by &quot;<strong>;</strong>&quot;. The first value before &quot;<strong>:</strong>&quot; represents the docID for the document, and the second value represents the frequency of the token for that particular token from the field file. Here in the final files are separated fields wise files, unlike in intermediate indexes.</p><p>In this way, the final Index will look similar to the above example shown. If you want to see the entire code, you can visit the Github code link given at the last of this blog and also try out different approaches you want to try out.</p><h4>(v) Secondary Index</h4><p>You can create a secondary index as and when required. This will help in searching fast by keeping the information of tokens in the secondary Index. I will be using the below format of the secondary Index. One can try out other possible forms as well.</p><pre>apple-563-1-3-4--2-4-6-</pre><p>According to the above format, <em>&#39;apple&#39; </em>will be the token. Then the value 563 will denote the frequency of apple in the entire Wikipedia dump. After that, the number 1 will indicate in which file number the token is present. The token can appear in any field. The number 1 will suggest that if the token is present in that field, it will only be there in that particular file number for that field. After that, all other values will be optional. I will be going about the details of the field values below:-</p><p>As mentioned above, I will be using fields title, body, category, infobox, link, reference. So the values between &#39;-&#39; will denote the line number in the final index files. Below example shows it:-</p><pre>&#39;apple-563-1-3-4--2-4-6-&#39;.split(&#39;-&#39;) ---&gt; [&#39;apple&#39;, &#39;563&#39;, &#39;1&#39;, &#39;3&#39;, &#39;4&#39;, &#39;&#39;, &#39;2&#39;, &#39;4&#39;, &#39;6&#39;, &#39;&#39;]</pre><p>Here the fourth element (&#39;3&#39;) indicates that &#39;apple&#39; appears at line number 3 of file number 1 for the title file. Similarly, &#39;4&#39; will show that it will be present at line number 4 of the body file; it does not appear in the category file as it is an empty string (&quot;), it appears at line number &#39;2&#39; for the infobox file, it appears at line number &#39;4&#39; of link file, and at line number &#39;6&#39; in the reference file. One thing that will be common for all the files is that if the token is present for any field, it will be present in that particular file number only, nowhere else.</p><h3>4. Implementing Search functionality</h3><p>One of the crucial things required for searching functionality is implementing the ranking functionality to rank the document according to its relevance. But before that, few other things are also needed, which I will be explaining below:-</p><h4>(i) Preprocessing the query</h4><p>The preprocessing will be the same which we did during the indexing phase. All the steps will be the same as the initial steps. Then we will get the final preprocessed query.</p><h4>(ii) Identify query type</h4><p>This is one of the mode useful steps as it will guide us whether we need to the simple query or field query or both. So basically a query can be of 3 types:-</p><pre>Type 1:- world war II<br>Type 2:- t:world cup i:2012<br>Type 3:- Sachin Tendulkar t:world cup i:2012</pre><p>We can identify the type of query using the below code-</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1d3339a6f210147cdd25c2005252bcd9/href">https://medium.com/media/1d3339a6f210147cdd25c2005252bcd9/href</a></iframe><h4>(iii) Ranking functionality</h4><p>Ranking functionality is used to get the ranked results. So that the relevant results are on top. Usually, tf-idf is a metric to rank documents.</p><p>So what is tf-idf? It is composed of two terms <strong>TF </strong>and<strong> IDF.</strong></p><p><strong>TF (Term Frequency):- </strong>It tells us the frequency of a term in a document.</p><p><strong>IDF (Inverse Document Frequency):- </strong>In documents, a lot of words that occur have a large frequency. But this large frequency reduces the relevance of the word for that document. So to reduce the effect of that large frequency words, IDF is used. It is basically a log of the total number of documents divided by the frequency of the word for that document.</p><p>So the final tf-idf value is the product of tf and idf values.</p><pre>tf-idf = tf * idf</pre><h4>(iv) Variants of tf-idf</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/671/1*-MfyZN9X_9CkD0Dn3ijjGQ.png" /><figcaption>Source:- <a href="https://nlp.stanford.edu/IR-book/pdf/06vect.pdf">https://nlp.stanford.edu/IR-book/pdf/06vect.pdf</a></figcaption></figure><p>One can use any one of the above-given tf-idf variants. It&#39;s up to their implementation and choice.</p><p>So, it is all for now from my side. If you want to see the full code, you can visit the below link:-</p><p><a href="https://github.com/DhavalTaunk08/Wiki-Search-Engine">GitHub - DhavalTaunk08/Wiki-Search-Engine</a></p><p>If you want to read more about machine learning, deep learning, do visit the below link:-</p><p><a href="https://medium.com/@taunkdhaval08">Dhaval Taunk - Medium</a></p><p>Happy reading…….</p><p>If you like my article:</p><figure><a href="https://buymeacoffee.com/taunkdhaval"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Nh1owl6a3HdaRB_Y0y04pw.png" /></a></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c3f7cc453250" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/search-engine-in-python-from-scratch-c3f7cc453250">Search Engine in Python from scratch</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Finetune DistilBERT for multi-label text classsification task]]></title>
            <link>https://medium.com/analytics-vidhya/finetune-distilbert-for-multi-label-text-classsification-task-994eb448f94c?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/994eb448f94c</guid>
            <category><![CDATA[bert]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[nlp]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[multilabel-classifier]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Thu, 17 Sep 2020 05:14:02 GMT</pubDate>
            <atom:updated>2023-07-28T07:25:25.910Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1023/0*06fQisdQnb_BPajl.png" /><figcaption>Source — <a href="https://developer.nvidia.com/blog/efficient-bert-finding-your-optimal-model-with-multimetric-bayesian-optimization-part-1/">https://developer.nvidia.com/blog/efficient-bert-finding-your-optimal-model-with-multimetric-bayesian-optimization-part-1/</a></figcaption></figure><p>In one of my last blog post, <a href="https://medium.com/analytics-vidhya/how-to-fine-tune-bert-on-text-classification-task-723f82786f61"><em>How to fine-tune bert on text classification task</em></a><em>, </em>I had explained fine-tuning <strong><em>BERT</em></strong> for a <strong>multi-class </strong>text classification task. In this post, I will be explaining how to fine-tune <strong><em>DistilBERT</em></strong> for a <strong>multi-label</strong> text classification task. I have made a GitHub repo as well containing the complete code which is explained below. You can visit the below link to see it and can fork it and use it.</p><blockquote><a href="https://github.com/DhavalTaunk08/Transformers_scripts">https://github.com/DhavalTaunk08/Transformers_scripts</a></blockquote><h3>Introduction</h3><p>The DistilBERT model (<a href="https://arxiv.org/pdf/1910.01108.pdf"><strong><em>https://arxiv.org/pdf/1910.01108.pdf</em></strong></a>) was released by Huggingface.co which is a distilled version of BERT released by Google (<a href="https://arxiv.org/pdf/1810.04805.pdf"><strong><em>https://arxiv.org/pdf/1810.04805.pdf</em></strong></a>).</p><p>According to the authors:-</p><blockquote>They leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster.</blockquote><p>So let’s start with the details and the process to fine-tune the model.</p><h3>Multi-Class v/s Multi-Label classification</h3><p>First of all, it is important to understand the difference between multi-class and multi-label classification. <strong>Multi-class classification</strong> means classifying the samples into one of the three or more available classes. While in <strong>multi-label classification</strong>, one sample can belong to more than one class. Let me explain it more clearly by an example:-</p><p><strong>Multiclass classification</strong> — Let say we have 10 fruits. They can belong to one of the three classes — ‘apple’, ‘mango’ and ‘banana’. If we are asked to classify the fruits in these given classes, they can belong to only one of these classes. Therefore, it is a multi-class classification problem.</p><p><strong>Multi-label classification</strong> — Let say we have few movie names and our task is to classify these movies into the genres to which they belong to like ‘action’, ‘comedy’, ‘horror’, ‘sci-fi’, ‘drama’ etc. These movies can belong to more than one genre. For example — ‘The Matrix movie series belongs to the ‘action’ as well as ‘sci-fi’ category. Thus it is called multi-label classification.</p><h3><strong>Data Formatting</strong></h3><p>First of all, there is a need to format the data. The required data can contain 2 columns. One column containing text to be classified. Another column containing labels related to that sample. The below image is an example of the data frame:-</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/398/1*-PYxaeUbp_RVuR7eiKwhrg.png" /></figure><p>The above example shows that we have six different classes and the sample can belong to any number of classes.</p><p>But the question is how to convert the labels into this format? Here, scikit-learn comes to the rescue!!!</p><p>Below is an example of how to convert these labels to the required format.</p><pre><strong>&gt;&gt;&gt; from</strong> <strong>sklearn.preprocessing</strong> <strong>import</strong> MultiLabelBinarizer<br><strong>&gt;&gt;&gt; </strong>mlb = MultiLabelBinarizer()<br><strong>&gt;&gt;&gt; </strong>mlb.fit_transform([{&#39;sci-fi&#39;, &#39;thriller&#39;}, {&#39;comedy&#39;}])<br>array([[0, 1, 1],<br>       [1, 0, 0]])<br><strong>&gt;&gt;&gt; </strong>list(mlb.classes_)<br>[&#39;comedy&#39;, &#39;sci-fi&#39;, &#39;thriller&#39;]</pre><p>Also, you can refer to the below link to get more details about it.</p><blockquote><a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html"><em>https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html</em></a></blockquote><h3>Code</h3><p>Now let’s get to the code part about the required libraries, how to write DataLoader, and model class for this task.</p><h4>Required libraries</h4><blockquote>transformers==3.0.2</blockquote><blockquote>torch</blockquote><blockquote>scikit-learn</blockquote><blockquote>numpy</blockquote><blockquote>pandas</blockquote><blockquote>tqdm</blockquote><p>These can be installed with the <strong>‘<em>pip install’</em></strong> command.</p><h4>Importing libraries</h4><pre>import numpy as np<br>import pandas as pd<br>import transformers<br>import torch<br>from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler<br>from transformers import DistilBertModel, DistilBertTokenizer<br>from tqdm import tqdm<br>from sklearn.preprocessing import MultiLabelBinarizer</pre><pre><strong>from</strong> <strong>torch</strong> <strong>import</strong> cuda<br>device = &#39;cuda&#39; <strong>if</strong> cuda.is_available() <strong>else</strong> &#39;cpu&#39;</pre><p>The above step is to set up the device for GPU.</p><h4>Training parameters</h4><pre>MAX_LEN = 256<br>TRAIN_BATCH_SIZE = 8<br>VALID_BATCH_SIZE = 4<br>EPOCHS = 1<br>LEARNING_RATE = 1e-05</pre><p>These parameters can be tuned according to one’s needs. But there is one important point to be noted here:-</p><blockquote>DistilBERT accepts a max_sequence_length of 512 tokens.</blockquote><p>We cannot give max_sequence_length more than this. If you want to give a sequence length of size more than 512 tokens, you can try the longformer model (<a href="https://arxiv.org/pdf/2004.05150"><strong><em>https://arxiv.org/pdf/2004.05150</em></strong></a>)</p><h4>DataLoader</h4><pre>tokenizer = DistilBertTokenizer.from_pretrained(&#39;distilbert-base-uncased&#39;)</pre><pre><strong>class</strong> MultiLabelDataset(Dataset):<br>    <strong>def</strong> __init__(self, dataframe, tokenizer, max_len):<br>        self.tokenizer = tokenizer<br>        self.data = dataframe<br>        self.text = dataframe.text<br>        self.targets = self.data.labels<br>        self.max_len = max_len<br><br>    <strong>def</strong> __len__(self):<br>        <strong>return</strong> len(self.text)<br><br>    <strong>def</strong> __getitem__(self, index):<br>        text = str(self.text[index])<br>        text = &quot; &quot;.join(text.split())<br><br>        inputs = self.tokenizer.encode_plus(<br>            text,<br>            <strong>None</strong>,<br>            add_special_tokens=<strong>True</strong>,<br>            max_length=self.max_len,<br>            pad_to_max_length=<strong>True</strong>,<br>            return_token_type_ids=<strong>True</strong><br>        )<br>        ids = inputs[&#39;input_ids&#39;]<br>        mask = inputs[&#39;attention_mask&#39;]<br><br>        <strong>return</strong> {<br>            &#39;ids&#39;: torch.tensor(ids, dtype=torch.long),<br>            &#39;mask&#39;: torch.tensor(mask, dtype=torch.long),<br>            &#39;targets&#39;: torch.tensor(self.targets[index], dtype=torch.float)<br>        }</pre><p>Calling the tokenizer and loading the dataset. Here, train_dataset and val_dataset will be training and validation datasets in pandas data frame format with column names as [‘text’, ‘labels’].</p><pre>training_set = MultiLabelDataset(train_dataset, tokenizer, MAX_LEN)<br>testing_set = MultiLabelDataset(test_dataset, tokenizer, MAX_LEN)</pre><pre>train_params = {&#39;batch_size&#39;: TRAIN_BATCH_SIZE,<br>                &#39;shuffle&#39;: <strong>True</strong>,<br>                &#39;num_workers&#39;: 0<br>                }<br><br>test_params = {&#39;batch_size&#39;: VALID_BATCH_SIZE,<br>                &#39;shuffle&#39;: <strong>True</strong>,<br>                &#39;num_workers&#39;: 0<br>                }<br><br>training_loader = DataLoader(training_set, **train_params)<br>testing_loader = DataLoader(testing_set, **test_params)</pre><p>The above step converts the data into the required format using the MultiLabelDataset class and PyTorch&#39;s DataLoader. You can read more about DataLoader by visiting the below-given link:-</p><blockquote><a href="https://pytorch.org/tutorials/beginner/data_loading_tutorial.html">https://pytorch.org/tutorials/beginner/data_loading_tutorial.html</a></blockquote><h4>Model Class</h4><pre><strong>class</strong> <strong>DistilBERTClass</strong>(torch.nn.Module):<br>    <strong>def</strong> __init__(self):<br>        super(DistilBERTClass, self).__init__()<br>        self.l1 = DistilBertModel.from_pretrained(&quot;distilbert-base-uncased&quot;)<br>        self.pre_classifier = torch.nn.Linear(768, 768)<br>        self.dropout = torch.nn.Dropout(0.3)<br>        self.classifier = torch.nn.Linear(768, <strong>num_classes</strong>)<br><br>    <strong>def</strong> forward(self, input_ids, attention_mask):<br>        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)<br>        hidden_state = output_1[0]<br>        pooler = hidden_state[:, 0]<br>        pooler = self.pre_classifier(pooler)<br>        pooler = torch.nn.ReLU()(pooler)<br>        pooler = self.dropout(pooler)<br>        output = self.classifier(pooler)<br>        <strong>return</strong> output</pre><p>Here, I have used 2 linear layers on top of the DistilBERT model with a dropout unit and ReLu as an activation function. <strong><em>num_classes </em></strong>will be the number of classes available in your dataset. The model will return the logit scores for each class. The class can be called by the below method:-</p><pre>model = DistilBERTClass()<br>model.to(device)</pre><h4>Loss function and optimizer</h4><pre><strong>def</strong> loss_fn(outputs, targets):<br>    <strong>return</strong> torch.nn.BCEWithLogitsLoss()(outputs, targets)</pre><pre>optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)</pre><p>Here, BCEWithLogitsLoss is used which is used generally for multi-label classification. One can read more by visiting the below link:-</p><blockquote><a href="https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html">https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html</a></blockquote><h4>Training function</h4><pre><strong>def</strong> train_model(epoch):<br>    model.train()<br>    <strong>for</strong> _, data <strong>in</strong> enumerate(training_loader, 0):<br>        ids = data[&#39;ids&#39;].to(device, dtype = torch.long)<br>        mask = data[&#39;mask&#39;].to(device, dtype = torch.long)<br>        targets = data[&#39;targets&#39;].to(device, dtype = torch.float)<br>        <br>        outputs = model(ids, mask)<br><br>        optimizer.zero_grad()<br>        loss = loss_fn(outputs, targets)<br>        <strong>if</strong> _%1000==0:<br>            print(f&#39;Epoch: <strong>{epoch}</strong>, Loss:  {loss.item()}&#39;)<br>        <br>        optimizer.zero_grad()<br>        loss.backward()<br>        optimizer.step()</pre><pre><strong>for</strong> epoch <strong>in</strong> range(EPOCHS):<br>    train_model(epoch)</pre><p>The above function is used for training the model for the specified number of epochs.</p><h4>Validation</h4><pre><strong>def</strong> validation(testing_loader):<br>    model.eval()<br>    fin_targets=[]<br>    fin_outputs=[]<br>    <strong>with</strong> torch.no_grad():<br>        <strong>for</strong> _, data <strong>in</strong> enumerate(testing_loader, 0):<br>            ids = data[&#39;ids&#39;].to(device, dtype = torch.long)<br>            mask = data[&#39;mask&#39;].to(device, dtype = torch.long)<br>            token_type_ids = data[&#39;token_type_ids&#39;].to(device, dtype = torch.long)<br>            targets = data[&#39;targets&#39;].to(device, dtype = torch.float)<br>            outputs = model(ids, mask, token_type_ids)<br>            fin_targets.extend(targets.cpu().detach().numpy().tolist())<br>            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())<br>    <strong>return</strong> fin_outputs, fin_targets</pre><pre>outputs, targets = validation(testing_loader)<br>outputs = np.array(outputs) &gt;= 0.5</pre><pre>accuracy = metrics.accuracy_score(targets, outputs)<br>f1_score_micro = metrics.f1_score(targets, outputs, average=&#39;micro&#39;)<br>f1_score_macro = metrics.f1_score(targets, outputs, average=&#39;macro&#39;)</pre><pre>print(f&quot;Accuracy Score = <strong>{accuracy}</strong>&quot;)<br>print(f&quot;F1 Score (Micro) = <strong>{f1_score_micro}</strong>&quot;)<br>print(f&quot;F1 Score (Macro) = <strong>{f1_score_macro}</strong>&quot;)</pre><p>Here I have used accuracy and f1_score for now. But usually, the hamming loss and hamming score are the better metrics for calculating loss and accuracy for multilabel classification tasks. I will be discussing that in my next post.</p><p>So this is it for now. Stay tuned for the next post for more details on hamming loss, score, and other things. If you want to read more, you can visit my profile for more posts.</p><p><a href="https://medium.com/@taunkdhaval08">Dhaval Taunk - Medium</a></p><p>If you liked my article:</p><figure><a href="https://buymeacoffee.com/taunkdhaval"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Nh1owl6a3HdaRB_Y0y04pw.png" /></a></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=994eb448f94c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/finetune-distilbert-for-multi-label-text-classsification-task-994eb448f94c">Finetune DistilBERT for multi-label text classsification task</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[NLP Preprocessing:- A useful and important step]]></title>
            <link>https://medium.com/analytics-vidhya/nlp-preprocessing-a-useful-and-important-step-e79895c65a89?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/e79895c65a89</guid>
            <category><![CDATA[nltk]]></category>
            <category><![CDATA[text-preprocessing]]></category>
            <category><![CDATA[naturallanguageprocessing]]></category>
            <category><![CDATA[spacy]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Sun, 26 Jul 2020 06:19:01 GMT</pubDate>
            <atom:updated>2023-07-28T07:26:50.166Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*0ayXED-vmUU7UTyy" /><figcaption>Source — <a href="https://s3.amazonaws.com/re-work-production/post_images/435/NLP/original.jpg?1506438363">https://s3.amuction/post_images/435/NLP/original.jpg?1506438363</a></figcaption></figure><h3>Introduction</h3><p>GPT-3 model has, for now, became a hot topic in the natural language processing field due to its performance. It has nearly 175 billion parameters in comparison to GPT-2 which had around 1.5 billion parameters. It&#39;s a major breakthrough in the field of NLP. But the preprocessing steps that are required before training any model is of utmost importance. Therefore in this article, I will be explaining all the major steps that are used and are required in preprocessing the data before training any NLP model.</p><p>First I will list out the preprocessing steps and then will explain them in detail:-</p><ol><li>Removing HTML tags</li><li>Removing stopwords</li><li>Removing extra spaces</li><li>Converting numbers to their textual representations</li><li>Lowercasing the text</li><li>Tokenization</li><li>Stemming</li><li>Lemmatization</li><li>Spell-checking</li></ol><p>Now let’s start with their explanation one by one.</p><h3>Removing HTML tags</h3><p>Sometimes the text data could contain the HTML tags along with the normal text if the data has been web-scraped from the internet. This could be removed by using python’s <strong>BeautfulSoup</strong> library because these tags will not be of any use or these tags can be removed using <strong>regex</strong> as well. The code is explained below:-</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3fffe2b4f5e9da8354dedfff346437ba/href">https://medium.com/media/3fffe2b4f5e9da8354dedfff346437ba/href</a></iframe><h3>Removing stop-words</h3><p>Many times the data contains a large number of stop-words. These might not be useful because they won’t be making any significant impact on the data. These can be removed by using <strong><em>nltk</em></strong> or <strong><em>spacy</em></strong> library. The code is shown below:-</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0d91a4938b125d5e87ee32e20fc3aef5/href">https://medium.com/media/0d91a4938b125d5e87ee32e20fc3aef5/href</a></iframe><h3>Removing extra-spaces</h3><p>There might be certain situations where the data might contain extra spaces within the sentences. These can be easily removed by python’s <strong>split() </strong>and <strong>join() </strong>functions.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/21e6d344233cd18c8dc000c155bce22a/href">https://medium.com/media/21e6d344233cd18c8dc000c155bce22a/href</a></iframe><h3>Converting numbers to their textual representations</h3><p>Converting numbers to their textual form is also much useful in NLP preprocessing steps. For this purpose, the <strong>num2words</strong> library can be used.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f60bd061ca6f74066275615cdfdcf7c7/href">https://medium.com/media/f60bd061ca6f74066275615cdfdcf7c7/href</a></iframe><h3>Lowercasing the text</h3><p>Converting all the words in data into lowercase is a good practice to remove redundancies. There might be a possibility that words may appear more than one time in the text. One in lowercase form, other in uppercase form.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/bb072c96e9ef8a850535d09ea4575f60/href">https://medium.com/media/bb072c96e9ef8a850535d09ea4575f60/href</a></iframe><h3>Tokenization</h3><p>Tokenization involves converting the sentences into tokens. By tokens, I mean that splitting the sentences into words. It is also useful to separate punctuation from the words because, in the embedding layer of the model, it is much possible that the model does not have embedding present for that word. For example — ‘thanks.’ is a word with full-stop. Tokenization will split into [‘thanks’, ‘.’]. The code for doing this is expressed below using NLTK’s <strong>word_toknize</strong>:-</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/dfe63ed5f2641999f6131d2a7d67d112/href">https://medium.com/media/dfe63ed5f2641999f6131d2a7d67d112/href</a></iframe><h3>Stemming</h3><p>Stemming is the process of converting any word in the data to its root form. For example:- ‘sitting’ will be converted to ‘sit’, ‘thinking’ will be converted to ‘think’ etc. NLTK’s <strong>PorterStemmer</strong> can be used for this purpose.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1cf23406481321503e4ac0471f66008a/href">https://medium.com/media/1cf23406481321503e4ac0471f66008a/href</a></iframe><h3>Lemmatization</h3><p>Many people consider lemmatization similar to stemming. But actually they are different. Lemmatization does a <strong>morphological analysis</strong> of words that stemming does not do. NLTK has an implementation of lemmatization (<strong>WordNetLemmatizer</strong>) which can be used.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3ccc642d4c058abc49e32cc194d36bbc/href">https://medium.com/media/3ccc642d4c058abc49e32cc194d36bbc/href</a></iframe><h3>Spell-checking</h3><p>There is much possibility that the data that is being used contain spelling mistakes. There spell-checking becomes an important step in NLP preprocessing. I will be using the <strong><em>TextBlob</em></strong> library for this purpose.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/152d80fc741ca6eadacc6fa2c2da60e2/href">https://medium.com/media/152d80fc741ca6eadacc6fa2c2da60e2/href</a></iframe><p>Although the above spell-checker may not be perfect still it will be of good use.</p><p>All the above methods that are depicted above are one of the possible techniques available to do those steps. There are other methods available as well.</p><p>I have created a Github repo as well accumulating all the above methods in one file. You can check it by going through the below link:-</p><p><a href="https://github.com/DhavalTaunk08/NLP_Preprocessing">DhavalTaunk08/NLP_Preprocessing</a></p><p>That’s all from my side this time. Keep reading. If you want to read more, you can reach to below stories of mine:-</p><ul><li><a href="https://medium.com/analytics-vidhya/how-to-fine-tune-bert-on-text-classification-task-723f82786f61">How to fine-tune BERT on text classification task?</a></li><li><a href="https://medium.com/analytics-vidhya/l1-vs-l2-regularization-which-is-better-d01068e6658c">L1 vs L2 Regularization: Which is better</a></li><li><a href="https://medium.com/analytics-vidhya/demystifying-generative-adversarial-networks-real-vs-fake-discriminator-bb5383ce89f9">Demystifying Generative Adversarial Networks: Real vs Fake discriminator</a></li></ul><p>If you liked my article:</p><figure><a href="https://buymeacoffee.com/taunkdhaval"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Nh1owl6a3HdaRB_Y0y04pw.png" /></a></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e79895c65a89" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/nlp-preprocessing-a-useful-and-important-step-e79895c65a89">NLP Preprocessing:- A useful and important step</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to fine-tune BERT on text classification task?]]></title>
            <link>https://medium.com/analytics-vidhya/how-to-fine-tune-bert-on-text-classification-task-723f82786f61?source=rss-a8e9132e2784------2</link>
            <guid isPermaLink="false">https://medium.com/p/723f82786f61</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[text-classification]]></category>
            <category><![CDATA[bert]]></category>
            <category><![CDATA[nlp]]></category>
            <dc:creator><![CDATA[Dhaval Taunk]]></dc:creator>
            <pubDate>Sun, 07 Jun 2020 16:38:15 GMT</pubDate>
            <atom:updated>2023-07-28T07:27:34.066Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*x3vhaoJdGndvZqmL.png" /><figcaption>Source:- <a href="https://pytorch.org/tutorials/_images/bert.png">https://pytorch.org/tutorials/_images/bert.png</a></figcaption></figure><p>BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based architecture released in the paper <strong><em>“</em></strong><a href="https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf"><strong><em>Attention Is All You Need</em></strong></a><strong><em>” </em></strong>in the year 2016 by Google. The BERT model got published in the year 2019 in the paper — “<a href="https://arxiv.org/pdf/1810.04805.pdf"><strong><em>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</em></strong></a><strong><em>”. </em></strong>When it was released, it showed the state of the art results on <a href="https://gluebenchmark.com/"><strong><em>GLUE benchmark</em></strong></a>.</p><h3>Introduction</h3><p>First, I will tell a little bit about the Bert architecture, and then will move on to the code on how to use is for the text classification task.</p><p>The BERT architecture is a multi-layer bidirectional transformer’s encoder described in the paper <a href="https://arxiv.org/pdf/1810.04805.pdf"><strong><em>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</em></strong></a>.</p><p>There are two different architecture’s proposed in the paper. <strong>BERT_base </strong>and <strong>BERT_large. </strong>The <strong>BERT base</strong> architecture has <strong>L=12, H=768, A=12</strong> and a total of around 110M parameters. Here <strong>L</strong> refers to the number of transformer blocks, <strong>H</strong> refers to the hidden size, <strong>A</strong> refers to the number of self-attention head. For <strong>BERT large</strong>, <strong>L=24, H=1024, A=16.</strong></p><figure><img alt="BERT: State of the Art NLP Model, Explained" src="https://cdn-images-1.medium.com/proxy/0*m_kXt3uqZH9e7H4w.png" /><figcaption>Source:- <a href="https://www.kdnuggets.com/2018/12/bert-sota-nlp-model-explained.html">https://www.kdnuggets.com/2018/12/bert-sota-nlp-model-explained.html</a></figcaption></figure><p>The input format of the BERT is given in the above image. I won’t get into much detail into this. You can refer the above link for a more detailed explanation.</p><h3>Source Code</h3><p>The code which I will be following can be cloned from the following <strong>HuggingFace’s</strong> GitHub repo -</p><blockquote><a href="https://github.com/huggingface/transformers/">https://github.com/huggingface/transformers/</a></blockquote><h3>Scripts to be used</h3><p>Majorly we will be modifying and using two scripts for our text classification task. One is <strong><em>glue.py, </em></strong>and the other will be <strong><em>run_glue.py. </em></strong>The file glue.py path is “<em>transformers/data/processors/” </em>and the file run_glue.py can be found in the location “<em>examples/text-classification/”.</em></p><h3>Format of data</h3><p>The format of the data is something like this. The first column is supposed to be the id column. The second column is supposed to be the column containing the labels. And the third column should contain the text that is required to be classified.</p><pre>data = pd.DataFrame()<br>data[&#39;id&#39;] = [i for i in range(num_text)]<br>data[&#39;label&#39;] = labels<br>data[&#39;text&#39;] = text</pre><p>Here <em>num_text</em> is the number of data points to be used, the <em>text</em> is the actual query that is to be classified, and <em>labels</em> are the label associated with its corresponding text. You should save your data in <em>tsv</em> format without headers present in the data.</p><pre>#if data is your training file <br>data.to_csv(&#39;train.tsv&#39;, sep=&#39;\t&#39;, index=False, header=None)</pre><pre><br>#if data is your validation file<br>data.to_csv(&#39;dev.tsv&#39;, sep=&#39;\t&#39;, index=False, header=None)</pre><pre><br>#if data is your test file<br>data.to_csv(&#39;test.tsv&#39;, sep=&#39;\t&#39;, index=False, header=None)</pre><p>In your test file, you can ignore the labels column if you want. I had used because it can be used to check models performance after prediction. Also, the name of the files can be kept according to one’s convenience. But accordingly, changes need to be made in the <em>glue.py</em> file by changing the file names.</p><h3>Changes to be made in the script</h3><h4>glue.py</h4><blockquote>path — transformers/data/processors/glue.py</blockquote><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/636bf5c4ee379e428d4074ea8a871675/href">https://medium.com/media/636bf5c4ee379e428d4074ea8a871675/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3f6fbb75dbae28c9ff3ee5ce2017b37a/href">https://medium.com/media/3f6fbb75dbae28c9ff3ee5ce2017b37a/href</a></iframe><p>For classification purpose, one of these tasks can be selected — CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.</p><p>I will continue with the SST-2 task; the same changes can be done with other tasks as well. The class that will be used will be -</p><p>Following are the changes needed to be made -</p><ol><li>Change the function <strong><em>get_labels()</em></strong>’s return list from [‘0’, ‘1’] to the list of labels present in your data.</li><li>In the <strong><em>_create_examples()</em></strong> function, change -</li></ol><pre>text_a = line[text_index]</pre><pre>⬇ ⬇ ⬇(to)</pre><pre>text_a = line[-1]</pre><p>3. In the dictionary defined as <strong><em>glue_task_num_labels, </em></strong>change the value of key <strong><em>‘sst-2’ </em></strong>to the num_labels present in your data.</p><h4>run_glue.py</h4><blockquote>path — examples/text-classification/run_glue.py</blockquote><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/5fadf9cf97b84095554ff2368cb5b95b/href">https://medium.com/media/5fadf9cf97b84095554ff2368cb5b95b/href</a></iframe><p>Changing this file is optional. Only make changes in this file if you want to save probabilities along with the predictions.</p><p>Predictions can be saved by saving the predictions array, which is shown in the above code in a text file.</p><h3>How to run the scripts</h3><p>To run the script, one can run the following command -</p><pre>python ./examples/text-classification/run_glue.py \<br>    --model_name_or_path bert-base-uncased \<br>    --task_name $TASK_NAME \<br>    --do_train \<br>    --do_eval \<br>    --data_dir $GLUE_DIR/$TASK_NAME \<br>    --max_seq_length 128 \<br>    --per_device_eval_batch_size=8   \<br>    --per_device_train_batch_size=8   \<br>    --learning_rate 2e-5 \<br>    --num_train_epochs 3.0 \<br>    --output_dir /tmp/$TASK_NAME/</pre><p>I will explain all the parameters one by one -</p><ol><li>— model_name_or_path — This parameter specifies the BERT model type which you want to use.</li><li>— task_name — This specifies the glue task which we want to use. One from CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI can be used.</li><li>— do_train — Specifies whether you want to fine-tune model or not. If you want to use the pre-trained model, just remove this parameter.</li><li>— do_eval — If you want to perform validation</li><li>— do_predict — If you want to generate the prediction on test data.</li><li>— data_dir — Directory path where training, validation and test data needs to be saved.</li><li>— output_dir — Directory path where you want to save the predictions and other generated files.</li></ol><p>I think all other parameters are evident by their name. Another setting which can be used is -</p><ul><li>— save_total_limit = Specifies how many checkpoint directories you want to save.</li></ul><p>All this is sufficient to fine-tune BERT on the text classification task.</p><p>That’s all from my side this time. I hope this blog post will help you in completing the specified task.</p><p>If you liked my article:</p><figure><a href="https://buymeacoffee.com/taunkdhaval"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Nh1owl6a3HdaRB_Y0y04pw.png" /></a></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=723f82786f61" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/how-to-fine-tune-bert-on-text-classification-task-723f82786f61">How to fine-tune BERT on text classification task?</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>