<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Angelo Bustamante on Medium]]></title>
        <description><![CDATA[Stories by Angelo Bustamante on Medium]]></description>
        <link>https://medium.com/@qacbustamante?source=rss-33ac81ea660b------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*FplV-lIBiUr_XLy5</url>
            <title>Stories by Angelo Bustamante on Medium</title>
            <link>https://medium.com/@qacbustamante?source=rss-33ac81ea660b------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 18 May 2026 11:16:39 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@qacbustamante/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Natural Language Processing Using Python (Part 1)]]></title>
            <link>https://medium.com/@qacbustamante/natural-language-processing-using-python-part-1-958d4ea1846e?source=rss-33ac81ea660b------2</link>
            <guid isPermaLink="false">https://medium.com/p/958d4ea1846e</guid>
            <category><![CDATA[nlp]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[pandas-dataframe]]></category>
            <dc:creator><![CDATA[Angelo Bustamante]]></dc:creator>
            <pubDate>Mon, 06 Jun 2022 17:08:02 GMT</pubDate>
            <atom:updated>2022-07-18T15:16:33.936Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pbEak-7yWOpouipducnmGA.jpeg" /></figure><p>In this article, I will show you how to use Natural Language Processing (NLP) and more specifically sentiment analysis to understand how people really feel about a subject.</p><h4>Install Packages</h4><p>Make sure you have pip and setuptools installed on your system. Don’t use Python 2 as it has been discontinued and make sure you have Python 3 &gt;=3.4 installed, you won’t need to worry because then you’ll normally already have it ready. If you already have Python3, just make sure you have upgraded to the latest version.</p><p>If you do not have Python installed on your system, then feel free to check out this tutorial.</p><p>Check whether your pip or pip3 command is symbolically linked to Python3, use the one which is linked to the current version of Python (&gt;=3.4) you plan to use in this tutorial. Also, check by typing Python in the terminal what version it shows is it &gt;=2.7 or &gt;=3.4, if it is 2.7, then check by typing Python3, if this works, then it means that you have two different Python version installed on your system.</p><p>To do this, run the following command in your terminal:</p><pre>pip install pandas<br>pip install nltk</pre><h4>Import Packages</h4><p>Here is the required packages:</p><pre>import pandas<br>import re, string<br>import os<br>import nltk<br>from nltk.corpus import stopwords<br>from nltk.stem import WordNetLemmatizer</pre><h4>Getting the data</h4><p>Pandas package is one of the best ways that you could often use to import your dataset and represent it in a tabular row-column format. The Pandas library is built on top of Numerical Python popularly known as NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language. Pandas have built-in functions that could be used to analyze and plot your data and make sense of it!</p><p>Because of the power and flexibility this library provides, it has become the first choice of every data scientist. Of course, there are some disadvantages of this library; especially when dealing with big datasets, it can be slower in loading, reading, and analyzing big datasets with millions of records.</p><p>To read in .xlsx files, you have a similar function to load the data in a DataFrame: read_excel(). Here’s an example of how you can use this function:</p><pre># Assign spreadsheet filename to `file`<br>file = &#39;example.xlsx&#39;</pre><pre># Load spreadsheet<br>excelData = pandas.read_excel(file)</pre><h4>Cleaning the data</h4><p>Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph.</p><p>Here is the function used to clean the text:</p><pre>#Clean text<br>def cleanText(text):<br>    text = text.lower()<br>    text = re.sub(&#39;@&#39;, &#39;&#39;, text)<br>    text = re.sub(&#39;\[.*?\]&#39;, &#39;&#39;, text)<br>    text = re.sub(&#39;https?://\S+|www\.\S+&#39;, &#39;&#39;, text)<br>    text = re.sub(&#39;&lt;.*?&gt;+&#39;, &#39;&#39;, text)<br>    text = re.sub(&#39;[%s]&#39; % re.escape(string.punctuation), &#39;&#39;, text)<br>    text = re.sub(&#39;\n&#39;, &#39;&#39;, text)<br>    text = re.sub(&#39;\w*\d\w*&#39;, &#39;&#39;, text)<br>    text = re.sub(r&quot;[^a-zA-Z ]+&quot;, &quot;&quot;, text)<br>    <br>    return text</pre><pre># Tokenize Text<br>def tokenizeText(text):<br>    oStopWords = stopwords.words(&#39;english&#39;)<br>    text = cleanText(text)<br>    #Tokenize the data<br>    text = nltk.word_tokenize(text)<br>    #Remove stopwords<br>    text = [w for w in text if w not in oStopWords]</pre><pre>    return text</pre><p>Here is what the function does:</p><ul><li>Remove all capital letters, punctuations, emojis, links, etc. Basically, removing all that is not words or numbers.</li><li>Tokenize the data into words, which means breaking up every comment into a group of individual words.</li><li>Remove all stopwords, which are words that don’t add value to a comment, like “the”, “a”, “and”, etc.</li></ul><p>Let’s now apply the function to the data:</p><pre>oExcelData[sColumnName] = oExcelData[sColumnName].apply(lambda sText: tokenizeText(convertToString(sText)))</pre><h4>Lemmatization</h4><p>Lemmatization reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis. Lemmatization is usually more sophisticated than stemming. Stemmer works on an individual word without knowledge of the context. For example, The word “better” has “good” as its lemma. This thing will miss by stemming because it requires a dictionary look-up.</p><p>The nltk.WordNetLemmatizer() function does just that. Here is the code:</p><pre># Lemmatizer<br>def lem(text):<br>    oLemmatizer = WordNetLemmatizer()<br>    text = [oLemmatizer.lemmatize(t) for t in text]<br>    text = [oLemmatizer.lemmatize(t, &#39;v&#39;) for t in text]</pre><pre>    return text</pre><h4>Conclusion</h4><p>In this first of this Natural Language Processing Using Python series, you learned on how to read excel file using pandas, tokenize your data, use stopwords, and Lemmatization.</p><p>In the next part my groupmate wrote about analyzing your data. Check it out <a href="https://medium.com/@mmjacosta/natural-language-processing-using-python-sample-part-2-4d651570bb89">here</a>!</p><p>Full code is available right <a href="https://github.com/bustamantegelo/NLP">here</a>.</p><p>Thanks for reading and happy coding!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=958d4ea1846e" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>