Natural Language Processing Using Python (Part 1)

Angelo Bustamante — Mon, 06 Jun 2022 17:08:02 GMT

In this article, I will show you how to use Natural Language Processing (NLP) and more specifically sentiment analysis to understand how people really feel about a subject.

Install Packages

Make sure you have pip and setuptools installed on your system. Don’t use Python 2 as it has been discontinued and make sure you have Python 3 >=3.4 installed, you won’t need to worry because then you’ll normally already have it ready. If you already have Python3, just make sure you have upgraded to the latest version.

If you do not have Python installed on your system, then feel free to check out this tutorial.

Check whether your pip or pip3 command is symbolically linked to Python3, use the one which is linked to the current version of Python (>=3.4) you plan to use in this tutorial. Also, check by typing Python in the terminal what version it shows is it >=2.7 or >=3.4, if it is 2.7, then check by typing Python3, if this works, then it means that you have two different Python version installed on your system.

To do this, run the following command in your terminal:

pip install pandas
pip install nltk

Import Packages

Here is the required packages:

import pandas
import re, string
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Getting the data

Pandas package is one of the best ways that you could often use to import your dataset and represent it in a tabular row-column format. The Pandas library is built on top of Numerical Python popularly known as NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language. Pandas have built-in functions that could be used to analyze and plot your data and make sense of it!

Because of the power and flexibility this library provides, it has become the first choice of every data scientist. Of course, there are some disadvantages of this library; especially when dealing with big datasets, it can be slower in loading, reading, and analyzing big datasets with millions of records.

To read in .xlsx files, you have a similar function to load the data in a DataFrame: read_excel(). Here’s an example of how you can use this function:

# Assign spreadsheet filename to `file`
file = 'example.xlsx'

# Load spreadsheet
excelData = pandas.read_excel(file)

Cleaning the data

Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph.

Here is the function used to clean the text:

#Clean text
def cleanText(text):
    text = text.lower()
    text = re.sub('@', '', text)
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub(r"[^a-zA-Z ]+", "", text)
    
    return text

# Tokenize Text
def tokenizeText(text):
    oStopWords = stopwords.words('english')
    text = cleanText(text)
    #Tokenize the data
    text = nltk.word_tokenize(text)
    #Remove stopwords
    text = [w for w in text if w not in oStopWords]

    return text

Here is what the function does:

Remove all capital letters, punctuations, emojis, links, etc. Basically, removing all that is not words or numbers.
Tokenize the data into words, which means breaking up every comment into a group of individual words.
Remove all stopwords, which are words that don’t add value to a comment, like “the”, “a”, “and”, etc.

Let’s now apply the function to the data:

oExcelData[sColumnName] = oExcelData[sColumnName].apply(lambda sText: tokenizeText(convertToString(sText)))

Lemmatization

Lemmatization reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis. Lemmatization is usually more sophisticated than stemming. Stemmer works on an individual word without knowledge of the context. For example, The word “better” has “good” as its lemma. This thing will miss by stemming because it requires a dictionary look-up.

The nltk.WordNetLemmatizer() function does just that. Here is the code:

# Lemmatizer
def lem(text):
    oLemmatizer = WordNetLemmatizer()
    text = [oLemmatizer.lemmatize(t) for t in text]
    text = [oLemmatizer.lemmatize(t, 'v') for t in text]

    return text

Conclusion

In this first of this Natural Language Processing Using Python series, you learned on how to read excel file using pandas, tokenize your data, use stopwords, and Lemmatization.

In the next part my groupmate wrote about analyzing your data. Check it out here!

Full code is available right here.

Thanks for reading and happy coding!