Sentiment Analysis (Using NLP)

Published in

GatorHut

7 min readAug 12, 2023

Words are fascinating entities that can take on diverse meanings based on the tone and context in which they are used. The way we express ourselves using the same words can vary greatly, leading to different interpretations and connotations. This intricacy of language allows for a rich tapestry of communication, where slight changes in intonation and emphasis can significantly alter the intended message or sentiment conveyed by those words.

Now Let’s discuss what exactly Sentiment Analysis is?

Sentiment analysis involves the classification of a given text into categories of positive, negative, or neutral sentiment. This process aims to decipher people’s opinions in a way that provides valuable insights for businesses to enhance their operations. It goes beyond just determining polarity; it also delves into emotions, such as happiness, sadness, and anger. The methodology of sentiment analysis utilizes a variety of Natural Language Processing algorithms.

Let us explore the inner workings of sentiment analysis through an example involving user reviews of the “Oppenheimer” show. However, before directly diving into applying sentiment analysis or any text analysis technique, there are several crucial pre-processing steps to undertake

Before embarking on text pre-processing,

I transformed the list of reviews into a DataFrame. This strategic move enabled me to execute the subsequent steps with enhanced ease and flexibility. By harnessing the capabilities of a DataFrame, I gained access to a plethora of options that streamlined the entire process, ultimately contributing to a more efficient and organized approach to the task at hand

Text Pre-Processing

Step 1:

To start, we address the cleaning of the text by utilizing the following code.

review['reviw']=review['reviw'].apply(lambda x: re.sub(r'[^\w\s]', ' ', x))

In this code, a lambda function is employed to replace all special characters with blank spaces, effectively preparing the text for further analysis.

Step 2:

Another crucial element of the cleaning process involves transforming all text data into lowercase. The previously mentioned code effectively accomplishes this task.

review["reviw"] = review["reviw"].apply(str.lower)

Converting to lowercase is a form of text normalization. It helps in reducing the complexity of the text data and ensures that words are treated the same way regardless of their case.

Step 3:

import pandas as pd
import nltk
from stop_words import get_stop_words

def remove_stopwords(df, column):
    stopwords_list = get_stop_words('english')
    df[column] = df[column].apply(lambda text: ' '.join(word for word in text.split() if word not in stopwords_list))
    return df
remove_stopwords(review,'reviw')

Above code remove stop words from all the reviews, but what is stop words?

Stopwords are words that are commonly used in a language but are generally considered to be of little value in terms of conveying meaningful information. These words are often filtered out from text data during natural language processing tasks like text analysis, sentiment analysis, and topic modeling. Stopwords include words like “and,” “the,” “is,” “in,” “to,” “of,” and so on.

And why it is important to remove these?

Stopwords are common words in a language that don’t carry significant meaning on their own, such as “and,” “the,” “is,” “in,” etc. These words occur frequently in text but don’t contribute much to the overall understanding of the content. Removing them helps reduce noise in the data.

Also, By removing stopwords, you’re left with the words that are more likely to carry the sentiment and actual content of the text. This can help the sentiment analysis model better capture the sentiment-bearing terms.

After removing stopwords from review column.

Now we are done with text pre-processing steps we can move further for Sentiment Analysis.

So we will calculate the both Subjectivity and Polarity for each reviews,but wait you might be thinking what is subjectivity and how text is calculated?? Let’s explore this first,

Subjectivity in sentiment analysis refers to the extent to which a piece of text expresses personal opinions, emotions, or judgments rather than objective facts. In other words, it measures how much the text reflects the writer’s feelings or thoughts rather than being purely informative or neutral.

Now lets trying to understand how it is calculated and how to interpret it.

Text: “The movie was fantastic and emotional, but the ending was disappointing.”

Lexicon-based Approach:

Count the occurrences of positive and negative words from the lexicon in the text.

Positive Words Count: 2 (fantastic, emotional)

Negative Words Count: 1 (disappointing)

Calculate Subjectivity Score:

Calculate the subjectivity score using the formula:

Subjectivity Score = (Positive Words Count + Negative Words Count) / Total Words Count

Total Words Count = 10 (after removing stopwords and non-alphanumeric characters)

Subjectivity Score = (2 + 1) / 10 = 0.3

Interpretation:

The calculated subjectivity score of 0.3 indicates that around 30% of the text’s content is subjective (expressing opinions or emotions), while the remaining 70% is objective (factual information).

A score of 0 might indicate high objectivity, while a score of 1 might indicate high subjectivity.

High objectivity indicates that a text is focused on factual information without personal opinions, while high subjectivity indicates that a text is filled with personal feelings, opinions, or emotional expressions. The degree of objectivity and subjectivity in a text can be measured using various linguistic and computational methods in sentiment analysis.

from textblob import TextBlob
def subjectivity(text):
    return TextBlob(text).sentiment.subjectivity
review['sub_score'] = review['reviw'].apply(subjectivity)

So from textblob library I have used Subjectivity to calculate subjectivity for each review.

Now we are focusing what polarity score is and calculation.

Polarity score in text analytics refers to a numerical value that indicates the sentiment or emotional tone of a piece of text. It is used to determine whether the text expresses a positive, negative, or neutral sentiment. Polarity score typically ranges from -1 to 1, where:

A score closer to 1 indicates a positive sentiment.
A score closer to -1 indicates a negative sentiment.
A score around 0 indicates a neutral sentiment

Example:

“I absolutely loved the movie! It was a masterpiece.”

Tokenization: Tokenize the text into individual words:

[“I”, “absolutely”, “loved”, “the”, “movie!”, “It”, “was”, “a”, “masterpiece.”]

Polarity Score Assignment: For each word, assign a polarity score. Let’s assume we have a polarity lexicon with the following scores:

“I”: 0.9 (positive)

“absolutely”: 0.8 (positive)

“loved”: 0.9 (positive)

“the”: 0.0 (neutral)

“movie!”: 0.7 (positive)

“It”: 0.0 (neutral)

“was”: 0.0 (neutral)

“a”: 0.0 (neutral)

“masterpiece.”: 0.9 (positive)

Polarity Score Aggregation: Calculate the average polarity score for all the words.

Average polarity score = (0.9 + 0.8 + 0.9 + 0.0 + 0.7 + 0.0 + 0.0 + 0.0 + 0.9) / 9 = 0.733

Showing positive sentiment.

def polarity(text):
    return TextBlob(text).sentiment.polarity

review['pol_score'] = review['reviw'].apply(polarity)
review.head()

Utilizing the code, I have computed the polarity score, which serves as the foundation for categorizing sentiments into three distinct classes.

def pol_range(df):
    if df <0:
        return "Negative"
    elif df == 0:
        return "Neutral"
    else:
        return "Positive"
review['Sentiment_Status']=review['pol_score'].apply(pol_range)

Upon implementing the provided code, the resultant DataFrame takes on the following structure.

Now time to see no of positive and negative reviews.Based on that we can come to a conclusion what people are thinking about this movie.

plt.figure(figsize=(8, 6))
plt.scatter(review['pol_score'], review['sub_score'], color='blue')
plt.title('Sentiment Analysis')
plt.xlabel('Polarity')
plt.ylabel('Subjectivity')
plt.show()

Too much scaterness though we can see from the above graph and this scatter plot provides a visual representation of considerable dispersion among the data points. It is apparent that the prevailing sentiment in the reviews is neutral, while the analysis also reveals a tendency towards higher subjectivity in these reviews.

sentiment_counts =review['Sentiment_Status'].value_counts()
plt.figure(figsize=(6,6))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140, colors=['yellow', 'green', 'red'])
plt.axis('equal')
plt.title('Sentiment Status Distribution')
plt.show()

The pie chart provides a clear visual representation of the distribution of reviews. It’s evident that Neutral Reviews hold the majority,followed by Positive Reviews. Notably, the count of Positive Reviews is nearly twice that of Negative Reviews.

Now we are done with sentiment analysis,we can quickly visualize the essence of positive reviews through the use of a word cloud. This creative tool enables us to gain an immediate insight into the most prominent words present in positive feedback. Through the dynamic representation of words, with their size denoting their frequency, the word cloud provides a visual summary that effectively captures the prevailing sentiments and themes expressed in favorable reviews.

import wordcloud
wordcloud =WordCloud(max_words=150,width=5000, height=5000, background_color='white').generate(' '.join(filtered_text_list))
plt.imshow(wordcloud)
plt.axis('off')
plt.figure(figsize=(12, 8))
plt.show()

These terms seem to be anticipated when discussing any film by Nolan, don’t they? I’m inclined to believe that the term “Masterpiece” is virtually synonymous with Nolan’s movies. I’d love to hear your perspective on this.