Sentiment Analysis (Part-2)

Published in

TechJingles

5 min readApr 27, 2021

Hello everyone! I’m Pavana, a 3rd-year student, studying Bachelors of Technology in the field of Computer Science. In the coursework of the degree, I did this project named “SENTIMENT ANALYSIS USING MACHINE LEARNING AND LEXICAL ANALYSIS” along with my teammates Nikhita, Pavithra, and Rutuja.

Go check out the previous post, Sentiment Analysis(Part-1) for an overview of our project !!!! The link is given below

Sentiment Analysis (Part-1)

Hello everyone! I’m Pavithra, a 3rd-year student, studying Bachelors of Technology in the field of Computer Science. In…

medium.com

I have listed all the hardware and software requirements in the video. Do check out the slideshow below which will take you over a short overview of the Requirements.

FEATURE EXTRACTION

The preprocessed dataset has many distinctive properties. In the feature extraction method, we extract the aspects from the processed dataset. Later this aspect is used to compute the positive and negative polarity in a sentence which is useful for determining the opinion of the individuals.

Machine learning techniques require representing the key features of text or documents for processing. Some examples of features that have been reported are:

Words And Their Frequencies
Parts Of Speech Tags
Opinion Words And Phrases
Position Of Terms
Negation
Syntax

WORD CLOUD

The first text visualization I chose is the controversial word cloud. A word cloud represents word usage in a document by resizing individual words proportionally to their frequency and then presenting them in a random arrangement.

The two methods of automatically annotating sentiment at the
word-level are:
(1) Dictionary-Based Approaches
(2) Corpus-Based Approaches.

Textual analysis is the most important analysis in the case of tweets, and it provides a general idea of what kind of words are frequent in the corpus, in a sort of quick and dirty way.

For the word cloud, I used the python library word cloud.

Some of the big words can be interpreted quite neutral, such as “today”,” now”, etc. I can see some of the words in smaller size make sense to be in negative tweets, such as “damn”,” ugh”,” miss”,” bad”, etc. But it is “sad” in a rather big size.

OK, even though the tweets contain the word “love”, in these cases it is negative sentiment because the tweet has mixed emotions like “love” but “miss”. Or sometimes used in a sarcastic way.

Some of the big words can be interpreted quite neutral, such as “got”,”going”,etc. I can see some of the words in smaller size make sense to be in positive tweets, such as “bet”,”dont”,”small”,”want”, etc. But there is “love” in a rather big size.

Interestingly, the word “work” was quite big in the negative word cloud, but also quite big in the positive word cloud. It might imply that many people express negative sentiment towards work, but also many people are positive about works.

PREPARATION FOR DATA VISUALISATION

In order for me to implement a couple of data visualization in the next step, I need term frequency data. What kind of words are used in the tweets, and how many times it is used in the entire corpus. I used a count vectorizer to calculate the term frequencies and parameter options available for the count vectorizer, such as removing stop words, limiting the maximum number of terms.

I implemented stop words included, and not limiting the maximum number of terms.

OK, it looks like the count vectorizer has extracted 3037 words out of the corpus.

Getting term frequency for each class can be obtained with the above code block.

For the below part, I have succeeded in processing the data in batches I have sliced the document matrix using ‘document_matrix[start_index,end_index].toarray()’,

Zipf’s Law

Zipf’s law is a relation between rank order and frequency of occurrence: it states that when observations (e.g., words) are ranked by their frequency, the frequency of a particular observation is inversely proportional to its rank, Frequency ∝ 1 Rank.

On the X-axis is the rank of the frequency from the highest rank from left up to the 500th rank to the right. Y-axis is the frequency observed in the corpus (in this case, the “Sentiment140” dataset).

Even though we can see the plot follows the trend of Zipf’s Law, but it looks like it has more area above the expected Zipf curve in higher ranked words.

So I will conclude the second part of the project by creating word clouds for positive and negative sentiments and applying Zipf’s law along with data visualization. In the upcoming post, we will work on creating a Zipf plot for tweet tokens, Tweets Tokens Visualisation, Tweet tokens for top 50 negative tokens, and top 50 positive tokens(My team members will further interact with you in the upcoming posts).

Thanks for reading the post, hope you found the post resourceful. Will see you all in upcoming posts!