Sentiment Analysis on Covid19 news

Sameer Kumar
6 min readJan 26, 2021

--

What is Sentiment Analysis ?

Sentiment Analysis is the process of computationally identifying and categorizing opinions from a piece of text and determining whether attitude towards a product/topic is positive , negative or neutral.

Sentiment Analysis is one of the main applications of NLP .

In this article , we will go through a basic project on Sentiment Analysis to analyze the sentiment of news related to Covid19 .

Steps followed :

  1. Importing Dataset and required Libraries .
  2. Text Data preprocessing
  3. One Hot Representation
  4. Padding Sequence and Embedding Layer
  5. Creating LSTM model and Dropout
  6. Metrics

Importing the Dataset

Here the positive sentiment is represented by 1 and negative sentiment by 0.

Sentiment is our dependent variable(y) and Headline and Description is independent variable(x) .

The dataset has 4,072 rows and 3 columns .

Text Data Preprocessing

In Sentiment Analysis and other applications of NLP like Amazon Alexa , Google translator etc. , text data is generally our input . We tend to go and preprocess each and every word of a sentence and convert into vectors . Let’s understand that in detail .

nltk is a famous NLP library which helps in text preprocessing . The for loop runs through each word of a sentence .

We can use re library for regular expressions which helps in dealing with only small a to z and capital A to Z and removing other things .

The first step is to remove the Stop words like ‘the’ , ‘of’ , ‘is’ , ‘a’ from every sentence as they do not contribute to the algorithm performance and then perform Stemming on each word where we remove the suffix of the word and reduce it to it’s root word. eg.) history-histori

Porter Stemmer class helps in doing stemming although lemmatization is a better approach as it gives the morphological analysis of the words.

We also lower the cases so that algorithm doesn’t treat ‘leave’ and ‘Leave’ differently and then apply split function to treat each word separately.

After stemming we append the new text to a corpus empty list.

One Hot representation

One of the most important step in NLP is to convert text data to vector so that it generalizes well to the predictions . One of the word representation technique is One Hot Representation where we assign index to the words based on a vocabulary size .

Intuition of One Hot Representation

Consider we have a vocab size of 1000 and we have a word ‘Man’ in a sentence and we want to convert to that to vector. Suppose the word is present at 500th index. Here we convert ‘Man’ into a 1000 dimension vector of 0’s and 1’s where 1 will be at 500th index(location) and rest all will be zero . (Sparse matrix and High dimensions) .

‘Man’ — One Hot representation — [0,0,0,0….1,0,0…] of 1000 dimensions

One of the disadvantage is that semantic information is not captured and size is also huge . It won’t treat ‘good’ and ‘great’ similar .

Index based Vectors

We can see from above list comprehension that each word is converted to index based vector given a vocab size of 5,000 .

What is the solution to this?

We use the concept of Word Embedding . It is used to overcome disadvantage of Bag of words and TF-IDF where semantic information was not captured .

In word embedding we will convert the words into vectors based on features.

It is a featurized representation of the words where similar words will be represented by almost equal vector for a particular feature.

Here we can see vector representation of few words like Boy , Girl ,King etc. based on certain features . The number of features will decide the number of dimensions for each word vector . It does not depend on vocab size.

Points

  1. Gender feature is related to Boy , girl ,king and queen and not apple and mango . So vector 1 and -1 uniquely represents Boy Girl and King queen .
  2. Royal is related to King and Queen , so those vector values will be higher and similar .

That is why Feature representation is useful as it helps in capturing semantic information and it also reduces into a dense matrix and low dimension unlike One hot representation where it is sparse matrix and high dimension.

What is Pad Sequences ?

Before passing One Hot representation to the embedding layer , we need to make sure that all the length of the sentences are equal . If it is not the same , we apply pre padding with zeroes to make the lengths equal by first defining a sentence length .

You can see the zeroes add to make the lengths equal of sentences . This is called pre padding and is done by pad_sequences .

Building LSTM(Long Short term memory) model

The algorithm used for this purpose is LSTM .

Why LSTM over other traditional neural networks ?

Humans do not start thinking from scratch every second . If we read a paragraph we tend to remember the previous words and based on that we understand the next word. Information gets retained .

Traditional neural networks cannot do this and is one of the major disadvantage .

How to address this problem ?

We use Recurrent Neural networks to solve this issue as the information gets retained through a loop . RNN can be thought of as multiple copies of same network each passing message(output) to successor . The output is always wrt time .

Same weights is assigned to each input then weighted sum takes place in the neuron and activation function is applied and output is passed to the next layer .

But RNN faces the disadvantage of Vanishing Gradient Problem so to overcome that , we use LSTM .

In LSTM we use cell state and various gates which determine which information has to be remembered and which has to be forgotten .

Some important points

  1. Whenever we have a binary classification problem(like sentiment analysis) , the loss function that we use is binary cross entropy .
  2. Whenever we have a multi class classification problem , we use the loss function categorical cross entropy .
  3. Whenever we have a binary classification problem , the activation function that we use in the output layer is sigmoid as it gives probability values .
  4. Whenever we have multi class problem, Softmax activation function is used in output layer .

5. The activation function generally used in hidden layers is RELU activation function .

The accuracy touched was 80 % .

Conclusion

This was the entire basic project on Sentiment analysis . I hope you all liked it . Stay tuned for more articles on Machine Learning and Deep Learning .

Till then you can visit my LinkedIn profile and Medium profile and see more interesting content :) .

--

--

Sameer Kumar

AI Research intern at SCAAI || Kaggle 2x Expert || Machine Learning || Deep Learning || NLP || Python