Sentiment Analysis Of WhatsApp Chat

Freeman Goja
Analytics Vidhya
Published in
5 min readDec 30, 2019

From mining tweets to Facebook posts and scrapping online reviews, sentiment analysis is increasingly shaping the design and delivery of more products and services that meet the customers’ needs. Thanks to the advanced packages in leading programming languages, it is relatively easier to do nowadays.

There are two common approaches to text mining, some go for the bag of words model, while others consider the structure and grammar of the words. We shall be looking at both approaches side by side in this article using R.

The major aspect of text mining is data cleaning. Dealing with loosely structured data require a great deal of preprocessing to prepare the data for analysis. Often, this step takes 60% or more of the time for the project. In my earlier article titled Statistical Analysis of WhatsApp Chats, I talked about the first stage of the cleaning process, which is more or less generalized. I will be dealing more with data cleaning specific to each analytical approach in this article. If you have not read Statistical Analysis of WhatsApp Chats, click here for a quick catch-up.

The first step in the mining of WhatsApp Chats is to export the desired chat(s) from a device to a text file. In this case, the chats were exported from a highly engaged WhatsApp group comprising of 39 members. The objective was to gauge the sentiments of the group using R. After importing the text file into RStudio, the text content, which is the target feature was extracted from the dataframe for analysis. Let us consider the different models one by one.

1. Bag of Words Model

This approach considers the frequency of use of words in a document. Arguably, the most important step here is data cleaning.

One vital point to remember at the very beginning is that when you export chats, they may contain text, and photos and videos (if exported with media) but you only need the words in the body of the text for analysis so before anything else, messages posted as photos and videos need to be removed. In this case, the chat was exported without media, therefore, posts of pictures and videos appeared as “image omitted” or “video omitted”. Hence the use of the following code to remove both.

new_chat<-chat[!endsWith(chat$text, “omitted”),]

my_text<-new_chat$text

The next step is to remove things that do not add meaning such as punctuation marks, spaces, numbers and stopwords then transform all the words to lower case to avoid seeing the same word in different cases as different words. If you choose to transform everything to upper case, that would work as well. There are special situations where punctuation marks are retained to preserve the integrity of content such as web addresses thus their removal should always be tailored to a specific objective. For example, in this case, they are not needed. Stopwords are mapped from an inbuilt English dictionary (because the chats are in English) and removed. The remaining words are then stemmed (transforming similar words with different endings to the same e.g, loving, loved, lovers to “lov”), and finally creating a document term matrix(see full code).

The inbuilt function to calculate sentiments was used and the output was converted to a dataframe.

cal_sentiments <- calculate_sentiment(names(final_words))

cal_sentiments <- cbind(cal_sentiments, as.data.frame(final_words))

The distribution of the sentiments can be visualized as follows:

Positive Sentiments

The figure above represents a collection of the most frequent used positive words by members of the group with “repay” appearing the most.

Negative Sentiments

A collection of the most used negative words is as shown above. “Smoke” was clearly mentioned more frequently than the rest and appears bolder.

2. Syuzhet Model

Another way to analyze sentiments is to use the Syuzhet’s algorithm to extract and plot the emotional trajectory. Here, each chat is taken as a sentence and assigned a positive or negative score based on the total score of all the words in it, termed Emotional Valence. There are a number of dictionaries to choose from but here, the NRC method which is based on simple lexicons was preferred. As usual, we will start with some cleaning to remove html links, punctuations and non-alphanumeric characters like emojis.

With this model, words are treated within the context of the sentence containing them and it is not necessary to adopt some of the cleaning techniques that are required for the Bag of Words approach we saw above. Below is a representation of the emotional valence plot.

Proponents of the Syuzhet’s model argue that treating words in isolation and outside the context of the sentence or phrase in which they were originally used by the author as in the case of Bag of Words model does not capture the complete expression of the author’s emotions. In this exercise, we can conclude that both models showed more positive than negative sentiments. In my opinion, the choice of which approach to adopt should be based on your objective.

I hope you learnt something from this piece. Stay in touch and see you in my next article on Twitter Sentiment Analysis.

https://www.linkedin.com/in/freemangoja

--

--