Kaptaan in the light of Data: Exploratory and Sentiment Analysis — Part 2

Rehan Ahmed
9 min readDec 2, 2018

This is the second part in a series analyzing sentiment about Imran Khan in over eight hundred articles written about him in Dawn since 2004. To read the first part, please visit here

One of the most interesting tools in a data scientist’s toolbox is sentiment analysis. Belonging to the branch of Natural Language Processing, Sentiment Analysis is the ability to evaluate a people’s opinion about something, or in this case, someone.

WHAT IS SENTIMENT?

Simply put, sentiment is the view or opinion of a person about a subject. It can be anything from anger, sympathy, sadness, happiness, hate, love, basically the whole spectrum of emotions a human being is capable of feeling.

Sentiment Analysis is a technique that tries to evaluate the exact sentiment from a given sentence. There are a number of different algorithms in use today, each with its own pros and cons. While some try to evaluate the sentiment by checking the ratio of positive to negative words — using a bag of words model, others take a more efficient yet complicated approach by judging the sentence as a whole. This allows it to handle negation in a sentence, for example, knowing the difference between ‘I am happy’ and ‘I am not happy’.

I will take a detailed look at the different methods I tried out and which one worked best for me in the third part of this series. Meanwhile, for the purpose of this article, here is some basic terminology I would like you to get used to before reading on,

Basic Terminology

  • Article Score — The sentiment score of the overall article, found out by averaging the sentiment of each sentence in the article. I wanted to capture the sentiment of all the comments, and also consider the ones that were extremely leaning to one side (positive or negative) so I used a mean instead of a median. I found this to be a more accurate measure of the overall comments polarity.
  • Comment Score — The sentiment score of all the comments on an article, found out by averaging the sentiment of each comment on that article.
  • Title Score — The sentiment score of the Article’s title

Here is a sample article and how it resides in my dataset,

Imran Khan summoned by NAB for ‘misusing’ KP govt’s helicopters

Title Score: — 0.189

Article Score: 0.039

Comment Score: — 0.0169 Positive to Negative Words: 35/939

The Title, quite obviously negative (but not too negative), is captured quite accurately in the title score. The article, covered in a more factual way rather than an op-ed article, loses the negativity a bit and shows a moderately neutral score. The comments on the page are clearly negative but the score doesn’t reflect that after it has been averaged out.

The Analysis

First, I managed to gather all the 38,866 comments in one data frame and found out their Comment Scores. Then, in another data frame that contained all the articles, I found the Article Score and the Title Score. Lastly, I merged these into one grouped data frame that now contained one row for each article, containing the author name, the publishing date, the title, the Title Score, the Article Score, the Comment Score, and the number of total positive and negative words in the comments.

Then, I plotted this on a scatter plot by year, using the Comment Score as the y-axis. Here, it shows how the polarity for each article’s comments has changed over time (the dotted black line is the neutral line),

The bigger the dot, the more the comments (hype) on that date. The higher the dot above the black dotted line, the more positive the comments will be.

The red line shows the median and the yellow line represents the mean.

There is a noticeable positive spike in the graph over 2015, showing an increase in positive sentiment in the comments on the articles written that year. To see the difference more closely, I plotted both — the positive articles (average polarity of comments greater than 0) and the negative articles (average polarity of comments less than 0) separately, showing how positive interest changes significantly over time.

I also used the bag of words model to find the ratio of positive to negative words in each article’s comment section. However, as you can see in the scatter plot below, this didn’t give me any meaningful results except the fact that, the people commenting on Dawn’s website tend to use a lot more negative words than positive ones.

The articles above the dotted line have more positive words than negative words in the comments.

Only one article had comments with a total positive word count higher than the total negative word count.

While 2014 was the year when interest around Imran Khan spiked, he attracted mostly neutral comments, but you can see how in 2015, the median, as well as the mean, jumps up considerably. This was the year when Imran Khan got married to Reham Khan and resolved most of his conflicts. Imran Khan’s marriage in January of 2015 attracted pretty positive comments, but his nephew’s subsequent arrest and his backing of talks with the TTP brought down the average a bit. Overall however, this one was one of his best years.

To look closer, I plotted a graph of just 2015 and looked at the articles with the average sentiment score in the top 5% or the lowest 5% — i.e. the articles with the most positive and the most negative comments.

The grey line represents the moving average for sentiment scores

For comparison, here is a look at 2018,

A look at the articles with the 1% most positive and the 1% most negative comments from 2011 to 2018,

A look at the articles with the 1% most positive and the 1% most negative comments from 2011 to 2018,

One interesting thing I noticed, while looking at the outliers that got picked out above, is that the text of the articles with the most positive or negative comments had significant polarity as well.

To see if there was a relationship between how positive or negative an article was and how positive or negative the comments on that article were, I plotted another scatter plot with each article’s Article Score vs the Comment Score,

Since the sentiment score is averaged out over all the comments, I didn’t see as much variance in it as I saw in the polarity of each individual article. But to really see if an article affected the polarity, I decided to remove the most neutral articles (between a polarity of -0.1 and 0.1).

While the comments for the articles with negative polarity are spread out evenly between negative and positive, I found a significant relation between a positive article’s Article Score and the sentiment of comments on that article.

‘94.2% of the articles with a positive sentiment score (above 0.1) had overall positive comments.’

What if we there was an even simpler way of predicting if an article’s comments would be positive or negative? Is it possible to predict it using just the title of the article? In the age of social media, people rarely go through the whole article before making a go for the comment box at the bottom. Does it affect their view?

Here is a scatterplot with the Title Score of an article on the x-axis and the Comment Score on the y-axis,

According to my findings, if an article’s title had a positive polarity score (0.1 or more), it had a 90.1% chance of getting positive comments overall. However, take this with a pinch of salt as this observation is made only on the topic of Imran Khan and 77.7% of the articles about Imran Khan already attracted positive comments.

For articles having an Article Score or Title Score below 0.1, there was no significant relation with comment polarity evenly spread out almost evenly.

Does Article or Comment Polarity depend on the author?

Lastly, I wanted to have a look if an author’s writing style had any effect on having the most polarity-charged comments or not. After removing the authors who wrote less than 4 articles and removing the generic names like ‘Dawn Staff’, ‘The Dawn Correspondent’, I plotted a bar chart with the authors who had the most positive comments overall.

Green — Average Comment Score

Red — Average Article Score

You can see how Article Score is generally pretty close to the Comment Score. Next, I plotted the authors with the most negative comment score and you can see how, here too, the correlation between both of them is pretty noticeable. So, it is fair to say that an author’s writing style (or the type of the news he or she generally covers) can have an effect, even if not significant, on the general sentiment of the comments people post.

It is interesting to see how much insights can be gathered by a simple analysis of one newspaper’s article on just one political personality. However, it isn’t without its own share of problems. Here are some of the biggest issues I faced,

  • The sentiment analysis models available today are not ideally suited for use in Pakistan, especially for analyzing informal forms of communication like comments or messages — us Pakistanis have a serious problem of slipping into Roman Urdu everywhere.
  • There are only two ways to fix this problem; either build a syntactical model that is capable of dealing with Roman Urdu which will require a lot of time, hard work, and intelligent minds, but will be extremely useful for the future. Or build a training model by feeding it all the data that you possibly can which is extremely resource-intensive, requires you to have access to a treasure trove of data but won’t work as accurately as a syntactical model.
  • A lack of APIs for public forums, leading me to spend a lot of time designing and running web scraping scripts, as well as problems and limitations with the APIs for social networks — Twitter imposes a limit on the number of tweets you can download while Facebook doesn’t even allow access to public content without making you develop a legitimate application and making you go through a lengthy review process.
  • A lack of local resources or guides for working on a local project. While I found tons of tutorials detailing what and how to do, I didn’t find many good projects by Pakistani experts on doing a data analytics project. It is sad to see as I have come across a lot of people who are doing some great work in this field and could help a lot of people if they decided to share their experience and knowledge.

Originally published at marketlytics.com in July, 2018.

--

--

Rehan Ahmed

Yes, I am tall. No, I don’t play basketball. The weather up here is perfect!