Sentiment analysis with NLTK /VADER — Comments on Lee Hsien Loong’s Facebook post

Too lazy to read through 2000 comments on Facebook? You’ve come to the right place. Here I’ll use NLTK’s VADER (a Python module) to sift through these comments and see what the hive mind thinks. Warning: it’s surprisingly cheery, like this gif:

Spoiler: the results are positive.

Why?

A post by SUTD student Jiayu Yi piqued my interest a month ago with its timely and step-by-step method for sentiment analysis of a certain Facebook post — perfect for a beginner like myself to replicate as a toy project. However, instead of using Google Cloud Natural Language API, I will attempt to use NLTK to classify comments as positive, negative or neutral instead.

Having grown up here, besides my intrinsic interest in understanding this and extreme laziness — who has the time to read 2000 comments on Facebook?! — I also wanted to see if using NLTK and running the analysis 1 month later would yield different results.

The post in question:

Goals

  • scraping comments using the Facebook Graph API, and adapting a Python script to collect comments into a text file
  • simple natural language processing through NLTK and VADER to classify comments as positive/negative/neutral

Part 1: Scraping

Following Jiayu’s method, I was able to scrape roughly 2.2k comments from the target page into a CSV file. However, instead of using the access token found in the screenshot below, I had to use the Access Token Debugger to fill in the token value in the script.

Your access token is in another castle!

Code for scraping

import requests
graph_api_version = 'v2.9'
# paste your access token below
access_token = ' '
# LHL's Facebook user id
user_id = '125845680811480'
# the id of LHL's response post at https://www.facebook.com/leehsienloong/posts/1505690826160285
post_id = '1505690826160285'
# the graph API endpoint for comments on LHL's post
url = 'https://graph.facebook.com/{}/{}_{}/comments'.format(graph_api_version, user_id, post_id)
comments = []
r = requests.get(url, params={'access_token': access_token})
while True:
data = r.json()
# catch errors returned by the Graph API
if 'error' in data:
raise Exception(data['error']['message'])
# append the text of each comment into the comments list
for comment in data['data']:
# remove line breaks in each comment
text = comment['message'].replace('\n', ' ')
comments.append(text)
print('got {} comments'.format(len(data['data'])))
# check if there are more comments
if 'paging' in data and 'next' in data['paging']:
r = requests.get(data['paging']['next'])
else:
break
# save the comments to a file
with open('comments.txt', 'w', encoding='utf-8') as f:
for comment in comments:
f.write(comment + '\n')

This gives me a text file with one comment on each row.

First comment … …

Looking through the Facebook page and comparing it with the scraped comments, the symbols in the text file are usually either comments in Mandarin or emojis.

Part 2: Quick & Dirty Sentiment Analysis

I am going to try to perform this analysis without cleaning it on a small sub-set of data, just to make sure everything works and that it is logical when the comment is read. And voila — although it took a while thanks to my not having installed the lexicon for VADER:

Testing took quite a while, thanks to VADER’s missing lexicon

Final code for sentiment analysis

import nltk # be sure to have stopwords installed for this using nltk.download_shell()
import pandas as pd
import string
messages = [line.rstrip() for line in open("filepath goes here")]
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# install Vader and make sure you download the lexicon as well
sid = SentimentIntensityAnalyzer()
# this step will return an error if you have not installed the lexicon
summary = {"positive":0,"neutral":0,"negative":0}
for x in messages:
ss = sid.polarity_scores(x)
if ss["compound"] == 0.0:
summary["neutral"] +=1
elif ss["compound"] > 0.0:
summary["positive"] +=1
else:
summary["negative"] +=1
print(result)

You should get:

{'positive': 1206, 'neutral': 601, 'negative': 270}
Pie chart generated with Excel in approx 5 seconds… Matplotlib another day

Using this method, with very few lines of code and for absolutely free, I was able to analyse a similar volume of comments.

However, the results were quite different. Instead of 68% positive, VADER found only 58% of comments were positive; also, instead of 18% negative, VADER was surprisingly upbeat finding only 13% of comments negative.

And we are dun dun done.

Odds & Ends

Comments were anonymised for this basic analysis. A good follow-up — to identify frequent posters and the tone of their comments.

Could the differences in my results vs Jiayu’s be due to time, with the later comments shading towards incrementally positive and neutral, or is this due to VADER being ever more sunny than Google? (I am totally loving the pun. Can you tell?)

References