GSoC’20 #2 [Week 2 &3]

2 min readJun 24, 2020

After working on unique messages detection last week, I began with sentiment analysis on all the issue and pull request messages within a repository.

I started off with VADER (Valence Aware Dictionary and Sentiment Reasoner), a classical lexicon and rule-based method, and TextBlob, another rule-based sentiment analysis library. Vader is specifically attuned to sentiments expressed in social media and handles emojis present within the text too. I used these on the most active repos, ie those repos which had the most number of messages and comments using the data collected from the Github & Pull request workers. To test and compare the results, I manually hand-labeled some messages as neutral, positive, and negative. I observed that TextBlob performed better than VADER in detecting negative sentiment, but both of them were easily misled by some common words used in the software engineering domain, like: fix, break, bug, revert, clean etc. After this, I did a literature survey to learn the best approaches to classify sentiment, specifically in the software engineering domain. I came across the following 3 models:

SentiStrengthSE [https://ieeexplore.ieee.org/document/7962370], SentiCR [http://amiangshu.com/papers/senticr-ase.pdf], and SentEmoji [https://dl.acm.org/doi/10.1145/3371158.3371218].

The first step was to prepare an annotated dataset for supervised learning as surprisingly there weren’t many annotated datasets available for sentiment analysis in the software domain! Following the approaches mentioned in the above papers, I considered the JIRA issue comments dataset, taking the love and joy groups as positive, sadness and fear groups as negative, and discarded the surprise label to avoid as it can match either positive or negative. Some comments from groups 2 and 3 were labeled as neutral. The oracle dataset had only the positive and negative classes so I applied VADER on the positive ones and set an appropriate threshold to generate a neutral class. This was split into 80–20 for train-test and saved as CSV files.

Then I adopted the SentiCR approach, with modifications in the preprocessing and training parts, as the data we are considering is more diverse. It involves TFIDF vectorization of the words, followed by applying a using Boosting methods for classification of sentiments. Currently, I’m experimenting with GradientBoostingClassifier and XGBoost.

In the next week, I shall be using the time series data, to analyze trends in the sentiment across a repository over time and also diving into converting the notebooks to workers! 😄

Some of my notebooks can be found here: https://github.com/chaoss/augur/tree/akshara/messages-notebooks

PS. I was away for most of this week due to exams in college

GSoC’20 #2 [Week 2 &3]

Written by Akshara Prabhakar