Hackbright Project Season: Sentiment Analysis

>> Goal of Week 2*: to select and implement a sentiment analysis method for analyzing my collected tweets

>> Approach:

Generally speaking, Sentiment Analysis is the task of computationally (and automatically) extracting information about sentiments and opinions from text. However, there are several different methods for how to go about conducting sentiment analysis, and my first task was to look into the two most prevalent methods, lexicon-based and machine learning based, and determine which method would make the most sense in the context of my project.

Machine-learning techniques are based on the process of applying general learning algorithms to ‘learn’ language rules from statistical inference. Most commonly in sentiment analysis, this is manifested in supervised machine learning, where hand labeled data is used to form a large training corpus from which patterns are inferred.

In contrast, lexicon-based (or rule-based) methods derive text polarity by matching against a dictionary of words, each of which is annotated with the polarity and severity of the word. In contrast with machine learning approaches, lexicon based approaches do not infer polarity about words/n-grams, but simply use the dictionaries provided.

Due to these differences, lexicon-based approaches are more useful when you want very high precision evaluation for a very narrow domain. However, creating the dictionary is very labor-intensive and doesn’t scale well to larger domains. In contrast, machine-learning techniques are more flexible, as training data is less intensive to curate than a fully annotated lexicon, and therefore can be used for analyzing texts in a wider domain. In recent research on sentiment analysis, these approaches are often combined.

While the tweets I am looking at could definitely be considered to be a small domain, I don’t have the time/training to create an annotated lexicon on my own. Instead, I briefly looked into the Vader lexicon (Valence Aware Dictionary and sEntiment Seasoner), as it includes idiomatic expressions often found on Twitter. However, given that words in the political sphere can have a different valence than if they were used in another context, I ultimately decided that a machine-learning technique would be more appropriate for my project.

I opted to go with a Naïve Bayes classifier — one of the simplest techniques available. As the name suggests, Naïve Bayes operates with the assumption that every feature is considered independent from all other features. At first, I was put off by this foundation — after all, at an intuitive level it doesn’t match what we know about natural language, where sentences are rarely (if ever) made up of completely independent words, and this syntactic and semantic coherence forms a lot of how we naturally understand sentences. However, despite this counter-intuitive foundation, Naïve Bayes preforms surprisingly well and has the added benefit of being already integrated into scikit-learn library for Python.

>> Implementing the Classifier:

When implementing my Naïve Bayes classifier I encountered two temporary roadblocks along the way. The first issue was how to avoid retraining my classifier every time that I wanted to use the classifier. I wanted to somehow store my trained/vectorized classifier so that I could import it into my seed file and classify each tweet. Luckily, I was not the first person to have this issue and I finally discovered the wonder that is pickling.

Pickle is a part of the Python Standard Library that allows you to serialize and de-serialize a Python object. The documentation says that “’picking’ is the process whereby a Python object hierarchy is converted into a byte stream, and ‘unpickling’ is the inverse operation, whereby a byte stream is converted back into an object hierarchy. In practical terms, this meant that I could create a vectorizer and a classifier based upon my training corpus and then “pickle” them immediately, and then “unpickle” them in a subsequent function where I process new tweets through my classifier — while having all the adjustments I had made to the Scikit-Learn classifier persist. Pretty cool.

My second problem didn’t have quite as easy of a solution. I quickly discovered while hand-tagging that my ratio of positive:negative tweets was about 1:150 — and this imbalance in the valence of tweets was having significant impact on the precision of my positive classifier. While my negative tweets were displaying a precision of around 82% and a recall of 90%, the classifier had almost no predictive ability for positive tweets. In order to address this problem, I tried several different methods. First I expanded the size of my positive corpus by actively searching through my dataset for positive words. While this has the potential to bias my classifier, this methodological issue was preferable to having to take the time to sort through ~100 tweets to add a single new tweet to my corpus. That greatly improved the precision and recall for positive tweets (although slightly reducing the negative precision), but I still was getting a fair amount of variance in the accuracy of my model.

I was wondering if part of this issue was due to the fact that there was a lot of crossover in words between negative and positive words, but that the order/context of those words determines their valence. To address this issue I switched to vectorizing based upon n-grams instead of individual words. This made a slight improvement in my results.

>> Unsolved Issues and Takeaways:

The more I learned about the current state of research into sentiment analysis, the more aware I became of the inherent limitations to current methodologies of natural language processing. The most predictable of these issues was the prevalence of sarcasm. Since sarcasm often requires knowledge of the subject/culture that is being referenced, it can often be difficult for humans to discern — let alone a naïve algorithm. I had anticipated that this would be an issue, but since this is still an active area of doctoral-level research, I figured it wasn’t something I was going to be able to address in my four week Hackbright project.

What I didn’t account for was how many tweets were “negative” in valence — but the negativity itself reflected a positive sentiment towards the referenced candidate. This phenomenon appeared to be particularly common among pro-Trump tweets — where extremely negative rants about Obama, immigrants, and the general “demise of America” were used to emphasize the tweeter’s support for Trump. While the classifier would correctly label these as negative, it would be inaccurate to count these as negative tweets about Trump.

Initially I had hoped to use the Naïve Bayes results as a rough proxy for voting preferences, and use this as a foundation to compare against traditional polling models. However, the inability to distinguish between negative sentiments expressed in the same sentence as a reference to a candidate, and negative sentiment directed at a candidate rendered that comparison completely meaningless, and I found myself having to completely reconsider my front-end visualizations.

While I was certainly disappointed in this discovery, it really drove home the importance of agile development, and being willing to iterate on your initial vision as you discover new input and constraints. Or more simply — don’t get too attached, something will probably need to change.

*Yes, I am aware it is no longer week 2