We Need to Talk About Sentiment Analysis

On the dangers of using pretrained sentiment models

Kristof Boghe

Published in

The Startup

55 min readJul 22, 2020

Having mixed feelings about sentiment analysis

Sentiment analysis, or the computational analysis of opinions, emotions and attitudes in written text, seems deceivingly simple. Indeed, when compared with more daunting tasks of computational text analysis such as automatic text generation or even named entity recognition, constructing a model that differentiates between — for example — a positive versus negative text seems like a walk in the park. Above all, for the uninitiated, it’s also one of the more obvious examples of Natural Language Processing (NLP), and therefore a popular technique among all kinds of NLP-related business applications — from social media monitoring to chatbots. And as an avid enthusiast for anything related to computational methods within the social sciences, I quickly became enthralled by sentiment analysis a couple of years ago when (big) data-driven social media analysis was all the rage among social scientists. At the same time, the notion that the omnipresence of algorithmic curation created ideological (or even moral) online filter bubbles became in vogue as well, putting emotion — and thus sentiment — firmly in the front seat in many theoretical debates on online communication. And since most sentiment algorithms are relatively easy to grasp, it became an ideal entry point for many budding academics who wanted to dip their toes in anything vaguely related to machine learning and AI. It’s exactly this combination of accessibility and theoretical relevancy that made sentiment analysis one of the new cool kids in town.

But something has been bugging me for a while now.

Despite its apparent simplicity, many researchers treat sentiment models as both a black box and and a one-size-fits-all solution. The black-box mentality shows itself by treating pretrained sentiment algorithms as if its some mystical wizardry, as something that should or even can not be scrutinized in detail before applying the model on your own research. Indeed, there is a strong temptation to just download the model, load it into R or Python and just roll with it. Maybe it’s exactly because detecting negative and positive emotions seems so straightforward (‘just count the happy versus sad words, right?!’) that researchers tend to gloss over the specific characteristics of the training data used to train the model, how the model was trained and what kind of potential pitfalls are associated with particular techniques. Precisely because of this lack of critical spirit, pre-trained sentiment models are deemed as perfectly interchangeable and universally applicable, no matter the specific characteristics of the corpus under study. This is emblematic of the one-size-fits-all fallacy. Feeding this negligent attitude are the academic articles published to ‘vet’ (but also market) the model’s viability and validity. More often than not, researchers simply cite the initial validation exercise, implying that one could and should expect that — if the authors of the model validated their algorithm properly — the model is perfectly capable of automating sentiment analysis in their specific research as well. Even plenty of academic articles are guilty of such negligence (example 1, example 2). As I’ll demonstrate throughout this blog post, however, they ignore the fact that model performance could vary wildly depending on the domain and context (e.g. news versus social media texts). In the end, both of these fallacies can wreak serious havoc when it comes to the validity of automatically coded sentiment-features and, ultimately, the research findings as such.

To state my case, I will utilize three popular pre-trained sentiment algorithms and demonstrate the risks of blindly accepting the outcome of such models and what you can do about it. Although sentiment analysis could incorporate multinomial and complex emotional classification schemes (e.g. fear, anger ,enjoyment…) or the detection of facts versus opinions, we’ll focus on the relatively straightforward goal of detecting sentiment polarity, or how negative versus positive a piece of text is, usually expressed on an interval scale. Although there are a plethora of ways to structure the (sub)field of sentiment analysis, I believe it makes most sense to categorize polarity models in one of three categories, defined by their particular modeling strategy. For each category, I’ll pick one polarity model to demonstrate the peculiarities associated with each technique. After that, I’ll compare sentiment polarity scores across different types of political texts — from official presidential speeches to politicians’ tweets — and try to draw some conclusions on the pitfalls (and opportunities) of using these pretrained models. But before we do that, let me provide you with a bare-bones overview of how polarity models actually work.

Calculated emotions

the inner workings of a sentiment polarity model

In essence, the modeler can adopt one out of three different strategies when developing a sentiment polarity algorithm.

▹STRATEGY 1: Lexicons
Model example: Pattern

The most straightforward method is a dictionary or lexicon approach. This comes down to creating — you’ll never guess– a dictionary of words (called features) with a corresponding polarity label. Now, this may sound as if the researcher is sitting at a desk, casually coming up with some kind of list containing features with positive and negative valence; but in this day and age there are already plenty of already existing corpora to start from. In the last couple of decades, linguists have done a remarkable job in compiling validated sentiment lexicons. Next to these authoritative dictionaries, the modeler usually draws upon a sample of ‘opinionated texts’ within a particular domain to come up with a list of feature-candidates. This step is usually undertaken to increase the lexicon’s sensitivity to a very specific language domain (e.g. politics, social media, and so on). One can, for example, extract all adjectives from a couple of thousand op-eds and present these features to a group of human coders who evaluate the features on their emotional valence. If particular features tend to receive a similar polarity score from the panel of human experts, the feature is deemed as reliable and can be added to the dictionary. The label can go from a crude binary classification (positive/ negative) to a nuanced valence score (e.g. a score on a scale between -1 (negative) and 1 (positive)).

When the end user utilizes the lexicon, the algorithm invariably performs the same two simple steps.

✓ First, the computer parses or tokenizes the textual data. In its simplest form, parsing entails transforming an entire string of text in smaller — meaningful — components, usually words. For example, the sentence “This is great!” consists of multiple tokens “This/is/great/!”. Ideally, these tokens are also transformed into their lemma, the canonical form of a particular word (so ‘is’ → ‘are’ and ‘schools’ → ‘school’). Most NLP packages include a lemmatization option, so this is all being done with a single line of code. Otherwise, inflected forms and conjugated verbs run the risk of being unrecognized by the polarity model. In essence, this kind of parsing transforms a sentence into a ‘computer friendly’ format. In many NLP analyses, for example, the computer constructs what we call a document feature matrix (DFM), where the presence or absence of tokens is represented within a straightforward matrix, such as:

The first row and first three columns of a DFM

This allows us the algorithm to treat the presence or absence of a token as a predictive variable, similar to any other predictor in a statistical model such as regression analysis.

✓ Next, the model scans the entire text and attaches the corresponding polarity scores to the tokens; that’s it! Some dictionaries evaluate chunks instead of tokens as such. Chunks can include expressions (e.g. “son of a bitch”) or other multiple-word strings (called n-grams) next to single-word tokens. To receive a polarity-score for the entire text, the algorithm performs a simple calculation, such as the average sentiment score of the evaluated tokens.

The most popular sentiment dictionary is LIWC, which has been extensively validated by academic researchers. For modern standards, however, LIWC is a rather crude polarity classifier given that it ignores polarity intensity. Tokens are simply ‘negative’ or ‘positive’, with no room for differentiating the intensity between — for example — the token ‘bad’ (negative) and ‘abysmal’ (extremely negative).

We’ll take the Pattern module as a more modern representative of the straightforward dictionary approach. Pattern is a lexicon developed by Dutch researchers (affiliated to the University of Antwerp) and contains 5500 features. It’s important to note that Pattern’s lexicon only includes adjectives. This means that other part-of-speech categories such as “killings” (noun) or “deprived” (verb) are therefore not evaluated. It’s one of the only widely available sentiment lexicons in Dutch and for this reason it enjoys some popularity among academics in Belgium and the Netherlands. Although the lexicon is originally constructed and validated in Dutch, they mapped their model to English as well.

Let’s see how Pattern handles a couple of example sentences. All code snippets here are written in Python .

The output looks something like this:

As you can see, the model does a reasonably good job. Bear in mind that the output represents the average polarity score going from -1 (very negative) to 1 (very positive). Pattern deals with direct negations (“not bad”) and indirect negations (“not really a good movie”) without too much trouble. Still, I was able to trick Pattern into two clearly erroneous codings. The second sentence expresses a certain expectation, followed by a negative outcome. However, since Pattern simply found two positive adjectives (“great” & “entertaining”) and only one somewhat negative adjective (“disappoiting”), the end result is an overall (though only slight) positive sentiment. Adding more syntactical and grammatical rules to a dictionary could solve this issue (see next paragraph). The erroneous coding in the third sentence has a different source though and can be explained by a lack of coverage. Although the words “riots”, “killings” and “genocide” clearly sketch a very bleak picture, Pattern only evaluates adjectives and thus it disregards the nouns.

The code provided here also prints the evaluated chunks. Let’s take a look at the assessments of the second sentence:

[([‘great’], 0.8, 0.75, None), ([‘very’, ‘entertaining’], 0.65, 0.91, None), ([‘disappointing’], -0.6, 0.7, None)]

The first number of each evaluated chunk represents the polarity score (from -1 to 1), the second one is a subjectivity-score (something we won’t discuss any further here). To obtain the final polarity score of 0.2833, Pattern simply takes an average of all sentiment scores: (0.8+0.65–0.6) / 3 = 0.283. Although these kind of lexicons are really transparent in their sentiment calculations, it’s already evident from looking at these handful of examples that they are easily led astray due to their context-insensitivity.

▹STRATEGY 2: Rule-based lexicons
Model example: Vader

The rule-based lexicon model adds another step to the dictionary approach. After attaching the appropriate polarity-score to the tokens or chunks of a piece of text, the algorithm applies a list of rules to avoid some common pitfalls associated with a crude dictionary-approach. In essence, all of these rules aim to either take (a) the order of words and/or (b) the co-occurrence of particular words into account. The negation-rule is the most obvious one and we already encountered it in our short demo of Pattern. So, although the token “bad” would clearly receive a negative score in our lexicon, if a customer reviews the food in a restaurant as “not bad”, this part of the review is at least (somewhat) positive. The algorithm could, for example, simply reverse the polarity score here and evaluate the more relevant chunk ‘not bad’ instead of the token ‘bad’ as such. In fact, the negation-rule is so easy to implement that even the most bare-bones dictionary models tend to incorporate this into their algorithm. Another common rule is adjusting the polarity in the presence of some intensifier (e.g. “Really bad” is stronger in polarity than “bad”).

However, a rule-based lexicon worth its salt goes a step further and tends to incorporate more complex rules such as ‘shift in tone’ detection. For example, the sentence “I heard the movie was terrible, but it was not bad at all.” is clearly positive in tone, but will fool the more bare-bones dictionary models, even if they incorporate the negation-rule. After all, the review contains the tokens “terrible” (extremely negative) and the chunk “not bad” (slightly positive), so in the end the sentence will likely be rated as slightly negative. An extensive rule-based model could recognize that the sentence starts out negative (“was terrible”), but that the text will probably include some info contradicting the negative sentiment (“,but”). Finally, the fact that a negated negative token is present in the second clause (“not bad”) should reaffirm the model’s suspicion that the text has a positive polarity. In the end, the algorithm could opt for (a) simply ignoring the negative token in the first clause, or it could go for a more subtle approach by (b) putting a higher weight on any second clause that starts with ‘but…’. The second strategy is less desirable in our specific example here, but tends to result in more valid results overall if someone truly expressed a nuanced sentiment. Let’s consider the following sentence from a restaurant review: “the lack of a kids menu was kind of disappointing, but the food is delicious and the staff is friendly, so it’s all worth it.” Although the customer was somewhat disappointed, in the end he or she was satisfied with the restaurant, so a moderate positive polarity seems fitting. Since language experts know that people tend to use the second clause to stress their most salient emotions and their ‘final’ judgement on a subject, we can incorporate a rule to ascribe a higher weight on the tokens ‘delicious’ and ‘friendly’, while at the same time not discounting the negative sentiment expressed in the first clause.

To represent the rule-based dictionaries, we’ll use the freely available Vader model. Vader stands for “Valence Aware Dictionary for sEntiment Reasoning” and is specifically designed to detect sentiment polarity in ‘microblog’ texts, which basically comes down to text types found on social media such as Facebook and Twitter. However, the researchers validated their algorithm on a relatively broad spectrum of text types, and the model performs surprisingly well across multiple domains (e.g. op-eds). Given the short length of most microblog texts, the model assumes that it has little to work with, so it is designed to be sufficiently sensitive to a wide range of polarity-relevant textual cues. To this end, Vader combines multiple existing dictionaries, adding lexical features that are characteristic of social media texts such as smileys (e.g. “ :D”), acronyms (e.g. “lmao”) and slang (“sux”). In total, the dictionary contains 7500 features, each rated on a valence scale by a sample of human annotators. Even more interesting, though, is the fact that Vader adds five additional rules to maximize the validity of its polarity estimates. Two of these rules are somewhat specific to microblogs: (1) words written in caps lock receive a more extreme polarity score (e.g. “I have some GREAT news!”) and (2) the model accounts for expressive use of exclamation marks (e.g. “ I have some great news!!!!!!!”). The other three rules are more general-purpose: (1) the model accounts for intensifiers (e.g. “Really bad”), (2) checks the preceding three tokens (or tri-gram) to check for negation (e.g. “This isn’t a really good movie”) and (3) includes a ‘shift in tone’ detection. Thus, after identifying the average sentiment score, the model checks whether one (or multiple) of the abovementioned rules apply to the sentence at hand and adjusts the score if necessary. For more info, check out their publication.

Let’s load the Vader library into Python and perform a quick and simple validity exercise before we continue. Please note how I use different example sentences here than the ones I used on the Pattern library. Keep in mind that I purposefully came up with some unique sentences to “stress test” the specific features and flaws of a particular package; the comparison between these models is something for later on.

Which gives us:

Compared to Pattenr, this model is clearly more nuanced and less easily fooled. Thanks to the incorporation of a few simple rules, the model smoothly handles intensifiers, negation and shift in tone. Of course, it makes some assumptions on the writer’s grammatical and stylistic capabilities. This is evident when looking at the last example: since the dictionary uses the ‘but’ feature to detect any shift in tone, the last sentence simply takes an average of the ‘bad’ and ‘good’ token, resulting in an overall slightly negative polarity score. The awkward sentence structure is clearly to blame here, but one only needs to glance at his or her Facebook and Twitter feed to know that awkward sentence structures are rampant on social media.

▹Lexicons and knowledge engineering

Thus far, the algorithms we discussed rely on what we call knowledge engineering. This means that the modeler feeds the algorithm, in an extremily formalized manner, knowledge on how language works, with the goal of mimicking — explicitly — the steps taken by a human reasoner (if A THEN B; EXCEPT if C, THEN the conclusion is D). This invariably involves some heavy investment in terms of time and (human) resources. Human coders painstakingly annotate words according to their polarity, and rule-based methods rely on some (basic) understanding of grammatical structure and syntax. However, this laborious method tends to pay off in terms of reliability and validity, so much so that purely human-annotated lexicons are literally called gold standard lexicons. Using pre-defined lexicons and — optionally — some decision-rules, the computer follows the instructions of their human masters, that’s it.

For the first 70-odd years of artificial intelligence, this kind of human-centered and explicitly coded artificial reasoning dominated the field of AI. The machine was considered to be a sound reasoner, an entity that follows the laws of logic based on some ground rules defined by the programmer. This made deductive reasoning such as syllogisms the hallmark of AI programming for decades. Experts coined the term symbolic AI to refer to this paradigm, also sometimes called GOFAI (good old-fashioned artificial intelligence). The idea here is that machines manipulate symbols (e.g. objects, words, etc.) that are afforded a particular meaning according to explicitly programmed rules (e.g. the appearance of ‘but’ in a sentence means the following clause will contradict the previous clause, so the machine should act/judge accordingly). Before machine learning took over the world of AI, this knowledge-based approach is how so-called ‘expert systems’ were made.The software produced according to these principles were often bulky programs, designed for very specific business purposes, executing decision rules or trees (together with their estimated probability of success/failure) which were ‘hard coded’ into the algorithm. These kind of programs were the real deal in the 80’s and single handedly saved the field from its demise. For example, some programs were designed for diagonistic purposes, where figures obtained from blood sample analyses were ran through thousands of decision rules (in the form of IF → THEN statements) with the end goal of closely mimicking the diagnostic skills of doctors. Although these tailor-made models are outperformed in some aspects by the recent rise of neural nets specifically and machine learning in general, the idea that algorithms — even complex machine learning methods — can profit from human-engineered knowledge remains relevant to this day.

When it comes to NLP specifically, all efforts to index our understanding of human knowledge in a digital, searchable and computer-friendly form is influenced by the philosophy of symbolic AI. One of the most impressive undertakings to build a relational database of language is the WordNet project. WordNet embraces the structuralist nature of human language by taking a strong relationist approach. In WordNet, each word is organized into what they call a ‘synset’ of related words which are, in their turn, related to other synsets. So when using WordNet, the computer can ‘understand’ that ‘tire’ has a whole-part relationship with the word ‘car’, and that the word ‘car’ is somewhat related to ‘bicycle’ or other modes of transportation. In some basic manner, WordNet aims to convert the computer to an ‘associative machine’, much like our own brain, where each and every word gains meaning by relating it to other words. One other common example of a knowledge-based NLP task is part-of speech-tagging (POS tagging). During this process, each token is marked according to the specific function it serves within the text (e.g. verb, pronoun, adverb…) and how it relates to other surrounding tokens (e.g. the word ‘kind’ in ‘kind of bad’ says us something about how bad something is). This allows the model to handle homonyms, expressions and intensifiers. Take the words ‘like’ or ‘kind’, for example. Compare the sentence “And I was like: this is kind of rude!” with “I like my colleagues a lot, and they’re really kind!”. Since part-of-speech tagging serves the general purpose of making meaningless tokens (somewhat) meaningful to the computer, plenty of NLP libraries provide this service (e.g. NLTK, Pattern). Although dictionary-models could incorporate part-of-speech-tagging as a pre-processing step, many lexicons don’t differentiate their library of tokens according to their part of speech. Again, just like the expert systems of the eighties, the machine is helpless without instructions from its human master. This is not (entirely) the case though for our third and final modeling strategy we’ll discuss here.

▹Machine learning models
Model example: Stanford CoreNLP

The third technique employs machine learning (ML) instead of knowledge engineering. In contrast to feeding the algorithm a dictionary of words and explicit decision rules, this strategy holds that we simply feed the computer examples of labeled data — in this case negative and positive texts — and let the algorithm find out by itself what kind of (combination of) tokens are predictive for a specific label.

This ‘learning by example’ principle is the defining characteristic of supervised machine learning. In supervised learning, the algorithm only depends on human-generated knowledge insofar that it relies on valid training data. Although the source of these example data varies from model to model and can be generated manually by trained coders, many modelers opt for texts that are ‘by definition’ polarity-labeld. Movie reviews are a good example here and a very popular source for generating training data. For example, one can scrape 10.000 movie reviews, each with a particular rating — let’s say from 1 to 5 stars, and use 2.5 stars as a cut-off to determine whether a particular review is negative or positive. The model will learn from these examples that words such as ‘abysmal’ and ‘poor’ are highly predictive for negative reviews, while words such as ‘great’ are indicative of positive texts. In its most simple definition, a bag-of-words model will transform the entire text into an unordered ‘bag’ of tokens, where word order and context are irrelevant. Words are then simply treated as variables that — separately or in combination with other words present in the text — are evaluated on their predictive power for a certain label. A Naïve Bayes Classifier does something along those lines.

However, supervised learners can go much further than that. The model can learn some intricate decision-rules to categorize textual data; even rules that go beyond our own knowledge of how we utilize language in naturalistic settings, rules that we would never even consider to ‘feed’ the algorithm in a knowledge engineering setup in the first place. This is the magic of machine learning: it is able to pick up complex patterns that defy formalization, patterns that are impossible to express in human language to begin with. To understand this, consider a neural network that learns to differentiate pictures of cats versus dogs (if you want to learn more about NN, check out this series made by 3Blue1Brown). The model might pick up some strange patterns in pixel properties that prove to be highly predictive for recognizing cats, but it’s more than likely that these patterns cannot be expressed — even comprehended — in natural language. Maybe the machine comes up with a way to recognize a cat’s whiskers, but this ‘rule’ is entrenched within a complex web of weights (for inputs) and activation tresholds. (e.g. If a pixel has a bright color — a whisker — but the surrounding pixels have a darker shade and there are multiple of these bright-dark patterns in the same neighborhood, a certain pattern of neurons tend to be activated that will rule that the picture contains a cat).

When it comes to predictive models within NLP, this means that the machine can profit from looking at thousands or even millions of specific examples of textual data (e.g. tweets, articles), looking for the metaphorical signal in the noise. Instead of reducing the text to a mere bag of words, the entire text — and thus its entire syntactic, grammatical and relational complexity — can be fed to an algorithm. The mere presence of a particular token, in combination with information on the presence of other tokens, word order and its general placing within the text provides an endless pool of candidate rules, of which some are highly predictive.

These kind of techniques are often called ‘black box’ models, since it’s hard to backtrack the model’s line of reasoning whenever it’s predicting a textual property. All we know is how well the model performs in predicting a specific textual property, not exactly why it performs well. Indeed, any kind of statistical modeling has to deal with an inherent trade-off between interpretability and predictability. Whereas simple models — like the ones published in most academic articles within the social sciences — allow us to gain key insights into our (social) reality due to their transparency, they often reduce the complexity of said reality in such a dramatic fashion that predictability isn’t exactly their strong suit. Especially when it comes to language, where context is key to construe meaning, algorithms that are capable of learning complex interactions may outperform more transparent models. It’s simply next to impossible, even for the most expert linguist, to come up with a clear set of ‘rules’ for a lexicon to cover all possible exceptions and idiosyncrasies inherent to natural language. Even trying to do so would end us up with a list of thousands and thousands of decision rules, of which some of them might contradict one another! Moreover, since sentiment analysis is usually utilized as a tool and not as an instrument for knowledge generation as such, the modeler sees no harm in sacrificing transparency for model performance. However, this increase in performance comes at a computational cost. Complex ML models might even be unable to code somewhat larger texts (such as movie reviews) on many personal computers, requiring a significant amount of available memory. In contrast, (rule-based) dictionary approaches are able to code texts more or less on the go, which makes them more suitable for tasks such as social media monitoring.

Although it’s entirely possible to employ some unsupervised machine learning methods here, such as dimension reduction techniques, they rarely serve as the true backbone of a sentiment model. In unsupervised learning, the algorithm doesn’t learn from example, but looks for statistically significant patterns that are inherently present in the data. The big advantage here is that there is no need for labeled data; the algorithm can simply construe that, for example, the words ‘bad’, ‘abysmal’ and ‘poor’ tend to co-occur in the same text. However, the relevance and meaning of the distinguished patterns still need to be interpreted by human annotators. So instead of labeling the input-side of the model as in supervised learning (i.e. training data), the modeler needs to interpret the output-side of the model (i.e. the observed patterns). The assumption, then, is that the co-occurrence of certain features measure some fundamental shared textual property. Admittedly, there are areas in text classification where unsupervised learning has proved to be useful, such as in detecting issue-specific news frames. An additional advantage is that the algorithm’s unguided search for patterns may point to new and surprising insights that can lead to new classification schemes. This is not the case in a supervised learning environment, where the end-goal is defined beforehand: both the classification scheme and the labeling of the text is dictated by human interpretation. Unsupervised learning has no such restrictions. It could, as in the above-cited research, point to the existence of several feature clusters that are indicative of news frames that are yet to be conceptualized by researchers.

The downside here is that there is no guarantee these statistical patterns are meaningful or relevant at all. This is especially true for models built for a very narrow and well-defined purpose, such as a polarity model. Therefore, whenever unsupervised techniques are deemed relevant for constructing a sentiment classifier, researchers tend to use semi-supervised learning. These ML methods start off with a small proportion of labeled data, in this case a modest sentiment dictionary, and leverage this knowledge while looking for patterns in unlabeled data. One common technique is to use information on neighboring tokens: if particular words keep popping up in the neighborhood of a feature included in my initial sentiment dictionary (e.g. ‘flawed’ is likely to appear in the neighborhood of ‘poor’), it is likely that these words express a similar sentiment (unless the text expresses multiple conflicted and contrary sentiments!). This allows the researcher to increase the coverage of the dictionary in a semi-automatic fashion.

We’ll take the sentiment model from the Stanford CoreNLP package as a representative of the ML models. This software package contains a collection of impressive NLP tools, including named entity recognition and a dependency parser. Their sentiment model uses a variation of the popular recursive neural network model. Their deep learning model is trained on what they call a treebank; a collection of around 10.000 labeled sentences — scraped from movie reviews — where word order is retained. You can find more info on their website. The researchers claim this type of model will outperform lexicon approaches, especially when it comes to short texts since the NN uses all available information: the presence of a specific token, what kind of other tokens are present in the text, the order of tokens, punctuation marks,…you name it! These features aren’t (necessarily) ‘hardcoded’ in the training data though. It is the exact lack of preprocessing steps that affords the model to formulate extremely complex decision rules. Their sentiment classifier assigns an overall probability that the text belongs to one of five classes: very negative, negative, neutral, positive and very positive. The class with the highest probability wins, obviously. We can apply this pre-trained model to our own texts without going through the hassle of building our own classifier. To do so, we’ll need to set up a Java client to run the model through Python.

I use the subprocess module in Python to set up the server. In essence, this package allows you to spawn new processes by executing queries through the command line (Windows) or terminal (Linux) interface. After downloading the Stanford Core NLP model, I simply launched the server in Python using a Windows distro by typing:

where the value of “cwd” (current work directory) refers to the folder containing the model.

Now that we have set up the server, let’s see how this more sophisticated model deals with several example sentences. In python, I executed:

The output returns a value between 0 (very negative) and 4 (very positive). I already used Pandas’ apply-function to convert the numerical values to their corresponding labels. This is what Stanford’s model came up with:

Testing the sentiment model incorporated into Stanford CoreNLP

To be fair : I came up with some particularly challenging sentences. Since the model claims to outperform straightforward lexicon approaches, I figured it should be up to the task.

Unfortunately, it is clear that the Stanford model has its share of structural flaws as well. Only four out of six sentences are at least ‘somewhat correct’ in their sentiment evaluation. The second sentence is a classic but rather complex ‘negation’ pitfall that most sentiment libraries would have trouble dealing with. Even Vader, which explicitly incorporated an extended negation-rule, would be unable to detect this sort of ‘stretched out’ negation. Indeed, Vader gives the sentence on Prague a negative sentiment score as well (-0.4). The fourth sentence has also been erroneously coded and it’s a good example on how layered our everyday language is, where we mix different viewpoints with ease. First, the speaker here takes on the viewpoint of the general public, admitting that the Transformers flick should be considered as a ‘bad’ film based on some shared notion of quality, such as the poor camerawork and writing. It’s the ironic “so bad it’s good” take at the end that is hard to recognize, even for deep learning models. The other sentences have an acceptable coding, although the first one should clearly be classified as ‘very negative’. Again, in this sentence the intricate interplay between the two clauses causes some trouble.

▹Knowledge engineering and machine learning as a continuum

Although the lexicon-ML distinction is a convenient framework to think about sentiment modeling (and NLP in general), in reality these two schools of thought should be considered as two extremes alongside a continuum. Indeed, most models adopt some kind of hybrid approach. Lexicon models often use ML principles to expand their human-constructed dictionary. At the same time, though, ML models still rely on some knowledge engineering as well. While algorithms outperform any human in spotting predictive patterns, humans are king when it comes to making these patterns meaningful. Contrary to an algorithm that starts with an entirely blank slate, we humans don’t exactly need 10.000 examples to learn that ‘not’ usually negates the sentiment of whatever token comes next. This is why part-of-speech tagging remains a popular pre-processing step for many NLP-tasks, even for so-called ‘black box’ models. In this sense, humans provide a basic roadmap of language, while machines provide the speed needed to come up with complex decision rules. Below you can find a brief summary of the three strategies we just discussed and the corresponding model we take as representative for each category.

A summary of the models under study and their corresponding modeling strategy

Setting the stage

sentiment scores of presidential speeches

Now that we have a basic idea of how these models work, it’s time to put them to the test.

Initially, I just wanted to analyze the polarity scores for a couple of political speeches as part of a workshop I was developing on sentiment analysis (which you can access here; and the code I wrote for this workshop can be run by anyone in Repl here). I had the feeling I did not took the most out of this little side project, so I decided to expand on it and do a full-fledged sentiment analysis of all speeches made by US presidents.

For some reason, I was unable to find a straightforward collection of speeches, so I decided to write a little scraper in Python and collect all the transcripts from the Miller Center. More specifically, I used Selenium for simulating a browser session on the Miller Center website and subsequently scraped the speeches page by page. The script looks something like this:

The first part of the code basically opens the website, scrolling down all the way from Donald J. Trump to the first inaugural address of George Washington. The page automatically loads a subset of the speeches whenever you’re scrolling down, similarly to how your social media feed (such as Facebook and Twitter) works.

Retrieving speeches from the Miller Center using Selenium — part 1

When the browser has reached the bottom of the page, we can extract the URLs to all individual speech pages. So the second part of the script opens these URLs one by one, extracting the appropriate tags and attributes using the Beautifulsoup package.

Retrieving speeches from the Miller Center using Selenium — part 2

995 pages later, we end up with an impressive dataset. I merged the presidential speeches with some additional variables from Wikipedia; such as the party of the president in question and the beginning and end year of each term. This allows us to, for example, compare the sentiment scores between Republican and Democrat presidents or to compare sentiment scores over time. You can find the result of this scraping process right here on Kaggle.

Great, now that we have plenty of textual data, let’s come up with a function to attach the sentiment scores of Pattern, Vader and Stanford Core NLP to the corresponding texts. I came up with the following chunk of code:

If you provide the function with an ID- and text-identifier, Python will add several new columns to the Kaggle dataset I just referred to. It’s a straightforward function that appends the sentiment scores to multiple lists, although it contains a couple of additional bells and whistles. For one thing, it features a neat progress bar, kindly delivered by the people who wrote the tqdm library. The user can also define a “sentence_level” parameter. If its value is 1, the text is broken up and coded on sentence-level. The default value is 0, which means the sentiment is calculated on the aggregate text-level. Third, I incorporated the language prediction function from the langdetect library. This could be relevant whenever you’re dealing with bilingual texts, such as speeches from some African leaders (e.g. Nigeria), where English and local languages or dialects can appear in the very same text. Thanks to the addition of the language-column, the data analyst can simply subset the texts or sentences that are coded in English or whatever language you wish to target for analysis.

Let’s put our little function to use. So, after loading the dataset into my Python environment, I subsetted the data to only include inaugural addresses and state of the union speeches. After that, and by using only a single line of code, I transformed the dataset to a sentence-level dataframe.

The first two rows of our dataframe looks something like this:

The first two rows of our sentiment dataframe

In total, we coded almost 22.000 sentences (again: this means we have around 22.000 rows in our dataset). Let’s take a look at the distribution of our obtained sentiment scores, while differentiating between democratic and republican speeches. The code is somewhat bulky, so I uploaded it on my GitHub profile right here.

Comparing the sentiment distributions across the three models — obtained from political speeches.

There a few striking observations worthy of our attention.

First of all, both the Pattern and Vader scores contain a high proportion of neutral codings (score 0). Especially the Pattern package fails to detect any polarity in a significant proportion of sentences: almost half of its sentiment scores are (almost) neutral, compared to only about 20% of (almost) neutral codings among the Vader codings. Be aware that the plot is divided in bins, so a score of 0.00 and 0.05 does end up in the same bin here (hence the word ‘almost’).
Second, the Pattern dictionary seems to have a positivity-bias. In other words, if the library detects any sentiment at all, it tends to perform better on positive sentences. Vader clearly has a larger share of negative codings (score < 0) for the very same sample data. Admittedly, this is something the authors of the Pattern lexicon already have noticed themselves. This might be explained by the role adjectives — the sole word category represented in the model — play in our everyday language. More specifically, perhaps we tend to use more adjectives when expressing positive emotions. The authors of the Pattern package also point to possible bias introduced by sarcasm (e.g. “Well, my life would be worthless without this awesome book…”) or negative comparisons (e.g. “Book X is great, but this was somewhat disappointing.”); all cases where positive adjectives are used to express a negative evaluation.
Finally, the scores obtained by the Stanford package seem a little…off. Almost 60% of all sentences are predicted to be ‘slightly negative’ by the model, with only one out of four sentences receiving a positive label. Strikingly, the neutral coding is one of the least popular categories. This is peculiar, since we could expect the neutral coding to be the most popular category among most sentiment algorithms due to limitations in the model’s coverage (e.g. see the adjective-limitation of the Pattern package). Just to double-check the validity of our code, let’s see how the model deals with a couple of clearly neutral or ambiguous sentences. Next to the predicted class, I also requested the predicted sentiment distribution and added the highest class probability in the third column.

Setting up some tricky boobytraps for the Stanford CoreNLP model

The first sentence proves more or less that there’s nothing wrong with the Python code on my side. Feeding the algorithm some gibberish should, indeed, produce a ‘neutral’ coding. There are also a couple of legitimately neutral sentences (number 3 and 6) that are recognized as such. The remaining three sentences, however, show a clear negativity bias. For the fifth sentence, which is undeniably positive, the Stanford model gives us a 42% probability that the text is, in fact, negative. The probabilities for the positive (15%) and extremely positive (16%) category don’t even come close. Strange, because I avoided any common pitfalls that might trick even a straightforward lexicon model; Vader gives this sentence an unambiguous positive coding (with a polarity score of 0.49). Moreover, changing “only halfway”, which admittedly could be interpreted as negative in a different context, to “already halfway” only results in an even increased negative probability (49%). The mistake made with the fourth sentence is more understandable. The speaker admits that there are differences and that polarization and vilifying others is rampant in the current political climate, but the speaker will do everything in his or her power to resolve these issues. In the end, the message is rather hopeful. Still, the Stanford model gives it a whopping 57% probability that the sentence is “negative” and even a 29% probability that it’s “very negative”. The positive (3%) and very positive (2%) categories barely register at all! Even the French (!) opening of the Albert Camus novel ‘L’étranger’ somehow registers as negative.

So, of all the models under review, and solely based on the sentiment distributions displayed above, Vader seems to do the most competent job; providing plenty of variance in sentence polarity with a relatively small proportion of neutral values.

For my little workshop, I intended to show my students how to display the pattern of polarity development throughout a single speech. In other words: what does the ‘fingerprint’ of a presidential speech look like? Moreover, do these patterns differ according to party affiliation? Since we’d like to maximize the comparability of these speeches, it’s advisable to standardize the length of each text so that the first sentence represents 0% (start of speech) and the last sentence 100% (end of speech). I wrote the following function to do just that:

Before we continue, let’s subset our dataset so we only retain the inaugural speeches starting from JFK (1961) right up to Donald J. Trump (2017).

OK, we’re all set! I wrote a little function to create a template for the graph below. You can check out the code here. Because I poured the code into a function, I can reproduce the following graph with different data using only a single line of code:

This will result in something like this:

Sentiment polarity (Vader) throughout inauguration speeches — from JFK to Trump.

Each and every subplot represents the course of a particular inaugural speech, featuring:

· The raw sentiment scores per sentence (dots)

· The mean sentiment score of the speech (horizontal dotted line)

· A moving average (bold solid line)

The moving average is a smoothing procedure that separates the metaphorical signal from the noise. In this case, I used a moving average of 10; which means that any point on the MA line represents an average from a window consisting of 10 sentences and their corresponding sentiment score.

Using these settings, it is hard to decipher any particular pattern of development, apart from a few minor trends. It’s pretty obvious that most speeches start off with a cliff, with a relatively steep decline in sentiment scores. This makes sense: the newly elected president wants to set the stage to announce the much-needed change that the United States needs, painting a rather bleak picture of the state of the union. For similar reasons and as a mirror image of this initial pessimism, most speeches tend to end with a bump in sentiment scores. After detailing their vision of the American Dream during the middle of the speech (Between around 30% to 80% of the speech), the new president makes clear that this new leadership has what it takes to deal with the many challenges that lay ahead for the American people.

Apart from that, speeches go through seemingly random peaks and troughs. Moreover, the data do not suggest that Republicans and Democrats differ all that much in their sentiment scores either, although a Mann-Whitney U test shows otherwise. More specifically, Republican sentences have a significant higher median sentiment score than sentences belonging to a democrat (Median Republican = 0.36, Median Democrat = 0.20, U = 237748.5, p < 0.05). In other words: the cliché of the drum-beating Republican taking every opportunity to pledge alliance to the great American flag, all the while expressing a can-do optimism so characteristic of the American dream, might be reflected in their inauguration speeches. A qualitative analysis should be performed to test this hypothesis, though. Performing the Mann-Whitney test just takes a few lines of code thanks to the scipy package:

Questioning model outcomes

Usually, this is where most online tutorials and even academic applications of sentiment analysis end. The goal is to get to the results, produce a nifty graphic or statistical model and woo the client or fellow academics. There is little motivation to actually question the output of the model. After all, the model’s accuracy has already been demonstrated, right? Most of these sentiment libraries even have an academic publication detailing the extensive validation process! Sure, this is true. But this ignores the fact that these models give wildly different sentiment scores. To demonstrate this, take a look at the gif I created below, which reproduces the previous graphic with different MA settings using the Vader versus Pattern package. What if we, just by happenstance, opted for the Pattern library to conduct our analyses? Our figures would look a whole lot different.

Comparing results from the Vader and Pattern model for inauguration speeches — from JFK to Trump.

Admittedly, one of the main drivers behind the highly variable moving average trends produced here stems from the fact that Pattern has a high proportion of zero sentiment-values (neutral scores). But it’s not particularly hard to find even opposite sentiment-scores. Let’s single out the most memorable quote from Donald J. Trump’s inaugural speech and compare the evaluated chunks by the Vader versus Pattern library.

Time after time, this quote featured in news articles around the globe as it encapsulated the apocalyptic worldview expressed throughout Trump’s address. It’s pretty obvious that Vader’s estimation is closer to the ground truth here and that Pattern is easily misled by a few positive adjectives. But how large is the discrepancy between these models really? And what can you learn from this when it comes to conducting your own sentiment analysis?

Agree to disagree

comparing sentiment scores of politicians’ speeches and tweets

Before we dig into the model disagreements, I’d like some additional textual data to control for any bias introduced by text type. After all, maybe political speeches are simply too complex, ambiguous or nuanced in their emotional valence to be properly coded by these packages. Moreover, two out of three models– Vader and the Stanford model — advertise their algorithm as particularly suitable to analyze ‘microblogs’ such as social media posts. Surely posts on Facebook or Twitter are more clear-cut in their emotional valence than a carefully crafted political speech, right? Besides that, one could also argue that it’s unfair to consider a sentence in a speech as a proper unit of analysis, since a single sentence only gains meaning within a broader context. Social media posts, on the other hand, are usually short (sometimes by design, such as on Twitter) and their overall sentiment value is contained in only one or two sentences. Maybe these models tend to deliver the same results when they analyze these shorter, more straightforward microblogs? This is a reasonable assumption and worth incorporating into our analysis here.

For this reason, I scraped around 11500 tweets from six popular politicians. I tried to strike an ideological balance, including a couple of Democrats and Republicans. Thus, I targeted the accounts of Donald J. Trump (Rep.), Rand Paul (Rep.),Ted Cruz (Rep.), Alexandria Ocasio-Cortez (Dem.), Nancy Pelosi (Dem.) and Bernie Sanders (Dem.). I’m not a big fan of the Twitter API, so I used a great little library called GetOldTweets, which is hosted on Pypi right here. I wrote a little function and a loop in just a couple of lines to obtain my sample of tweets:

Moreover, thanks to the sentiment dataframe function we created earlier, all it took is one additional line of code to attach the appropriate sentiment scores to our twitter data. Keep in mind that, contrary to the coding of political speeches, the entire tweet is analyzed as such without sentence-splitting (see the value of the ‘sentence_level’ parameter below, which is set to zero).

That’s basically all the data collection and wrangling we need to conduct our comparative analysis. I uploaded both coded datasets — one for the speeches and one for the tweets — to Kaggle right here.

Let’s take a quick look at how the sentiment distributions of our tweets corpus compare to the ones obtained from the political speeches. I plotted some kernel density plots below to ease the comparison (You can find the code for this plot on my Github here.)

Comparing sentiment distributions for political speeches versus tweets. (Click for high res version)

Interestingly, the within-model distributions of the two samples exhibit the same characteristics. For example, the positivity-bias of Pattern pops up again for the tweet sample. In the same vein, the suspiciously high proportion of negative coded texts by the Stanford model is similar for the two samples. However, Vader detects a higher proportion of negative tweets versus negative sentences in the political speeches. Although I don’t have a golden standard (i.e. human-coded) sample of tweets at my disposal, the finding that politicians are more negative in their emotional valence on their Twitter timeline versus in their official speeches seems common sense. Again, the Vader model looks the most promising choice among the competing models, giving us a nicely distributed set of sentiment scores across the whole spectrum.

But how do these codings for individual sentences overlap really? To compare and contrast individual scores, we need to come up with a couple of relevant summary statistics and a sensible way of presenting these graphically. To this end, I plotted a figure Python using the matplotlib and seaborn libraries. I did this by:

1) Doing some additional data wrangling. This is the part where we come up with the relevant statistics to create our comparison graph.

· First, since Vader and Pattern return numrical values, I aggregated both scores so a sensible comparison could be made (i.e. negative, neutral, positive sentiment categories). I also aggregated the Stanford scores from five (very neg/neg/neutral/pos/very pos) to three categories.

· Second, since I wanted to know the proportion of agreement versus disagreement between each and every model, I had to calculate a couple of additional binary variables that would indicate whether two particular models coded the text similarly or not.

· Third, I wanted to normalize the comparison figures so that one row represents the total amount of texts in that particular category (i.e. the sum of one row should equal 100%).

I wrote the following function to do just that:

The function outputs one dataframe — which is an adapted version of our original dataset — and 6 dictionaries: 3 dicts for creating pie charts and 3 dicts to create what they call heatmaps.

I applied the function to my two datasets with a single line of code:

· Fourth, I wanted to calculate some additional info on (a) the correlation between the sentiment scores and (b) whether a disagreement stems from a truly opposite coding (one model codes the text as positive versus one model as negative) or from the fact that one model fails to detect any sentiment at al (e.g. one positive versus one neutral score). I created the following function:

And obtained the relevant metrics by using the ‘corr_info’ function in the following manner:

2) After obtaining the relevant stats, we need some code to transform these unintelligible metrics to an informative graphic. To accomplish that, I wrote a chunk of code, which you can access here.

If this looks like a whole lot of clunky code to just obtain a single figure, it’s because it is. Unfortunately, Python code looks kind of chaotic whenever you want to stray off the predetermined path and create a custom-made figure. Below you can find the image produced by this wall of code.

(Dis)agreement between different sentiment polarity models for different text types (speeches versus tweets of politicians)

The image is information-dense and contains plenty of interesting points of discussion. Let’s unpack some of these for a minute:

Across six comparison groups ([text type] * [comparison between two models]), the range of disagreement between the sentiment models varies from 40% to 64%.

Put differently and more specifically, the Pattern and Vader models disagree whether a particular tweet is negative/neutral/positive in 4 out of 10 cases. When it comes to analyzing presidential speeches, the Pattern and Stanford model disagree in more than 6 out of 10 cases. All the other model comparisons fall somewhere in between these two extremes.

This looks disastrous and even outright implausible, but there’s a catch behind these seemingly spectacular figures. More specifically, and given the aforementioned sentiment distributions (see earlier), one could expect that a substantial part of disagreements between the lexicon models (Vader and Pattern) are driven by a model’s inability to detect any polarity at all, i.e. the so-called ‘neutral’ sentiment category. Indeed, the role of the neutral category is of pivotal importance here: in 7 out of 10 (71%) cases where Vader and Pattern disagree on the coding of presidential speeches, one of the two libraries categorized the sentence as neutral. This means that only 13% of all sentences (0.455 * 0.29) in presidential speeches actually received an unambiguous opposite coding from the Vader versus Pattern models. The proportion of disagreement driven by the neutral category is somewhat lower for the tweet sample: Around 5 out of 10 (54%) disagreements can be attributed due to a neutral coding. But the total proportion of disagreement in this sample is lower, which totals to about 20% of tweets receiving a truly opposite coding.

These coding differences seem less problematic since — in a lexicon approach — the neutral category serves as a de facto ‘trash bin’ for the model; a slot reserved for those pieces of text where the model fails to detect any emotive substance. In this sense, it seems more valid to interpret the neutral coding in Pattern and Vader as absence of any evidence that a particular statement is positive or negative, not as any evidence of absence.

While the proportion of disagreement sounds reasonable when taking this caveat of the ‘trash bin’ category into account, a statistical model doesn’t really care about your neat rationalizations and treats the neutral category as just another factor level, similar to the positive and negative category. This means that — more than likely — the zero-values in many sentiment distributions is riddled with faulty codings. This has severe implications for researchers who use these models within their own inferential or predictive model (e.g. predicting whether a news article becomes popular based on sentiment scores). We’ll talk more about these implications in the next section.

The comparison between the Stanford model and any other model is all over the place.

Around half of the Vader scores don’t match with the ones obtained by the Stanford model; for Pattern the disagreement lies at around 60% for any text type. There is one exception though: whenever Stanford deems a text as positive, the other models tend to agree. For example, around 8 out of 10 positive Stanford scores also received a positive evaluation with the Vader model. Remember, however, that only a very small proportion of sentences are coded as positive by the Stanford model in the first place. In other words: the Stanford model seems to have an extremely high threshold for the positive sentiment categories (under the assumption that the Vader categorization is somewhat valid). The neutral and negative category are a different story. For example, only 4 out of 10 sentences that are coded as negative using the Stanford model are coded as such by the Vader model. Almost half of the sentences deemed neutral by the Stanford model are coded as positive by Vader. The comparison between the Stanford and Pattern model are even more catastrophic.

All in all, and also given the aforementioned strange concentration of negative scores, it seems unlikely that the Stanford model — for all its fancy computations and machine learning wizardry — produced valid sentiment scores. Unless, of course, more than half of the tweets and sentences in political speeches under study here are actually negative and that both the Vader and the Pattern models fail to capture this trend. While this is theoretically possible and we don’t know for sure unless we compare the scores to a golden standard coded subset of texts, it seems implausible to say the least.

Text type has a negligible influence on coding agreement between the models under study, although the neutral category is a less strong driver for disagreement among less complex text types (tweets). In other words: polarity models disagree on complex texts (presidential speeches) because some models fail to detect any polarity at all, while models tend to disagree on less complex texts (tweets) because of truly opposite codings.

The amount of disagreement is somewhat higher when analyzing presidential speeches versus tweets, but this can easily be explained by sample variability. However, the more fine-grained coding of the Vader and Pattern model reveal something else. While the correlation between Pattern and Vader scores is 0.52 for the tweets sample, the same figure for the speeches sample is only 0.43. This is to be expected, given that (a) tweets are probably less ambiguous in their emotional valence than a carefully crafted speech and (b) the sentence-by-sentence level analysis might confuse any sentiment model since it is unable to grasp the meaning of the text in full context. This is also evident when looking at the patterns present in the two top heatmaps: the tweet-heatmap between Vader and Pattern shows a clearer linear trend. Please note that zero-values are excluded from the two heatmaps at the top of the graph, since the overabundance of zeroes would make these subplots uninformative (besides the fact that, well, there are a lot of neutral codings).

However, this result does contradict with the aforementioned higher proportion of opposite codings in the tweet sample (20%). Taken together, these somewhat contradicting findings suggest that less nuanced texts such as microblogs (i.e. social media posts) may — overall — cause less disagreement between different sentiment models than more ambiguous and complicated texts such as political speeches. However, this might be due to a lower proportion of neutral codings among the tweet sample, not because there is actually more agreement among the non-neutral codings. In other words, there is less disagreement only because the tweets are less nuanced and more blunt in their emotional expression, not because these models actually agree more on the emotional valence. This is the only possible explanation for these contradictory findings. Indeed, the stats do confirm this assumption: the amount of exact negatives (a score of 0.00) ranges from 14% (Vader) to 24% (Pattern) for the tweets sample, compared to 22% (Vader) to 36% (Pattern) for the political speeches sample.

All in all, our little experiment here shows that the researcher’s choice for a particular model has a significant impact on the sentiment scores obtained.

Not so neutral after all

seven key questions whenever you’re considering sentiment analysis

What do these results actually mean for anyone interested in applying sentiment analysis to their own project? Most importantly, the analysis reported here makes it clear that using an off-the-shelf sentiment model without providing any further argumentation why the model is suitable to analyze your specific text corpus is simply unacceptable. Unfortunately, this remains the rule among academic researchers — and it’s probably equally true for most data scientists in the private sector. Usually, researchers simply refer to the appropriate publication of the sentiment model to reassure the reader that the off-the-shelf model they employ has been validated by the model’s authors. But this simply does not suffice. Take the Pattern lexicon for example. Given this model’s sentiment score distribution, its clear positivity-bias, its high proportion of zero-values and the numerous faulty codings it produces, would you use its polarity score as a predictive variable in your model without conducting any quality check whatsoever? That seems, given our comparative analysis, negligent and irresponsible. But I’m pretty confident that this is exactly what happens in the majority of the (currently) 333 articles on Google Scholar that cite the Pattern library in one shape or form.

I realize I painted a rather bleak picture of the current state of the art of sentiment polarity analysis and how researchers employ these techniques. It’s easy to criticize, of course, and I see it as a moral duty to formulate a feasible way forward — or at least a set of best practices. In this light, and with the obligatory disclaimer that this list is incomplete and only reflects my own experience on the matter, I’d like to present my seven key questions worth pondering about whenever you decide to perform sentiment analysis.

· Do you really need to automate your coding?

Using ML models or other algorithmic models is cool. I get it. For the last couple of years, businesses and academics alike put a “machine learning” sticker on everything to ride the waves of the AI-craze. However, our enthusiasm to buy into the new hypes and buzzwords might very well be a major force that drives the unnecessary use of sentiment models.

Is it at all feasible to manually code your corpus, like in the good old days? If so, it might be a good idea to just stick with that. There is a reason why human-coded sentiment evaluations are referred to as ‘golden standard’ corpora. Human coders grasp the intricate complexities of language like no ML algorithm can and this technique ultimately yields the most reliable results. Multiple trained coders could evaluate the same textual data, and measures of inter-coder reliability such as Krippendorff’s alpha can be calculated with ease to gauge the quality of the coding. Of course, the counterargument is that the raison d’être of automatic coding is that you can code texts on the fly and that the speed that comes with automation is indispensable for some research projects. Although this is true in some specific cases, it is certainly not the case for many academic applications of sentiment analysis, where establishing inferential relationships on a clearly predefined corpus is often the main goal. Taking a carefully constructed sample from the bigger corpus and subjecting this smaller collection of texts to a manual coding process mostly suffices in this case. There is no real advantage in trading in a properly, human-coded, smaller sample for a much bigger but less reliable sample. Big is not always beautiful in data science; sampling theory in the social sciences has shown that a proper sample is able to replicate relationships present on population-level.

Admittedly, there are plenty of exceptions. For example, what if…

a. A key component of your research question involves some kind of longitudinal trend over time (e.g. Measuring polarity in news media over the last 100 + years) or a comparison between many different entities (e.g. comparing CV’s of your company over 20+ countries)? In that case, the research question itself demands a large corpus given that it needs plenty points of contrast, even if you choose to sample each point of interest in time or place.

b. You use the textual data to build a predictive model? Contrary to inferential analysis (i.e. is there a relationship between X and Y?), predictive analysis (i.e. Can I predict Y if I know the value of X?) tends to benefit a lot from using enormous and ever-growing datasets to train your model. So if, for example, the goal of your research is to automatically evaluate incoming resumes from job applicants, it makes sense that you want to evaluate these resumes on the fly to improve model your model over time. After all, the whole point of your model is to lessen the workload of your HR managers so that the algorithm can weed out the most bland and unconvincing motivational letters.

If your project falls under one of these two categories, it’s safe to assume that some automatic coding is desirable.

· Are you falling into the trap of algorithmic determinism and disregarding the importance of theory?

It’s one thing to know that automatic coding is the way to go, but this doesn’t mean that you know what you want to measure in the first place. This is a tricky but often overlooked issue. Given that NLP is a complex subject, researchers tend to gravitate towards the tools and techniques they already know. Sentiment analysis is a great example of this: even many laymen know that sentiment analysis exists, even if they haven’t seen a line of code in their entire life. It’s an easy to grasp concept and a prime candidate to include in any NLP-related project, regardless of whether there is a clear rationale to include some kind of sentiment measure in the first place. If you want to incorporate a sentiment algorithm in your project ‘just because you can’ or because ‘everyone does it’, you may have fallen into the trap of what I call algorithmic determinism. This concept holds that available NLP techniques and models dictate your research goals and model definition instead of the other way around.

Ideally, the researcher starts from a well-defined conceptual framework or general idea to inform model definition, either derived from theory (in academia) or common sense and field experience (in the private sector). The often-touted idea that theory becomes obsolete in the era of Big Data is absolute nonsense. On the contrary: in an era where data scientists have an abundance of data at their disposal, it’s more important than ever to have a firm grasp of our assumptions how the world works to guide us through the jungle of data, predictors and modeling techniques. Data doesn’t ‘speak for itself’, it is leveraged by humans who purposefully use the data to design and interpret a model of our social world. So, above all: read the literature; and not only the quantitative stats-nerd kind of literature, but also qualitative research that aims to deep-dive into how humans make sense of the world around them. Data scientists who stick to the data as such often make bogus-models that break under their own assumptions. The odds that the model is explaining noise at worst or serves as a proxy of some other — more predictive — variable or process at best is substantial. So think about it: is there a theoretical reason to incorporate a sentiment score in your model? And if so, do you want to measure sentiment polarity? Maybe a different sentiment categorization (e.g. measuring the five universal emotions from Paul Ekman) makes more sense? Or if moral judgement is a key construct within your field, maybe the ‘moralstrength’ measure from the moral foundation theory is more fitting?

But there is another way in which algorithmic determinism invariably pops up in every machine learning-driven project, no matter how developed the theoretical underpinnings of your research are. More fundamentally, predictive models still harbor inherent shortcomings in grasping interpretative constructs. Again, polarity measures serve as an excellent example. Polarity is, obviously, a reductive operationalization of the multi-layered and complex concept ‘sentiment’. Indeed, many automatic coding tools opt for simplicity. This at least partially stems from the fact that ML algorithms perform better when predicting binary or evenly distributed classification schemes. For example, compared to the four-class classification of the Stanford model, developing a model that predicts whether a text primarily expresses disgust, anger, sadness, or any other 10+ different emotions needs way more data to train the model, especially if some of these emotional categories are relatively uncommon. The issue here is that computational or algorithmic limitations steer the questions the researcher ask and the answers the model can deliver. For example, when predicting textual properties of news articles in an automatic fashion, academics tend to incorporate straightforward classification schemes such as hard news versus soft news, simply because it’s relatively easy to develop a well-performing model to perform this classification task. So, next to a theoretical reflection on the concepts relevant to your project, make sure that you communicate honestly about the operationalization of these abstract concepts and what the limitations are of this measurable translation (or, more bluntly: how your NLP tool butchers the theoretical construct).

· What is your unit of analysis?

Doing sentiment analysis on movie reviews is a vastly different kind of ordeal than on a bunch of tweets. Both text types require a model sensitive to different potential pitfalls and text characteristics. Tweets are short and the sentiment model should therefore ideally incorporate different part-of-speech categories (verbs, adjectives, etc.) to maximize the sentiment extracted. Moreover, tweets usually express one viewpoint and sentiment in a single sentence, so incorporating at least a couple of basic contextual rules (e.g. shift in tone) to avoid any dramatic faulty codings seems advisable. Movie reviews, on the other hand, are usually longer and are therefore less sensitive to common pitfalls such as negation, shift in tone or sarcasm. In this case, a crude, straightforward model containing only one part-of-speech type without any decision rules may do just fine. An additional benefit is that simpler models are usually faster, which is especially beneficial when you want to code long texts on the fly.

Next to general text length and characteristics, one should also consider whether the project requires a single sentiment measure or a variable sentiment score for each smaller sub-unit within the text. A sentence-by-sentence analysis, like the one we did on the presidential speeches, is a good example. Such an analysis should be performed by at least a rule-based lexicon or a machine learning model trained on individual sentences ideally. Considering the three models under study here, Vader and Stanford CoreNLP are explicitly designed to analyze short text segments. However, it does seem reasonable to assume that Pattern will perform better on lengthier texts that are more forgiving when it comes to excluding particular word categories given its exclusive focus on adjectives.

· What is the range of polarity you’re interested in?

We’ve already noticed that the vast majority of disagreements between models is driven by the neutral or ‘trash bin’ category, i.e. the category used for texts with an absence of clear evidence that a particular text is either positive or negative. We also remarked that the neutral category is most likely riddled with noisy data, populated by texts that are clearly not neutral but are symptomatic of the blind spots that the model harbors; and every model has its share of those. It is therefore worthwhile to take a moment and consider whether the neutral category actually matters for the research question at hand. Maybe you can just delete neutral-coded texts from your sample? Of course, this does introduce a new kind of bias into the model: you purposefully deleted a subset of your polarity distribution. However, if you’re only interested in comparing negative versus positive texts, there is an argument to be made that you can do without the neutral category, especially since you’re most likely deleting the most noisy subset of your sentiment scores. This might improve model fit (for inferential purposes) or accuracy (for predictive purposes) considerably.

· Are there any reasons to be concerned about domain, time or context-specificity?

After you’ve established the sentiment construct and the specific measurement of interest, take your time to index the available (and preferably free) off-the-shelf models online. When you do so, it is wise to compare the contextual factors of your corpus under study with those of the data used to develop and evaluate the off-the-shelf model. The models we employed in our analysis here are used interchangeably by researchers to analyze every conceivable text type, from short microblog (social media) texts to full length news articles. This is problematic, because language only becomes meaningful within a particular social sphere, time and place. Let’s take a look at the training and test data used in the three models under study:

Domain-dependency of the different sentiment polarity models under study

It seems that Vader did cast a wider net when it comes to the training data, combining different already existing lexicons (the authors remain somewhat vague on the specific sources, however). They also tested the validity of their model — that is, comparing the output of the model with a golden standard sample of coded texts — on a much wider range of texts than the other models. The other models do restrict themselves to one text type, namely reviews of books or movies. This can be problematic, especially for the more complex ML models such as the one used by Stanford CoreNLP. To understand that, I need to explain the importance of domain- context — and time-dependency.

First, domain-dependency refers to the fact that generally ML models tend to perform poorly outside the domain (e.g. political speeches, product reviews, news texts,…) they were specifically trained on. When it comes to sentiment analysis, research points out that even straightforward dictionary-approaches such as Pattern are tailored towards specific domains, suggesting it is wise to develop separate models sensitive to the language typical for the type of text in the analysis. For example and outside the context of sentiment analysis per se, researchers showed how a classifier trained on predicting policy issues in news texts performs poorly when applied to parliamentary questions. Similarly, politicians may differ in their word usage or use different sentence structure than — let’s say — a bunch of teens on the r\teenagers subreddit. If you trained your sentiment model on Reddit data, it might perform poorly in language expressed in a different domain such as politics, where language use tends to be more nuanced or ambiguous and where sentence structure is expected to be more complicated and multi-layered. Second, context-dependency holds that constructs gain meaning within a specific sociocultural environment and timeframe, even within the same domain. For example, expressions and word usage might differ significantly between book and video game reviews, simply because they operate within their own social sphere and are constrained by different social expectations. These and other context-dependencies violate the classic assumption that observations in the training data are independent and are representative of a single homogeneous population. Finally, predictive models may perform worse over time because many constructs are not time invariant. This time-dependency is a direct consequence of the aforementioned context-dependency: as the socio-cultural context changes over time, so does the meaning of culturally shaped constructs. Imagine, for example, that I analyze all inauguration speeches — all the way from George Washington to Trump. It is entirely plausible that I need multiple models to do so, simply because the speeches by a president who held office from 1789 to 1797 are not really comparable in tone, used expressions or even words when compared with the current president. All these contingencies should prompt the modeler to consider carefully the de facto costs of either training their own model or using an already existing model, which may underperform in several domains, socio-cultural contexts, or time periods.

In our case here, both the Pattern and the Stanford CoreNLP models run the risk of introducing bias due to domain-dependency. Although the authors of the Pattern package claim that their model generalizes well across domains, it seems somewhat of a stretch to view book reviews and CD reviews as entirely different domains. But it’s the Stanford model we really should worry about here. For one thing, the Pattern package does extend its lexicon database by exploiting relationships present in Cornetto’s synsets. As such, it goes beyond whatever adjectives present in book reviews. The second and more fundamental issue lies with the ML approach of the Stanford model. Exactly because this model learns from example data, the chances of it learning some peculiar and unexpected relationships that are unique to the text type under study — in this case movie reviews — increases. It is Pattern’s simplicity that protects it from being too dependent on its source material: the word “terrible” is a negative word, whether it is featured in book reviews, political speeches or a news article. However, the broader context in which this word tends to appear may differ significantly from domain to domain, and it is exactly these dependencies that constitute the breeding ground for a ML model to come up with intricate decision rules. So, all things considered, it might have been a pretty bad idea to analyze political speeches with a pre-trained ML model that was trained and (!) validated on movie reviews.

Luckily, some recent advances in ML might attenuate the problems of domain-dependency for ML sentiment models. For example, multi-domain learners (MDL) are capable of incorporating the influence of one or more specific ‘domains’ into a single model. Instead of building a separate model for each domain, the MDL leverages the common predictive power of features shared across the different domains. This search for commonality makes it a more efficient alternative to developing a bunch of separate models. Still, the technique assumes a sufficient amount of elements in the training dataset for each domain, which increases the costs of coding a model yet again. Moreover, there are limits to the generalization capabilities of MDL, with a drop in reliability when the domains are too divergent. Another promising avenue when it comes to text classification is the application of word embeddings. Word embeddings takes on a relational linguistic perspective by considering the co-occurrence of features in the same text, paragraph or even sentence. In the end, features (e.g. ‘Navy’) are represented in a vector space where similar features (e.g. ‘Military’) tend to cluster together. This doesn’t only improve the overall accuracy of the prediction, but it could diminish the negative impact of context-dependent word usage as well. Namely, its relational approach simplifies generalizations to features that were infrequent or even absent from the training data. In a similar vein, several ensemble techniques, such as bagging and boosting, should lower the generalization error as well.

· Is it desirable and feasible to build your own model?

If you believe none of the models are suitable to analyze your corpus given its unique domain-characteristics, maybe you can develop your own model? The machine learning approach is often the least labor intensive one since it learns from examples. Back in the day, when I was a teaching assistant, I wanted to perform a modest analysis of tweets sent by Dutch politicians. Being unsatisfied with the results obtained from the Pattern package, which is one of the few sentiment libraries available in Dutch, I planned on training my own sentiment model on a sample of tweets. This would involve me and two other researchers coding a couple of thousand tweets on their sentiment and feeding it to a neural network or a support vector machine model. Unfortunately, this plan never came to fruition, but in this case the ML approach would involve the least amount of expertise from linguists and would accommodate to the domain of political social media texts most explicitly. Beware, however, that “a couple of thousand” tweets may be an euphemism, since training and validating the model as such may involve a considerable amount of manually coded data. For example, researchers noted how the performance of their news topic classifier stabilizes after they incorporated around 2000 articles in their training and validation set. Remind yourself that news articles are relatively long, well-curated and professionally written texts; so imagine how many tweets we actually need to analyze to come up with a well-performing model.

With this in mind, and if you’re short on time or HR resources, there are two potential alternatives to developing your own sentiment model. First, you could potentially retrieve some training data from the web. Look for texts that are ‘labeled’ by definition. For example, mining Facebook hate groups or tweet timelines that are known to be hateful or toxic versus groups or timelines that are known to be rather supportive or positive in tone can be used to train a toxicity classifier for social media texts. Alternatively, researchers could opt for outsourcing the coding work to so-called untrained ‘crowdworkers’ on platforms such as Amazon Mechnical Turk (MTurk). Research suggests that these untrained coders form a viable and cheap alternative to expert coders, yielding similar reliability and validity measures even when coding latent constructs.

· Do you have the resources to produce a small sample of human-coded texts that could serve as a point of comparison for different off-the-shelf models?

If you end up using an off-the-shelf model after all, do take the effort to produce a small human-coded random sample from your corpus for comparison purposes. This golden standard sample could help you decide which model is most suitable for your project. For example, if you want to analyze tweets, code a couple of hundred texts manually and compare these results with the output of whatever candidate off-the-shelf model you’re considering. The model that most closely aligns with the manual coding should be one of the clear favorites to do the automatic coding in your final analysis. Given the importance of domain- and context-dependency (see earlier), this is really the least you could do.

These guidelines could serve as a quick checklist whenever you want to incorporate sentiment analysis into your project. Remember that it’s not about identifying ‘bad’ and ‘good’ off-the-shelf sentiment models. All the models we discussed here are excellent tools, but they must be used within the appropriate research context. Moreover, researchers should be honest about their biases and shortcomings. In this light, don’t be blinded by the use of fancy buzzwords and complex sounding techniques. After all, for all its complex machine learning wizardry, we have some reasons to believe that the Vader model — a relatively straightforward rule-based lexicon — provided us with the most common-sense polarity distribution for this specific little research project.

Links: data and literature