NLP Intro: Using Lexicons in Sentiment Analysis
Sentiment analysis, or as it is also known, opinion mining, is a key step in unlocking the meaning of a text. Having spent some years in the media analytics industry, I can also confirm it is a highly demanded but also highly contentious aspect of text mining (NLP-based transformation of the unstructured text for easier analysis).
Why contentious? Because it is tough to get it right. Human language is complex as it is: the same word can mean many different things depending on how we use it in a sentence, who the speaker is, and the social, cultural, and historical context of the conversation…You have a lifetime of experience helping you understand the jokes, subtleties, and overall sentiment in a discussion with your peers. A computer can only rely on the set of rules and available datasets it has been trained on. That’s why, typically, sentiment analysis can sometimes do quite well with conventional information sharing but fall off the cliff entirely when it comes to sarcasm, jargon, and obscure references.
What can we do if we wish to do sentiment analysis?
Dictionary-based methods & Lexicons
Lexicons are a common component of NLP processing art. They are linguistic sets with grammatical and semantic information about the words in a given language. In NLP lexicons are:
- Language-specific (i.e., if you need to analyse Portuguese texts, you would need a different lexicon from the one you use for English content).
- Focused on n-grams, most commonly unigrams (single words). Each word in the lexicon has a specific score indicating whether it is typically positive or negative.
- Subject to different licenses: something to be mindful of depending on how you intend to use them in commercial/academic/personal research, etc.
- Created and validated within a specific context: lexicons and sentiment ratings do not materialise out of the ether upon request (unfortunately, it would have been handy if they did!). The quantitative nature of their sentiment ratings creates the illusion of objective measurement but it is good to keep in mind that lexicons often use crowdsourced sentiment ratings that are validated by online communities. So you should be aware there is a degree of subjectivity to them, especially as they reflect current sociocultural perceptions and might not be well suited for analysis of historical texts created in a different linguistic environment.
- Note that in dictionary-based methods for sentiment analysis, lexicons are used to get sentiment judgements for individual words, which are then tallied up to get a sense of the overall sentiment of a piece of text. This has its limitations because if you are looking at a single-word level, the contextual information surrounding that word (including negative qualifiers like no or not) is not included.
Other caveats & things to consider
- Different lexicons approach sentiment categorization in different ways. Common approaches include a binary choice (a word is listed as positive or negative) or a numeric score from a negative value to a positive one (e.g. the AFINN lexicon goes from a score of -5 to 5). Lexicons also vary in terms of how many positive and negative words they contain and in how well they are able to capture the vocabulary used by certain authors or within certain time periods. As a result, analysing sentiment with each lexicon will vary even though the main trends should remain relatively the same.
- Polysemous words (with multiple meanings) can be tricky to capture with standard lexicon ratings;
- Lexicons also feature commonly used words but might miss archaisms (old expressions) or neologisms (brand new words).
- The size of the text you analyse may impact results: individual sentences and paragraphs might lean toward negative or positive sentiment but very long texts are likely to contain both. So, the final result will have negative and positive sentiments cancel each other out. A way to circumvent this limitation is to set up your sentiment analysis to look at chunks of text at a time (for example, 80–100 lines long) that can be plotted to track how sentiment unfolds in the text.
Quick example: sentiment analysis of works by Edgar Allan Poe
In the previous article in this NLP series, we analysed word frequency in the work of some of my favourite authors. So let’s continue with a sentiment analysis example focusing on a small sample of works by one of them: Edgar Allan Poe.
Using the tidytext R package, we can turn the text into structured tidy data, break it down into about 80 lines and analyse the sentiment of each chunk.
# I have preselected and downloaded 3 works by Poe (poesample) from gutenbergr
# Getting the content into a nice tidy format
tidy_books <- poesample %>%
group_by(gutenberg_id) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
# Using inner join to analyse sentiment based on the bing lexicon
poe_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(gutenberg_id, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
# Visualising
ggplot(poe_sentiment, aes(index, sentiment, fill = gutenberg_id)) +
geom_col(show.legend = FALSE) +
facet_wrap(~gutenberg_id, ncol = 2, scales = "free_x")
Note that, as I mentioned, the size of the text you are analysing will impact the results. I chose two short stories and a poem (The Raven) to illustrate that. The bars vary in number and size across plots precisely because of this: The Fall of the House of Usher has many more lines than the Raven.
To nobody’s surprise, the sentiment is predominantly negative: Poe is not famous for a sunny disposition. With the exception of a few positive paragraphs on Roderick Usher’s looks, the dark mood prevails. Note the stark contrast and variability of the sentiment across the text when we compare it to longer pieces of fiction in other genres, like Jane Austen’s novels, also broken down into chunks of 80 lines (below).
If we are curious about which specific words stood out in our example (the tiny sample of 3 popular works by Poe), we can list the top positive and negative words and/or build a word cloud with them.
# Bar chart
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
# Word cloud
library(wordcloud)
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
We can even configure the word cloud to separately highlight the most common positive and negative terms.
library(reshape2)
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("coral", "darkolivegreen4"),
max.words = 100)