Multilanguage sentiment analysis, in Node.js
Hello my friend,
if you are here is because you are probably interested in sentiment analysis and Node.js.
First thing first, let’s introduce what an AFINN document is, for those who lack the definition.
AFINN is a list of English words rated for valence with an integer
between minus five (negative) and plus five (positive). The words have
been manually labeled by Finn Årup Nielsen in 2009–2011
→ details here
Even though in the era of the DL/ML (Deep Learning/Machine Learning) a dictionary based solution might be not that exciting, it still could be a good compromise for those who are just approaching this field or running a small project.
If you are already comfortable with concepts such as “tokenisation”, “stemming” or “lemmatisation” you can just skip this part and jump straight down to the bottom.
What’s the basic approach to approximate the sentiment of a phrase or, even better, a whole corpus of text?
Let’s start by simplifying the problem and, by doing that, let’s consider to divide everything into small parts, which in our case it does mean only one thing: single words.
The process of dividing a string into its single components is what we call “tokenisation”.
How to split a text into words is totally up to you; we can use “space characters”, “punctuation”, “numbers”, “special characters” but, above all, it’s highly recommended to use regular expressions for such a thing, whatever your expression will be.
Once the tokenizer has done its job, we will end up with a list of words; not enough to accomplish our goal though…
Let’s get it straight, computers are still dumb as a rock. No matter how many times you have been told about “artificial intelligence solutions” chatting with colleagues or businessmen… there are developers, out there, that still have to put a lot of effort into profusing useful knowledge into machines.
Our simple example is the perfect showcase for that…
imagine having software which has the following dictionary of words:
[ dog, crocodile, cat, are, is, fantastic, scary, awesome ]
then you decide to match the result of the previous tokenisation against it…
Even if the dictionary contains the “cat” word, its plural form didn’t get match. What a sad world you might say… well, actually, you got the point.
To soothe the pain we need to apply stemming on our list of words; we need, in other terms, to get the base form of every single word to better match our dictionary entries.
Note: the stemming approach tries to return the base form without taking in consideration the context of the phrase, which is, instead, covered by the lemmatisation.
→ more info here
Great… let’s imagine that we finally have a perfect match between our words and the ones in the dictionary. How could our software compute the sentiment of this list of words at this point!?
AFINN ❤ Finn Årup Nielsen
In 2011, a data scientist named Finn Årup Nielsen decided to create a list of English words “rated for valence with an integer between minus five (negative) and plus five (positive)”.
The usage of this list should be pretty clear, we have to match our tokens against it and store the integer value for each one of them; in our basic example, the polarity of our phrase will be given by the weighted average of the scores.
The Multilanguage Node.js package
The module is derived/forked by the great “sentiment” library (which, unfortunately, still doesn’t implement any multi-language support, even if they are working on it).
Multilang-sentiment differs from the original one for some specific integrations needed to support different languages…
Plurals and typos
There are languages, like the Italian one, that follows particular rules for plurals; for example, in some cases it’s not enough to add an “s” at the end of a word but you really have to replace characters, which, on big texts, might have a huge impact from a computational point of view.
The perfect solution would have been having a stemmer (or lemmatizer) for each language… which, unfortunately, in our case is something that requires a big effort; a clever way to find similar words was needed then… so the Fuse.js package(a fuzzy-search library) has been implemented, plus some extra tweaks, here and there, in order to mitigate false-positives results.
Negators and their variants
“Negators” are those words (or constructs) aimed to flip the integer value of a word… for example: “beautiful” = 4; “not beautiful” = -4.
In multilang-sentiment there are negators for all the supported languages, plus, some possible variants are computed in order to gain accuracy.
Another concept that has been introduced is a distance threshold for negators… quickly described in the following image.
We were talking about rules to split phrases into single words… also, in this case, there are languages with specific rules (or no rules at all) to divide words; imagine the Chinese or the Arabic languages… we would need to have a tokenizer for every specific case, which is a big task to reach.
That’s why is possible, instead, to pass single tokens to the library, processed by any 3rd party tokenizer, which will skip then the tokenization process jumping directly to the matching process against the AFINN list.
That’s it, we are done. Even if very basic, I hope you did enjoy it and, if you are still curious, you can:
a) play with the library here → https://repl.it/@MarcelloBarile/multilang-sentiment
b) check out the repo here → https://github.com/marcellobarile/multilang-sentiment
c) write your own solution … because there is nothing better than getting hands dirty!
P.S. If you find bugs or ways to improve the solution don’t hesitate to contribute! Take into consideration that I’m going to port the package in TypeScript very soon.
P.P.S. If you are an academic and would like to write a story about state-of-the-art technologies and solutions you are more than welcome… just drop me a message if so.
Before leaving don’t forget to check some resources about our beloved Finn Årup