Combining Sentiment Analysis with Regular Expression in Vue.js.

Another approach to categorize text based on certain terms for Bahasa Indonesia.

hanifa
hanifa
Nov 7 · 6 min read

There are many ways to do sentiment analysis, even some of programming languages have already give a shortcut to do so. Sentiment analysis is commonly used to gain knowledge about what opinion is identified within a sentence, and — a more advanced one — to understand socially how a subject judge an object based on selected sentence. Recently, I have tried to use a Javascript library for those purpose so that this system will easily paired with a Node.js-based web scraper system that has already been created.

Here’s the thing. Imagine if I have a mission, to know what kind of topics are discussed inside a blog, or a news portal, or in my case, in a student blog portal. Beside, I also have to find out about certain brand reputation which might be discussed there. Another thing that we should know is I have to analyse those articles on Bahasa Indonesia, instead of English.

When you decide to use Javascript as your core, then you have less options to conduct any Bahasa Indonesia-based sentiment analysis. Only a few developer had covered this version into their language collection. Luckily, after some deep dive on Google, I found this! A multilanguage library for sentiment analysis named Multilang-Sentiment. Multilang-Sentiment is basically using a lexicon created by Finn Arup Nielsen where he categorized each English word in lexicon with “rated for valence with an integer between minus five (negative) and plus five (positive)”. We will not discuss about how this lexicon works, but instead we will break down a project which tries to implement this lexicon into the system. If you are interested more in how this lexicon works, I recommend you to read some insightful article about that.

Multilang-Sentiment has wider various of language collection than any other lexicon. We will use Bahasa Indonesia here, so the library used is a list of Indonesian words. Lets dive into the code!

Source Object Breakdown

First thing first, it has to be clear about what will we analyse. We were talking about the articles that will be the object of this project. We will break down an article into 3 parts : the title, release date, and content. Check out this pict!

breakdown an article into 3 parts

Multilang-Sentiment contains 2208 Indonesian words, all divided into 11 rating categories between -5 to 5. Here the snippet of the list ~

list of Indonesian words on Multilang-Sentiment

Content Analysis

For the sentiment analysis purpose, we will use the content of the articles as an object here. The content is chosen because compared to the title, author will describe more opinions on an issue there. The title later will be used as an object on regular expression method, as we aspire to categorize the topics.

The usage of this lexicon is to find the sentiment score of each word in the content, and store it into our databases to understand the polarity of the articles. Again, I would like to recommend you to mastering this article before you step into next paragraph.

Multilang-Sentiment counts total score for each input as a calculation between all positive scores and negative scores. Based on what we found in the project, articles that discuss about certain opinion on social issues will have more positive or negative words rather than a regular informational article.

Code snippet above illustrates about the usage of Multilang-Sentiment method. This method will takes each content of articles then return some objects i.e. score and tokens. Problem occurs when we get total score since it can have a higher scale more than just between -5 to 5, then we have to categorize the content not only based on total score. To determine the polarity of an article, we use another rule that described above on the snippet. We will classify an article as positive when total score is more than 5, etc. This will makes sure that an article have more than 5 lowest positive words, or have more than 1 highest positive words. The ‘label’ variable will simplify how we categorize each total score.

Regular Expression

In the meanwhile, this function will gives output for each analysis and will continue to do the rest, the extracting-word process. This process will give us the hint about what was discussed in the article. In programming languages, we use regular expressions to match parts of strings. Here, we use regular expressions to match some certain terms in strings with the title itself. Remember that we use Bahasa Indonesia here and will only use the title from all three parts.

If you are new to regular expressions and interested on dive deeper on this topic, I recommend you for some visits on articles that mentioned below. Regular Expression package for this project is downloaded from here and we will call it ‘Regex’ from now on.

construct some terms to match the title

Regex use method test( ) to pass a string inside and try to match one by one any words contained. We see a sign ‘|’ above to separate each terms that we seek for its existence in the title. This sign means OR in human language, so we can say that based on that pict above Regex is trying to track down some words in title, like if the title contains ‘prestasi’ word, OR ‘berhasil’ word, OR ‘penghargaan’ word and etc, then Regex will include this article to Prestasi Category. Besides this OR sign, Regex works completely good in AND sign.

As explained before, this process requires some terms that will be a comparison for every single word in the title. When one or more words in the title match with the terms, this function will return a condition : ‘isExisting = true’. This condition then will change default category into certain category based on those words. Below is an example for ‘Prestasi’ category :

return a condition

Aaand voila~

We can categorize some more! Just add or edit some terms and words to match with something you want. Now, if we want to know how’s the sentiment of XYZ brand, we can analyse the sentiment first and stick to this method. Kinda complicated somehow at the beginning, but we have an absolut advantage here : we can manage to narrowing down categorizations, especially if we are dealing with big data.

That’s it, folks. Thank you for your time, I hope you find this helping and interesting. Would love some feedbacks!

P.S. : If you’re interested on how I collect the articles (about how to do web scraping in Node.js), I’ll prepare another article! See you there.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade