Sentiment Analysis in Arabic

Comparing libraries on GitHub

Nick Doiron
Voice Tech Podcast
6 min readMay 20, 2019

--

After my last post on machine learning found political trolls on Twitter, I had a conversation about similar problems in the Arabic Twitter-verse. After considering the challenges to attempt this, I decided to start with a low-stakes problem: reactions to ‘@NetflixMENA’. Netflix often uses the account to Tweet in Arabic about new releases and originals including Jinn (their first Arabic-language original series, premiering next month).

A photo from my 2016 trip to Muscat, Oman

We are also currently in Ramadan, which is apparently a big month for binge-watching: “Netflix’s research from last year revealed that, across the MENA region, people feel they gain two to three extra hours a day of free time during Ramadan… peak hours for streaming are from 2am to 5am”

Arabic NLP Libraries on GitHub

As of this writing, Google AutoML tools which I used earlier can analyze text in these languages:

English, Spanish, Japanese, Chinese (simplified and traditional), French, German, Italian, Korean, Portuguese, and Russian

For this project, I should search for appropriate tools first, and then see how applicable and reusable their code is for Twitter analysis. I want to find one which comes with a pre-trained model or datasets, so I can get results without labeling source data myself.

Here are three projects which I found, with help of GitHub’s arabic-nlp tag/topic and Walid Ziouche’s Awesome-Arabic repo:

Honorable mention: “Building Large Arabic Multi-domain Resources for Sentiment Analysis” has large datasets of reviews, and combinations of classifiers, vectorizers, cross-validation parameters, etc. The repo had more stars than other projects. Unfortunately this was written in 2015 with an old version of scikit-learn, so it was hard to pursue updating for this post.

Updates?

This article is from 2019.

As of 2020 are many more options… see snakers4/emoji-sentiment-dataset for a dataset of Arabic and other languages’ Tweets for sentiment analysis; AWS Comprehend for a dedicated service; AraBERT and bert-base-arabic for the latest in neural networks on HuggingFace.

I will keep a list of latest recommended links on https://github.com/mapmeld/use-this-now#arabic-nlp

The Analysis

I downloaded NetflixMENA’s Arabic-language Tweets and 11k replies between April 1st and May 18th, 2019 (when I did data collection). In each case I looked for the most positive and negative Twitter threads of that time.

ar-embeddings

ar-embeddings includes training data, tokenizing from NLTK, and multiple algorithms (random forest, gradient descent, support vectors, logistic regression, Bayesian) from scikit-learn. In real life you could check all of the options and find the best predictor, but for the sake of this post I will sum all together.

Make sure to follow this readme to download a vectors file based on an Arabic news corpus (large dataset, but notably different from social media content). Also, the code would train all classifiers and graph their accuracy on initialization, so I forked it to wait, so I can train each individually and get a prediction for my own data.

Without averaging, the sum of classifiers found the most number of positive responses to a list of all content coming in April, with the second being a meme poll anticipating the next season of Money Heist / La casa de papel.

The most negative responses were to a Tweet about how someone who would not spoil a series for you is a true friend (but this may be sympathizing with the negative context of spoiled shows, after all).

After averaging by the total number of replies to a Tweet (with significant numbers of responses), the most positive announced daily new episodes. The most negative is talking about an episode with a friend, but being a little embarrassed to discuss the content.

I didn’t check each algorithm, but the results of the first (random forest) and the combined results were noticeably similar.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

arabic-sentiment-analysis

arabic-sentiment-analysis was created for a Kaggle project. It comes with Twitter data for training models, and multiple algorithms from SciKit and/or NLTK. In the SciKit-only option, it uses a TF-IDF vectorizer. Testing the code as-is consumed too much disk space for my machine, so I did this work in the cloud.

I had the most success with the SciKit-only option. There was a small change to return the vectorizing-and-classifying pipeline for use in my own script. As with the previous example, I will not choose an algorithm, but combine all of the positive and negative scoring.

Summing up positive Tweets, the highest-scoring is the trailer for Jinn, and the best average with significant reply count was a retweeted announcement of Our Planet.
The Tweet about a true friend not spoiling a series scored both highest in negatives and second-highest in positives here. How can it be? I think we are getting tripped up is because of the confusing / negative context of the Tweet.
Unfortunately the most negative average was an announcement of another series, Al-Rawabi School for Girls.
One bonus for this project: my Tweet-scraping code has a major flaw in that it doesn’t capture emojis; if I made changes, I would likely use this repo to collect emoji-rich content.

Youtube-Comments-Analyzer

This uses sample positive and negative Tweets to generate a classifier with NLTK’s NaiveBayesClassifier. It will use NLTK to tokenize your string and create positive and negative scores for each tokenized word. There’s a CLI where you can input any text and get back its sentiment scores.

My first concern after running was that only two Tweets had received a more negative score than positive, and neither had a complete word.

Summing up the positive and negative scores, the biggest positive audience was again the trailer for Jinn, and most negative for this question (“who is your favorite investigator from a crime series?”)

By averages, mostly positive responses came in for a quote-Tweet of a user talking about family viewing time, and a video interview with an actor and two actresses. The most negative were announcements for You vs. Wild and Street Food, but for unclear reasons (for example, this Tweet teasing not to watch Street Food on an empty stomach during Ramadan).

Lessons for Future Arabic NLP

Continuing with this dataset

Check out my GitHub repo for notes. I’d recommend the YouTube repo for beginner projects, and arabic-sentiment-analysis for more advanced analysis (including choosing the best model, n-gram, etc).
Here are some questions that you could research:

  • Which shows do people often request for Netflix to add?
  • In announcements of monthly releases, which are most anticipated?
  • What was sentiment of Tweets mentioning specific shows (Jinn, Our Planet, Lucifer, and The Rain)?
  • How does including emoji improve the quality of sentiment analysis? Can some type of image-hash be used to recognize meme replies?

The open source community

I was happy to find projects which were open source and batteries-included, including the sample dataset, requirements.txt, etc.

It did seem strange to me that most had training and test/analysis phases, but no module files or configuration to import the code as a module or to evaluate my own text samples on the trained model. There were also no notebooks in these repos. Notebooks have their pluses and minuses, but data scientists use them a lot, and AWS SageMaker and Azure’s machine learning platforms have come to depend on them.

Helping computers understand more

Comments on YouTube and Twitter are short, full of emojis and the occasional slang or misspelled word. There is a long way to go before translation tools truly understand that. For example, it’s common to laugh “hahahahaha” with a chain of Arabic H’s:

Microsoft translation doesn’t see the humor :(

Could our artificial intelligence try to be a little more human, enough to recognize laughter?

Written Arabic also comes with challenges which are unfamiliar to speakers of European languages: highly different classical and spoken language, distinct regional dialects (you can find some classifiers on GitHub), tashkeel vowel rules, etc. Machine learning systems will struggle with these as well, until we see more programmers from all parts of the world, and include them in discussions.

--

--