Diving into Arabic NLP: A Beginner’s Guide to Dialect Identification

Keroles Elkess
5 min readMar 11, 2022

--

The intricate landscape of Arabic natural language processing (NLP) presents a dynamic challenge owing to the rich variety in word usage and letters formation (تشكيل الحروف). Spread across 22 Arab countries, a plethora of dialects exist, with most regions boasting a distinct dialect that sets them apart. In this blog, I will unfold the methodology of employing machine and deep learning models to discern the various classes of Arabic dialects with precision.

To set the ball rolling, I initiated the process by fetching the QADI dataset in JSON format from a private API capable of processing up to a thousand tweets at a time. This data was then meticulously saved as a CSV file utilizing the Pandas library.

Next on the agenda was a vital step: to analyze, visualize, and clean the gathered data.

For instance, the ‘EG’ class had six times more tweets compared to classes like ‘TN’. To tackle this, I opted to use a specialized accuracy metric for a more accurate representation. One might ponder the reason behind not choosing under-sampling; this is to prevent the loss of crucial information embedded in the dataset. A viable solution adopted was binning, which aggregated the 18 dialects into 10 prominent dialect groups, mitigating the imbalance effectively.

Fortunately, the dataset has no missing data

Before delving into the cleaning process, it’s essential to comprehend what ‘cleaning’ entails in this context. The identification of an Arabic dialect, or any language dialect for that matter, hinges predominantly on the unique verbs and nouns it employs. Therefore, elements such as emojis, links, HTML elements, mentions, and hashtags, among other things, are superfluous and were eliminated using regular expressions, thereby focusing on the linguistic characteristics that truly matter.

Using Regex to clean the tweets
Removing Arabic and English punctuations

To further streamline the dataset, Arabic letters that can appear in various formats (like أ, ؤ, گ) were standardized using regex, ensuring a uniform representation.

Unifying Arabic letters

Some stop words like ‘إن’, and ‘أن’ do not influence dialect identification, while others like ‘دونك’ (frequently used in Morocco and Algeria) do play a significant role. This guided the selective removal of stop words, retaining those that aid in dialect distinction.

Removing stop words and extra spaces

After a rigorous cleaning process, I engaged in an analytical venture to pinpoint the most frequently used words across different dialects.

Most frequent words in some dialects

This analysis revealed a considerable overlap in word usage between various countries.

It is vital to remember that machine learning (ML) and deep learning (DL) models communicate in the language of numbers, not text. This necessitated the extraction of numerical features to facilitate smoother interaction with the classifiers. This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes into play, extracting a specified number of words as features and vectorizing the tweets. Following this, the dialects were encoded into labels using a label encoder.

Feature Extraction

Using LDA to visualize the generated features VS dialects.

Arabic Dialect Identification in the Wild

Given the non-linear nature of the features in this scenario, adopting an advanced machine learning model like LSTM or a transformer seemed the prudent choice. Especially when dealing with sequential data types, recurrent neural networks (RNNs) tend to shine, prompting me to select models such as Naive Bayes, Logistic Regression, Neural Networks (NN), and RNN for two separate experiments, each utilizing an 80–20 train-test split.

The first experiment is to feed the model with TF IDF extracted features. The second experiment is to feed the RNN model with the original preprocessed text.

In the first experiment, I tested Naive Bayes, Logistic regression, and a regular deep-learning model. They could achieve at most 38% percent accuracy and 34% F1-Score.

The Confusion Matrix of Logistic Regressing, Naive Bayes, and DL Model Respectively

In the second experiment, The preprocessed text is tokenized and saved in an indexed corpus with a size of, 405000 words. Then each tweet is converted to a sequence of indices with max length= 250 with post padding and used OOV tokens. Using RNN Architecture as following

RNN Architecture

The model could achieve higher results than the other model with just 3 epochs! The mode could achieve about 50% accuracy and 45% of F1 Score.

RNN Results

After considerable time invested in training and experimenting, I was content with the results and proceeded to deploy the model locally, utilizing Flask, HTML, and CSS to create an Arabic Dialect Identification application.

Arabic Dialect Identification Local App

However, the journey doesn’t end here. Despite achieving significant milestones, the algorithm faced hurdles such as class imbalance and inaccurate data collection. Potential further enhancements include employing the Arabert Pre-trained Model to generate word embeddings and extending the training period to 100 epochs, albeit requiring substantial time and computational power.

You can explore the code repository here: GitHub Repository

Thank you for embarking on this journey with me! Wishing you a wonderful day ahead 😊

--

--

Keroles Elkess

Data Scientist experienced in Statistics, Classical ML, DL and NLP