Correlations among unclassified social media data to structured event data during the Syrian Civil War

Sentiment Analysis Using NLTK and Machine Learning Techniques on the Syrian Conflict

Introduction

The conflict in Syria, which has been ongoing since March 2011, has been characterized by the extensive use of online social media platforms by all involved parties. The unprecedented use of social media in the Syrian conflict, and the exceptional human and strategic urgency of the conflict, made it an ideal case for ongoing research.

I use data to explore and analyze the social opinions and sentiments towards real war events that took place in Syria. To analyze the public sentiments about specific war events, I collected over 3,000,000 relevant tweets and over 3,000 verified war events.

Goal of my research

My primary interest is to understand the extent to which twitter activity may reflect on the situation on the ground, where one can find additional insights into the relationships between war events and corresponding public sentiment. Finding the sentiment of tweets will lead to the overall sentiment of the community the tweets pertain to, and knowing the overall sentiment of the community leads to understanding sentiment towards specific events.

This project will try to investigate this logic and evaluate the accuracy of this approach.

Data Collection

Here is the hypothesis of having the data properly pre-processed: to reduce the noise in the text should help improve the performance of the classifier and speed up the classification process, thus aiding in real time sentiment analysis.

Twitter

I obtained a twitter dataset of over 3,000,000 tweets, but data obtained from twitter is not fit for extracting features. Most tweets consists of a message along with usernames, empty spaces, special characters, stop words, emoticons, abbreviations, hash tags, time stamps, url’s , etc. To make this data fit for mining we pre-process this data by using various functions of NLTK (Natural Language Toolkit).

Syria events data

Syria civil war event data was collected by The Armed Conflict Location & Event Data Project (ACLED). ACLED collects the dates, actors, types of violence, locations, and fatalities of all reported political violence and protest events across Syria. From their Syrian war data, I’m using over 3,000 documented and verified war events.

Heat map and event points of all of the Syrian war events in my dataset

Initial Investigation

Before I build any models, I want to test my hypothesis. I manually selected an event from my event dataset and found related tweets based on the similarity of the tweet text and event text.

Specific event to test

Violent clashes took place in the village of Ghanim al-Ali in Ar-Raqqa countryside between the Syrian army and its allies on one side and the Islamic State on other, the clashes were accompanied with airstrikes on the area of conflict. Pro-Syrian regime forces fully controlled the village. No fatalities reported.
ACLED event dataset

After manually checking through the tweets I found a pretty good match. See below.

Tweet match

tweet example from dataset

Event and the tweet

Violent clashes took place in the village of Ghanim al-Ali in Ar-Raqqa countryside between the Syrian army and its allies on one side and the Islamic State on other, the clashes were accompanied with airstrikes on the area of conflict. Pro-Syrian regime forces fully controlled the village. No fatalities reported.

Framework

Sentiment Analysis

Sentiment analysis can be defined as a process that automates mining of attitudes, opinions, views and emotions from text, speech, tweets, and other data sources through Nature Language Processing.

Today, most people use social media sites to express their opinions. Sentiment analysis helps understand people in a more accurate way.

I’m using NLTK’s built-in Vader Sentiment Analyzer (VADER) to simply rank a piece of text as positive, negative or neutral using a lexicon of positive and negative words.

What is a lexicon? A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information, such as part-of-speech and sense definitions.

Why VADER for sentiment analysis? What makes VADER great for social media text? As you might have guessed, the fact that lexicons are expensive and time-consuming to produce means they are not updated all that often. This means they lack a lot of current slang that may be used to express how a person is feeling.

About the Scoring The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single uni-dimensional measure of sentiment for a given sentence. Calling it a ‘normalized, weighted composite score’ is accurate.

It is also useful for researchers who would like to set standardized thresholds for classifying sentences as either positive, neutral, or negative. Typical threshold values (used in the literature cited on this page) are:

  • positive: compound score >= 0.05
  • neutral: (compound score > -0.05) and (compound score < 0.05)
  • negative: compound score <= -0.05
Negative tweet:

['I liked a @YouTube video https://t.co/zo5YUOyXyM Tillerson calls for Syria unity, no-fly zone as Russia alerts over chemical attacks']
Positive tweet:
'@kwilli1046 They need to be sent to that beautiful country called Syria!!!!\n😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😜😄😜😜😜😜😜😜😜']

Stemming

It is a process of breaking up the given text into small units such as sentences or words. It does this work / task by locating word boundaries.

What is word boundaries? Ending point of a word & beginning of the next word is called word boundary. Tokenization is also known as word segmentation.

Lemmatization and stop word removal

Stop Words are considered as noise in the text. Text may contain stop words such as is, am, are, this, a, an, the, etc.

Lemmatization is process of removing prefixes, suffixes from the words and reduce them to their stem form. It refers to map a word form to basic word. Words forms may differ from stem due to morphological changes and grammatical reasons. For example, the word “computation”, might be stemmed to “comput”.

Term Frequency — Inverse Document Frequency (tf-idf)

What is Tf-idf? Tf-idf is a very common technique for determining roughly what each tweet in a set of tweets is “about”. It cleverly accomplishes this by looking at two simple metrics: tf (term frequency) and idf (inverse document frequency). Term frequency is the proportion of occurrences of a specific term to total number of terms in a document. Inverse document frequency is the inverse of the proportion of documents that contain that word/phrase.

The general idea is that if a specific phrase appears a lot of times in a given number of tweets, but it doesn’t appear in many other tweets, then we have a good idea that the phrase is important in distinguishing that tweet from all the others.

Now, I’m going to apply that same logic to train the tweets, where I weight each term based on it’s frequency relative to other tweets. And then to classify an event to a tweet.

Cosine Similarity

Cosine similarity calculates similarity by measuring the cosine of angle between two vectors (or two documents on the Vector Space). Here’s an example illustrating the scoring between two vectors.

Below, I iterate through the tweet and event vectors and calculate the cosine of every event for a specific tweet. I take the max cosine between a tweet and all of the events and then assign the max cosine event id to the tweet.

What’s next?

My next steps are to evaluate the accuracy of pairwise cosine similarity and to apply other machine learning models. Stay tune!

Acknowledgements

I would like to thank Chipy (Chicago Python), Zax, and everyone who assisted me in my research and mentorship. Thank you for your time, effort, and pizza.I really appreciate it. Especially the pizza.