Syria Twitter Data Processing, Visualization, and NLP Analysis

An overview of the Syrian Civil War

My project is aimed at bringing the Syrian crisis to the attention of the general public through data, visualizations, and analysis.

Like other Middle Eastern nations that were part of the Arab Spring, there was an undercurrent of discontent in the Syrian population that finally reached a boiling point. It started with relatively peaceful protests that the Assad regime responded to with violence. That escalated things really quickly and many members of the Syrian military defected to join opposition groups. Commence the Syrian Civil War.

Syria’s conflict has devolved from peaceful protests against the government in 2011 to a violent insurgency that has drawn in numerous other countries. It’s partly a civil war of government against people; partly a religious war pitting President Assad’s minority sect, aligned with fighters from Iran and Hezbollah in Lebanon, against Sunni rebel groups; and increasingly a proxy war featuring Russia and Iran against the United States and its allies. Whatever it is, it has so far killed over 500,000 people, displaced half of the country’s population (+7,000,000), and facilitated the rise of ISIS. Link

About the dataset

I’m a huge fan of open-source projects and giving back to the community. In my first blog post of this series, I talked about contacting individuals and companies for datasets. Through research, I was lucky to connect with an individual located in Finland named Daniel Zautner. Daniel has previously researched Syria’s rebel communities on Twitter and analyzed the CIA’s TOW anti-tank missiles program. The CIA orchestrated a plan to provided missiles to rebel groups, and in return, the rebels had to upload a video of themselves using the weapon to YouTube. You can read more about the program [here] and [here]. Daniel was able help assist me with my research and provided me with his Twitter dataset. I especially want to say thank you Daniel for his help, and I hope we’re able to meet in the future over drinks or a coffee.

All of the code shown in this tutorial lives on Github. The Jupyter Notebook in the repo includes all of the text from this blog post, with some slight modifications.

Understand the data

Since the dataset is unknown to me, I need to do basic analysis. I need to decide what’s important for the analysis, and I want to get familiar with the data. To help me understand the data I’m going to look at the following:

  • How big is the dataset?
  • What attributes does it have?

First, the investigation of the dataset. We need to decide what’s important for the analysis. Each tweet JSON object has a lot of information and I need to figure out exactly what attributes I want to investigate. I want cut down on the dataset to identify what data is actually useful.

Print sample tweet

[{'_id': {'$oid': '595e82d713bbf01307babbba'}, 'created_at': {'$date': '2017-07-06T18:34:37.000Z'}, 'id': 8.830314454338519e+17, 'id_str': '883031445433851904', 'text': 'RT @rishibagree: Distance From Delhi to Syria - 3600 Km\nDistance From Delhi to #Basirhat WB-1100 km\n\nGuess where did @ndtv Reporters… ', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 94272634, 'id_str': '94272634', 'name': 'Oh my god!', 'screen_name': 'AnxJain', 'location': None, 'url': None, 'description': None, 'protected': False, 'verified': False, 'followers_count': 80, 'friends_count': 51, 'listed_count': 0, 'favourites_count': 6708, 'statuses_count': 4096, 'created_at': 'Thu Dec 03 06:37:58 +0000 2009', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'C0DEED', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/858923442631974912/yqngwcC2_normal.jpg', 'profile_image_url_https': 
...

Loading sample tweets in excel

Sometimes working with excel is necessary for readability, and in this case, it is. I wrote a tool that takes n number of tweets in JSON, transposes the tweet data vertically, and writes each tweet to a separate excel tab. This will help me gain a better understanding of the JSON object structure.

Display excel sample

Total count of tweets and column attributes

Total count of tweets 2853841

['_id.$oid', 'contributors', 'coordinates', 'created_at.$date', 'entities.hashtags', 'entities.symbols', 'entities.urls', 'entities.user_mentions', 'favorite_count', 'favorited', 'filter_level', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place', 'retweet_count', 'retweeted', 'retweeted_status.contributors', 'retweeted_status.coordinates', 'retweeted_status.created_at', 'retweeted_status.display_text_range', 'retweeted_status.entities.hashtags', 'retweeted_status.entities.symbols', 'retweeted_status.entities.urls', 'retweeted_status.entities.user_mentions', 'retweeted_status.extended_tweet.display_text_range', 'retweeted_status.extended_tweet.entities.hashtags', 'retweeted_status.extended_tweet.entities.media', 'retweeted_status.extended_tweet.entities.symbols', 'retweeted_status.extended_tweet.entities.urls', 'retweeted_status.extended_tweet.entities.user_mentions', 'retweeted_status.extended_tweet.extended_entities.media', 'retweeted_status.extended_tweet.full_text', 'retweeted_status.favorite_count', 'retweeted_status.favorited', 'retweeted_status.filter_level', 'retweeted_status.geo', 'retweeted_status.id', 'retweeted_status.id_str', 'retweeted_status.in_reply_to_screen_name', 'retweeted_status.in_reply_to_status_id', 'retweeted_status.in_reply_to_status_id_str',
...

Removing retweets from my dataset

First, the condensing part. I found that a large portion of the data was from retweets. Not only is the majority of tweets from retweets, but the size of the data drastically expanded for retweets. For example, if I flatten an original tweet there’s only ~60 attributes. However, if I flatten a retweet there’s about ~200 attributes. For this analysis, I’m going to concentrate my efforts on original and unique tweets. My assumption is that I can find more pertinent information from unique tweets and incorporate the retweets later.

Total count of tweets without retweets: 1160088

Identify tweet attributes and save to CSV file

In this step, I’m going to condense and clean the data to get it into a more analysis-friendly format. I want to be able to load my entire dataset into a pandas data frame for easier interpretation. In order to do that, I need to format and structure my data. Therefore, I’m going to be writing data to a CSV. Some notes about why I chose to write to my JSON data to a CSV:

  • A lot of people like seeing the data all at once as a CSV — there are neat ways to print it in Python but this is easier to absorb
  • CSV is a great format to use if you need to share the data with people who don’t program as they can open it up in any notepad or spreadsheet program, no coding required
  • If you want to continue the analysis with Python but in a different project file, you can always use the dictionary or read in a CSV in a couple of lines of code

Tweet attributes for analysis

I’m going identify all important attributes for this project and list them below. This will be a part of the data cleaning/processing section. I referenced the Twitter API official documentation to understand all of the fields and identify what information I want to pull from each tweet. Here’s the link.

The attributes

The user object contains public Twitter account metadata and describes the account. The entities section provides arrays of common things included in Tweets: hashtags, user mentions, links, stock tickers (symbols), Twitter polls, and attached media. Here’s a sample of tweet attributes I’m extracting:

  • created_at: UTC time when this Tweet was created.
  • id: The integer representation of the unique identifier for this Tweet.
  • id_str: The string representation of the unique identifier for this Tweet.
  • text: The actual UTF-8 text of the tweet.
  • source: Utility used to post the Tweet as an HTML-formatted string.
  • retweet_count: Number of times the tweet was retweeted.
  • favorite_count: Number of times the tweet was favorited.
  • lang: When present, indicates a BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected.
  • coordinates: Represents the geographic location of this Tweet as reported by the user or client application.

Data cleaning functions

Write data to CSV file

Exploring and visualizing the dataset

Let’s start by getting a qualitative view of what we have before we dive into the text. There’s a mix of quantitative tools for this step and NLP features for looking at the text.

Here are my steps:

  1. Load clean dataset into a data frame to summarize the main characteristics of a dataset.
  2. Generate visualizations to quickly and simply view most of the relevant features of the dataset.
  3. Identify variables that are likely to have interesting observations.
  • What is our timeframe?
  • Top users by followers?
  • Top hashtags by count?
  • Top sources used to send tweets?

Load CSV file into dataframe

Tweet dates timeframe

Let’s take a look at the timeline of our dataset. What are the dates and the counts over time in data? What events happened during the time frame?

Here’s a list of events that happened during this timeframe:

July 2017

  • 1 July Syrian Army forces fully recapture the quarries area west of Baath City in Quneitra reversing all rebel gains during the offensive.
  • 4 July A fifth round of talks organized by Russia, Turkey and Iran takes place in Astana, Kazakhstan discussing the implementation of safe zones in Syria.
  • 11 July The Syrian Observatory of Human rights confirm the death of Abu Bakr al-Baghdadi, the leader of the Islamic State.
  • 14 July Syrian Arab Army claims that Free Syrian Army commander of Regiment 107 is killed by a roadside bomb planted by the SAA in Daraa.
  • 29 July The Islamic State of Iraq and the Levant loses 6,000 square kilometers during the first month of the 4th year of its creation.
  • August 2017
  • 3 August The Syrian government shells rebel towns in Eastern Ghouta.
  • 4 August Nearly 8,000 Syrian refugees and fighters arrived in rebel-held central Syria from Lebanon as part of a ceasefire deal between Hezballah and Fateh al-Sham.

Top tweet sources

In this part we are going to check the source of the tweets. And by the “source” I mean the device the tweet was sent from. As usual, the first step is the cleaning. The kind of device is situated at the end in the column called tweet_source. With the following example we can see what the device is.

This chart shows us that the most important part of the tweet is sent through the Twitter Web Client, iPhone, and Android, and less than 3% by Instagram and other devices. I noticed that IFTTT makes up about 6% of devices.

What is IFTTT? 
It’s named from the programming conditional statement “if this, then that.” What the company provides is a software platform that connects apps, devices and services from different developers in order to trigger one or more automations involving those apps, devices and services. This is really surprising to me and could be indicator of Twitter bots.

Top tweet hashtags

Not surprising that some the top hashtags are “syria”, “isis”, “iran”, and “iraq”. This is a great indication that my data topics are centered around the Syrian war. However, I notice one of the hashtags is in Arabic. I’ll discuss later how I plan on handling my data that’s in a foreign language.

Top twitter users by followers

Tweet media types

Tweet languages

As I mentioned earlier, I noticed a hashtag in Arabic. Since my topic and subsequently dataset are over an event in the middle east, a large subset of my data is probably in another language.

The majority of my data is in English, which is great. However, given my topic is centered in the Middle East, I want to be sure to include that data (e.i. Arabic). To find meaningful results using text processing, I need to translate all of my text data into a single language ( i.e. English). The Google Translation API is very easy to set up and use. However, it is not free so if you’re making any API calls, make sure you have taken the steps to estimate how much it will cost. Especially for a large dataset.

I’m still estimating how much it will cost me to translate the non-English text into English, but I hope to keep it around $300.

Sentiment analysis with textblob

What is sentiment analysis? Sentiment analysis refers to a classification task in the Natural Language Processing (NLP) community, the goal of which is commonly to determine the polarity (positive/negative) of the input text. Whereas subjectivity analysis deals with the detection of private states (opinions, emotions, sentiments, beliefs, speculations), classifying the textual input as objective/subjective.

What is textblob? I’m not an NLP expert, but I recently discovered a library called TextBlob. I found it very easy to get started with. You might want to read the docs and see how it suits your project. Here’s the link.

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation and more.

Does textblob have any advantage or differences from natural language tool kit (NLTK)?

  • I don’t have much experience, but I found TextBlob much easier to pick up and use than NLTK.
  • It is built on top of NLTK and provides a cleaner API that can help you get your job done quickly.

For more accurate results I should consider adding the retweets back into my dataset but dropping any duplicates. Also to translate any foreign text into english. An interesting idea would be to analyze the polarity of the tweets from different media types. It might be deterministic that by only considering the tweets from a specific media type, the polarity would result more positive/negative.

Next steps

What’s next? I have a lot of data and a great start towards analyzing the Syrian Civil War. However, there’s never enough data! Since my dataset is from last summer, I’m going try and bridge the gap from last summer to today for a subset of twitter users. I’m also going to work on translating the text to English. For my research, here are some areas I’m going to investigate:

  • Classify topics within tweet
  • Analyze the sentiment around various forms of media, like a picture about a war (e.g. does a picture make people angry or sad?)
  • How do people’s emotional states influence their decision of whether or not to tweet?
  • How do people’s sentiments towards media types affect their decision on whether or not to engage with different types of media?

I’m excited to begin working on the data science aspect of my project.