The first things I do when I get to work are 1). grab a cup of coffee and 2). read the news.
My go-to news sources are The New York Times, Hacker News and Reddit. These three media sites already say a lot about me and my interests. I have developed a strong interest in global affairs and the way information is reported. I‘ve noticed that many people are switching to alternative forms of news and instead, many are relying on social media for information. And for people in less developed countries or areas of conflict, sometimes social media can mean life or death.
One example of where this is most apparent is within communities affected by the Syrian civil war. I’ve been following this conflict since 2011 while I was studying at The University of Iowa. I first became interested in the unrest and protests in the Middle East and then specifically began to focus on the Syrian crisis as it progressed from protest to civil war. I’m fascinated with this conflict because it’s the most covered and documented war in history due to it developing during the height of social media (YouTube, Twitter, Facebook, etc.).
When thinking about a subject that would inspire me throughout my ChiPy experience I realized I would want a subject that is constantly developing, uses social media, and is relevant to society today. The Syrian Civil War quickly came to mind. The fact that news comes in at such an extreme speed through social media makes it all the more intriguing and something that I want to delve into further during my mentorship.
My initial plan is to collect and organize data, investigate and discover questions to answer, and then apply data science to solve the question I develop. To be honest, I don’t know where my research will lead me, but the unknowns make me even more excited to start discovering.
In the meantime, I have identified a few possible routes for my project:
- Determining if environmental factors affect a siege or area of conflict. Questioning whether holidays increase/decrease violence? How does the weather, specifically different weather patterns, affect the conflict?
- Looking for correlations between social media activity and more ordered data, such as verified causality or event conflict datasets. By using Natural Language Processing can I find trends that both correlate and do not correlate with the structured data?
- How much does fake news and bot networks affect the war and twitter? Can I detect potential bot networks and analyze the effect on the Syrian Civil War and the twitter community?
Gaining Access to Private Datasets
A part of being a successful researcher is realizing that not all information can be obtained on your own. And so, I have been reaching out for help as I’ve been developing my work. Everyone needs a buddy system in the data world so don’t be afraid to ask for help. Using sites such as reddit, google dataset search(a cool new tool), and contacting people through email is how I’m going to get this project done.
I have been both successful and unsuccessful already at reaching out to non-profits and companies with information I was trying to obtain. A method I’ve used to find where datasets are located is reading academic papers and searching through their references. Another great resource is the footnotes of Wikipedia articles. After finding the organization that has the dataset I’m interested in, I search their website for contact information, specifically an email address.
Re: The example email
To Whom It May Concern,
My name is [Name], and I recently came across the [Name of Organization] website while exploring sources with data on the [Subject Topic]. I’m reaching out because I’m working on a research project that [Brief Description of Project].
I would like to request access to the data to help assist my research. I will reference any data used from the [Organization Name] and can present my final research to you. I would greatly appreciate all efforts from the [Organization Name] in aiding me in this project.
Please let me know if you need any additional information from me. Thank you.
Here are some tips I’ve learned to help you approach these kinds of situations:
- Before you reach out, make sure you’ve done your research about the topic and the person you’re reaching out to
- Have a specific time frame in mind for when you need to receive whatever you are asking for
- Give a brief description about your motivations and what you’re trying to accomplish
- Be courteous and thank them for their time but also be up front about what you need
Scraping Twitter with Twint
When I was reaching out to companies to gain access to private datasets, I was told about a few web scraping techniques to help me get some of the data I’m looking for. A really cool tool that was suggested is the python module called twint. The module is a Twitter scraping tool that’s very simple, fast and anonymous. Below is a brief overview taken from their GitHub page.
Some of the benefits of using Twint vs Twitter API:
- Can fetch almost all Tweets (Twitter API limits to last 3200 Tweets only)
- Fast initial setup
- Can be used anonymously and without Twitter sign up
- No rate limitations
- Python 3.6
pip3 install -r requirements.txt
git clone https://github.com/twintproject/twint.git
pip3 install twint
Here’s an example of a small but powerful script I wrote to scrape all tweets in Chicago for the month of August. The script will even translate all tweets into English, and save to a csv file.
import twintdef main():
c = twint.Config()
# coordinates of Chicago, Illinois
c.Geo = "41.881832, -87.623177" # get all tweets from August to September at the coordinates
c.Since = "2018-8-1"
c.Until = "2018-9-1" # translate to english
c.Lang = "en" # custom output format
c.Store_csv = True # custom twitter fields to capture
c.Custom_csv = ["id", "date", "user_id", "username", "tweet", "hashtags", "location"]
c.Output = "twitter.csv" # write to file
twint.run.Search(c)if __name__ == '__main__':
It feels great already to be part of such a cool group of people. I’m very excited to see where the next few months take me. It is one thing to program for a job on a day to day basis, and it is completely different to do it as a passion.