Collecting Twitter Data Using R
I’m currently working on a text analysis project and I wrote a simple script on R for collecting Twitter data through Twitter’s API. I also created a cron job to automate this process for me, which collects data on an hourly basis.
I’m pretty much a beginner in using R, Terminal shell, and the Twitter API, so if you have any suggestions on how to make this collection process better, please let me know.
Tools Used
- RStudio (IDE for R)
- Terminal (Mac OS X’s Bash Shell)
- Twitter’s Search API
Step 1: Create a Twitter Application
Follow this tutorial on how to create a Twitter application and how to generate keys. You will need these 4 keys and tokens from your Twitter Application (this is unique to each user so keep this information secret):
- Consumer Key
- Consumer Secret
- Access Token
- Access Token Secret
Step 2: Run R Script
Running this script manually on RStudio works perfectly fine. So if you only want to get data on a one time basis you can stop here. However, if you’re looking to collect data multiple times, follow the next step on how to run a cron job.
#!/usr/local/bin/Rscriptsetwd("/Users/ahipolito94/Capstone_2/Data")
library(twitteR)setup_twitter_oauth("consumer-key", "consumer-secret",
"access-token", "access-secret")terms <- c('"iphonex", "iPhonex", "iphoneX", "iPhoneX", "iphone10", "iPhone10","iphone x", "iPhone x", "iphone X", "iPhone X", "iphone 10", "iPhone 10", "#iphonex", "#iPhonex", "#iphoneX", "#iPhoneX", "#iphone10", "#iPhone10")terms_search <- paste(terms, collapse = " OR ")iphonex <- searchTwitter(terms_search, n=1000, lang="en")
iphonex <- twListToDF(iphonex)write.table(iphonex,"/Users/ahipolito94/Capstone_2/Data/iphonex.csv", append=T, row.names=F, col.names=T, sep=",")
Here’s a line by line explanation:
#!/usr/local/bin/Rscript
— instructs Terminal to run the script using R. To find where your Rscript is stored on your system, type which Rscript
in Terminal.
setwd("/Users/ahipolito94/Capstone_2/Data")
— sets your working directory. This line allows us to save/append to a CSV file in your working directory.
terms <- c("iphonex", ... "#iPhone10")
— variable to store keywords and hashtags you want to search through.
terms_search <- paste(terms, collapse = " OR ")
— inserts OR
between each term. This is the syntax used in searchTwitter()
for multiple search terms.
iphonex <- searchTwitter(terms_search, n=1000, lang="en")
— uses twitteR
function to search for 1000 tweets in the english language. I think n=3200
is the maximum number of tweets you can search for.
iphonex <- twListToDF(iphonex)
— uses twitteR
function to convert list of tweets to a dataframe.
write.table(iphonex, "/Users/ahipolito94/Capstone/Data/iphonex.csv", append=T, row.names=F, col.names=T, sep=",")
— saves the dataframe into a CSV file in your working directory. append=T
allows R to add rows to the file instead of just overriding the data.
Step 3: Run Cron Job
A cron job schedules a command or script to run automatically at a specified time and date. If you’re using Mac OS X (or Linux), you can follow this step to schedule cron jobs. If you’re using Windows, I believe the equivalent of a cron job is a Scheduled Task.
I followed this guide on how to run a cron job.
- Open Terminal Window
- Give R permission to run on Terminal: type
chmod u+x /Users/ahipolito94/Capstone_2/Data/Get_Data.R
and press enter. Just replace my filename with your filename. - Add new cron job to crontab: type
crontab -e
and press enter. This opens vi editor. Personally, I found vi editor hard to navigate so I switched to nano editor for easier use. To switch to nano editor, typeexport EDITOR=nano
and press enter. - Create the cron command: type
0 * * * * /Users/ahipolito94/Capstone_2/Data/Get_Data.R
and save it. To save it on nano editor, hitctrl+x
then hity
then press enter. In this example, the cron job runs every hour (ex: runs at 12:00pm, 1:00pm, etc). To change the frequency of the job, change the values of the asterisks. Follow this guide for more information on how to do this. - Check if the cron job is running: type
crontab -l
and press enter. This lists the cron jobs that are currently running. So it should return the cron command that you typed earlier, in my case0 * * * * /Users/ahipolito94/Capstone_2/Data/Get_Data.R
. - To stop the cron job, type
crontab -e
and add#
before your cron command. So in my case,#0 * * * * /Users/ahipolito94/Capstone_2/Data/Get_Data.R
.
Step 4: Check Working Directory for the Data!
If everything ran smoothly, you should see data being added to the file in your working directory.
Step 5: Read CSV File on RStudio
This script reads and views the CSV file on RStudio:
iphonex_csv <- read.csv("/Users/ahipolito94/Capstone_2/Data/iphonex.csv", header = TRUE, encoding = "UCS-2LE")View(iphonex_csv)
Output on RStudio: