Automatically Storing Tweets in BigQuery using Google’s Compute Engine and Twitter’s Search API

Why this post might interest you

If, like me, you belong to the common herd and ‘only’ have access to Twitter’s Standard Search API and want to start sending popular tweets into BigQuery for later analysis, then this post if for you.

As mentioned above, this post shows how to use the Search API but I guess using the Streaming API instead only requires minor changes in the Python script presented later on.

What’s shown in this post

On your Google VM, a Python script gathers the last two days’ most popular tweets using tweepy, extracts the needed information from the tweets, and writes these into a new line delimited json file so that BigQuery can process it. The above is executed in a bash script which runs the Python script every two days, sends the json file to a Cloud Storage bucket and then to BigQuery.

Python, tweepy, Search API, and json

import tweepy
from tweepy.auth import OAuthHandler
import json
import datetime
APP_KEY = ‘your_app_key’
APP_SECRET = ‘your_app_secret’
OATH_TOKEN = ‘your_oath_token’
OATH_TOKEN_SECRET = ‘your_oath_token_secret’
auth = OAuthHandler(APP_KEY, APP_SECRET)
auth.set_access_token(OATH_TOKEN, OATH_TOKEN_SECRET)
api = tweepy.API(auth)
targetDate = datetime.datetime.today() — datetime.timedelta(days=2)
targetDate = targetDate.strftime(‘%Y-%m-%d’)
tweets = tweepy.Cursor(api.search, q=’\”Google Cloud\”’, lang=’en’, result_type=’popular’, since=targetDate).items(10)
fileName = “tweets_” + datetime.datetime.today().strftime(‘%Y-%m-%d’) + “.json”
for tweet in tweets:
with open(fileName,’a’) as f:
dic = {‘texts’: tweet.text, ‘dates’: str(tweet.created_at)}
json.dump(dic, f)
f.write(‘\n’)

After importing and authenticating the time frame is defined and then the 10 most popular tweets of the last two days concerning ‘Google Cloud’ are gathered. Afterwards a json file with the current date in its name is created and the data is written into it.

Creating a BigQuery Table

On the Google Cloud Platform console, navigate to BigQuery, create a dataset and a table with a schema that corresponds to the written json files. The image below illustrates.

The bash script

#!/bin/bash
while true; do 
today=$(date +"%Y-%m-%d")
python3 bq_tweepy.py
gsutil cp *.json gs://your-bucket-name/tweets/$today/

bq load --source_format=NEWLINE_DELIMITED_JSON my_twitter_data.texts_and_dates gs://your-bucket-name/tweets/$today/*.json texts:STRING,dates:STRING

rm *.json
sleep 2d
done

Through bash, the Python script is executed and a json file created. This json file is copied to Cloud Storage and from there loaded into BigQuery. More information about loading data into BigQuery can be found here. The script then sleeps for two days and runs a new iteration afterwards.

Executing it all on Google’s Compute Engine

Establish a SSH connection to your VM and make sure you can use pip/pip3 on your VM to install tweepy. That out of the way, upload the bash and Python scripts, named here tweets2Days.sh and bq_tweepy.py, respectively, to your VM.

As can be seen from the bash script, it contains an infinite loop. Just typing source tweet2Days.sh executes the script but stops doing so once the SSH connection is lost. To solve this, one can use

nohup bash -c "source tweets2Days.sh" &

and you can close your VM-window and it won’t bother the script’s execution. Alternatively, you can also use screen as described here.

To stop the infinite loop type ps -ef, identify the respective process ID and kill it with kill <PID>.

And that’s basically it. In practice one would of course gather more information then just the tweet itself and its creation date.

Some Pics to show what happens

In Cloud Storage:

The newline delimited json file:

The json file in BigQuery:

Have fun collecting and analyzing tweets!