Comparing tweets about Trump & Hillary with natural language processing
How are people talking about Hillary and Trump on Twitter? Using the same approach I did here to parse tweets about the Olympics, I crunched some data on 300k tweets from September 9th — 10th about both candidates. Here’s what I came up with:
Top adjectives used in Tweets with Hillary or Trump as the subject
Top verbs used in Tweets with Hillary or Trump as the subject
Top emojis used in election tweets
Which tools did I use?
- Twitter Streaming API: get all the election tweets
- Cloud Natural Language API: parse the tweets & get syntactic data
- BigQuery: analyze the tweet syntax data
- Tableau and some JavaScript hacks: visualize the data
Twitter Streaming API: get all election tweets
I streamed tweets mentioning Hillary or Trump using the Twitter Streaming API with Node.js. You can see the search terms I looked for bolded in the first line of code:
var search_terms = '#Trump2016,#ImWithHer,@HillaryClinton,@realdonaldtrump,#NeverTrump,#MakeAmericaGreatAgain,Hillary Clinton,Donald Trump';client.stream('statuses/filter', {track: search_terms}, function(stream) { stream.on('data', function(tweet) {
if (tweet.text.substring(0,2) != 'RT') {
callNLApi(tweet);
}
}); stream.on('error', function(error) {
console.log(error);
});
});
Once I got a tweet (excluding those starting with “RT”), I sent it to the Natural Language API for syntax analysis.
Cloud Natural Language API: parse the tweets
The new Cloud Natural Language API has three methods — syntax annotation, entity, and sentiment analysis. Here I’ll focus on syntax annotation, but you can check out this post for details on the other two. The syntax annotation response gives you details about the structure of the sentence and the part of speech for each word. Tweets are often missing punctuation and aren’t always grammatically correct, but the NL API is still able to parse them and extract syntax data. For example, here’s one of the ~300k tweets I streamed:
Donald Trump is the lone holdout as VP nominee Mike Pence releases his tax returns http://bit.ly/2c8ZyzP — Newsweek
And here’s a visualization of the syntactic data returned from the API for that tweet (you can create your own here):
The API’s JSON response gives you all the data visualized in the dependency parse tree above. It returns an object for each token in the sentence (a token is a word or punctuation). Here’s a sample of the JSON response for one token from the example above, in this case the word ‘releases’:
{
"text": {
"content": "releases",
"beginOffset": -1
},
"partOfSpeech": {
"tag": "VERB"
},
"dependencyEdge": {
"headTokenIndex": 2,
"label": "ADVCL"
},
"lemma": "release"
}
Let’s break down the response: tag tells us that ‘releases’ is a verb. label tells us the role of the word in this context. Here it’s the ADVCL, which stands for adverbial clause modifier. headTokenIndex indicates the position of the arc going to this token in the dependency parse tree, with each token as an index. lemma is the root form of the word, which is useful if you’re counting occurrences of a word and want to consolidate duplicates (notice that the lemma of “releases” is “release”).
Here’s what my request to the NL API looks like:
function callNLApi(tweetData) {
var requestUrl = "https://language.googleapis.com/v1beta1/documents:annotateText?key=API_KEY" var requestBody = {
“document”: {
“type”: “PLAIN_TEXT”,
“content”: tweetData.text
}
} var options = {
url: requestUrl,
method: “POST”,
body: requestBody,
json: true
} request(options, function(err, resp, body) {
if (!err && resp.statusCode == 200){ var tokens = body.tokens;
// Do something with the tokens }
}
}
Now that I have all of the syntax data as JSON, there are an endless number of ways to analyze it. Instead of doing the analysis as tweets came in, I decided to insert every tweet into a BigQuery table and figure out how to analyze it later.
BigQuery: analyze linguistic trends in tweets
I created a BigQuery table of all tweets, and then ran some SQL queries to find linguistic trends. Here’s the schema for my BigQuery table:
I inserted each tweet into my table using the google-cloud npm package with just a few lines of JavaScript:
var row = {
id: tweet.id,
text: tweet.text,
created_at: tweet.created_at,
user_followers_count: tweet.user.followers,
hashtags: JSON.stringify(tweet.hashtags),
tokens: JSON.stringify(body.tokens)
};table.insert(row, function(error, insertErr, apiResp) {
if (error) {
console.log('err', error);
} else if (insertErr.length == 0) {
console.log('success!');
}
});
Now it’s time to analyze the data! The tokens column in my table is a giant JSON string. Luckily BigQuery supports user-defined functions (UDFs), which let you write JavaScript functions to parse data in your table.
To identify adjectives, I looked for all tokens returned by the NL API with ADJ as their partOfSpeech tag. But I didn’t want all adjectives from all the tweets I collected, I really only wanted adjectives from tweets where Hillary or Trump was the subject of the sentence. The NL API makes it easy to filter tweets that fit this criteria using the NSUBJ (nominal subject) label. Here’s the finished query (with the UDF inline) — it counts adjectives from all tweets with Hillary or Trump as the nominal subject.
To count emojis, I modified my UDF to look for all tokens with a partOfSpeech tag of X (indicates foreign character), and used a regex to extract all emoji characters (thanks Mathias for your emoji regex!). Here’s the query:
And the output:
This data is more fun when viewed as an emoji tag cloud, see the next section for details on how I did that.
Visualizing the data
One of my favorite things about BigQuery is its integrations with data visualization tools like Tableau, Data Studio, and Apache Zeppelin. I connected my BigQuery table to Tableau to create the bar graphs shown above. Tableau lets you create all sorts of different graphs depending on the type of data you’re working with. Here’s a pie chart showing the top 10 hashtags in the tweets I collected (lowercased to eliminate duplicates):
To make the emoji tag cloud, I downloaded the JSON from my emoji query:
And then used this handy JavaScript library for generating word clouds.
What’s next?
- Get started with the Natural Language API: try it out in the browser, dive into the docs, or check out one of these blog posts for more info
- Get started with BigQuery: follow the Web UI quickstart or check out any of Felipe Hoffa’s Medium posts
Have questions? Find me on Twitter @SRobTweets or let me know what you think in the comments.