Big data and the elections 2016
Analyzing reddit, twitter, global media, wikipedia, funding, expenses… and a touch of ML
Old media, social media, contributions, expenses — so much data you can use to understand what moves (and how to move) the elections. In these two videos we show how to query and visualize terabytes of data, and uncover patterns governing this process.
Previously I sat down with Bitnami’s Raj Rikhy, to show how to setup BigQuery and re:dash to visualize this data:
Perhaps my favorite query: Let’s jump to the past, taking today’s reddit supporters of Donald Trump, Bernie Sanders, and Hillary Clinton — Where were they 4 years ago?
SELECT past_sub, FIRST(candidatesub) candidatesub, FIRST(authors) authors, ROUND(100*FIRST(authors)/FIRST(authors_present),2) percent, COUNT(*) subs
SELECT a.subreddit past_sub, candidatesub, COUNT(DISTINCT a.author) authors, RANK() OVER(PARTITION BY candidatesub ORDER BY authors DESC) rank, FIRST(authors_present) authors_present, COUNT(*) c
FROM [fh-bigquery:reddit_comments.2012] a
SELECT author, subreddit candidatesub, COUNT(*) c, COUNT(DISTINCT author) OVER(PARTITION BY candidatesub) authors_present
AND subreddit IN (
GROUP BY 1,2
GROUP BY 1,2
HAVING authors>15 AND c>1
GROUP BY 1
ORDER by 2, 4 DESC
That’s right: This query shows that today’s Bernie supporters where contributing to /r/space and r/occupywallstreet back in 2012. Meanwhile Trump supporters where contributing to /r/guns, /r/MensRights, and /r/ronpaul (see chart for more). Reddit has less of Hillary supporters, so their history is not as statistically significant, but some of the 2012 reddits related to her followers today are /r/lgbt, /r/relationship, a bunch of sports ones, and /r/EnoughPaulSpam. Interesting “data backed” facts.
And if you are looking for a monster query: How about measuring all Bernie Sander metrics across Wikipedia, reddit, and the worldwide media (GDELT):
SELECT source, day, AVG(normalized) OVER(PARTITION BY source
ORDER BY DAY ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as normalized_mentions FROM (
SELECT source, day, mentions / max_mention as normalized FROM (
SELECT source, day, mentions, MAX(mentions) OVER(PARTITION BY source) as max_mention FROM (
SELECT “gdelt” as source, day, INTEGER(mentions) as mentions,
where candidate = “bernie sanders”
SELECT “reddit” as source, day, INTEGER(mentions) as mentions, FROM (
SUM(LOWER(body) CONTAINS b.x) mentions,
COUNT(DISTINCT IF(LOWER(body) CONTAINS b.x, author, null)) authors,
(SELECT * FROM [fh-bigquery:reddit_comments.all_starting_201501] WHERE score>5) a,
CROSS JOIN (SELECT ‘bernie sanders’ x) b
GROUP BY 1, 2
SELECT “wiki” as source, TIMESTAMP(DATE(datehour)) as day, SUM(requests) as mentions
“table_id contains ‘pagecounts_’ and INTEGER(RIGHT(table_id, 6)) > 201501”)
WHERE title IN (‘Bernie_Sanders’)
GROUP BY day
ORDER BY day, source
Beware before running this query — it goes over 5TB of data!
Last query for this post: Which were the most efficient candidates in transforming donations into actual delegates during the primaries (by state)?
SELECT *, ROUND(a.amount/b.delegates,0) ratio_delegates, ROUND(a.amount/b.votes,0) ratio_votes FROM (
SELECT recipient, state,
FIRST(REGEXP_EXTRACT(recipient, ‘, (.*)’) + ‘ ‘ + REGEXP_EXTRACT(recipient, ‘(.*),’)) candidate,
COUNT(DISTINCT contribid ) c,
SUM(c) OVER(PARTITION BY recipient) c_tot,
FROM [fh-bigquery:opensecrets.indivs16v2] a
JOIN (SELECT cid, FIRST(party) party FROM [fh-bigquery:opensecrets.cands16] WHERE distidrunfor = ‘PRES’ GROUP BY 1) b
WHERE LEFT(realcode, 2)!=’Z9'
GROUP BY 1,2
JOIN [fh-bigquery:opensecrets.primaries_results] b
Watch the video to see more details, or join me in /r/bigquery to discuss!