Big data and the elections 2016

Analyzing reddit, twitter, global media, wikipedia, funding, expenses… and a touch of ML

Old media, social media, contributions, expenses — so much data you can use to understand what moves (and how to move) the elections. In these two videos we show how to query and visualize terabytes of data, and uncover patterns governing this process.

At Google I/O we did that, using data generated and shared by GDELT, reddit, Twitter, Wikipedia, the FEC, and OpenSecrets.org:

Previously I sat down with Bitnami’s Raj Rikhy, to show how to setup BigQuery and re:dash to visualize this data:

Perhaps my favorite query: Let’s jump to the past, taking today’s reddit supporters of Donald Trump, Bernie Sanders, and Hillary Clinton — Where were they 4 years ago?

SELECT past_sub, FIRST(candidatesub) candidatesub, FIRST(authors) authors, ROUND(100*FIRST(authors)/FIRST(authors_present),2) percent, COUNT(*) subs
FROM (
SELECT a.subreddit past_sub, candidatesub, COUNT(DISTINCT a.author) authors, RANK() OVER(PARTITION BY candidatesub ORDER BY authors DESC) rank, FIRST(authors_present) authors_present, COUNT(*) c
FROM [fh-bigquery:reddit_comments.2012] a
JOIN (
SELECT author, subreddit candidatesub, COUNT(*) c, COUNT(DISTINCT author) OVER(PARTITION BY candidatesub) authors_present
FROM [fh-bigquery:reddit_comments.2016_03]
WHERE score>5
AND subreddit IN (
'SandersForPresident',
'The_Donald',
'hillaryclinton'
)
GROUP BY 1,2
HAVING c>2
) b
ON a.author=b.author
WHERE a.score>3
GROUP BY 1,2
HAVING authors>15 AND c>1
)
WHERE rank<=60
GROUP BY 1
HAVING subs=1
ORDER by 2, 4 DESC

That’s right: This query shows that today’s Bernie supporters where contributing to /r/space and r/occupywallstreet back in 2012. Meanwhile Trump supporters where contributing to /r/guns, /r/MensRights, and /r/ronpaul (see chart for more). Reddit has less of Hillary supporters, so their history is not as statistically significant, but some of the 2012 reddits related to her followers today are /r/lgbt, /r/relationship, a bunch of sports ones, and /r/EnoughPaulSpam. Interesting “data backed” facts.

And if you are looking for a monster query: How about measuring all Bernie Sander metrics across Wikipedia, reddit, and the worldwide media (GDELT):

SELECT source, day, AVG(normalized) OVER(PARTITION BY source
ORDER BY DAY ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as normalized_mentions FROM (
SELECT source, day, mentions / max_mention as normalized FROM (
SELECT source, day, mentions, MAX(mentions) OVER(PARTITION BY source) as max_mention FROM (
SELECT “gdelt” as source, day, INTEGER(mentions) as mentions,
FROM [cloude-sandbox:tigani_views.candidate_first_mentions_by_day]
where candidate = “bernie sanders”
), (
SELECT “reddit” as source, day, INTEGER(mentions) as mentions, FROM (
SELECT
TIMESTAMP(DATE(SEC_TO_TIMESTAMP(created_utc))) day,
b.x,
SUM(LOWER(body) CONTAINS b.x) mentions,
COUNT(DISTINCT IF(LOWER(body) CONTAINS b.x, author, null)) authors,
FROM
(SELECT * FROM [fh-bigquery:reddit_comments.all_starting_201501] WHERE score>5) a,
CROSS JOIN (SELECT ‘bernie sanders’ x) b
GROUP BY 1, 2
)), (
SELECT “wiki” as source, TIMESTAMP(DATE(datehour)) as day, SUM(requests) as mentions
FROM TABLE_QUERY([fh-bigquery:wikipedia],
“table_id contains ‘pagecounts_’ and INTEGER(RIGHT(table_id, 6)) > 201501”)
WHERE title IN (‘Bernie_Sanders’)
AND language=’en’
GROUP BY day
)
))
ORDER BY day, source

Beware before running this query — it goes over 5TB of data!

Last query for this post: Which were the most efficient candidates in transforming donations into actual delegates during the primaries (by state)?

SELECT *, ROUND(a.amount/b.delegates,0) ratio_delegates, ROUND(a.amount/b.votes,0) ratio_votes FROM (
SELECT recipient, state,
FIRST(REGEXP_EXTRACT(recipient, ‘, (.*)’) + ‘ ‘ + REGEXP_EXTRACT(recipient, ‘(.*),’)) candidate,
SUM(amount) amount,
SUM(IF(b.party=’R’,-1,1)*amount) amount_n,
COUNT(DISTINCT contribid ) c,
SUM(c) OVER(PARTITION BY recipient) c_tot,
FROM [fh-bigquery:opensecrets.indivs16v2] a
JOIN (SELECT cid, FIRST(party) party FROM [fh-bigquery:opensecrets.cands16] WHERE distidrunfor = ‘PRES’ GROUP BY 1) b
ON a.recipid=b.cid
WHERE LEFT(realcode, 2)!=’Z9'
GROUP BY 1,2
HAVING c>10
) a
JOIN [fh-bigquery:opensecrets.primaries_results] b
ON a.candidate=b.candidate
AND a.state=b.state_abbrev

Watch the video to see more details, or join me in /r/bigquery to discuss!

@FelipeHoffa