Analyzing the “FireMcMaster” Twitter Data

Michael Hunger
Neo4j Developer Blog
6 min readAug 7, 2017

My interest was triggered by this tweet, which points out that the “#FireMcMaster” hashtag suddenly trended and that a network of bots was driving it up.

Update: The O’Reilly book “Graph Algorithms on Apache Spark and Neo4j Book is now available as free ebook download, from neo4j.com

Here is the quote from the New York Times Article: “Trump Defends McMaster as Conservatives Seek His Dismissal”

The #FireMcMaster hashtag was tweeted more than 50,000 times since Wednesday. Echoing the drumbeat were social media organs tied to the Russian government. According to the Alliance for Securing Democracy, a bipartisan group created to focus attention on Russian interference in the West, the top hashtag among 600 Twitter accounts linked to Russian influence operations at one point on Thursday was #FireMcMaster.

Data

Neo4j Database with 90k imported Tweets

Or use this:

Database + plugins + config neo4j 3.2.3 (60MB Zip)

Twitter import

I used the Python script to import user and their tweets from our Community Graph Initiative with a blank Neo4j Sandbox

The import data is pulled from the twitter search API: 900 pages with 100 tweets each

match (n) return labels(n), count(*)"labels(n)"                  │"count(*)"
["Tweet","Content","Retweet"]│79689
["Tweet","Content","Reply"] │2522
["Tweet","Content"] │5933
["Tag"] │1102
["User"] │32885
["Tweet"] │1378
["Link"] │4009

Most of the tweets are retweets though, with only 6k being original content, issued by a total of 32k users.

Tweets per day

MATCH (t:Tweet) WHERE exists(t.created)
RETURN apoc.date.format(t.created,'s','yyyy/MM/dd') AS date, count(*)
ORDER BY date ASC
"date" │"count"
"2017-07-27"│13
"2017-07-28"│10
"2017-07-29"│64
"2017-07-30"│26
"2017-07-31"│26
"2017-08-01"│14
"2017-08-02"│8754
"2017-08-03"│33031
"2017-08-04"│32874
"2017-08-05"│13030
"2017-08-06"│201

You clearly see a sharp rise in tweets with these hashtags since August 2nd but going down again on the 5th.

Top tags, correlated tags

Top Tags

match (t:Tag)
return t.name, size( (t)<-[:TAGGED]-() ) as deg
order by deg desc limit 20
"t.name" │"deg"
"firemcmaster" │47581
"mcmasterfacts" │16114
"mcmaster" │3668
"maga" │2407
"muslimbrotherhood"│2085
"draintheswamp" │1582
"deepstate" │860
"firemueller" │653
"drainthesewer" │575
"trump" │544
"leakerstatus" │533
"trumptrain" │462
"trumprally" │421
"leaks" │388
"firemcmasters" │345
"traitor" │343
"americafirst" │343
"thursdaythoughts" │329
"susanrice" │288
"leaker" │286

Most frequently correlated Tags

MATCH (t:Tag) WHERE toLower(t.name) = "firemcmaster"
MATCH (t)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE t2 <> t
RETURN t2.name, count(*) AS freq
ORDER BY freq DESC LIMIT 20
"t2.name" │"freq"
"mcmaster" │2407
"maga" │2087
"muslimbrotherhood" │2030
"draintheswamp" │1216
"mcmasterfacts" │836
"firemueller" │591
"deepstate" │575
"drainthesewer" │572
"leakerstatus" │531
"trumprally" │418
"trumptrain" │391
"leaks" │376
"trump" │346
"traitor" │320
"americafirst" │309
"leaker" │283
"mcleaker" │275
"iftwitterdidntexist"│234
"rednationrising" │217
"susanrice" │208

Tags in Tweets that replied or retweeted this tag

MATCH (t:Tag) WHERE toLower(t.name) = "firemcmaster"
MATCH (t)<-[:TAGGED]-()<-[:REPLIED_TO|RETWEETED]-()-[:TAGGED]->(t2:Tag)
WHERE t2 <> t
RETURN t2.name, count(*) AS freq
ORDER BY freq DESC LIMIT 20
"t2.name" │"freq"
"mcmaster" │2347
"muslimbrotherhood" │2021
"maga" │1807
"draintheswamp" │1053
"mcmasterfacts" │754
"leakerstatus" │530
"drainthesewer" │529
"deepstate" │448
"trumprally" │407
"trump" │384
"leaks" │367
"firemueller" │338
"trumptrain" │337
"traitor" │286
"thursdaythoughts" │281
"americafirst" │276
"mcleaker" │274
"leaker" │266
"iftwitterdidntexist"│233
"rednationrising" │215

Top-Mentions

MATCH (t:Tag) WHERE toLower(t.name) = "firemcmaster"
MATCH (t)<-[:TAGGED]-()-[:MENTIONED]->(u:User)
RETURN u.screen_name, u.name, count(*) as freq
ORDER BY freq DESC LIMIT 20
"u.screen_name" │"u.name" │"freq"
"realDonaldTrump"│"Donald J. Trump" │7853
"POTUS" │"President Trump" │6847
"stranahan" │"Lee Stranahan" │3778
"NatashaBertrand"│"Natasha Bertrand" │3686
"DarrenKaplan" │"Darren Kaplan" │3685
"WayneDupreeShow"│"Wayne Dupree" │2299
"PeeSparkle" │"PSparkleMAGA" │2088
"LVNancy" │"ɳαɳ૮ყ │1991
"PrisonPlanet" │"Paul Joseph Watson" │1789
"RedNationRising"│"Red Nation Rising" │1593
"Cernovich" │"Mike Cernovich " │1519
"pnehlen" │"Paul Nehlen" │1474
"StefanMolyneux" │"Stefan Molyneux" │1410
"TrumpTrain45Pac"│"Patriot 24/7" │1143
"alozrasT" │"Amy T " │992
"Pamela_Moore13" │"Pamela Moore" │936
"StockMonsterUSA"│"STOCK MONSTER" │817
"ReaganBattalion"│"The Reagan Battalion"│787
"_Makada_" │"Makada " │725
"ThePatriot143" │" Cris " │575

Most Retweeted

MATCH (u:User)-[:POSTED]->(t:Tweet)<-[:RETWEETED]-(o:Tweet)
RETURN u.screen_name, u.name, count(distinct t) AS tweets, count(*) AS freq
ORDER BY freq DESC LIMIT 10
"u.screen_name" │"u.name" │"tweets"│"freq"
"Cernovich" │"Mike Cernovich " │21 │8792
"StefanMolyneux" │"Stefan Molyneux" │6 │3895
"DarrenKaplan" │"Darren Kaplan" │1 │3737
"stranahan" │"Lee Stranahan" │17 │3721
"passionatechica"│"ᗷᗩᔕᗴᗪ ᑭᖇᎥᔕᑕᎥᒪᒪᗩ" │3 │3683
"AmyMek" │"Withheld account" │2 │3117
"LVNancy" │"ɳαɳ૮ყ │5 │2882
"JackPosobiec" │"Jack Posobiec " │1 │2711
"WayneDupreeShow"│"Wayne Dupree" │7 │2607
"StockMonsterUSA"│"STOCK MONSTER" │3 │2572

Interestingly Darren Kaplan shows up here. Why? Because his tweet about the “hashtag driving bot net” got retweeted and favorited so often

Most Active Accounts

match (n:User)
return n.screen_name, n.name, n.location, size( (n)-[:POSTED]->() ) as activity
order by activity desc limit 20
"n.screen_name" │"n.name" │"n.location" │"activity
"IRISHHEAVYT" │"HEAVY T " │"" │179
"thatgirlsandra5"│"Trump 2020 " │"Florida, USA" │157
"TheGoodGuy2017b"│"Anti-Globalist" │"" │155
"BoycottHRC" │"John Durrant" │"" │133
"MarieMa49685063"│"I ️Winning " │"TrumpTrain, USA"│126
"unablogger" │"Una Blogger" │"Los Angeles" │120
"MagaNavajo" │"MAGA " │"Main St. USA" │118
"GeneralDefense" │"General Defense " │"United States" │105
"uliw315" │"Anonymous Source" │"" │100
"clint4usa" │"BringBackFlynn" │"" │98

If we look at the distribution of activity across users, i.e. how many tweets were posted / retweeted / replied to by a user. We see that there is a common power-law distribution of activity.

MATCH (u:User)
RETURN size((u)-->(:Tweet)) as activity, count(*)
ORDER BY activity ASC LIMIT 100

We can see that most (31k) users only had a few interactions (<10) and only 1400 had more than that.

MATCH (u:User)
RETURN size((u)-->(:Tweet)) < 10 as lessThan10, count(*)

Something we don’t have is followship between these accounts and which other interactions (outside of the tags we looked for) there were. We could pull the tweets of all these accounts and start to look into this, but I leave that for a later time.

Algorithms — Centrality / PageRank

call algo.pageRank('MATCH (u:User) return id(u) as id',
'MATCH (u:User)-[:POSTED]->()<-[:RETWEETED|REPLIED_TO]-()<-[:POSTED]-(u2:User) return id(u) as source,id(u2) as target',
{graph:'cypher'});
MATCH (u:User)
RETURN u.name, u.screen_name, u.pagerank
ORDER BY u.pagerank DESC LIMIT 10
│"u.name" │"u.screen_name" │"u.pagerank"
│"HEAVY T " │"IRISHHEAVYT" │12.1623955
│"Anti-Globalist" │"TheGoodGuy2017b"│5.187134
│"King Eric" │"ericsuniverse" │4.2865675
│"Dani Pereira" │"weblollipop1" │2.7041904999999997
│"John Durrant" │"BoycottHRC" │2.5428434999999996
│"deborah sidener" │"debbiesidener2" │2.1846535
│"Based chris" │"CJTUCKERTRUPAT" │1.9179149999999998
│"Bobby" │"slowbob" │1.8188475
│"ZillaStevenson" │"ZillaStevenson" │1.772446
│"Trump 2020 ‼" │"thatgirlsandra5"│1.7422285

PageRank on a “mention” Network

call algo.pageRank('MATCH (u:User) return id(u) as id',
'MATCH (u:User)-[:POSTED]->()-[:MENTIONED]->(u2:User) return id(u) as source,id(u2) as target',
{graph:'cypher',writeProperty:'mentionRank'});
MATCH (u:User)
RETURN u.name, u.screen_name, u.mentionRank, u.pagerank
ORDER BY u.mentionRank desc LIMIT 10
"u.name" │"u.mentionRank" │"u.pagerank"
"Natasha Bertrand" │1273.3055510000002│0.15003400000000003
"Darren Kaplan" │690.7605669999999 │0.15000000000000002
"Donald J. Trump" │416.37035 │0.15000000000000002
"Mike Cernovich " │352.29304899999994│0.1650705
"Paul Joseph Watson" │220.54896499999998│0.15013600000000002
"The Columbia Bugle" │217.4538345 │0.15034850000000002
"President Trump" │215.79957299999998│0.15000000000000002
"STOCK MONSTER" │192.0740075 │0.15008500000000002
"ɳαɳ૮ყ" │156.2747785 │0.15512550000000003
"Stefan Molyneux" │116.32824650000002│0.15006800000000003

Clustering Algorithms

We run the clustering on top of the “interaction” network, i.e. people interacting with each others tweets.

call algo.unionFind('MATCH (u:User) RETURN id(u) as id',
'MATCH (u:User)-[:POSTED]->()<-[:RETWEETED|REPLIED_TO]-()<-[:POSTED]-(u2:User) RETURN id(u) as source,id(u2) as target'
{graph:'cypher'})

Results in 1946 partiions, but almost all people are in the first partition.

MATCH (u:User)
RETURN u.partition, count(*) as c ORDER BY c DESC
LIMIT 10
"u.partition"│"c"
70 │30404
3657 │27
21563 │6
2721 │4
738 │3
9403 │3
1415 │3
22819 │3
14646 │3
3659 │3

So let’s try label propagation instead.

call algo.labelPropagation('MATCH (u:User) return id(u) as id',
'MATCH (u:User)-[:POSTED]->()<-[:RETWEETED|REPLIED_TO]-()<-[:POSTED]-(u2:User) return id(u) as source,id(u2) as target','OUTGOING',
{graph:'cypher'});
match (u:User)
return count(distinct u.partition) as partitions

Resulting in 9786 partitions with a long tail.

match (u:User)
return u.partition, count(*) as c order by c desc
limit 30
"u.partition"│"c"
70 │21368
103537 │397
79479 │338
17685 │209
50206 │82
30059 │79
54271 │71
23203 │46
103520 │45
23026 │42

We can mark the largest partitions and give them the name of the member with the highest pagerank.

match (u:User)
with u order by u.pagerank desc
with u.partition as p, count(*) as c, collect(u) as users,head(collect(u.screen_name)) as partitionName order by c desc
limit 50
foreach (u in users |
set u:Group set u.partitionName = partitionName)

And render them as a “summary” visualization.

call apoc.nodes.group(['Group'],['partitionName']) yield nodes, relationships
unwind nodes as n
return n,relationships

There is much more possible with this data (esp. if we pull in the followships and the other tweets of the users).

This should just give you an idea what is in here.

Sorry for the ugly tables but adding proper HTML tables to Medium seems to be no meant to be easy.

Free download: O’Reilly “Graph Algorithms on Apache Spark and Neo4j”

--

--

Michael Hunger
Neo4j Developer Blog

A software developer passionate about teaching and learning. Currently working with Neo4j, GraphQL, Kotlin, ML/AI, Micronaut, Spring, Kafka, and more.