Natural Language Processing analysis of r/The_Donald

Harsha Goonewardana
Lot of active engagement

I scraped the reddit r/The_Donald subreddit to conduct a sentiment and word analysis of submissions.

The dataset consisted of 28726 submissions over the last 30 days.

Sentiment Analysis

I used the Textblob library for Python to conduct the sentiment analysis. Textblob uses the Stanford University’s Natural Language Training Kit (NLTK) to classify the words as either positive, neutral or negative. Without going into too much detail, it uses a Naive Bayes algorithem trained on movie reviews to categorize words according to movie reviews. Then the scores of all words in your sentence get multiplied to produce the sentence score.

Here are the top 5 postive submission titles:

1. I'm sure it was an oversight that there is no Google Doodle for D-Day, so I whipped this up.", 2. 'We must always protect those who protect us. Today, it was my great honor to sign the #VAMissionAct and to make Veterans Choice the permanent law of the land!', 3. 'MAGIC WAND CONFIRMED!! MSNBC(ucks) ON SUICIDE WATCH', 
4. 'New Book Coming Out: "Spies in Congress: Inside the Demorats\' Covered-Up Cyber Scandal"',
5. 'Trump injects over 5 BILLION into veteran care on d-day anniv. - 🌭What did Barry Soetero do on d-day? Order 65k worth of hot dogs?!?🌭',

and the negatives:

1. "My humble submission for the 2018 Meme War.Don't give up on us out here!", 2. "Batshit Crazy: Stormy Daniels sues former lawyer, accuses him of being a 'puppet' of Trump and Michael Cohen", 3. 'When you agree with them harder than everyone else and they still find a way to hate you', 4. 'Weird how the left COMPLETELY IGNORED the fact that Siddhartha Gautama is a Trump supporter.', 5. 'Wrong order', 

Top 10 words:


('trump', 1506), ('new', 571), ('president', 470), ('right', 372)('great', 355), ('good', 343), ('people', 311),('first', 287), ('best', 284), ('obama', 275)


('trump', 806), ('news', 304), ('fake', 280),('illegal', 240), ('hate', 228), ('people', 221),('obama', 194),('president', 182),
('bad', 175),('black', 173)]

Where do most of the reposts come from?

Most of the reposts are coming from reddit itself and from the r/The_Donald subreddit. As expected the conservative websites also feature in the top ten along with Twitter, perhaps a reflection of the President’s panchant for tweeting.

Title Length

Most of the titles are below 20 words in length.

I hope to do the same analysis on the r/conservative and r/republican subreddits to see the similarities.

Harsha Goonewardana

Written by

I am interested in the intersection of data science and international development. Better development outcomes through analysis.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade