I wrote a Python program to calculate the most commonly used words in subreddits. Here’s what I found…

Tommy Carrascal
Dec 6, 2018 · 4 min read

Since I have been working exclusively with JavaScript over the last few months I decided to do a project with one of the first languages I learned when I started to get serious with programming. Python!

With that I needed some ideas for a project, so I decided to mess around with the Reddit API and wanted to do some data analysis/visualization type project. Then I looked at the documentation for a wrapper called PRAW and saw that I could extract comments and so I came up with the idea of finding the top words based on comments per subreddit.

Beware, lots of technical mumbo-jumbo ahead, feel free to skip to the end to see the results

What better community to analyze vocabulary than Reddit

The wrapper comes with a method which returns an array containing individual submission elements each with their own comments. From there I could loop through each submission and then from each comment extract the individual words and analyze them. Easy right? or so I thought.

As it turns out the comments are actually a giant array of individual characters! So now I needed to find a way to convert a giant character array into individual words

Extracting individual words

The solution I came up with was simple. Every time I came across an empty character it would imply that the current word ended. So I would concatenate every character until I reach the empty space and store the string into an array.

The resulting array looked like this:

Initial attempt to extract words

Making progress but as you can see there are characters which are not letters that could possible skew the data later on. So I needed a way to get rid of them. Luckily python has a neat method built in to test for alphanumeric characters, and so I could just check for those and reduce the string accordingly. I also made sure empty words didn’t go through.

Alphanumeric Check

Awesome, now the last major thing I needed to do was map the character count so I used a dictionary to keep track of the number of occurrences. Then I sorted it to get the top occurrences.

Mapping words to occurrences

When I first tested it I found that naturally the top results would be common words such as “the”, ”this”, “that”, etc. and so I wanted to ignore those common words to find words more unique to each subreddit. The not-so-elegant-yet-efficient solution was to create a set of common words and before adding a word to the list it checks if the word is in the set and ignores it if so. I used a set instead of a regular list for more efficient lookup time O(1) vs O(n)

Yikes, I’m still adding new words to this set
The modification to check for the common words

now I have an ordered mapping of unique words to a subreddit and so I have to just display it. Python has a neat library called Matplotlib which can represent data beautifully. I used the pie chart component to display the data and picked the top 10 words.

Using matplotlib

The results were glorious, I tested it out on a few subreddits each with a sample size of 10,000 comment posts . Here’s what the subreddits had to say:

Warning, lots of foul language ahead! (it is Reddit after all)

Our dear President is a hot topic over at the politics subreddit

For a subreddit about atheism, religion sure is discussed a lot

Our president is yet again the highlight of a lot of news

Lots of positivity from this subreddit, to be expected from a supportive community

As a football fan I can confirm that ‘goal’ and profanity are the most commonly used phrases
Not surprising to see Facebook here with all the controversies recently

This was a really fun hack to do and it turned to be more technically challenging than I anticipated. You can check out the code for it along with my other projects here

Thanks for reading! I hope you learned something whether it was about Reddit or tech

Tommy Carrascal

Written by

SWE Intern at GoDaddy | Full-Stack Developer | https://www.linkedin.com/in/carrascalt/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade