I wrote a Python program to calculate the most commonly used words in subreddits. Here’s what I found…

Since I have been working exclusively with JavaScript over the last few months I decided to do a project with one of the first languages I learned when I started to get serious with programming. Python!

With that I needed some ideas for a project, so I decided to mess around with the Reddit API and wanted to do some data analysis/visualization type project. Then I looked at the documentation for a wrapper called PRAW and saw that I could extract comments and so I came up with the idea of finding the top words based on comments per subreddit.

Beware, lots of technical mumbo-jumbo ahead, feel free to skip to the end to see the results

What better community to analyze vocabulary than Reddit

The wrapper comes with a method which returns an array containing individual submission elements each with their own comments. From there I could loop through each submission and then from each comment extract the individual words and analyze them. Easy right? or so I thought.

As it turns out the comments are actually a giant array of individual characters! So now I needed to find a way to convert a giant character array into individual words

Extracting individual words

The solution I came up with was simple. Every time I came across an empty character it would imply that the current word ended. So I would concatenate every character until I reach the empty space and store the string into an array.

The resulting array looked like this:

Initial attempt to extract words

Making progress but as you can see there are characters which are not letters that could possible skew the data later on. So I needed a way to get rid of them. Luckily python has a neat method built in to test for alphanumeric characters, and so I could just check for those and reduce the string accordingly. I also made sure empty words didn’t go through.

Alphanumeric Check

Awesome, now the last major thing I needed to do was map the character count so I used a dictionary to keep track of the number of occurrences. Then I sorted it to get the top occurrences.

Mapping words to occurrences

When I first tested it I found that naturally the top results would be common words such as “the”, ”this”, “that”, etc. and so I wanted to ignore those common words to find words more unique to each subreddit. The not-so-elegant-yet-efficient solution was to create a set of common words and before adding a word to the list it checks if the word is in the set and ignores it if so. I used a set instead of a regular list for more efficient lookup time O(1) vs O(n)

Yikes, I’m still adding new words to this set
The modification to check for the common words

now I have an ordered mapping of unique words to a subreddit and so I have to just display it. Python has a neat library called Matplotlib which can represent data beautifully. I used the pie chart component to display the data and picked the top 10 words.

Using matplotlib

The results were glorious, I tested it out on a few subreddits each with a sample size of 10,000 comment posts . Here’s what the subreddits had to say:

Warning, lots of foul language ahead! (it is Reddit after all)

Our dear President is a hot topic over at the politics subreddit

For a subreddit about atheism, religion sure is discussed a lot

Our president is yet again the highlight of a lot of news

Lots of positivity from this subreddit, to be expected from a supportive community

As a football fan I can confirm that ‘goal’ and profanity are the most commonly used phrases
Not surprising to see Facebook here with all the controversies recently

This was a really fun hack to do and it turned to be more technically challenging than I anticipated. You can check out the code for it along with my other projects here

Thanks for reading! I hope you learned something whether it was about Reddit or tech