I wrote a Python program to calculate the most commonly used words in subreddits. Here’s what I found…
With that I needed some ideas for a project, so I decided to mess around with the Reddit API and wanted to do some data analysis/visualization type project. Then I looked at the documentation for a wrapper called PRAW and saw that I could extract comments and so I came up with the idea of finding the top words based on comments per subreddit.
Beware, lots of technical mumbo-jumbo ahead, feel free to skip to the end to see the results
The wrapper comes with a method which returns an array containing individual submission elements each with their own comments. From there I could loop through each submission and then from each comment extract the individual words and analyze them. Easy right? or so I thought.
As it turns out the comments are actually a giant array of individual characters! So now I needed to find a way to convert a giant character array into individual words
The solution I came up with was simple. Every time I came across an empty character it would imply that the current word ended. So I would concatenate every character until I reach the empty space and store the string into an array.
The resulting array looked like this:
Making progress but as you can see there are characters which are not letters that could possible skew the data later on. So I needed a way to get rid of them. Luckily python has a neat method built in to test for alphanumeric characters, and so I could just check for those and reduce the string accordingly. I also made sure empty words didn’t go through.
Awesome, now the last major thing I needed to do was map the character count so I used a dictionary to keep track of the number of occurrences. Then I sorted it to get the top occurrences.
When I first tested it I found that naturally the top results would be common words such as “the”, ”this”, “that”, etc. and so I wanted to ignore those common words to find words more unique to each subreddit. The not-so-elegant-yet-efficient solution was to create a set of common words and before adding a word to the list it checks if the word is in the set and ignores it if so. I used a set instead of a regular list for more efficient lookup time O(1) vs O(n)
now I have an ordered mapping of unique words to a subreddit and so I have to just display it. Python has a neat library called Matplotlib which can represent data beautifully. I used the pie chart component to display the data and picked the top 10 words.
The results were glorious, I tested it out on a few subreddits each with a sample size of 10,000 comment posts . Here’s what the subreddits had to say:
Warning, lots of foul language ahead! (it is Reddit after all)
This was a really fun hack to do and it turned to be more technically challenging than I anticipated. You can check out the code for it along with my other projects here
Thanks for reading! I hope you learned something whether it was about Reddit or tech