Using PRAW to Better Understand Content in r/gamedev

AS
Social Media: Theories, Ethics, and Analytics
5 min readSep 22, 2020

With thousands of new games released every week, it should come to no surprise that game development is a fast-growing field. In the past, designing and building game engines was an obstacle to many aspiring developers. Today, there are many free and freemium third-party engines, which significantly lower the barrier to entry. The choice of a game engine can initially sound like a straightforward decision based on current needs. It also doesn’t sound like a topic that can provoke heated debates. Unfortunately, discussions on game engines can sometimes regress into an unconstructive and toxic cesspool. This essay describes an exploratory study to develop a better understanding of content in online game developer communities.

My primary research question is as follows: What is the proportion of software-agnostic content to engine-specific content in a selected online community? Answering this question will validate the plausibility of using the target community in a more extensive study.

I’ve selected the r/gamedev, a subreddit dedicated to game development, as a data source. Choosing Reddit was somewhat arbitrary. It’s a platform I am most familiar with, and its information architecture is well suited to lengthy and multithreaded debates. Selecting this particular subreddit is more defensible since it is the largest in this area of interest. At the time of writing, this post r/gamedev has 488k developers subscribed. Other potential candidates included r/IndieDev with 61k members. Many other subreddits that gather users interested in specific subdisciplines of game development were also omitted. It’s also important to note that each of the game engines in my target list has a dedicated subreddit.

To answer my research question, I will use word frequency to analyze post titles. After collecting posts, a basic “for” loop will mark posts that include names of game engines. I used a dictionary to store target words to account for abbreviations or extensions. For example, “Unity” and “Unity3D” are both counted as “Unity.” Similarly, “gm” will also count as “Gamemaker.” This method is not very sophisticated, but it will serve as a great starting point for exploratory results.

There are some obvious biases that this data might have. This subreddit may not be representative of the entire game developer community. It also gathers almost exclusively English speaking users. Additionally, my target word list counts only the most popular engines, including Unity, Unreal, Godot, and GameMaker. Omitted engines include Scratch, Construct, Lumberyard, Stingray. I will run a query to gather 1000 posts from the subreddit sorting by new and top. Reddit allows for sorting top post by day, week, month, year, and all time. I include all of these.

I used the Python Reddit API Wrapper or PRAW to collect the data needed. I found working with PRAW surprisingly simple. I’ve had some exposure to working with APIs as an IT undergrad, and those experiences were rarely positive. What really helps out here is the simplicity of working with Python thru the Jupyter Notebook. It is significantly simpler to get the project up and running than projects which require the setup of a LAMP stack. As a result, I have not encountered any bugs at this stage.

PRAW returns the Reddit post comments in data structures called Forests. Forests were new to me, and I found them unintuitive. Formatting the data into neat threads rather than flattened list was more of a pain than initially expected. Almost all of this was utterly unnecessary. A flattened list of comments worked fine for what I was trying to accomplish. In the end, I decided to scope down and limit queries to post titles. Any problems I’ve run into up to this point resulted from my python skills being a little rusty.

Additionally, there are some issues with the integrity of data. First, there is an issue with PRAW returning duplicate results. I was unable to replicate it reliably, and this prevented me from finding a quick fix. Second, I don’t have data on the number of posts per day. The results for posts in the “new” category are probably skewed.

The results were somewhat surprising. Out of the 1000 posts sorted by “new,” only 103 included mentions to specific engines. It was the second-largest number. Posts sorted by “Top of Year” returned 129 positive results. The lowest number of hits was 7, in posts sorted by “Top of Day.” The sharp drop of mentions from “New” to “Top of Day” categories was surprising. One possible interpretation is that posts that reference specific engines are technical questions. After the question is answered, the tread is buried as there is no reason to reply.

To dig a little deeper, I ran a query of 10000 posts sorted by “New” that separates mentions of engines into categories. These results were equally interesting but significantly less insightful.

This study may seem trivial, but considering the lack of literature on this specific topic, any investigation must begin with an exploration of the problem space. The results lead me to the following insights. First, using Reddit as a data source of a more extensive study on this topic is plausible. Second, even basic methods of analysis can yield interesting results. Finally, PRAW is a handy tool and may prove useful in scoping any future work in this domain.

With all that said, there were some significant limitations. My method of analysis was, for the lack of a better term, primitive. It is impossible to distinguish between the name of the engine from other uses of the same word. GameMaker is a particular case where any mention of this engine with a space between game and maker was ignored. Methods that look at words in the context of their neighbors would solve these issues.

I think taking a moment to reflect on this work is appropriate. I must admit that I did not approach this assignment with much enthusiasm. I am more familiar with qualitative methods, and quantitative methods are a little out of my element. Parsing thru the data from Reddit did, however, spark my curiosity more than I expected. There are fascinating opportunities in this domain that I have not considered prior. It occurs to me that PRAW offers a solution to one major issue with qualitative studies. Missing a target population in a qualitative study often occurs after a significant amount of resources has been spent. Using PRAW to narrow down and validate the target community prior to qualitative data collection can potentially address this issue.

--

--