How Do Incoming/Current/Old Students at the iSchool feel about the major? Analysis through Reddit API

trinity killip
INST414: Data Science Techniques
4 min readSep 15, 2023

While it may or may not be a good resource, one of my resources to help me decide on what major I should switch to was reddit. Before deciding to switch to the Information Science program, I wanted to research more on the major. I didn’t know too many people who graduated with a degree in Information Science at the time, so I relied on Google and that eventually led me to Reddit, where several students would rant, explain, and simply talk about the program.

I wanted to find a way to analyze the UMD subreddit and give people a strong awareness of what comes with the Information Science major. Since there are many Information Science students who use Reddit, I decided to use the Reddit API and use PRAW to scrape the data from the subreddit to display a general overview of what’s being discussed about the major. Within the submissions, there are complaints, reviews, and questions. I wanted to create a database that would filter all the reddit posts in the subreddit to those that mention information science.

Create application to access Reddit API
PRAW documentation

Before getting to scraping the data, I needed to find access to Reddit’s API. In order to access the API, I created a developed application in order to get the client id, client secret, and user agent which would allow me to access the API. Using the Reddit API, we are able to access information all throughout the site. We could go through subreddits and look through the submissions, comments, upvotes, users, etc. With the help of PRAW, I would be able to filter out all the posts on r/UMD and look at only information science specific posts to help students with inquiries.

After creating the Reddit application, I wanted to first create a submissions object that would contain what I was looking for. The object would have a list of posts on r/UMD and each post would contain information about the URL, author, upvotes, etc. Afterwards, I planned on using pandas to create a data-frame and display a database for users to see. I successfully was able to grab all the submissions from r/UMD and found thousands of posts to filter through. In order to filter down the posts a tiny amount, I decided to filter all submissions into submissions that were previously labeled as “hot.” Which is a category in reddit that shows that many users engaged with that post. This can filter out any submissions that have 0 responses or have no engagement at all which wouldn’t be significantly helpful for the users.

Now that the object was created and ready, I used pandas to create a data frame. I decided that the data frame would have the columns: title, number of comments, score, upvote ratio, and url. Remember, I just want to show users posts about information science that will help them learn more about the major from students themselves. I didn’t want to show every single detail about each submission, because that would be useless.

Unfortunately, the Reddit API doesn’t allow users to go through all r/UMD submissions overall, so in order to still have a large number of submissions, I set the limit of submissions to 1000. So with pandas, I was able to create a data frame of 1000 submissions.

After creating the data frame, I used regex to filter out all the submissions and only catch submissions that are related to information science. To do this, I had to use several terms to make sure I didn’t miss any submissions such as “infosci”, “info sci”, “information science”, “iSchool”, etc. I ended up being successful, however, there were some submissions that weren’t related to information science, but a keyword had the same letters as info sci, like “instrument”, so I adjusted the regex to make sure I didn’t get unrelated posts.

In the end, I was able to get a table of submissions that had information about information science.

Information Science r/UMD dataframe

With this, users can scroll through each post and easily find information about the major from students themselves rather than the school, which could have bias. Reddit, however, contains bias, but it’s an additional source of information to consider.

ISSUES:

Reddit API had some limitations that prevented me from scraping as many posts as I good. For example, the Reddit API doesn’t allow users to scrape all of the posts ever created in the subreddit. The highest amount of posts you can access are 1000. This affects my data frame because Information Science is a relatively small major still, so many students aren’t posting on Reddit about Information Science in comparison to other majors. There is an alternative database, PushShift. However, it seems that their documentation is currently down, so I don’t have the knowledge to use it currently.

In all, this data-frame can be used to help inform prospective and current students about the major in general and the data frame columns give a general idea about how reliable or popular a topic is.

Github link:

https://github.com/tkillip7/inst414/blob/main/Assignment1_414.ipynb

--

--