What’s The Word on The Hill?
A look into how the COVID-19 pandemic influenced conversations in the Cornell University subreddit
For many students at Cornell, Reddit is like an unofficial guidebook. Whether we have questions about admission, financial aid, housing, or the best dining halls on campus, the Cornell subreddit seems to always have an answer. But when the pandemic hit in March 2020, classes moved online, social gatherings were limited, and Cornell students were sharing new questions, comments, and concerns in the subreddit. But how did the pandemic shift conversations in the reddit thread?
Why Topic Model?
In order to answer this question, I used the text analysis method, Topic Modeling. Specifically, I used Latent Dirichlet Allocation (LAD) topic modeling, which uses statistics and probability to identify themes or “topics” in texts and groups and calculates the probability that words belong to topics based on whether or not they co-occur in the same documents. In this case, I examined the text in Cornell reddit posts to identify some of the main topics of discussion. While it’s not perfectly accurate, I felt that topic modeling was the best way to gather information and quickly identify general themes that were being discussed in the large thread.
In order to complete this project I used Web Scraping (gathering data from the internet) to collect and process posts from r/Cornell. Specifically, using the Pushshift API Wrapper psaw, I scraped Reddit posts directly from the subreddit. I created my own dataset of posts in the subreddit that had over 10 upvotes. My data set consists of 2573 text posts from December 8th, 2010 up until April 6th, 2021.
The data set contains the post titles, the text within the post, the upvote score, and the date that the submission was posted. I also filtered out any posts that did not have any text in the submission, as some users post without any text or include an attachment like an image. However, for the purposes of this project, I wanted to focus on the posts’ text for Topic Modeling.
Gaps and Holes
One issue with this data set is that it’s missing posts that were posted after April 6th. Given that I am examining the mentioning and appearance of certain topics in the context of the pandemic, I’m sure that there are interesting topics that may arise from posts later in May or late April with improved vaccine rollout in the United States and loosening social gathering guidelines. This may change the formation of topics slightly or alter the way that topics appear to change over time.
Also, there may be many other posts of relevant nature that just have less than 10 upvotes (as that was my criteria). However, I chose 10 upvotes as the minimum to make sure that I was using posts that were being recognized and interacted with by as many users as possible.
It’s also important to recognize that technically, I have not received consent from users to use their posts in my research. It’s possible that the people posting in this thread intend certain sensitive topics to be only be viewed by the Cornell community. To protect their identities, I have not included any specific user names or titles.
After topic modeling the entire Cornell subreddit data set, I received 20 different topics. A few notable ones are listed below along with the top words that are most likely to appear in the topic:
Academics: class, classes, semester, grades, final, grade, professor, take, exam, also, prelim, median, courses, lecture, first, taking, questions, course, level, professors
COVID/Pandemic Procedures: students, cornell, health, covid, semester, campus, quarantine, person, fall, online, time, student, ithaca, testing, housing, need, test, available, august, store
Concerns with Socializing and Friendships: like, feel, really, people, know, things, want, going, friends, lot, even, make, think, way, everyone, life, seems, else, anything, get
Recruiting/Meeting New People: people, friends, club, group, clubs, cornell, meet, new, know, fun, interested, everyone, join, anyone, want, party, get, together, freshman, make
Advice and Suggestions: know, would, anyone, looking, ithaca, thanks, wondering, campus, guys, like, really, also, need, going, good, want, love, anything, lot, freshman
Some topics, like “Academics”, are generally present regardless of date. As seen in Figure 1, “Academics” is a pretty consistent topic throughout the subreddit’s existence, with probabilities spiking around months like December 2020 to 0.15 or March 2011 to around 0.2. These are both important periods in the academic calendar, specifically the end of Fall semester or Spring mid-term season, respectively. So it’s understandable why students would be talking about this topic the most. Clearly, before and after the pandemic, grades are always a concern.
Where did “COVID” go…?
Regarding the topic that I labeled, “COVID/Pandemic procedures”, the top ten posts that had the highest probability of appearing in the topic had topic distributions ranging from 0.8198 to 0.4662. The post most likely to appear in the topic was a very lengthy summary of the University’s plan to reactivate campus for the Fall 2020 semester. Most of the top words like “students”, “health”, “online” and “quarantine” appear multiple times in this single post as well. The post’s length and singular focus on this reactivation announcement might explain why it has the highest topic distribution.
Also, all top ten posts are from the year 2020 or 2021. One post from March 2021 even shared a link and discussed a mass Johnson & Johnson vaccination site in Syracuse (probability ~0.4963). Thus, it’s clear that this topic encompasses posts mainly from 2020–2021, which I had expected.
According to Figure 2, it’s clear that most posts that have higher probabilities of containing this topic appear around late 2020 to early 2021, both before and during Cornell’s re-opening. The post with the highest probability (0.8198) appears on July 22nd, as do most of the posts with high topic distributions for this theme.
We also see that this topic is most likely to appear in posts in 2014 and Spring/Summer of 2020 (Figure 3). The spike in 2020 might be explained by increased questions about Cornell’s abrupt cancellation of classes, the uncertainty of reopening campus, or advice for new freshman admits. Or, given that a majority of users were in lockdown, it’s possible that people had more time to spend on social media and thus post on and scour the subreddit.
The appearance of this topic in 2014, however, was interesting.
Because the time series plot uses the average probability of the topic distributions and there are only 25 posts in this dataset from 2014, this is likely skewing the average probability of this “COVID/Pandemic Procedures” for posts in the months of 2014.
Figure 4 displays this: there were more posts overall in later years from around 2019–2021, gradually increasing with time, dropping drastically during the early months of 2020, then rising to a peak around the end of 2020 and beginning of 2021. So, with more posts in more recent years, the average probabilities for each month are probably lower since they are being divided over way more posts. This shows that mean probability might not be the most reliable measure for certain topics. But, this topic is still very likely to appear around 2020 (probability ~0.14), which matches with the pandemic’s timeline (Figure 3).
Since the time series graph is not as reliable, I wanted to look at different topics and how they might change over the course of the pandemic. So I split up the data set into two smaller data sets with Pre-COVID data posts and Post-COVID (after the pandemic was announced) data posts. Pre-COVID data included 1899 posts that date beginning December 8th, 2012 and before March 14th, 2020 (one day after classes were suspended due to the pandemic). Post-COVID data consists of 674 posts from March 14th, 2020 up to April 6th, 2020.
Based on earlier analysis, the presence of the “COVID/Pandemic Procedures” topic over time in the subreddit seems to make sense since the pandemic began in 2020. But at the same time, the top three words in the topic for “COVID/Pandemic Procedures” don’t necessarily relate to the pandemic. “covid” is only the fourth most probable word in the list. So I wanted to look at “health”, which is the third most probable to appear in the topic. I wondered, how might “health” get categorized differently based on the time frame and context of posts?
What Kind of Health?
Interestingly, when I observed both Pre-COVID and Post-COVID posts, “health” appeared in two different topics as seen in Figure 5.
For the Pre-COVID topic I labeled, “Mental Health Struggles and Concerns”, “health” is considered with texts regarding resources for mental health and concerns for mental health issues. Specifically, the post that is most likely to contain this topic has a distribution of 0.575. In it, the user is seeking help for their mental health struggles on-campus. The next top nine posts follow similar patterns with topic distributions ranging from 0.0.4531 to 0.5278. Many discuss confessions about poor work habits, the utilization of mental health resources, and some even question dropping out.
On the other hand, when looking at Post-COVID posts, the words for the topic are almost identical to the topic, “COVID/Pandemic Guidelines” for the entire dataset. For example, in the context of the Post-COVID topic, the post with the highest topic distribution (0.7262) is the same top post appearing in “COVID/Pandemic Guidelines”. The remaining top nine posts have distributions ranging from 0.2889 to 0.6313. They not only discuss reopening plans, but are about Daily Check requirements, arrival tests for move-in, and information about online textbooks and course materials for incoming freshman.
So, over time, “health” seems to have become more associated with a physical health perspective regarding COVID-19 and new on-campus COVID-19 procedures. In other words, based on the topic model and the sheer volume of posts from 2020–2021 alone, the word “health”, more likely to appear in posts about COVID-19, new guidelines, and protocols (Post-COVID) as opposed to mental health (Pre-COVID).
But discussions of mental health haven’t disappeared at all; they just come under a different context in Post-COVID reddit posts.
For example, one topic among Post-COVID posts, which I labeled “Adjusting to Online School”, contains words like, “even”, “get”, “school”, “many”, “still”, “need”, and “work” (Figure 5). But when observing the words in context, the top ten posts for this topic mainly discuss struggles with online and remote learning and their impact on students’ mental health.
For instance, in the post with the highest topic distribution for this topic (0.4935), one user discussed their worsening depression due to returning home which resulted in drastic changes in work ethic, and motivation. They then mentioned how they recently began attending lectures again and taking notes again, and that they were proud of their progress.
Therefore, while the primary conversation on “health” appears to have shifted from mental to physical with the pandemic, it has not been completely erased from the picture. In this case, it’s simply discussed in conjunction concerns over online learning, (a feature that gained popularity as a result of the shift to virtual learning during the pandemic).
Using techniques like topic modeling, we can clearly see the impact that major global events and happenings have on smaller scales through something as easily accessible as a subreddit. Casual reddit posts can actually tell a lot about how students are reacting and discussing these issues. Thus, for further research, we can examine how other major events on more local levels (e.g., elections in New York State, Ithaca, etc..) have shifted or impacted conversations among students.
By using topic modeling, we can clearly see how conversations in the thread have shifted. For instance, there is emphasis on physical health and safety while new mental health struggles can be seen through the lens of virtual learners.
There are many other topics evident in the thread that weren’t mentioned or picked up by the topic model. And the posts mentioned here are not the only significant indicators for specific topics. So, remember this is a snapshot of the amazing ways that technology can help us explore large trends and themes in qualitative data.
Note: Last updated Sept 12, 2021. The article previously incorrectly described the topic, “Adjusting to Online School” as part of the Pre-COVID reddit post group. This has been changed and the topic is now classified as Post-COVID post.