Natural Language Processing and News Media

Grant Wilson
Nov 5 · 3 min read

I recently completed a project utilizing Natural Language Processing to classify posts as coming from either the WorldNews or News communities from Reddit.

I chose the News and WorldNews subreddits as I knew they would be fairly difficult to classify. Reddit frequently allows for similar subreddits to arise with minimal differences, this typically allows users more freedom to delve into the nuances of their interests. However, it can make classification difficult as subreddits may link to similar articles or have similar threads and posts.

Both the News and Worldnews subreddits curate news articles for their consumers. While one might assume that news would focus on US news with Worldnews focusing on International matters. However, in reality both of these subreddits cover both US and International news.

This presents an interesting problem in a classification model. Are certain topics more frequent in one subreddit over another? What about political leanings? Do these communities rely on different news sources for their posts? Do posts of one community generally have higher scores than the other? Constructed models seek to answer these questions.

Classifying posts in different subreddits may be an interesting thought experiment, but what practical applications does it have? In an increasingly interconnected society with a 24/7 news cycle allows for more media than we possibly have time to consume. It can be incredibly easy these days to rely on one or two news outlets that already cater to our existing opinions without looking for different interpretations of the facts.

A well-rounded news diet is essential for understanding all sides of an argument. This includes both understanding multiple sides of an argument and understanding a diverse range of topics. So often, we are confronted by calls of “Fake News” that highlight how differently our media outlets portray the same stories. In practice, it can often be hard to find unbiased news so we are forced to confront many outlets. This way we at least we consume enough of the facts to render our own interpretation of the full picture.

These two key points, understanding multiple sides of an argument and consuming diverse media, are possible through model building and interpretation. In a perfectly balanced model, there is an equal chance of categorizing a post from either subreddit. As the subreddits diverge in topic of conversation, they become easier to categorize. With this model, we can expect models with high accuracy to be examining subreddits that are less alike. Additionally, breaking down each post’s title allows us to examine topic frequency and perhaps hint at leanings of the subreddit. For example, a subreddit talking about protests may talk more about police action or protester disruptions, depending on the subreddit’s opinion of the events.

Further analyzing how we consume media is essential for creating a civil public discourse. So often we see online arguments devolving into name-calling and separation. This is due to a reactionary focus on right vs. wrong instead of a proactive conversation about shared values and the effectiveness of public policy strategies. Personal news media analysis can move us in the right direction.

The unexpected value of NLP classification was that it can hint at the similarity of language used in two distinct communities. However, interesting this may be as a thought experiment, the long term generalizability of a news classification model may be more difficult in application. Topics most recently in the news will be frequent across similar communities and when the scope of the daily news ranges from public policy to international affairs, it can be difficult to construct a corpus broad enough to encompass all newsworthy topics.

While the models constructed for the scope of this project may fall short of perfect classification, implications for the use of NLP in the analysis of news media hold exciting promise for analyzing partisan bias and promoting a multi-faceted education of the issues of the day.