CapstoNews: Reading Balanced News

Published in

SFU Professional Computer Science

12 min readApr 20, 2020

Authors: Max Ou, Kenneth Lau, Juan Ospina, and Sina Balkhi

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

Motivation and Background

Stop and ask yourself: how often do you question the news you are reading, watching, or listening, especially on political topics? All media outlets have a profound impact on shaping the public opinion of the masses. Many of them produce news with a particular political stance to suit their agenda, or they tell us what we want to hear, reinforcing our own belief. There are political central news outlets; however, the majority of them have biases. Bias is not inherently bad. However, receiving news from only one source is; mainly because people become less informed about the multiple perspectives that could arise on the same topic. As a vision statement, what we aim to accomplish is to provide an outlet for people who want to stay informed on all sides of the political spectrum to form well-informed opinions.

Problem Statement

Currently, some websites provide balanced news coverage based on surveys, editorial reviews, and third party data. However, our product intends to use data science to help us determine the biases of articles and to find their siblings (articles that are about the same topic but have different political biases). To address the problem, we need to answer two questions.

The first question is to create a machine learning model that can predict the political biases of a news article. To do this, we need to find an appropriate dataset that caters to our product. The dataset needs to include the article itself, its bias (left, center or right) and what category (business, culture, etc.) it belongs to. Furthermore, the articles in the dataset should have an equal distribution of labels to prevent our model from having a poor performance as a result of a potentially imbalanced dataset.

The second question we attempt to solve is how do we compare different political stances. After creating a working machine learning model, we need an algorithm that can find similar articles from different political biases. There may be many possibilities to search for siblings, and we want a solution that could be effective and efficient.

Data Science Pipeline

Figure 1: A high-level overview of our entire data science pipeline

The preliminary work consisted of several text preprocessing modules which use the Newspaper3k library, for training spaCy’s statistical models for bias and category classification. For political bias classification, we used the “All the News” dataset from Kaggle. The dataset consists of around 150,000 articles from the year 2016 to July 2017. For training the bias classification model, however, we only used a balanced training dataset of 30,000 articles, which consisted of 10,000 samples for each political leaning.

The dataset itself does not come with bias labels. Instead, we used the publication as a proxy for the political bias of an article. We used https://mediabiasfactcheck.com/ and https://www.allsides.com to map each publication to its political bias. For example, Breitbart and Fox News are labelled right-leaning, Reuters is labelled center-leaning, and CNN and BBC are mapped to left-leaning. We did this because a moderate-sized dataset with articles labelled by their political bias does not exist. Nonetheless, we believe that our heuristic is logically sound and that the publication of an article serves as a reliable proxy for its political leaning.

Here is a breakdown of the mapping used:

Left: The Atlantic, Buzzfeed News, CNN, The Guardian, New York Times, Talking Points Memo, Washington Post, Vox. (10,000 samples)
Right: Breitbart, National Review, New York Post, Fox News. (10,000 samples)
Center: Business Insider, NPR, Reuters. (10,000 samples)

We used Kaggle’s News Category dataset to train another statistical model for news category classification. The dataset mainly consists of Huffington Post articles published between the years 2012 and 2018. A significant portion of the dataset also consists of articles from other publishers such as Vox, the Verge, etc. We used the links provided in the dataset to scrape the article content for about 100,000 articles. We dropped several of the rarer categories while merging others under their “umbrella” category to have a more balanced dataset. The following shows our final categories with their sample sizes:

Business: 3,267
Culture: 16,727
Entertainment: 12,849
Living: 10,297 (consists of Style, Green Living, etc.)
Politics: 32,722
Science: 2,316
Society: 14,459
Sports: 3,544
World: 7,479 (consists of World News, i.e. outside North America)

To train the bias and categorization models, we loaded one of spaCy’s pre-trained starter models into the program. We initialize our models with a starter model, a transfer learning starter pack with pre-trained weights, to achieve better accuracy. We created a pipeline component called TextCategorizer, or textcat. The textcat takes in the document and transforms the words into the GloVe vectors. The vectors are then passed into two convolution layers and outputs the label classes with their confidence scores.

For our model to be trained, we first loaded our dataset, shuffled it and split off a part of it to hold back for evaluation. Then we created a dictionary where the keys are the labels and the values are set as “1” if the bias or category matches the label, otherwise set the value as “0”. Together with the text from the article and the dictionary, the model is trained on 20 iterations.

To evaluate our models, we applied them to a sample of 10,000 random articles from both datasets. For each article, the ground truth is compared to the predicted label. The output of the test showed the Bias model having an accuracy of 91% and the Categories model having an accuracy of 96%.

We then evaluate the models on the entire dataset, including testing and training data, to get the model-predicted labels with their corresponding scores. We do this for both bias prediction and news categorization. Then we obtain our predictions and store it along with the entire dataset in MongoDB as our final articles pool for siblings lookup.

We used the TextRank algorithm to extract keywords for each article. TextRank is a graph-based algorithm that uses PageRank to compute the importance of words in a given piece of text. We refer the reader to the original paper for a detailed study of TextRank.

In our case, we extracted lemmatized tokens that were tagged nouns, verbs, adjectives, and proper nouns (excluding “mr.”, “ms.”, and “say”). We lemmatized the tokens because it is more robust to variations in keywords that have the same meaning. Then we create the undirected graph of tokens in an adjacency matrix format and run the PageRank algorithm to compute the rank (importance score) of each keyword. We then return the top 100 keywords of a given article.

In our experiments, we found this area could further be improved to a great extent. For example, we could find better, more informative keywords by tweaking our rule-based matching in spaCy. We chose a window size of 4 and a damping ratio of 0.85 with a convergence threshold of 1e-5. All these are hyperparameters that could be tuned to extract better keywords.

We added our TextRank module as a component in the spaCy nlp pipeline in production. This means that we can access the keywords as an attribute of the doc object created by calling nlp on an article.

Methodology

***Figure 2:*** *A visual overview of our algorithm*

Definition: We define “siblings” as articles that are talking about the same topic but have different political biases.

The article submission is the single entry point to our application. When a user submits the URL of an article, we use Newspaper3k to extract the content of the article and its metadata. We chose to use Newspaper3k because it was an all-encompassing news article scraper so that we did not need to write individual scrapers for each publisher. Doing this enables us to work with news articles from any publisher that the user submits.

After the content and metadata of the submitted article are extracted, we use two pre-trained spaCy statistical models to predict its bias and category. We used spaCy because it boasts as an industrial-strength NLP library with a wide range of functionalities, including neural networks for text classification. Note that we have two classification models, and since spaCy’s nlp pipeline does not allow more than one classification component. We created two pipelines: One with tagger, parser, bias classification, and TextRank, the other with just category classification.

After extracting keywords of the submitted article and predicting its bias and category, the article, along with its prediction results, are stored in MongoDB. We chose MongoDB as our project database because it has multiple advantages. The first advantage is that it offers cloud provider distributions, easy and fast setup with web access portals. Secondly, MongoDB has high scalability — we can easily increase our database size just a few clicks away through the MongoDB Atlas web console. Thirdly, it is a document-oriented database, and data is stored in JSON format, which fits well with our articles and their metadata. Last but not least, it has a high performance in retrievals. Once an article is saved to the database, we don’t need to update it again. However, we need to search through thousands of articles from the database to find potential siblings. As a result, the database read performance is critical in our project.

To get the submitted article’s siblings, we query the database for articles with other political bias. For example, if the submitted article’s bias was left, we fetch articles with bias labels right and center. We now see why we need the category: To keep the computation low by fetching only the articles with the same category. It does not make sense to, for example, find the right-leaning entertainment sibling of a left-leaning political article.

For each candidate siblings of the submitted article, we compute the sum of rank scores of their common keywords. In other words, we find their common keywords and sum all of their values. Then, in our algorithm, the sibling with the maximum sum is the closest pair of the submitted article. In effect, this is a distance measure and we’re trying to compute document similarity with this approach.

We chose to apply TextRank this way mainly because we thought it would be an interesting experiment. TextRank is commonly used for keyword extraction and extractive summarization, but we could not find any previous work that used it for document similarity purposes. We employed the heuristic that if similar articles have similar keywords, then the rank of these keywords should give us a fairly reasonable distance measure between these articles. Because TextRank uses PageRank to compute these importance scores, we have an empirical method of computing the distance between two articles. In fact, we believe this is a potential research topic.

After we have computed the closest siblings of the submitted article this way, we publish all three sibling articles with their metadata on a Dash/Flask web application. For our project, we chose Dash/Flask because it is an easy-to-use UI library for creating analytical web applications. We developed a Dash-based web application that allows the user to input a news article URL. Then, our Dash app, acting as the View-Controller component in a Model-View-Controller (MVC) software design pattern, takes the article with all its metadata and feeds the back-end. The back-end determines the article’s political bias, category and its siblings. Then, this information is returned to the front-end to be viewed by the user. In the final stages of our project, we scaled back some of Dash implementation and used the native Flask for more flexibility with the design and development of our final application.

***Figure 3:*** *High-level architecture diagram of our solution*

Evaluation

To evaluate our models, we applied them to a sample of 10,000 random articles from both datasets. For each article, the ground truth is compared to the predicted label. After training our machine learning models implemented in spaCy with sample articles, and the labels to predict political bias, we performed several tests over these models and obtained accuracy scores between 91% and 96% on the test data. These accuracy scores imply that our models are performing well in identifying the political leaning of news articles.

With regards to our sibling finding algorithm, we found that the closest siblings obtained only have limited accuracy. TextRank produced informative and relevant keywords that captured the context and topic of the articles. However, to obtain better accuracy in real-world scenarios, TextRank-based software should be used more as a method for supporting a bias assessment performed by humans, instead of fully replacing them.

Nevertheless, quantifying our results with metrics was our biggest challenge as there is no human-annotated data that could serve as a gold standard against which we could evaluate our results. We instead looked at an individual sample when trying out our application and found that more than half of the time, the siblings were sensible (articles about the same topic, but with different political bias).

Data Product

After inserting an article URL into the search bar, our front-end sends it to the Newspaper3k program which scrapes the data to extract the article contents.

Figure 4: CapstoNews search bar

In the backend, our models find the bias and news category of the article, then, the TextRank algorithm searches for the top 100 words. The output of this process is the article itself, along with their siblings. On the left side there will be a shortcut to quickly view the different siblings.

***Figure 5:*** *News article and its siblings*

To the right of each article, the user can view their publication information, political bias, bias confidence, and the top 10 keywords with their scores.

***Figure 6, 7 and 8:*** *News article and its attributes (publication information, political bias, bias confidence, and the top 10 words with their scores.)*

For readers, this will show them additional information available that they might have never been exposed to prior to using our product. They will have a greater understanding of the different political spectrums as well as expanding their own information bubble.

Lessons Learnt

It’s fair to say that this project is not simple. As we were creating our product, we had learned about many different technologies and their application. The implementation of the NLP pipeline and creating custom components with spaCy was a first for us. We wanted to utilize an industrial-strength Natural Language Processor instead of the traditional NLTK libraries. In comparison to the NLTK, spaCy supports integrated word vectors and neural network models, while NLTK does not have these functionalities.

Learning and using TextRank to find sibling articles is a unique way to resolve our second challenge. TextRank is used to provide a score for each sentence in a text, then it can take the top-n sentences and sort them as they appear in the text to build an automatic summary. However, instead of that, we used TextRank as a method to search for articles with similar text and rankings.

We found that Dash was a good choice to implement our visualization scope because of its native visualization components, and its easy integration to data science pipelines developed in Python.

Summary

Our project aims to provide an application for people who want to stay informed on all sides of the political spectrum to build well-informed opinions. To accomplish this, our product uses data science to determine the biases of news articles and to find their siblings (articles that are talking about the same topic but have different political biases). Specifically, machine learning models were developed to predict the political bias and the category (business, culture, etc.) of a news article.

Additionally, using the TextRank algorithm, we find the article siblings from different political biases. All the input information is captured and the corresponding results are presented in a Dash/Flask web application, which is an industry leader in data science visualization.