Predicting Political Bias with Python

Published in

Linalgo

8 min readOct 19, 2017

Written by Arnaud Rachez, Rodrigo Castro — September 30, 2017

Recent scandals around fake news have spurred an interest in programmatically gauging the journalistic quality of an article. Companies like Factmata and Full Fact have received funding from Google, and Facebook has launched its “Journalism Project” earlier this year to fight the spread of fake stories in its feed.

Discriminating between facts and fake information is a daunting task but often times, looking at the publisher is a good proxy to gauge the journalistic quality of an article. And while there is no objective metric to evaluate the quality of a newspaper, its overall quality and political bias is generally agreed upon (one can for example refer to https://mediabiasfactcheck.com/).

In this article, we present a few techniques to automatically assess the journalistic quality of a newspaper. Our models, although quite simple, allow us to cluster similar newspapers on a 2D map and perform reasonably well at the task of predicting political bias (right or left). The entire code of the analysis can be found on Github.

Previous Work

Media Bias Fact Check determines a publisher’s bias by subjectively looking at 4 variables including: wording, factual reporting, and political affiliation. A crowd-sourced poll serves to verify their findings on many U.S. outlets.

Stanford proposed a new approach to detect systematic bias (Quotus). Their system is based on encoding quoting patterns on a large dataset of Barack Obama’s speeches and their media coverage. They discovered that quoting patterns align reasonably well with political ideology and outlet type.

Recurrent Neural Networks have been tried to detect systematic bias but, to our knowledge, none of these approaches have used a sufficiently large dataset.

The Data

We manually chose 67 online publishers from a balanced pool of left, right, factual and non-factual reporting outlets. We used WebHose.io to obtain a snapshot of the web content created by those publications using their Archived Data API. We downloaded 7.9 million articles, or 11GB compressed or ~60GB uncompressed data.

We used stratified sampling to get an even amount of articles. We then filtered them down to 12 online publishers whilst keeping the balance between left and right. The resulting 103K articles is the dataset used in all of our experiments.

The data is available on request.

Additionally, we make available the JSON responses in pickle format received from Google’s Entity Sentiment NLP API for a subset of ~85K articles.

Approaches

Clustering

Our first assumption was that we could estimate the bias and quality of a journal by looking purely at the distribution of words and topics across different publishers. We decided to model our corpus as a collection of bag of words documents. This is a very common simplification that completely discards any syntactic information that may be contained in the articles. While a lot of information might be lost in this model, we expected that a purely lexical analysis of the articles would still retain enough of the political and quality biases.

On top of the bag of words, we trained a Latent Dirichlet Allocation model. Again, this is a very widely used model in document analysis that make the assumption that each document is a mixture of topics, and each topic in turn is simply a collection of word with their associated probability.

We trained our model to automatically extract 20 topics and further reduced the dimensionality of the article embeddings by projecting them onto a 3D space using t-SNE. Below is a visualization of the articles using TensorBoard:

Articles with similar meaning (for example, articles about the american presidential election) tend to cluster together.

To visualize the biases per journal, we aggregated all articles under one domain into a single representation by averaging all articles. Finally, we wanted to have all right-wing journals on the right of the map, all left-wing journals on the left and high quality journals at the top while low quality one are at the bottom on a 2D map. To achieve that, we used PCA to project on a 2D space and changed the coordinates of the embedding by simply using rotation and translation transformations. We achieved this by using the difference between journal as the x and y axis. For example, for the x-axis (political bias) we used Breitbart.com — Dailykos.com as the reference.

Here is the result of this projection (data & graph available here):

The mapping is not perfect but already captures a few interesting features about the publishers. One way to improve the visualization would be to play with the definition of the x and y axes.

Logistic Regression

The previous map is constructed in a completely unsupervised way. The information about political bias is only given in the very last step, when choosing the axes to project the publishers onto. Since we have access to political labels for the journal appearing in our corpus, we also tried a supervised approach.

Again, we chose the bag of words model for simplicity and use logistic regression to predict the political bias of each article.

This is obviously a very difficult task as we are using publisher level labels on an article bias prediction. Nevertheless, we expected that, in aggregate, the noise present in each article would cancel itself out and that the algorithm would be able to capture patterns related to political bias.

We used 7 labels for this task: extreme left, left, left-center, center, right-center, right and extreme right. Setting up the task is quite straightforward and we used 57 journals for training and 12 for testing.

On the test set, and after aggregating the journals to get a score per journal, we obtained reasonably good qualitative results that are consistent with the human evaluation of journals that appears on mediabiasfactcheck.com. Below are a few examples:

One limitation of this algorithm is that it seems sometimes difficult to discriminate between extremes.

Interestingly, one explanation could be that both extreme left and extreme right journal tend to use a similar wording.

Entities sentiment

To address that limitation, we thought of using sentiment polarity towards named entities as an additional feature.

We had hypothesised that detecting Named Entities along their sentiment would help detect systematic bias. For instance, if a publication constantly publishes positive content regarding a Republican and constantly publishes negative content about a Democrat, it could very well be a biased publication.

Under this assumption, we processed close to 85K articles through Google’s Named Entities Sentiment Beta API; passing the article content as input, extracting named entities and their attached sentiment. We stored the resulting JSON file in Python pickle format and added this data as features to the Logistic Regression Model built in our previous experiment.

Highly Negative Named Entities (sentiment < -0.5)

Named Entities with unknown sentiment (sentiment == 0)

Highly Positive Named Entities (sentiment > 0.5)

Google’s extracted Named Entities turned out not to be a very informative feature. This, we assume, is due to many named entities having no sentiment (0 value), which translates into the model not learning anything new from this additional data.

More extensive data mining might reach to a different conclusion.

Neural Networks

Neural Networks have been proven to work extremely well on some NLP tasks. We explored the use of shallow Neural Networks for this task. We implemented several Neural Network architectures based on Jeremy’s and Rachel’s Fast.AI course:

We implemented a Single Denser Layer Neural Network where the input of the net is a learned D-dimensional embedding of the K most frequent words. K = 100,000 and D = 32. The single dense layer is of size 50 and the output is a classifier of 7 bias types (extreme left, to extreme right). We used Dropout before the classifier.
We implemented a 1-D CNN classifier that uses the first N word embeddings of the text with filter size 5 and Output Channels 32. Followed by MaxPooling, a single dense layer and a classifier. Dropout is used before and after the convolution and before the classifier.
We implemented a more advanced CNN classifier that chains multiple CNN of filter sizes 3, 4, and 5. Followed by the same MaxPooling architecture as above.

We tried implementing the models in PyTorch, but had an issue with the training/metrics library we were using. We dropped it in favour of a more thoroughly tested framework: Keras.

We stopped training after 2 or 3 epochs as the neural networks would have a tendency to overfit the dataset, getting close to 100% accuracy on the training set but would not cross the 70% threshold on the validation set. This behaviour suggests that the models are too complex for this type of problem as they are practically memoizing the inputs.

The neural net architectures were built in parallel to the simpler Logistic Regression model and, after inspecting the results, we concluded neural nets were too complex for the task at hand.

The simpler Logistic Regression model shown above achieves similar validation scores but with a much simpler process.

Future Work & Conclusions

- Better annotations: our data is relatively noisy; our choice of labelling is clearly not the best. A biased publisher may not always publish biased content and vice-versa. Spending more time on this task should lead to better data, thus better models.

- Better data pre-processing: we spent very little time on cleaning the data we obtained from WebHose.io. The content of articles contain irrelevant information such as: related links, ads, social network stats, etc. We think that by spending more time on this task, it could eventually lead to better models.

Modelling systematic bias is not an easy task. Trying to objectively model political bias, which is ultimately a widely debated and disputed topic, requires lots of context and information about the political system and the media entities that we assumed to be encoded on the articles themselves.

While we achieved a relatively good score in our validation accuracy, we did not explore what the model is actually looking at and whether our model detects bias or instead some other correlated event.

Furthermore, while we do have an extensive amount of articles at our disposal, we used a very small sample of publishers for validation. We cannot say at this time whether the model will generalize to a larger set of publishers.

We hope our work will serve as basis for future work.

Thank you

This work was made possible with project direction and financial support of Vladimir Baranov.