MLforSocial : Predicting Media Bias
Using Machine Learning to understand and predict media bias in articles.
Media bias is bias or perceived bias of journalists within the mass media including social media, in the selection of events and stories that are reported and how they are covered.
The term “media bias” implies a pervasive or widespread bias contravening the standards of journalism, rather than the perspective of an individual journalist or article. The direction and degree of media bias in various countries is widely disputed. [WIKIPEDIA]
A very recent example of - Muller Report proves that depending on what you’ve been reading and watching in the aftermath of the release of the redacted version of the Mueller Report, you have probably encountered one of these two narratives. But unless you have gone out of your way to expose yourself to a different perspective, it’s likely that you have not encountered full-throated versions of both.
This is unfortunate. As is often the case with such things, the truth might well lie somewhere in the middle.[PUB_DISCOURSE]
For few of us Millennials it feels like, only in recent times things have started spiraling downwards, but mass media’s reputation has almost always been topsy-turvy. Every since Gutenberg got his press rolling, people have been rolling out mud as well.
You can find a few such examples here
Getting the Data
To begin our analysis we need to get a dataset of articles from various different media houses that has been pre labelled as being biased or not.
The dataset that I will be using comes from Ad Fontes Media. Even though they don’t provide the actual text for the articles but the csv export does include the URLs of the articles.
I created a simple script in python using Beautiful Soup and the request library to fetch the actual article text.
To follow along the code you can view the colab notebook here
Now the data looks as below.
- url — The public url of the article
- source — Media House
- text — text of the article
- bias — bias score where < 0 is left bias and > 0 is right bias
- quality — quality of the article as scored by a manual reader.
We have 1646 articles in total.
There are a few sources which have very few articles like Daily Mail and it would be better to get rid of such sources as they increase the skew
We have two columns that aid in our analysis — text and bias.
Now let’s look at what this bias variable really is
We see that most of the articles have bias very close to zero and there are values ranging from -40 to +40
Regression or Classification ?
Now since we are trying to predict bias and bias is numerical column we might want to try regression but in this approach we first bin bias column into 3 categories — Left, Center and Right and then try and make classifications.
The cutoff of +5, -5 is based on eye balling the above density plot of bias.
After binning we see below that most frequent bin is center.
Now to extract features from text, we first load spacy model and then lemmatize the sentences to create a new column text_pos
Then we play a trick to come up with features. Idea was to use the fact that we have labelled data to come with a vocabulary that can be used by TF-IDF vectorizer so that the vocab is not based on individual articles but based on the known bias label.
Therefore we first group the dataframe using the “bias_bin” label and then combine the “text_pos” for each label.
Essentially we have combined all articles for each category into one big article and then run TfIdf Vectorizer on it
Notice we specify min_df = 1 and max_df=1 so that phrases from different categories can contribute to discriminatory features.
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
After that we can get the features for each document using the .get_feature_names method on the TfIdf Vectorizer. Below is a random sample of features
We also remove n-grams that are a subset of other n-grams.
I.e suppose we get two features like
‘politics security’ and ‘politics security demand’
then we remove ‘politics security’ to keep the vocabulary size tractable.
Now we can use the features found and pass them to another TfIdf vectorizer and vectorize all available articles
Now we have numerical representation of all the articles and we can now begin modeling part.
To do the modeling part we first split the data into test and train and also rely on cross validation. We try two approaches one with Gaussian Process Classifiers and another using the Ensemble stacking approach.
Gaussian Process Classifier
A Gaussian process is a stochastic process whose kernel is a Guassian normal distribution. In other words,in a Gaussian process every (linear) combination of predictor varibles is multivariate normally distributed. And hence, going the other way, any target variables whose predictor variables basically approximate this foundational distributional property can be modeled using a Gaussian process! This extension of Gaussian processes to regression, and separately to classification, exists in
GaussianProcessClassifier, respectively. [KAGGLE_RESIDENT_MARIO]
We get only decent scores in 5-fold cross validation
If we run separately on train and test sets we get
Stacked classifier consists of stacking the output of individual estimator and use a classifier to compute the final prediction. Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator [SKLEARN_STK]
So we try stacking the Gaussian Process Classifier with an Ada Boost classifier.
We get only marginal improvements in scores
Results and Bias-Variance Tradeoff
Even though the accuracy is almost abysmal, one fact to note is that the difference in training and testing sets is not that large therefore we don’t have a variance problem, but a bias problem.
Bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).[WIKIPEDIA_BIAS_VARIANCE]
Therefore the most likely cause for poor performance in this case is because of quality of features or data. Using simple representation of text like TfIdf might be contributing to it or it could be inherent in data as “Media Bias” is something semantic and Tfidf being overly syntactic is unable to capture the right signals.
Hope you found the article insightful. Find me on @linkedin
[KAGGLE_RESIDENT_MARIO] — https://www.kaggle.com/residentmario/gaussian-process-regression-and-classification