A personalized ranking, content-based approach to model digital media

Published in

ELCA IT

10 min readJan 19, 2022

As news reports possess the power to influence and alter our perception of world events, our information consumption choices are really important. The rise of digital media and the abundance of sources has recently increased the difficulty of these choices. The development of new media monitoring methods is important to increase awareness about the widespread media bias, which is not objectively or easily quantifiable by humans.

For these reasons, we developed News Cracker, a novel embedding method for digital news sources. News Craker is the outcome of my Master's Thesis, which was conducted at ELCA in collaboration with the LSIR EPFL laboratory. It learns a latent representation space for news outlets to highlight similarities based on the content they produce. The method also exposes hidden higher-order patterns which reflect editorial guidelines and ideological alignment trends.

We will first present the key intuition behind News Cracker: approaching the subject as a personalized ranking problem, like a recommender system. We will then showcase the optimization criterion, the architecture of the model, and its performance. Lastly, the second half of the article is dedicated to the analysis of the obtained embedding space, including an evaluation of News Cracker as a classifier for ideological alignment.

A personalized ranking problem

The key idea is to have our model learn the implicit content-publishing preferences of news sources through a personalized ranking approach. We assume that lexical and semantic choices in articles implicitly reflect an interest towards them from the sources, which end up using them frequently. In other words, a source “prefers” one of their own articles (aⱼ) over an article from another news outlet (aₖ).

This modeling approach is akin to what recommendation systems do, where “users” (sources) implicitly express their preference towards an “item” (article) by buying it. Formally, this problem is an instance of One-Class Collaborative Filtering (OCCF) from implicit feedback, with dyadic interactions of the form “source sᵢ has published article aⱼ”

We want to optimize the representations of the sources by learning to produce an accurate total ordering of articles >sᵢ for each publisher, where

means that source sᵢ prefers article aⱼ over article aₖ.

This approach is heavily inspired by the work of D. Bourgeois and J. Rappaz Selection Bias in News Coverage: Learning it, Fighting it, where they employ the same personalized ranking idea to capture the latent structure of media’s decision process regarding event coverage. They show a worrisome coverage convergence trend following channel acquisitions from large media conglomerates.

The News Cracker source embedding method

To learn our source-embeddings, we employ the Bayesian Personalized Ranking (BPR) loss. It is a pairwise approach where the model is presented with triplets (sᵢ , aⱼ , aₖ ) sampled uniformly at random from the article corpus, where sᵢ is a random source from the source set S, aⱼ is an article published by sᵢ (positive sample) and aₖ is an article not published by sᵢ (negative sample).

In BPR, we maximize the following posterior probability over the training set of such triplets, where Θ are the parameters of an arbitrary ranking model.

Concretely, we can compute the right-hand side of the formula thanks to a source-article interaction model (ranking model) which produces real-valued scores

that represent the likelihood of source sᵢ publishing article aⱼ. With the likelihood score from the ranking model, we can define

where H(.) denotes the Heaviside step function and σ the sigmoid function. Finally, with a Gaussian prior on Θ, we obtain the following criterion to maximize

This approach is clarified in the following diagram. Our source vectors sᵢ are initialized at random and optimized through BPR-OPT along with all the other parameters of the architecture, which are trained all at the same time (embeddings, article-embedding model, source-article interaction model).

The News Cracker source-embedding method. Image by the author.

Text representations and ranking model

We obtain the likelihood scores from the source-article interaction model. First, we embed the articles through an LSTM-based architecture that combines word-embeddings learned through FastText. We maintained the article-embedding model relatively simple on purpose (no transformers), in order to more closely model individual lexical choices, as one of the goals of the project was to understand whether these choices would induce enough separation between source representations. With this article-embedding model, we obtain article representations aⱼ.

These representations are used in a ranking model that we define as the source-article interaction model, which outputs the likelihood of the article being published by source sᵢ. For this part, we experimented with latent factor models like matrix factorizations, as they are a traditional approach to personalized ranking problems, but they didn’t yield the desired results. We hence decided on a feedforward neural network architecture that, given the source vector (sᵢ) and an article representation (aⱼ), learns to output a likelihood score that maximizes the BPR-OPT criterion presented above. In doing so, and since the article representation aⱼ is given by the article embedding model, we can learn the source vectors sᵢ.

The source-article interaction model architecture (ranking model). Image by the author.

Dataset and model evaluation

The article corpus used during this project was collected in the context of the Media Observatory Initiative and consists of titles and bodies of articles (in English) scraped from a large number of news outlets on the web, based all around the world.

In order to evaluate the News Cracker model, we adopted a leave-5-articles-out methodology. For each of the 446 sources, we selected 5 articles at random to be kept out and used as positive samples in testing triplets, which result in a test set composed of 2'230 samples. From the remaining articles, we sampled 2'000'000 triplets, uniformly at random, to make up our training set. From the same pool of articles as the training set, we sampled 5 pairs of training/validation sets with the same methodology (leave-5-out and 2 million training triplets) to have a more robust assessment of results.

In terms of metrics, we use a formulation of the Area Under the ROC Curve (AUC) which is specific for this pairwise setting and can also be referred to as “pairwise accuracy”

The model achieves 0.94 test AUC and 0.93 validation AUC, proving to be extremely good at learning this personalized ranking task.

Analyze the embedding space

After learning the source-embeddings (dimensionality 100), we need analysis and visualization methods to interpret our results. First, we computed the cosine similarity between embeddings as a measure of source similarity, to identify the closest neighbors. Second, to visualize the space, we applied the T-distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique to project the embeddings to a bi-dimensional space, in which we applied the DBSCAN clustering algorithm to obtain groups of sources. The resulting space and clusters are depicted in the following plot:

The source-embedding space learned by News Cracker on our article corpus. Image by the author.

We now show that the model is able to easily capture three factors that characterize a source:

Geography
News organizations/conglomerates
Domain/editorial guidelines

We state that most of the regions of the space identified by DBSCAN can be described by one of these clustering factors.

The plot below has been annotated with respect to the geographical location of sources. It is a trivial distinguishing factor that we expect to find, as local entities (places, people, …) are frequently mentioned only by news sources from the same area. These lexical “choices” cause the source embeddings to be really similar, which proves that our embedding method is learning as intended. We can see that grouping happens at different geographical levels, from macro-regions down to individual cities.

The source-embedding space, annotated for geography. Image by the author.

The second factor is source ownership/brand membership. If we look at the closest neighbors of a few sources, we see a clear trend.

The three closest source embeddings for some selected news outlets. Image by the author.

Sources from the same brand/organization tend to have very similar embeddings, forming groups of closely grouped neighbors. We can identify these subclusters in the space within location-based groups, as the annotated plot below shows. This shows that the algorithm can learn structure that goes beyond expected content differences and finds high-level patterns in the news landscape.

The source-embedding space, annotated for organizations. Image by the author.

The third and last clearly distinguishable factor is represented by the source domain. We can see below that clusters arise for sources reporting about the entertainment business, science&nature, finance, think tanks, etc. It should come as no surprise now that News Cracker is equally able to group together sources with similar editorial guidelines.

The source-embedding space, annotated for the domain. Image by the author.

The embedding space seems to have a hierarchical structure. Geography represents the upper-level factor according to which sources are distributed on the projection plane. Then, for regions where we have numerous outlets (mainly the U.S.), the other two clustering factors (brand/organization and domain/editorial guidelines) form most of the remaining structure, producing source sub-clusters with clear semantics.

What about media bias?

While this analysis is interesting, we want to assess whether the approach is able to capture non-trivial structure in the news ecosystem and relate sources that share similar ideological/political stances about world events. We remind that the approach is completely unsupervised in terms of media bias prediction and the model is free of any potentially bias-inducing design.

To determine whether some of the insights align with known political leanings, we collect explicit bias ratings for 123 of our U.S. sources from the AllSides Media Bias Ratings. This rating labels over 600 online news sources, almost exclusively from the United States, employing a sound methodology to produce the rating for each source. Additionally, the ratings are to some extent community-driven. It is the ideal type of data to compare to our purely data-driven results. The news outlets are labeled with one of five ratings: Left, Lean Left, Center, Lean Right, Right.

The media bias ratings collected from AllSides for 123 of our sources. Image by the author.

We plot the subset of (now labeled) sources on our projection plane and color code them according to their media bias rating.

Subset of U.S. news sources for which we collected an explicit media bias rating. Image by the author.

At first view, there is not a super clear structure revolving around the bias ratings, but there are still many regions of the space where some same-label groups/regions appear. Therefore, to see whether News Cracker has actually learned something about ideological leaning, we use this dataset of 123 labeled sources and their projected embeddings to evaluate a majority-voting Nearest Neighbor classifier (k=12). On the task of predicting the exact label (5 classes), the model achieves 52.03% accuracy.

Confusion matrix and performances of the kNN classifier trained on sources labeled with bias ratings. Image by author.

Even though this is not impressive, from the confusion matrix we can see the model is making reasonable predictions and mostly confusing labels which are adjacent in the political spectrum (ex: Left and Lean Left). Since such a discretization of the continuous ideological spectrum is arbitrary, we evaluate the model on the binary task of predicting whether a source leans left or right (for this, we discard the 11 sources labeled as “Center”). For this task, the classifier achieves 78.25% accuracy. In other words, given a source, our embedding space can help identify the correct (according to AllSides) leaning for almost 4 out of 5 news sources. We need to reiterate how hard this task is, even for humans. This kind of performance is therefore noteworthy and surprising, considering the unsupervised nature of our model.

Conclusions

With News Cracker, we propose a novel embedding method for online news sources. We model each news source by learning its latent content preferences, an approach that allows us to identify similar sources with respect to the news reports they choose to publish. The analysis of the embedding space indicates that News Cracker is capable of successfully identifying many factors that distinguish news sources. The embeddings seem to capture to some extent the ideological leaning of news sources in certain regions of the space. This media bias prediction task is extremely difficult even for humans. Overall, we can conclude it is possible to capture source preferences that lead to bias-related considerations, even with a model that does not push any explicit definition of ideological bias.

Explore the embedding space through this demo interface!