Content-based Recommendations for News Media

Published in

Froomle

5 min readFeb 8, 2023

Media companies need to consider several factors in developing a strategy for recommending news to readers. — Photo by Bank Phrom on Unsplash

News recommendations are often seen as a particular case in the recommender systems field. Methods and architectures are explicitly created for news recommendations. One important concern is the rapidly changing set of relevant items. New items are constantly published, while older ones become obsolete. Algorithms need to be able to handle this changing environment.

The state of the art for news recommendations typically uses a hybrid approach, combining content-based recommendations with collaborative filtering and/or item features such as popularity and recency.

In this blog, we give an overview of which principles underpin Content-Based Recommender Systems and share insights we at Froomle have learned from both literature and practice.

What Are Content-Based Approaches for News?

Items a user read are used to construct a user profile. This user profile is an aggregation of the user’s interests, for a variety of what we call “features.”
These features can be broad categories (like domestic news and sports), narrow topics (like “The Champions League” or “Tour de France”), entities, or abstract features obtained with natural language processing techniques used on the title or article text.

How are these profiles computed? The first step is nearly always to compute a profile for each of the items. These can be binary vectors (is something related to a topic or not?), count vectors (how many times does a term occur), float vectors (embeddings computed through neural networks), or any combinations thereof.

The news recommendation framework SCENE uses related topics and entities to represent items (binary profile), while CHAMELEON computes an embedding for the items using a convolutional neural network. The user profiles are constructed based on these item profiles. Typical approaches include averaging the profiles of each item a user has seen or learning the user profile using a deep neural network.

The system used in Google News takes a different approach. Instead of the typical similarity approaches, it uses a probabilistic approach. For each category of items, it computes how likely a user is to click on or interact with that category based on historical data. It’s this probability it saves in the user profile.

The computed user profile can be used for both analysis and recommendations.

In analysis, the profiles can help answer questions like “How many users are interested in topic X?” or “How big is the overlap in readers between entity A and entity B?”

To use the profiles in a recommender system, the typical approach is to compute a similarity between the profile of an item and that of a user, either through direct measures such as cosine similarity and Jaccard similarity or through a neural network.

Why use Content-Based Recommender Systems for News?

Using content-based approaches solves two issues faced by News Recommender Systems.

First, it helps alleviate item cold start. When new items are published, every model relying on (co-)visitation is unable to recommend the item. The content-based approaches, however, can analyse the content of the article, and therefore recommend it to the interested reader as soon as it has been published.

Secondly, it helps in condensing data and avoiding sparsity issues. As many users do not return every day, the probability that a pair of items is visited by a single user goes down as they are published further and further apart. Even if the two articles are very similar.
Models like collaborative filtering that rely on co-visitation obviously struggle with this sparsity.
Aggregating a user’s reading history based on the item's contents helps alleviate this issue. There are usually fewer features used in the profiles than there are items, thus automatically reducing the sparsity.
Further, the features are usually long-lived. For example, sports topics will always remain relevant, and so we can recommend new sports-related articles to users that read old sports articles, even if the two articles were never before read by the same user.

How Are Content-Based Recommendations Used?

While the content-based approach is often a fundamental part of the recommender system, most state-of-the-art approaches combine it with other approaches to improve results. Some companies (like Google) combine the user’s content-based score with a global current interest score and a collaborative filtering score to get the final result. The global interest score helps to account for certain global interest spikes due to special events, like COVID-19, World Cup, and natural disasters.

Collaborative filtering helps find the right articles within a set of articles related to the broad topics a user is interested in. SCENE also combines the content-based score with additional features like recency, popularity, and list diversity to further improve the candidate list.

What Have we Learned and Applied to our Own Use Cases?

At Froomle, we use a content-based approach in our TF-IDF (term frequency, inverse document frequency) algorithm. The difference with the approaches suggested here is that we compute the user profiles in real-time to use the most recent interactions of the user. Our TF-IDF algorithm precomputes item profiles based on the frequency that tokens (words) occur in the title and category strings.

These raw frequencies are weighted by the inverse of the frequency of a token in the data of all items. This means that a token like “sport” occurring in a lot of articles gets less weight than a token like “tennis,” which will occur in fewer items and so be more representative of a user’s interest. The user profiles are computed in real-time as the average of the profiles of items they recently consumed.

We could pre-compute these user profiles as well, storing them to be used at prediction time, or for analysis purposes. The advantage of the pre-computation is that we can take longer histories into account, now that we don’t need to compute them in real-time.

The downside to pre-computation is that a user’s profile is not updated with the most recent user events. It takes a rerun of the computation to consolidate them into the profiles. Another disadvantage is that you risk computing large amounts of profiles that will never be used since you don’t know which user will be going online.

When the history you want to use is short, then it is better to not pre-compute the profiles, since that would introduce complexity where it is not necessary. You should use pre-computed profiles when you want to use a long history such that loading all the events in real-time and computing the profiles would take too long.

Conclusion

Content-based approaches help solve the cold start problem since they rely on information that is readily available, whereas collaborative filtering needs user interactions that are amassed over time. Representing the user’s interest as their preference for certain features also reduces the sparsity of the data. (Typically, there are fewer categories than there are items, and they are relevant for longer periods of time.) Stored user profiles are an effective method to handle a large user history, without having to recompute it on the fly.

To get the best final result, content-based approaches are combined with other models to improve recommendations. Simple features like popularity, recency, and diversity all play a role in getting the best result for the user. In addition, collaborative filtering can be used to incorporate the user’s recent interests and browsing behaviour.