The Value of Responsible Personalization in News Recommender Systems

Published in

NZZ Open

8 min readFeb 11, 2022

In this article, I explain how the Data Team of NZZ, Switzerland’s German-speaking newspaper of record, proved for the first time to stakeholders the value of personalization in content recommendations. With a responsible design and an experiment-driven approach!

Starting Point

Mid of 2019 our Sunday Newspaper NZZaS was up for a major redesign, with the goal of optimizing the website for mobile usage and, in general, making it faster, prettier, and… more personal. New features like bookmarking, reading history, and personalized article recommendations were planned.

Having recently joined the company as a Senior Data Scientist, I was tasked with designing the algorithm behind the personalized reading recommendations for this new website — my first major project in the media industry. At that point, NZZ had already explored content personalization in two projects:

NZZ Companion App: an experimental news app funded partly by a Google DNI grant. Although this app was only rolled out to a group of beta testers in order to learn more about their needs and usage patterns, it helped us kick-start our cloud data infrastructure and build our first reusable data pipelines. Also, based on these first results, personalized reading recommendations were introduced in the user area on the daily newspaper NZZ;
meineNZZ (in English: myNZZ): NZZ’s personalized newsletter sent every Friday afternoon. It consists of a personalized list of articles published during that week that the user has missed. Up until today, meineNZZ has a loyal subscribers pool and very good open rates.

Now, for the first time, content personalization was planned to play a major role on one of our websites. Specifically, the “Nur für Sie” (in English: Only for you) reading recommendations were to be displayed in 3 locations on the new NZZaS product:

in the user area, along bookmarked and recently read articles;
in a feed on the front page;
in a feed in the next reads section of the article pages.

Algorithmic Design

So, why content personalization? From the hundreds of articles published in the last editions, a regular user gets to see and consume online just a handful. Not all articles make it on the editorially curated front page, and even if, the user might have missed them if she was not online at the time they were promoted. So the main purpose of a personalized feed is to surface relevant content that the user has so far missed. If done right, the users become more engaged and loyal to the product.

While relevant content can mean personally-tailored content, e.g. based on past reading interests, other factors should be accounted for — especially in the domain of news. While we indeed want to consider individual preferences, we also want a feed that upholds journalistic standards. Meaning, it should account for the editorial judgment of our journalists, as well as for what we observe in the crowd behavior. This is what we call responsible personalization.

Based on the above, our transparent design follows. For each user, the candidate articles are scored against three dimensions:

Personal Score: here, we follow a content-based filtering approach. Namely, articles that are semantically similar to what the specific user has read in the past, receive a higher personal score. Refer to the next section in this blog for more technical details.
Crowd Score: all articles that have recently been popular among our readers, receive more points in the crowd score. This is how the first recommenders worked in the news industry and remains a good baseline to outperform.
Editorial Score: this score reflects the editorial value of an article. There are several valid approaches to this, like journalists scoring manually each article or defining a fixed pool of articles to be considered in a given period. We went for a fully automated solution here, that re-engineers the article value from the way the journalists chose to promote it on the front page (promotion day, duration, etc.).

In the end, all scores are discretized and put together in a weighted sum. The weights of each term could be a business decision depending on the principles and goals of the newspaper or treated as a hyper-parameter to be optimized in the learning phase.

Final score as a weighted sum of the personal, crowd, and editorial scores and further business logic.

Differently to products, movies, or songs recommendation systems (think of the Amazons, Netflixes, and Spotifies of this world), the shelf-time of a news item is much shorter and we account for that with a time decay function (the older the article, the higher the penalty).

The feed is recalculated every few hours for each user based on the formula above. Depending on its product placement, surrounding articles are filtered out dynamically in the frontend (e.g. the current article on the article page).

From word embeddings to the personal score

Here, I explain how we compute the personal score for each user-article candidate combination by means of a content-based approach using similarities between the news items.

First, for each user, we need a user profile. The profile is the set of all articles that the user has read up until then. Each article in the profile is semantically represented as a 300-dimensional vector, which is an average of the word embeddings of all words appearing in the text (of course, after careful preprocessing, such as removing stop words). For that, we use FastText German embeddings pre-trained on Wikipedia.

Then, at regular intervals and for each article candidate in the pool (i.e. recently published articles that the user has not yet read), we compute the distance between this article and the user profile. We do that by means of the Euclidean distance between the vectors, but other measures such as the cosine similarity are also valid choices.

The closer an article is to the user profile, or, to at least a subset of it, the more points it receives in the final personal score.

A/B Test Design and Results

But what is the added (user!) value of this carefully crafted personalized score? To answer that, we have designed and run the following online A/B test.

The users in the control group (A) have been served recommendations computed using only the crowd and editorial scores — i.e. a highly relevant, yet unpersonalized feed. The users in the treatment group (B) have been exposed to recommendations computing all 3 scores — i.e. a balanced personalized feed.

Beyond algorithmic metrics (such as precision and recall), at the end of the day, what we want to improve is the quality of the user engagement with our content. A good product metric to measure this is the number (or ratio) of completed reads (i.e. articles read until the end) referred by the feed.

A/B Test duration: 7 weeks (peaks represent Sundays). Assignments: approx. 26,000 users / test group. Significant differences as per Wilcoxon signed-rank test (p<0.01).

On day one of the experiment, we assigned logged-in users to two balanced groups and observed them over the course of 7 weeks. Being a Sunday publication, most of the traffic took place over the weekend. At the end of the test period, we could see that the personalized version of the feed exhibited an 18% uplift in the core experiment metric — a result that was statistically significant. With this, we put a first number to the value of responsible content personalization!

User Feedback

True to our design principle of transparency, we also wrote an article (in German) explaining in lay terms to our users how the feed works. The article is linked on the “Nur für Sie” container on the front page and is always accessible to the readers.

As on all articles, after finishing reading the text, the readers can choose to let us know whether the article met their expectations, and even leave textual feedback — a trove of qualitative data (as opposed to the quantitative data from the online experiment above). Looking at two and a half years worth of collected data, this is what they say: 72% of those voting found that the article fulfilled their expectations, 14% found that the article only partly met them, while 14% found the article unsatisfactory.

From those voting Yes, many users found the article interesting, easy to follow and appreciated the full transparency. Some of the users voting Partly noted that they mostly read the print version of the newspaper and that is not considered by the algorithm. Finally, some of the users voting No mentioned that they simply do not want personalized content or pointed out some inherent shortcomings in the data.

The Team

Katrin already gave credit to the whole NZZaS development team in her original post, so here I will only mention the team specifically behind the “Nur für Sie” data product:

Marco Metzler (Rep. of the Editorial Team & Head of Digital NZZaS)
Katrin Huth (Product Manager)
Cristina Kadar (Senior Data Scientist)
Paweł Kaczorowski (Senior Data Engineer)

Disclaimer

NZZaS will soon be replaced by a different product, called NZZ Magazin — so if you read this article sometime in the near future, do not wonder if you do not find “Nur für Sie”. We will be working on a new algorithm design, specially tailored for the new title!

I have partially presented this content at various events so far: in the AI & Online Business Track at the Applied Machine Learning Days conference, in the Media Innovation Seminar at ETH Zurich, and at the Data Science Night of the University of Applied Sciences and Arts Northwestern Switzerland.

If you liked this article, hit the applause button below, share it with your audience, and follow me on Medium, Twitter, and Linkedin for more insights.