Better article recommendations with triplet fine-tuning

Published in

Süddeutsche Zeitung Digitale Medien

7 min readAug 21, 2022

For the newest version of our SZ news app, we wanted to make it easier for readers to find articles they might be interested in. There are several variations to this task, for example:

Which articles are other people (with similar interests) reading?
What set of articles matches my own interests?
What would be good follow-up articles to the one I’ve just read?

The last variation had special priority for us because right now, editors spend significant effort to select such follow-up articles by hand. It is even more work to change them when the original recommendation needs to be removed or a better matching article is published.

Here, we want to show how we tackled this problem — how can we find good follow-up articles using our collection of news articles, some natural language processing (NLP) know-how, and machine learning techniques?

Following up

A follow-up article should be closely related to the topic of the source article and relevant on its own, neither outdated nor too niche. Discounting for article age and moderately boosting popular articles is technically straightforward — even based on live article stats. But how do we find related articles? In this work, we focus on this problem in particular. The data engineering, tuning, managing data quality, and compatibility to the other tasks mentioned above set their own challenges worthy of their own story.

Finding similar documents is a basic task in information retrieval for which many NLP techniques can be applied. Since we care much more about the general content of an article than, say, selected keywords or word choice, we decided to use text embeddings as a basis for a similarity search. Text embeddings are condensed vector representations of the text with the property that semantically similar and related texts appear close together. You can read more about embeddings in NLP: Everything about Embeddings or Jay Alammar’s Illustrated Word2vec.

One can generate embeddings with language models based on transformers. Many models, like BERT, are available as open source and pre-trained for German texts. However, while the German BERT model is partly trained on news corpora and already ‘gets’ the basic structure and vocabulary of news articles, we liked to think that we could do better if we specialized the model on our articles. That way, current news topics, vocabulary and the SZ style would be represented more accurately. Even better, we already had the perfect training data for our SZ-model: The thousands of recommendations hand-crafted by our editors.

Implementation and Training

To train or fine-tune language models, we need examples to learn from. An effective method for fine-tuning is training via triplet loss. The basic idea is to combine an anchor — a reference point of some sort — with both a positive and a negative example, and have the model push the positive example closer to the anchor than the negative example. The loss function to formalize the intuitive idea — the quantity we want to minimize during training — is then:

max⁡{d(a, p)−d(a, n) + 𝛼, 0}

The loss is 0, if the anchor a and positive example p are closest, and grows in proportion as the negative example n gets closer. The 𝛼 is a slack variable declaring a soft margin — if the difference in distance is within this margin, we always penalize the model. As distance function d(x, y) we choose the cosine distance which is commonly used with embeddings.

The cosine distance is equivalent to the angle between vectors where a smaller angle implies a smaller distance. After training, the angle between the anchor and positive examples should be much smaller than the angle to negative examples

After some filtering and cleaning, we have about 35k editorial recommendations in one month as training data. For every pair of article and recommendation, we add a third article, chosen at random from articles with different editorial keywords in the same month. This might not be a 100% guaranteed but reasonable way of selecting an article that should never be recommended. The three articles form our triplet with the source article as the anchor, the recommendation as a positive example and the random article as a negative example.

We generate training triples from editorial recommendations

For the model itself, we start with the pre-trained German BERT embedding model and add an average pooling custom head. The pooling head reduces the embedding vector size from the standard BERT embedding of 768 down to 128 dimensions. We train the whole model with the triplet loss function on a subset of the article-recommendation-random triplets. In good BERT tradition, we call the resulting model HeriBERT.

Architecture and training regimen of HeriBERT

We evaluate our work on a 5% holdout set of articles. The similarity between anchor and recommendation using the pre-trained, non-customized BERT model is ~0.95, and to the random article ~0.91. This means, the similarity between randomly selected articles is quite high and related articles do not stand out much. The most likely reason is that BERT is trained with a much more diverse set of text data, so news articles of any kind automatically appear more similar to each other and fall into a smaller subspace. Using HeriBERT embeddings for the same holdout set, we found that on average the similarity between the anchor and recommendation is ~0.63, the similarity to the random article is only ~0.09. As we had hoped, this means our custom model is now much better at separating news articles in general and random articles from fitting recommendations in particular, so training has been successful.

We can even look at a 3D visualization of the article space and find that the departments like sports and politics appear as distinct clusters. News stories like the series of articles about the Suisse Secrets coalesce together as well.

3D PCA projection of HeriBERT article embeddings (via projector.tensorflow.org)

The recommendations are, however, only rarely the most similar articles overall. This is to be expected since similarity is not the only criterion why they have been selected in the first place. Instead, we manually inspected the most similar articles, which we call recommendation candidates, for several hundred test articles, and found that the candidates match the topic quite well. Once we added the age discount and popularity boost in a weighted sum with the similarity, the editorial recommendations appeared regularly in the top three of our candidates.

How we brought it to production

SZDM infrastructure for the app already runs on AWS. Our recommendation service centered around HeriBERT is integrated into the larger app architecture as well. There are a lot of details we need to skip over here but, in essence, the required components are as follows:

AWS services to deploy the recommendation system built around HeriBERT

We deploy an AWS Sagemaker endpoint for our trained model. Every article publication triggers a lambda function that extracts the text plus some metadata and calls the Sagemaker endpoint. In the serverless version, these endpoints scale down to zero and do not incur costs in periods when no articles are published. The generated embeddings are stored in a DynamoDB table and an OpenSearch cluster. Every time the app backend requests follow-up recommendations for an article, our API looks up the embedding in the table. OpenSearch supports a fast kNN (k-Nearest Neighbor) search, so we query the cluster for the recommendation candidates, re-rank them including their age and popularity and send back the top three candidates to be shown as follow-up options.

Conclusion

HeriBERT provides follow-up recommendations for over 5000 different articles every week and updates recommendations for millions of page views in real-time with no additional effort by our editors. The images below — taken live from the app — show how both a political story and a vacation piece get matching recommendations.

An example of a fast-moving news story: Three recent background articles are chosen as recommendations.

An example for a softer article: All recommendations match the source article in tone and subject matter even when there is no ‘obvious’ follow-up article.

Through fine-tuning and condensing a pre-trained BERT model, HeriBERT weighs current news topics properly and has been able to learn from the curated selections of the editorial staff. We hope to extend the scope of HeriBERT from our mobile app to the website as well, and even include video and audio content in the future.

Better article recommendations with triplet fine-tuning

Following up

Implementation and Training

How we brought it to production

Conclusion

Written by SZDM Data Science