Building a content based recommender system for a news website

5 min readDec 31, 2019

Unless you are New York Times or The Guardian, every news publisher is fighting with heavy weights such as Facebook, Instagram, YouTube, TikTok to keep users engaged on their platform. In these situations, building a custom content delivery platform is of paramount importance.

Working for a digital news publishing company and managing the data platform to support more than 120 websites, helped me understand the severity of this problem and to build a article recommendations solution, I had to unlearn publishing practices which are no longer relevant and learn new concepts in data science to build intelligent and engaging experiences for news readers.

In this article, I have summarized my experiment to build basic content based recommender system for a news site.

Step 1 — Setting the objective and goals

Objective

“To increase the engagement of active users on a news site using articles recommendations”

Business goals

Increase page views per session per user
Increase session duration per user

User goals

Reduce the cognitive load by serving preferred content
Increase relevance and freshness of recommended content

Product goals

Measure the content inventory (articles DB), use it efficiently
Increase content consumption among anonymous, registered, subscribed users
Maintain content diversity to reduce the risk of over-customization

Step 2 — Selecting the technique to implement recommendations

Approach

Content based recommendations — Identifying the past user activity and using content attributes as input, provide recommendations to users
Knowledge based recommendations — Capturing the explicit specifications from user and using domain knowledge as input, provide recommendations to users
Collaborative filtering — Identifying the content and user attributes to create clusters of similar data to provide collaborative recommendations

Since this is the first experiment, of the 3 approaches mentioned above, content based recommendations would be a good start.

please note, as the approach identified is content based recommendations, When a user “U1” is reading an article “A1”, articles recommended are computed using article attributes of similarity, freshness and correlation and these recommendations will be article A1 specific and not user U1 specific.

User specific recommendations can be built using knowledge based recommendations or collaborative filtering approach, which is not covered in this article.

Step 3 — Preparing the team and data pipeline infrastructure

Team

Developer familiarity with Python and data science libraries (PySpark, Pandas ..etc.)
Developer familiarity with AWS machine learning stack (AWS Glue Jobs, Sage maker platform, Infrastructure needs, visualization tools, New code repository & project set up on GitHub

Data Pipeline & Data Preparation

Using the analytics captured from web and mobile applications, create a data pipeline for audience data and page-level events data using a scalable data streaming service.

Preparing the analytics set for processing and to compute the recommendations

Normalize ‘audience’ data
Enrich ‘events’ data by merging with audience data
Anonymize merged data set for engineering consumption

Anonymization/Tokenization is required to ensure that PII (person identifiable information) data is privacy compliant and secure for data controllers (news publishers).

Step 4 — Validating the approach and binding use case to a logic

Use case: As a user, I would like to see engaging & relevant recommendations when I am reading an article on a news site, so that I will read more articles in the same session

Content based recommendation logic: We will compute the following 3 metrics to identify an article suitable for recommendation:

Similarity (S), Correlation ( C ), Freshness (F)

Final score for an article is computed as a function of these 3 metrics i.e.

Recommended_Article_Score = f (S +C+ F)

Assumption:

weightages for each metric in the final equation can be need-specific i.e.

· If users like to read articles that are read by others with similar preferences on the platform, then set higher value for “Correlation metric”. If users want to read articles of similar category or topics, then set higher value for “Similarity metric”. If users want to read “fresh & trending” articles then set higher value for “Freshness metric”

Step 5 — Defining the metrics & choosing the batch compute resource

Similarity metric: computed values range = {0 to 1}

Article 1 is compared with article 2, w.r.t. categories, subcategories, key topics, entities derived from (NLP) natural language processing of data and filtered for only those entities with high relevance score to compute Jaccard similarity index ‘S’.

Jaccard similarity between 2 articles can be computed using page views data:

S (Article1, Article2) = | Article1 ∩ Article2 | / | Article1 U Article2 |

Correlation metric: computed values range = {0 to 1}

In data mining, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model.

Lift is simply the ratio of these values: target response divided by average response. Using this logic to compute the correlation between articles read by user, as below:

C= [n (People who read Article A and Article B)] / [n (People who read Article A) * n (People who read Article b)]

Freshness metric: computed values range = {0 to 1}

For every article viewed within last 24 hours, we can compute the freshness metric based on 2 factors:

· User traffic factor

· Time decay factor

An article can be categorized as ‘fresh’, when more people are reading the article on the site and we can see the fresh page views generating OR when an article is more recently published compared other articles on the site

User traffic may/may not decrease freshness over time, whereas time decay factor decreases freshness over time. Hence, combining these 2 factors, we determine the freshness metric.

F = f (user_traffic_factor * (time_decay_factor))

Step 6 — Composing the recommendation list & filtering the display for custom requirement

1. Combining the output of 3 metrics computation Glue jobs, use a final recommendation score computation Glue job to compose the list of sorted articles with scores for recommendation on the news site.

2. Apply business logic to filter articles before recommendations (ex: (a) no sponsored articles, (b) no articles older than 7 days to be recommended…etc)

AWS Glue job based batch computation of recommendation metrics

3. Implement the job performance monitoring solution to ensure the high availability and efficient usage of Glue resources (Job, Catalog, Crawler)

4. Identify the right metrics to measure the quality of recommendations using Precision, recall and mean average precision metrics to fine tune the algorithm as required.

Based on the feedback from users and verifying the metrics, quality of recommendations can be enhanced by focusing on and improving correct metrics.

Hopefully, this article serves as a stepping stone for product managers managing data platforms and interested in building recommender systems.

Never stop learning!