How to Build a Recommender Engine for Medical Research Papers

A step-by-step guide to building a recommender pipeline, from data wrangling to model evaluation

In 2006, Netflix, which was then a DVD rental service, announced a data science competition for movie rating predictions. The company would offer a $1 million grand prize to the team that could improve their existing recommender system’s prediction accuracy by 10%. The competition garnered much interest from researchers and engineers in both academia and industry. Within the first year of the competition, over 40,000 teams from more than 100 countries had entered the competition [1].

In June 2009, the prize was awarded to BellKor’s Pragmatic Chaos, a team of AT&T engineers, who submitted the winning algorithm a few minutes earlier than the second-place team [2]. The paper describing their solution can be found here.

Image for post
Image for post
BellKor’s Pragmatic Chaos, the winning team of the Netflix Challenge (2009)

A decade later, recommender systems have become an active topic of research. It’s now being taught in computer science courses [3] and discussed in international conferences [4]. A quick Google search of “recommender systems” will give you an overwhelming list of tutorials, papers, book chapters, and videos about the topic. However, much of this content focuses on the theory around recommenders, rather than practical application. This post is a step-by-step practical guide on how to build a recommender pipeline, using a case study of medical research papers.


  1. Data Pre-Processing
  2. Model Selection
  3. Hyperparameter Tuning
  4. Model Training and Evaluation
  5. Data Post-Processing

The Problem and Dataset

Image for post
Image for post
QxMD’s “collection” is analogous to Spotify’s “playlist”

This problem was more complex than a traditional recommendation system because we were dealing with user-curated collections (i.e., playlists for papers), which could be updated at any time. We had data on which collections a user had subscribed to, and which papers a user had opened and read. While we knew when a paper was read and for how long, we did not have information on how the paper was accessed (i.e., via a collection, user’s home page, or search result). Our challenge was to identify relevant collections for users using their interaction data at the paper-level (rather than collection-level).

Step 1: Data Pre-Processing

We were working with an implicit feedback dataset, so each cell in our utility matrix represented a user’s interaction with a given collection. This can be thought of as our response variable.

Image for post
Image for post

We tested different types of preference metrics such as:

  • whether a user proceeded to read the full-text paper from the abstract
  • whether a user shared a paper
  • whether a user “favourited” a paper
  • time spent reading a paper (dwell time) [5]

You can also define your own “preference score” based on what type of activity you want to optimize. If the goal is to get users to read more paper abstracts and share more frequently, you can construct a score that takes into account a user’s abstract views and sharing activity. To incorporate temporal information, you can modify for your score to give more weight to more recent interactions [6].

As part of our initial solution, we used number of abstract and full-text paper reads as our preference score. If a user has read many papers within a collection, we assumed that this was a strong signal of user preference. We identified outliers in our dataset as users with unusual bot-like behaviour, having thousands of “shares” and paper reads in a single day. These users were removed from the utility matrix. Our resulting utility matrix consisted of around 500,000 users and 10,000 collections, with a density of ~1%. As the matrix becomes more dense, recommendations tend to improve in quality.

Step 2: Model Selection

1. Alternating Least Squares

Image for post
Image for post

where “k” represents latent factors. By multiplying latent factor matrices U and V, we get a reconstructed matrix R with populated values representing estimated user ratings (or interaction score since we’re dealing with implicit feedback). In traditional matrix factorization, we solve for U and V simultaneously. With ALS, we fix U and compute V, then proceed to fix V and compute U. This process is repeated until the loss function converges. What makes ALS appealing is that it performs extremely well given sufficient data and it’s parallelizable, which makes it relatively inexpensive to tune hyperparameters.

2. Bayesian Personalized Ranking

We ultimately chose Alternating Least Squares as our model for the recommender pipeline because it was 1) easier to implement, 2) computationally efficient, and 3) generated impressive results based on a smaller proof-of-concept prototype.

Step 3: Hyperparameter Tuning

We used a Python package called scikit-optimize to tune the hyperparameters of our ALS model. If you take a look at scikit-optimize’s documentation, you’ll learn that it’s based on sequential model-based optimization (SMBO), which is more efficient and less exhaustive than other optimization techniques such as grid search. With SMBO, results from the previous iteration are used to improve the sampling method of the next iteration over a defined search space.

Image for post
Image for post
Sequential model-based optimization using scikit-optimize

The plot above illustrates how scikit-optimize works. In this example, there are two hyperparameters that are being optimized:

  1. number of latent factors (int)
  2. regularization factor (float, ranging from 0 to 1)

Each dot represents one iteration of the optimization process. Earlier iterations are represented by purple dots, while later iterations are represented by green and yellow dots. The red dot represents the optimal value of a given hyperparameter. As shown above, yellow dots are clustered around the red dot which suggests that as the number of iterations increases, we converge toward the optimal hyperparameter value.

Hyperparameter tuning is a dynamic process that needs to be re-run as more data enters the pipeline. Constantly tuning hyperparameters on the fly can get computationally expensive, especially if you’re dealing with a large dataset. To reduce the computation costs of hyperparameter tuning, you can set up a cron job that runs the tuning process at a reasonable time interval (e.g., hourly, daily, weekly). You can also use the “warm-start” technique which lets you initialize the current tuning process with input points calculated from a previous tuning process [10]. In other words, with warm-starting, you can continue the tuning process from where it left off.

Step 4: Model Training and Evaluation

Quantitative Evaluation

Qualitative Evaluation

Step 5: Data Post-Processing

Another post-processing task was to generate more descriptive titles of our collections. Since collections are curated by users of QxMD, a title may not accurately represent the content of that collection. For example, let’s say a paediatric resident’s recommended collections are titled “Aa”, “Neo”, and “Pediatrià”. While the content of these collections may be highly relevant, the ambiguous titles may discourage the user from opening the collection and taking a look at the papers within it. To solve this issue, we used a technique called “term frequency–inverse document frequency” (TF-IDF) to extract the most relevant topics from a given collection. To solve this issue, we used medical subject headings (MeSH) to extract the most relevant topics from a given collection.

Upon publication, a peer-reviewed medical research paper will get assigned a set of medical subject headings (MeSH). These MeSH terms are essentially keywords that represent the main topics of a given paper. A collection can contain anywhere from a few to over 100 papers, with each paper having several MeSH terms. We can collect these MeSH terms and perform a technique called “term frequency–inverse document frequency” (TF-IDF) to identify the most pertinent MeSH terms that describe the collection.

TF-IDF is a term-weighting schema that is often used for information retrieval and text mining. It works by looking at the frequency of a term in a given collection, and comparing it to the frequency across all collections. As the frequency of a terms increases within a given collection, its TF-IDF value increases which indicates that the term of interest is highly relevant to that collection. A term’s TF-IDF value is offset by the number of other collections that also contain this term, which ensures that we don’t overvalue terms that appear more frequently in general. We used scikit-learn to implement TF-IDF and chose the top 5 most relevant terms to describe a collection. The figure below is an example of the results from this process.

Image for post
Image for post

This MeSH feature will be deployed to production once appropriate changes to the UI have been made.

Wrap Up


  1. BellKors Pragmatic Chaos Wins $1 Million Netflix Prize by Mere Minutes. Wired 2009.
  2. Recommender Systems on Coursera
  3. ACM RecSys Conference
  4. Yi et al. 2014. Beyond Clicks: Dwell Time for Personalization.
  5. Lee et al. 2006. A Time-based Recommender System using Implicit Feedback.
  6. Hu et al. 2008. Collaborative Filtering for Implicit Feedback Datasets
  7. Rendle et al. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback
  8. Rodrigo et al. 2016. Distributed Bayesian Personalized Ranking in Spark
  9. Kim et al. 2017. Learning to Warm-Start Bayesian Hyperparameter Optimization

Written by

Data scientist based in Toronto, Canada.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store