Common Datasets Benchmark for Recommendation System

AC
Data Folks Indonesia
5 min readAug 14, 2021
Photo by Daria Shevtsova from Pexels

Recommendation system is everywhere, you can find recommendation system in every digital services such as e-commerce, movie streaming, e-books marketplace, and more. When I started a research project for recommendation system, I struggle to find which dataset that are commonly used in the literature. A question that I had was how many datasets out there that are popular to benchmarking the model performance. During literature review, I found some of these dataset that I list here in this article.

In the era of information explosion, recommender systems play a pivotal role in alleviating information overload, having been widely adopted by many online services, including E-commerce, online news and social media sites. The key to a personalized recommender system is in modelling users’ preference on items based on their past interactions (e.g ratings and clicks), known as collaborative filtering. Among the various collaborative filtering techniques, matrix factorization (MF) is the most popular one, which projects users and items into a shared latent space, using a vector of latent features to represent a user or an item. Thereafter a user’s interaction on an item is modelled as the inner product of their latent vectors. There are two main approach of recommendation system. Explicit feedback and implicit feedback. Explicit feedback when the data is usually in a form of rating like scale from 1–5, 1 being bad and 5 is great. In the other hand, Implicit feedback is a like a transaction data in a e-commerce, the user didn’t buy the item do not mean they don’t like it, it could be they didn’t know that the item exists or it simply the user doesn’t need the item. Which contrast like implicit feedback that the user is actually said they don’t like it.

These are some of recommendation system datasets that you can use and experiment with

  • MovieLens Dataset

MovieLens dataset is one of the most popular dataset that are commonly found in the research paper. The dataset is coming from movielens.org which is a non-commercial, personalized movie recommendations. This dataset is suitable for explicit feedback (there is rating for a given movie and user). There are various types of movielens dataset such as Movielens 1M, 20M, and 25M which means millions interactions. Interesting right? Find out more here MovieLens | GroupLens

  • Pinterest

This dataset is not from official Pinterest that release the dataset. This is an academic project from Xue Geng et al from School of Computing, National University of Singapore. They published a paper namely Learning Image and Users Features for Recommendation in Social Networks. They explain how the dataset is collected and I quote

We use Pinterest, which is one of the largest social curation networks, as the source of the content-centric network for evaluating our proposed methods. To our best knowledge, there is no publicly available social media dataset that is large scale and image-centric with ground-truths of images. In Pinterest, user “pin” images to their own boards, showing their preferences of these images. In This research, we only crawled images with additional information indicating their categories (e.g., Fishing, Travel, Hockey). We used the images categories as the ground truth of user interests. In particular, given a user and his/her pinned images, we first found the category labels of these images and used these labels as the interest of this user. We crawled the profiles of 1 million users together with their pinned images from Pinterest. The users were randomly sampled from the users communities found in the 468 categories we analyzed.

This is interesting for me, I personally haven’t explored the dataset. I suggest you to read the full manuscript here. Full page is Academic Projects — Xue@Alphabeta (google.com)

  • Netflix

In 2009, Netflix held the $1M Netflix Prize, like Kaggle but in their own website called netflixprize.com. This dataset is then hosted in the Kaggle platform for anyone who wants to explore and create a model from it. The dataset consist of training_set.tar contains 17770 files, movie_titles.txt.

Find out more here Netflix Prize data | Kaggle

  • Amazon Product Data

Let me quote straight from the website

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 — July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

Download the dataset here Amazon review data (ucsd.edu)

  • Steam Dataset

Steam Recommender Systems Dataset is a dataset that used in the academic research, produced the same author like Amazon Product data. “The dataset contains reviews from the Steam video game platform, and information about which games were bundled together.” There are ~8 mills reviews, ~2,6 mills users, ~15K items, and 615 bundles.

Find out more in the website Recommender Systems Datasets (ucsd.edu)

  • MIND by Microsoft

MIND is one of the public dataset that Microsoft release for machine learning community. The MIND dataset is impressive because it contains so many interactions, metadata, etc.

MIcrosoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. The mission of MIND is to serve as a benchmark dataset for news recommendation and facilitate the research in news recommendation and recommender systems area.

MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID. — Website

Find out more MIND (msnews.github.io)

  • Yelp Dataset

I found Yelp dataset for a really long time. The dataset is really large and has many purposes. This dataset is not specifically design for recommendation system, but every review has rating in it. So may be it is useful for your project.

Conclusion

That’s it. Those are the dataset that I found many in literature. I personally used movielens, it also uses in ML Perf, HPC benchmark, and I tried to dive more on the model and the dataset.

--

--