Building A Large-scale Short-video recommendation Dataset and Benchmark

5 min readMar 4, 2024

title：Building A Large-scale Short-video recommendation Dataset and Benchmark

TLDR:This article introduces MicroLens, a large-scale, content-driven dataset for short video recommendations. The dataset comprises 1 billion user-short video interaction records, 34 million users, and 1 million short videos. Additionally, it provides rich modal information about the short videos.

artile：http://arxiv.org/abs/2309.15379
code：http://github.com/westlake-rep

In recent years, short videos, ranging from a few seconds to several minutes in length, have become increasingly popular among internet users, occupying a prominent position on various platforms such as social media and entertainment. Due to their ability to capture users’ attention effectively, short video recommendation has garnered widespread attention in both academia and industry. However, the current research lacks large-scale publicly available datasets specifically dedicated to short videos, hindering the development of the field of short video recommendation.

Existing video recommendation datasets, such as the well-known MovieLens, mainly focus on feature-length films and lack content characteristics related to short videos. Since short videos possess unique features compared to feature-length films, there is a need to develop a dataset specifically tailored for short video research. Additionally, datasets like Tenrec and KuaiRec only include video IDs and pre-extracted video features, limiting the flexibility of video recommendation algorithms to extract features directly from original video content. Therefore, there is an urgent need to provide a large-scale short video recommendation dataset with diverse original content to facilitate research on short video recommendation systems.

To address these challenges, this paper introduces a large-scale short video recommendation dataset called MicroLens, consisting of 1 billion user-short video interaction records, 34 million users, and 1 million short videos. Each short video contains original modal information such as titles, cover images, audio, and video information, providing a rich and diverse feature set for recommendation models. Additionally, over 10 recommendation baselines and video encoders were benchmarked on this dataset.

The generation process of this dataset includes seed video selection, dataset expansion, data filtering, interaction information collection, and dataset integration steps.

During the seed video selection phase, seed videos were chosen from the time range between June 2022 and June 2023, selecting videos with over 10,000 likes as seed videos. In this phase, 400,000 short videos containing video titles, covers, and audio/video information were collected. During the dataset expansion phase, 10 external linked short videos were randomly selected from each page containing a seed video. Approximately 5 million video pieces of information were collected in this phase. In the data filtering phase, deduplication was performed, and videos were filtered based on different modal requirements. For example, for the text modality, short videos with titles containing fewer than three words were filtered out to ensure the quality and relevance of the dataset.

In the interaction information collection phase, user-short video interaction information was mainly collected based on user comments on the videos. To collect comments, each video’s webpage was visited, with a maximum of 5000 comments collected for each video. In the dataset integration phase, due to the existence of large-scale data content, special dataset integration technologies were used, such as distributed large-scale download systems (including collection nodes, download nodes, and integration nodes).

For privacy protection considerations, the dataset only includes publicly available user behavioral information. Additionally, all user and short video IDs were anonymized. The following figure illustrates the content features of each short video in the MicroLens dataset (including titles, covers, videos, etc.).

This dataset provides multiple scale versions, including MicroLens, MicroLens-1M, and MicroLens-100K. The specific data statistics are shown in the table below:

The following figure illustrates some statistical data for MicroLens-100K:

The article also evaluated multiple recommendation baselines on MicroLens, including:

It can be observed that the performance of IDRec based on sequence modeling (SR) is generally superior to the non-sequence modeling-based IDRec (CF). The introduction of pre-extracted video information in VIDRec does not significantly improve the performance of IDRec. End-to-end training of video features and recommendation models can enhance the performance of video recommendation. Utilizing raw video features rather than pre-extracted frozen features is crucial for achieving the best recommendation results, emphasizing the importance of the MicroLens video dataset.

The following figure illustrates the impact of different video encoders on video recommendation. The VideoMAE encoder performs best in the video classification task, so it is used as the video extractor for VIDRec. During end-to-end training, SlowFast video networks achieve better performance, thus serving as the video encoder for VideoRec.

The following figure illustrates that parameters learned from computer vision tasks, particularly video understanding tasks, can enhance the precision of video recommendation.

You can read the original paper for more exploration of the dataset.

Building A Large-scale Short-video recommendation Dataset and Benchmark

Written by AI-Advance