Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction

Paper by Google Research(Brain Team)

5 min readApr 28, 2024

LLMs have shown there success in various tasks from text generation, translation and summarization to human like chatbots.
They are good at the above tasks because they were trained on large datasets which makes them rich in information and also helps to generalize to other tasks in zero or few shot settings.

There are early attempts using BERT and GPT2 to generate recommendations through natural language on the movielens dataset which shows promising results but are not as good as baseline models.
P5 fine-tunes open sources model T5 which unifies ranking, and retrieval into one model, M6-Rec is another related work that tackles CTR prediction tasks.
Two recent works also explored zero shot prediction and 3 stage prediction which show competitive results but are still not better than existing MLP baselines.
In spite of these efforts, there is an absence of a comprehensive study that evaluates LLMs of varying sizes and contrasts them against carefully optimized strong baselines.

This paper explored LLM model size ranging from 250M to 540B parameters for task specific, user rating prediction under 3 scenarios: 1) zero shot, 2) few shots , 3) fine tuning.
The experiments are conducted using 2 open datasets Movielens and Amazon Books.
The task is to predict the rating of the movie/book for a user.

As shown in Figure 1, In both zero and few shot settings they have used title and genre along with rating to give users past interaction. The output is then parsed to get a rating. Sometimes LLMs require additional instruction to give the desired output. Example: “Do not give reasoning” to prevent the model from outputting text other than just rating.

In this paper, they explored Fintuning Flan-T5 which was publicly available and had competitive performance on wide benchmarks at that time. The problem can be formulated in 2 different ways: 1)Multi-class classification, 2) Regression.
For Classification: They have defined 5 classes and optimized using cross-entropy loss.
For Regression: They have transformed the projection layer to (d,1) and optimized using MSE loss.

As mentioned above they have used 2 open datasets(MovieLens and Amazon-books). They have filtered the items that don’t have metadata and split data based on time. Finally, 90% of the data is used for training and the rest for test but due to high computation cost only 2000 samples were sampled from the test and all the results were reported on these samples only. While training and evaluation only recent 10 user interactions were considered.

RMSE, MAE and ROC-AUC metrics were used for the evaluation. In ROC-AUC, a rating greater or equal to 4 is considered positive and the rest are negative.
Baselines: They have considered traditional recommenders like Matrix Factorization(MF) and Multi-layer Perceptrons(MLP). Attribute and Rating-aware predictors like Transformer-MLP and heuristics-based methods like global average rating, candidate item average rating and user past average rating
LLMs for Zero-shot and Few-shot learning: A total of 3 best-in-class LLMs were used. Text-davinci-003(175B) and ChatGPT(gpt-3.5-turbo) are being used through Openai’s API and Finally, Flan-U-PaLM(540B) by Google is also utilized.
LLMs for Fine tuning: Flan-T5-base(250M) and Flan-T5-XXL(11B) models were used.

In Table 2, It is shown that Zero shot and Few shots(3 examples were given, 3 shots) have performed better than the global avg. rating and somewhat comparable with the item and user avg. rating but underperform in comparison to traditional recommendation models.
Fine tuning LLMs is a better to feed knowledge about dataset. The performance of Fine tuned LLM(Flan-T5-XXL) model is better than the strongest Transformer-MLP model suggesting that LLMs may be more suitable for ranking tasks.

In Figure 3, They have shown that the LLM model performance is only reasonable after 100B size in the Zero shot setting.

As LLMs were already trained on large amounts of data. They required a small fraction of training data compared to the traditional recommendations model as they were trained from scratch. A detailed comparison can be seen in Figure 4.

In this paper, they have shown the potential of LLM models, Although zero shot and few shots are not competitive but fine tune LLMs were able to outperform traditional methods
Fine tuning LLMs also has the benefits of less data requirement and simplicity of input data(just need to design prompt) hence no feature process is required :).

NOTE: This blog is an overview of the paper to inspire the reader to read the paper mentioned in the reference section.