Recommender Systems inspired by Large Language Models

6 min readAug 8, 2023

Recommender systems are designed to match users with items. Typically they are organized into two stages: (i) Retrieval (aka Candidate Generation) and (ii) Ranking as described in this classic YouTube paper:

Overview of a general recommender system architecture from YouTube paper in RecSys 2016

They rely heavily on feature engineering and digest a large number of features (in the order of thousands) per impression. This places a huge burden on training and serving infrastructure.

The nature of the problem dictates many feature pipelines and feature groups (i.e., collections of related features): For example, at the minimum one typically needs a user feature group, a few context feature groups and an item feature group. Furthermore, we need many more cross-feature groups (think how often the item was served during the 11AM — 12PM time window as opposed to the 6AM — 7AM time window; this is an item-context feature). Each feature group in turn typically contains multiple rolling window aggregation of events (like 1h, 6h, 1d, 7d, 14d, 30d, etc). Setting these feature groups for training and provisioning feature stores for serving is non-trivial. Ensuring the correctness of numerous pipelines backing these systems is a challenge in itself. Typically ML practitioners build features on intuition. It is very expensive to run ablation studies and isolate the useful features in each feature group. This contributes to quick feature bloat. Feature bloat and mismatch impacts cost and quality (latency, accuracy, etc) of recommender systems.

This is not too different from how one used to work with Language Models in the past: The general approach was to engineer many features (think n-grams, entity extractors, etc) and then hand them off to models that tease out information from these features. However, we have moved on to looking at a long sequence of text using transformer models and they work very well without any tailored feature engineering.

Can we take inspiration from language models to make feature engineering minimal for recommender systems?

We believe that it is possible and will save a lot of repetitive wasted effort in feature engineering. To do this we need to materialize all the metadata that was used to design the original features as input to the “RecSys LLM” in the form of user-level time-ordered aggregations (arrays). In typical settings, one would need a separate array for item IDs (since they are used as sparse features), arrays for context like location & time and finally arrays for outcomes (like whether the item was liked, or the item was bought, etc) as all of these are typically used in feature aggregation. One can even input a sequence of content embeddings of the items. To keep the model general, one should not use user IDs in this LLM formulation. These feature sequences should be appropriately converted to embeddings and acted upon by a Causal Transformer Encoder. Such an architecture can mine features and feature-crosses through transformer layers.

What are the challenges in building such an LLM?

For scenarios like short-video recommendations the number of interesting events (typically positive interactions) can easily exceed thousands if the look back window for feature aggregation is 30 days (very typical). So feature fetching is not easy in the serving path. But, it is a solvable problem. Instead of retrieving 1000s of features, we now need to retrieve arrays of sizes in the 1000s for active users. We can combat this to an extent by limiting the “context window” to say 768. The transformer models built using this kind of features will also be somewhat slow, as they need to operate on long context windows (cost grows quadratically with window size).

But the nice thing is we can run the expensive part of the model (the causal transformer encoder) just once per user and reuse the cached results for all items that need to be ranked as part of each request (Of course this applies to the retrieval model as well as it anyway needs to run only once per request).

How is this different from sequential recommendation models like BERT4Rec etc?

From a high-level the ideas are similar. However sequential recommender models focus on the in-session setting and do not look at long contexts. Barring some early work like SASRec, recent sequential recommender models like BERT4Rec are not causal and hence may not be well suited for most settings. Most relevant to “RecSys LLM” proposed here is the offline Pinnerformer:

Pinnerformer architecture from Pinterest

How can one pre-train such an RecSys LLM?

Pre-training a Recommender System LLM for retrieval task

While the basic ideas like causal targets and causal transformer encoder backbone are the same as training any LLM, there are some notable differences as explained below:

The item ID space (analogous to tokenizer vocabulary) is typically very large (compared to ~40K for LLAMA-2 the active ID space for short form videos can be in the many millions). Rather than using a separate embedding for each unique ID, we encourage that you use QR/ROBE embeddings (or even better invent your own simple parameter sharing scheme that relies on some form of hashing like KShift). This means that one does not need expensive and hard to maintain string indexer (like this one in spark ml).
The output layer of the model cannot be logits that predict the next ID exactly for the same reason (large vocabulary, no string indexer, etc). We encourage the use of embeddings to represent the next ID and use contrastive loss between the model output sequence and the input item embedding sequence (detached) to ensure the model learns to predict the next ID accurately. Furthermore one can easily remove “accidental hits” [link for original form of this idea in TFRS] in this setting

Special in-batch negatives loss for sequences

We suggest adding content embedding sequences as inputs to the model to effectively deal with low-distribution content. To learn highly non-linear ops, we found it useful to use a cascade of increasing resolution LSH ops (non-learnable) followed by embedding layers (learnable) to align content embeddings with latent embeddings (see this post for more details on LSH embeddings)
Label sequences, embedding sequences and context sequences that have been input to the model can also be predicted in a causal manner to further improve model accuracy
Using left padding and reverse positional bias seems to help with predicting the next item

The resulting model is an RecSys LLM model. If serving such a model in production settings is hard, we could batch-predict with the above “query” model to generate long-term user embeddings at a regular cadence and then use these long term user embeddings and recent interactions for more responsive retrieval and ranking models. The item model can be used to generate fine-grained embeddings of items as well. This scenario is studied in the Pinterest paper TransAct:

In certain settings like e-commerce, where the number of orders in say a 6-month look back window is typically small, one may be able to also train an end-to-end retrieval and ranking model on top of this pre-trained model in one go and use it as is in production.

Recommender Systems inspired by Large Language Models

Written by Dinesh Ramasamy