ZX Programming | Recommendation Systems

Splitting user-item interaction matrix

One thing to keep in mind when splitting interaction matrices for recommender systems

Mayur Kr. Garg
ZaykXire Programming

--

A sample log of interactions between users and items with the time of interaction and the rating given by the user
Table 1. A sample log of interactions between users and items.

For creating recommendation engines, one will typically encounter a log of user and item interactions such as one in the table above. Such a data contains the historical account of which user interacted with which item along with some supporting information such as time of interaction, rating given by the user, etc. An interaction could be defined simply as the user clicking on a link (ads), watching a media content (social media) or purchasing a product (e-commerce).

For use in recommendation systems, it is often transformed into a user-item interaction matrix as the one shown below.

A sample user-item interaction matrix consisting of 5 users and 5 items.
Table 2. A sample user-item interaction matrix.

Each value in the matrix could be:

  • Binary: Boolean stating whether the user interacted with the particular item or not.
  • Count: Number of interactions between the user and the item.
  • Rating: User’s rating of the interaction if the interaction is present or null otherwise.
  • Ratio: Ratio of times the user interacted with the item with the number of possible scenarios the user had the chance to do so. This is useful in cases where the user cannot directly interact with the entire catalogue of items but only with the those shown or recommended to him/her.

For building robust recommendation algorithms, this data must be partitioned into a training and a validation data set. In this article, we will look into two methods of splitting this data (randomly and based on timestamps) and look into some pitfalls that may occur due to the former. All examples in this article will use a synthetic data comprising of binary interaction matrix but the concepts are general enough for almost every use case.

A minimalistic example

Data overview

Below is a synthetic data set indicating whether a user has watched a certain movie in the theatre denoted by a 1 or 0 if he/she hasn’t. This data only contains interactions between 5 users and 5 movies but is meant to act as a placeholder for a more realistic data set.

A synthetic data set containing interaction data between 5 users and 5 movies with 12 total interactions
Table 3. A synthetic data set about the movies watched by some users.

Splitting randomly

Since the data set consists of 12 total interactions, we can split the data randomly but randomly keeping some of them in the training set and the rest in the validation set. The validation set is intended to be used for evaluation once a recommendation engine has been trained.

The actual split would most likely be done on the interaction logs (as in Table 1) but here the matrices generated after the split have been shown for simplicity. Table 4 and Table 5 denote the training and validation data respectively after random splitting.

Training set of the synthetic data containing interactions using random split.
Table 4. Training set using random split.
Validation set of the synthetic data containing interactions using random split.
Table 5. Validation set using random split.

Splitting based on timestamps

We can also split the data using the timestamps of the interactions. For this data, we would assume that the interactions were logged at the time of the release of the movies, and we would split the data using the year 2021 as the cut off.

Table 6 and Table 7 denote the training and validation data respectively after time-based splitting.

Training set of the synthetic data containing interactions using time-based split.
Table 6. Training set using time-based split.
Validation set of the synthetic data containing interactions using time-based split.
Table 7. Validation set using time-based split.

It is to be noted that the splits in the two cases do not have the ratio of training and validation data points. However, this is only due to the very small size of this dummy data set. But, as stated before, this data set is merely for illustration purposes.

Issues in random split

Let us assume that we used the training data from the random split (Table 4) to train a recommendation engine. Then we can use it to predict the missing interactions and compare the predictions with the validation set (Table 5). Here are some pitfalls that you can come across:

  • Lack of cold start examples: When splitting randomly, it is very likely to include at least one interaction for most items (here movies) as well as users (as shown in Table 4). This reduces the likelihood of getting training examples related to cold start scenarios. When evaluating on the validation set, this can lead to better metrics than expected as we would have some information about most items as well users which the recommendation algorithm can exploit. However, the same algorithm may perform poorly during production, as more movies are released or new users are added to the database with no prior information. Such a splitting methodology can lead to such scenarios being vastly underrepresented during both training and validation.
  • Relying on relationships not explicitly provided: A model trained on data in Table 4 would pick up on the relationship between the movies Dune 1 and Dune 2 due to high correlation between their interactions. Hence, if asked to predict if User 1 would watch Dune 2, the model is likely to predict 1 which would be a correct prediction leading to great performance on the validation set from Table 5. However, the relationship between these two movies can only be extracted if we knew most of their interactions in advance which renders predictions less useful. This can be further understood when making predictions for Dune 3 for example. Since that would have no interactions to account for, the model would not be able to establish the relationship between the three movies and hence make poor predictions. Ideally, such a deep relationship between multiple items should be established using item metadata so that such dependencies can also be built for new items or items with few interactions. Such dependencies can also exist between users and can take multiple forms.
  • Using information from hindsight: Another issue is utilizing information from the future. Consider a scenario where we were supposed to recommend news articles to users. Let us assume that there is a sports season in the first month followed by national elections in the second. Then sports articles would be popular earlier followed by a jump in the popularity of the political section. If the data was split randomly, the algorithm would be able to model this shift in preferences since it would have some data points from both months, and we would be overconfident in the ability of our model to predict such abrupt changes. But similar to before, such a model would perform poorly in reality since it would not have information about any future shifts in data. Data shifts like this can occur for interactions as in the example given but also for items (the product catalogue being expanded or modified) and users (demographic shifts).

Identifying problems using time-based splitting

Most of the issues using random splitting arises from the fact that the model has information that it should not have and the same may not be detected during validation. Splitting based on time allows us to identify them by forcing more realistic constraints on the nature of the data so that they can fixed before deployment.

  • Time based splits can enforce the existence of cold start examples since any item or user added after the cutoff date won’t have any interactions to utilize. This can be illustrated using Table 6 where the movies released after the cutoff date have no interaction data. To fix this, we can try exploiting similarities between items/users based on their metadata.
  • Similar to above, lack of any dependency information between multiple items would be caught when validating (such as using Table 7). It is then up to us to model them so that our algorithm can know explicitly that interactions for Dune 2 are going to very heavily influenced by that of Dune 1 instead of any other movie.
  • Information about sudden shifts is lost if no data is available for interactions beyond a certain timestamp. So, if there is a big distribution shift before and after the cutoff timestamp, the validation metrics would be very poor forcing us to find ways to account for such scenarios.

It is to be stated that splitting user-item interaction matrix randomly can cause silent failures in production which would not be caught during validation. Splitting using time also doesn’t fix this, but it merely helps us spot them during validation instead.

Conclusion

Splitting based on timestamps for modelling such a data should be the way to go. One exception could be when timestamps are not available for some reason. Another is when the intent of modelling is diagnostic and not predictive. In such cases, the idea is to model similarities between users/items based on their interactions instead of the other way round. Another point to note is that one must try to model the same data using multiple cutoff points to make sure that the algorithm is not sensitive to the timestamp used for splitting.

For some people all of this might sound very obvious. However, I think, many others would not have thought that such a small difference in data preparation can have major differences in actual model performance. Possibly, many of us utilize similar data preparation pipelines for a plethora of ML tasks (such as using train_test_split for preparing splits for all use cases or using StandardScaler for all numerical columns) without understanding the fine details of the problem. The aim of this article to allow us to be slightly more aware of the specifics.

--

--