Model training & data quality matter in crafting ideal recommendation system
Recommendation systems have become omnipresent in our daily lives. Where the technology used to be limited to tech giants like Facebook, Netflix, and Spotify, we now find them used increasingly on other Web sites.
Retail companies want to show relevant products when you are browsing their store page, podcast apps suggest new shows to listen to, and news Web sites find content related to your topics of interest.
The research community is very active and proposes many new algorithms, methods, and metrics.
At Froomle, the research team realised the choice of algorithm is not as important as making sure the model is up-to-date and that it is trained on the right amount of data. Controlling these factors better has a larger impact on customers’ KPIs rather than trying out many different algorithms.
Here’s why these two factors are so important:
Timely training of recommendation models
A trained recommendation model freezes user interests and item relationships. The world, on the other hand, keeps changing. Items that were related a while ago might no longer be related, old items lose relevance, and user interests change. Eventually, as time passes, the model’s frozen reality is so different from the environment’s reality that it hurts the quality of the recommendations.
This is especially true for news use cases. Specifically related to the quick rotation of “relevant” items, most items quickly lose relevance after a short period in the spotlight.
To avoid the difference between model reality and environment reality, we need to keep the models updated. However, every model update costs money. To balance performance with model costs, companies need to find a way to schedule the right amount of model updates at the right moments to maximise performance.
Typically, the starting point is a certain budget available for updates, which gets translated, based on past experience, into an average number of updates per day (or month). A basic scheduling solution is to update the models on a fixed cadence — updating every two hours, for example. However, the research shows this is not the optimal usage of the available model updates.
Activity fluctuates. For example, at night there is usually less traffic on a Web site. Any update scheduled during a period of “low traffic” is far less useful than an update during peak hours. During peak hours, large amounts of information are collected, which must be captured by updating the model.
Fortunately, it is pretty easy to achieve this behaviour. Rather than specifying a fixed schedule based on time, scheduling can be based on the number of events that occurred since the last update. Once enough events have been collected, the model is retrained. The threshold defining “enough” can be computed based on the number of allowed updates, and the average number of events collected every day.
This is already a step toward better scheduling, but this method is still lacking, especially in settings where models do not show regular performance degradation, such as web shops and streaming services. In these settings, models do not grow stale at all over long periods of time, so they do not need to be updated frequently. They also do not grow stale at a regular pace, as the news models do. Rather there are a few moments where the environment’s reality shifts drastically, rather than gradually.
In these settings, the staleness of a model is not influenced by the number of new events collected, but rather by the amount of new information those events carry. Events that confirm knowledge already present in the model are not as useful as those the model had not expected.
For example, if the model thinks a user is interested in Squid Game, and the user watches an episode from that series, that contains very little additional information to the model. But, if the model has no idea the user likes The Office, and they watch the first episode of that series, that has a lot more value to a future model, because the system can learn something new about the user’s preferences.
This realisation led to exploring information-based schedules.
The first method, inverse predicted relevance (IPR), assigns each event a weight (information value) based on the model’s predicted score for that item, given the user.
So, if a model is sure a user will be interested in an item (high model score), that event gets a low information value. If the model thought it unlikely the user would be interested, the information value is high. Low model scores (high information value) can occur when the model does not have enough knowledge about the item yet, or the user has not expressed interest in similar items before. Thus, this method addresses both new items inserted into the system and users changing their interests.
The second method is based on the assumption that two different models are going to react similarly to changes in the training data. When one model changes its recommendations (so its reality changes), we expect the other to do so as well.
We can exploit this, by looking at how much a cheap model changes every time we consider training our production model. If the cheap model does not change, we can assume the production model will also not change. The decision to schedule an update of the production model is made when the cheap model changes enough (where enough is an estimated threshold).
While it is easy to define situations where two models are not strongly correlated in their changes, researchers have found that changes in the popularity model are an indicator of changes in personalisation algorithms.
Froomle has implemented the second method already after promising offline results and is working on also implementing the IPR scheduler. Applying the smarter schedulers in production has led to an increase in click-through rates (CTR) of 2% (relative) on news and a 20% (relative) increase in CTR on retail.
Right training data
The second aspect to optimise is the amount of historic data to use when training models.
Through experimentation, Froomle noticed that even if there are months or years of data available, training simple models with just the last few hours of data is a very effective way to give better recommendations.
The changing environment again is at the basis of this solution. Old interactions contain different information from what is relevant now. So, using these interactions in models that need to perform now can contaminate the recommendations.
Some models are able to use more data effectively, by accounting for the order of events or how old events are, but simpler models are easily drowned by older events, giving poor recommendations when they receive too much training data.
One could conclude that these simpler models are therefore not a good fit for those use cases. However, researchers found that by using only recent interactions as training data for these algorithms, they can outperform the more complicated algorithms.
From this experimentation, the Froomle standard procedure for optimising A/B tests now also includes finding the right window of data to use when training recommendation models. For example, for news use cases, the best results were achieved with training windows between 12 hours and 36 hours for personalisation models, and one hour for popularity.
In a series of A/B tests on Mediahuis brands, Froomle found an uplift of 8% for the popularity model, training it on one hour of data, rather than three hours of data.
An interesting side effect of this data reduction approach is that training the model takes less time and fewer resources than it would without paying attention to training data. Thus, more models can be trained on the same budget, and so the production model is more up-to-date as well.
For more information, the research team at Froomle published a paper on the methodology and experiments in the Proceedings of the Perspectives on the Evaluation of Recommender Systems Workshop in 2022.
Conclusion
To get high-quality recommendations, companies need to look beyond the choice of algorithm and consider when to train their models and on which data to train them.
By getting these choices right, companies are able to make more effective recommendations. Ignoring these choices can have dramatic effects, as the model might be hopelessly out of data or trained on data that is not representative of the environment in which it needs to make predictions.