Building a Validation Framework For Recommender Systems: A Quest

Published in

Moosend Engineering & Data Science

6 min readJun 10, 2019

Hello again and welcome to yet another one of my quests on building a Recommender System that will tackle all problems and bring optimal results!

Today, I am going to show you how I guided my team through the process of building a Validation Framework.

Now, one of the trickiest (if not the trickiest) parts of Recommender Systems is measuring the quality of recommendations the model generates, seeing as the validation framework techniques used in Machine Learning don’t really apply to our case or, sadly, don’t describe the results as a whole, only partly.

In this article I am going to present an alternative way on how you can evaluate a recommendation engine and built a validation framework for testing our models.

This quest aimed to present another, alternative way of evaluating a recommendation engine and building a validation framework, all in favor of testing the models.

Evaluation Techniques for Recommender Systems

There are two ways to evaluate a recommendation system: The online way and the offline way.

I won’t tell you which one steered the wheel to the right direction, but I will describe how they work. The decision, after that, will be yours to make:

1. Online evaluation for Recommender Systems

In order to go through with this plan, you’ll need to deploy the algorithm to the production, track down any and all recommendations it generates and validate those through customer interaction.

This may represent the real performance of the algorithm and may seem like a good, solid choice, it is, however, time consuming. The metrics Online Evaluation needs in order to work, are the following:

Customer Lifetime Value (CLTV)
Click-Through Rate (CTR)
Return On Investment (ROI)
Purchases

Another thing you should keep in mind, would be that every time you make a modification to the algorithm, you have to deploy anew and wait for the evaluation. Also, you need to apply A/B testing principles and make sure you’ve fished out all the right data.

It is only normal that this takes time to run, especially if you’re aiming for those long-term metrics.

2. Offline evaluation of Recommender Systems

Offline evaluation can be divided into two categories: Implicit and Explicit Feedback.

Implicit feedback will help you estimate the results through each interaction with the product: Clicks, views, add-to-cart, purchases, etc.
Explicit feedback requires that you measure a different component: A score from the customer, or perhaps an upvote.

In both cases, a Data Pirate like myself needed to split the dataset into two training and validation sets. This division is meant to measure the performance with a metric or a group of metrics.

Offline Evaluation Metrics

There are many evaluation metrics to be used for offline evaluation. The ones that are most used are the RMSE, Normalized Discounted Cumulative Gain (NDCG@k), Precision@k, Recall@k and F1 Score.

RMSE

In order to calculate this metric, you need to have ratings. RMSE’s values range from 0 to 1 and the lower the RMSE, the better.

The Mean Squared Error is calculated between the predicted ratings (p) that our algorithm produced and the actual ratings (a).

This is the route:

Onto our second metric!

NDCG@k

This metric is one of the best known, due to the information it can retrieve. It takes the order of recommendations into account and it’s very helpful in terms of measuring the quality of web search and recommendation engines.

The results are evaluated in terms of relevancy (rel) of the actual result compared to the predicted result list and its values range between 0 and 1. Of course, the best value you could aim for is 1.

Next in order of business, and in order to normalize the results from the different lists, we divide the DCG by the ideal DCG (IDCG):

I wouldn’t want to tire you with references from all my previous adventures, and seeing as Accuracy, Precision@k and Recall@k have been analyzed in a previous article, I’ll ship you there.

F1 score

This metric measures the tests’ accuracy. The F1 score is the harmonic average of (p) and (r) and the values, again, range between 0 and 1, with 1 indicating the perfect precision and recall.

The following equation is what guided us through this:

All aboard the Simulation Ship

We didn’t have a lot of time of course, not nearly enough to work on improvements and test as many scenarios as me and my crew would’ve liked, when it came to online evaluation of our Recommender System.

We decided to do it offline.

However, the standard offline evaluation just couldn’t cut it.

Moosend’s database is an unlimited source of data, and we needed a very solid plan to avoid getting shipwrecked.

The only way around that, was to create a simulation.

That simulation would measure the performance of the model and was meant to track down the improvements we perform, on top of that.

We needed to train our recommendation engine first. For this, we utilized 3 months’ worth of interaction data, geniously named “The Training Section”.

Right when this period ended, we entered the second phase: The “Recommendation Day”.

After 29+ days, we created a different time period called “The Validation Set”.

This period is the time during which we wait, in order to check if the customer is going to interact with the recommended products.

Here’s the process:

We repeated the process from the very beginning, in order to generate product recommendations for the following day.

Let’s assume that you’ll need to recommend products on December 1st. Your training data should be from September 1st, up until November 30th.
And then, you wait. Up until December 30th, more specifically, to make sure that the customer is going to interact with the recommended product.

This is the way to simulate the inner workings of your model under real circumstances and, at the same time, evaluate its performance during different time periods, seeing as you’ve already got the data.

But what metrics were used for the evaluation of the simulator? This is where you’ll find out!

The Segmentation Trick for Recommender Systems

Let’s say that you’ve got a chap that won’t interact with anything but one product and that alone.

The lack of information will send our recommendation engine to a new adventure, in search of similar customers that will be, down the line, irrelevant to our original customer.

You see, there’s not enough data to determine whether or not there are any similarities.

Our need to measure how our model performs, in regards to the information provided by the data of the “Training Section”, made me and my crew split the customers into three different categories, based on the product interactions.

So, we created the “New”, “Regular” and “VIP” categories.

New customers were the ones with 1–4 product interactions. Regular customers were the ones with 5–10 product interactions. VIP customers were all the customers with more than 10 product interactions.

Stay tuned!

This journey enabled us to measure how the model performs through a period of time, by combining simulator and segments.

It describes the model in various ways and can provide a full report on our model, depending on the information that we have on each of the customers.