We know what you need, but we do not see your data

Huynh Nguyen
Red Gold
Published in
6 min readNov 12, 2019

How to build a recommendation system with censored data

Unlike the common belief that the recommendation system, which is now a standard part of online services, requires using all historical interactions between users and items for the accuracy of the service. Here, I introduce one approach for both accurately recommending users the good items meanwhile, on the user-side, censoring users’ historical interaction with items by replacing with fake interactions (intentional data noise).

The purpose of generating noise into data is that, on the system-side, it can not be told whether the interaction between users and items is a real or random noise, thus protect users from fully exposing their privacy. And it allows users to control the level of privacy exposing to the online services by screwing the level of noise generator on the user-side. This also implies a new economic approach that allows users to negotiate on the value of their privacy while using the online services, rather than the current privacy-for-convenience approach used by most of nowadays online services.

The modern recommendation systems are designed to deal with non-observable data since most of the feedback data only for items that users interact with and those non-interacting items remain implicit feedbacks. Which could be treated as negative feedbacks by randomly sampling. By sampling to simulate the unobservable data, the algorithms already have the capacity to deal with noisy observation as well.

By actively adding noise into user interactions to censor the real data from the user-side, I demonstrate that the noise does not affect the accuracy of models, even with a large percentage of interactions are replaced by noise.

First, I use Spotlight to build a recommendation model working with the sequential interaction datasets, Movielens, for the next movie recommendation. The recommendation model learns the representation of movie items and users using stacked causal atrous convolution layers (set to 1 layer in our experiment) and generates the implicit negative feedbacks by uniformly sampling.

import numpy as np
user_ids, item_ids, ratings, timestamps = zip(*[i.strip().split('\t') for i in open("./ml-100k/u.data").readlines()])
user_ids = np.array([int(u) for u in list(user_ids)])
item_ids = np.array([int(i) for i in list(item_ids)])
timestamps = np.array([int(s) for s in list(timestamps)])
interactions = Interactions(user_ids=user_ids, item_ids=item_ids, timestamps=timestamps)
train, test = random_train_test_split(interactions)

The 100k Movielens dataset contains 943 users, 1682 movies, and 100000 interactions between users and movies. The interactions are split at an 80/20 ratio for train and test. The train interactions are separated into three sequences: list of user id, list of item id that user interacts with and timestamp of that interaction. For the normal scenario, all interactions between users and items are captured and stored on the server-side. With our system, users are allowed to submit fake interactions with a condition that it is sampled from uniform noise.

I prepare three scenarios of noise as 25%, 50% and 75% simulating for that corresponding amount of train data being replaced by noise in the dataset.

import random
preserving_25_percent_items = []
preserving_50_percent_items = []
preserving_75_percent_items = []
vmin = train.item_ids.min()
vmax = train.item_ids.max()
for real_item_idx in train.item_ids:
random_item_idx = random.randint(vmin, vmax)
sampling_threshold = random.random()
if sampling_threshold < .25:
preserving_25_percent_items.append(real_item_idx)
else:
preserving_25_percent_items.append(random_item_idx)
if sampling_threshold < .5:
preserving_50_percent_items.append(real_item_idx)
else:
preserving_50_percent_items.append(random_item_idx)
if sampling_threshold < .75:
preserving_75_percent_items.append(real_item_idx)
else:
preserving_75_percent_items.append(random_item_idx)

Then, visualize each dataset’s distribution shape by the histogram plot to compare the effect of noise on the distribution shape on each level of noise.

from matplotlib import pyplot
plt = pyplot.figure(figsize=(16,10))
pyplot.subplot(221)
pyplot.hist(item_ids, bins=50, alpha=0.7, label='100% item preserving', color='red')
pyplot.legend(loc='upper right')
pyplot.subplot(222)
pyplot.hist(preserving_25_percent_items, bins=50, alpha=0.7, color='green',
label='25% item preserving, 75% random noise' )
pyplot.legend(loc='upper right')
pyplot.subplot(223)
pyplot.hist(preserving_50_percent_items, bins=50, alpha=0.7, color='blue',
label= '50% item preserving, 50% random noise')
pyplot.legend(loc='upper right')
pyplot.subplot(224)
pyplot.hist(preserving_75_percent_items, bins=50, alpha=0.7,
label='75% item preserving, 25% random noise')
pyplot.legend(loc='upper right')
pyplot.show()

As shown on the histogram plots, compared to the distribution shape of the full data, the other distribution shapes become more flat corresponding to the noise level added.

Then, create three extra censored train datasets with the same user ids and timestamps, only different in item ids.

user_ids = train.user_ids
timestamps = train.timestamps
preserving_25_percent_train = Interactions(
user_ids=user_ids,
item_ids=np.asarray(preserving_25_percent_items),
timestamps=timestamps)
preserving_50_percent_train = Interactions(
user_ids=user_ids, item_ids=np.asarray(preserving_50_percent_items),
timestamps=timestamps)
preserving_75_percent_train = Interactions(
user_ids=user_ids, item_ids=np.asarray(preserving_75_percent_items),
timestamps=timestamps)

And prepare four training models for each case of the training dataset. All hyperparameters are identical amongst models.

from spotlight.sequence.implicit import ImplicitSequenceModel
model = ImplicitSequenceModel(embedding_dim=128)
preserving_25_percent_model = ImplicitSequenceModel(embedding_dim=128)
preserving_50_percent_model = ImplicitSequenceModel(embedding_dim=128)
preserving_75_percent_model = ImplicitSequenceModel(embedding_dim=128)

It is now ready for training four models, which could take some minutes. The loss of each model is slightly different but still converged.

model.fit(train.to_sequence(), verbose=True)
preserving_25_percent_model.fit(preserving_25_percent_train.to_sequence(), verbose=True)
preserving_50_percent_model.fit(preserving_50_percent_train.to_sequence(), verbose=True)
preserving_75_percent_model.fit(preserving_75_percent_train.to_sequence(), verbose=True)

After fitting, evaluate models by using mean reciprocal rank (MRR) scores. The score is counted by one for each time a user interacts with an item in the test set, then return the list of the mean reciprocal rank of all test items.

from spotlight.evaluation import mrr_scoretrain_mrrs = mrr_score(model, train)
preserving_25_train_mrrs = mrr_score(preserving_25_percent_model, preserving_25_percent_train)
preserving_50_train_mrrs = mrr_score(preserving_50_percent_model, preserving_50_percent_train)
preserving_75_train_mrrs = mrr_score(preserving_75_percent_model, preserving_75_percent_train)
test_mrrs = mrr_score(model, test)
preserving_25_test_mrrs = mrr_score(preserving_25_percent_model, test)
preserving_50_test_mrrs = mrr_score(preserving_50_percent_model, test)
preserving_75_test_mrrs = mrr_score(preserving_75_percent_model, test)
print('For 100% preserving interactions')
print('Train MRRS {:.3f}, test MRRS {:.3f}'.format(train_mrrs.sum(), test_mrrs.sum()))
print('For 25% preserving interactions')
print('Train MRRS {:.3f}, test MRRS {:.3f}'.format(preserving_25_train_mrrs.sum(), preserving_25_test_mrrs.sum()))
print('For 50% preserving interactions')
print('Train MRRS {:.3f}, test MRRS {:.3f}'.format(preserving_50_train_mrrs.sum(), preserving_50_test_mrrs.sum()))
print('For 75% preserving interactions')
print('Train MRRS {:.3f}, test MRRS {:.3f}'.format(preserving_75_train_mrrs.sum(), preserving_75_test_mrrs.sum()))

For 100% preserving interactions
Train MRRS 9.566, test MRRS 10.132
For 25% preserving interactions
Train MRRS 6.064, test MRRS 11.379
For 50% preserving interactions
Train MRRS 8.932, test MRRS 12.749
For 75% preserving interactions
Train MRRS 9.836, test MRRS 12.488

And the results are quite surprising, with 25%, 50%, and 75% percent item preserving, those models perform better on the test set respectively 11.379 and 12.488 and 12.749 compared to 10.132 of the full data model. Even it could incidentally happen to be better, it can be confirmed that the recommendation model still working with noisy data.

Nassim Nicholas Taleb, the author of Black Swan, once makes a statement that most of the sell of a publisher contributed by a small group of authors. The same observation could also be applied to this movie dataset when some famous movies account for most of the interest of users and the other movies do not have good enough bandwidth for getting users’ attention. Here, I do not want to claim this is bias in the dataset and then suggesting another kind of bootstrapping treatment since I do not expect fairness in any human emotional-driven data. Perhaps what recommendation system should only focus on is to find good content among the content pool.

Here is the full gist code, just download and extract dataset into the ‘./ml-100k’ folder, then the result can be validated.

https://gist.github.com/huynhnguyen/e02b3c96063327cf68c78c562cb57f3a

For the next step, I plan to report our approach with a comprehensive validation using different types and sizes of datasets and also make a proof of concept of this recommendation system.

Our AI solution is open source and actively developed under the Telar project, also some interesting articles from us:

👋 For more information please visit Telar Press or Red Gold team official website.

All kinds of feedbacks are always welcome. Thanks for your time!

--

--