Optimize Batch Recommendations
Increase Click Through Rate by optimizing recommendations at the batch level.
Recommendations are generally displayed in batches. For example, after searching for a chainsaw and clicking on one, Home Depot provides recommendations.
Typically, there is a model that estimates a click probability based on factors such as search term, browser history, basket and item currently being viewed.
Often the top few recommendations are displayed. This is suboptimal. The click probability for the entire batch of recommendations should be optimized at the batch level.
Notice, that in the recommendations shown above, there are two recommendations for chain oil. If the user does not want chain oil, then that’s two wasted spots. Even if the customer does want chain oil, he or she can only choose one. This is the flaw in recommending the top few recommendations. In order to maximize the probability of clicking on the batch, the recommendations should be diverse.
Bayesian Optimization
The optimization of multiple, simultaneous recommendations can be framed as a Bayesian Optimization problem. In typical gradient methods, functions are optimized by evaluating the function over and over again very frequently. When the evaluation of the function is computationally expensive, a surrogate function can be evaluated instead.
A surrogate function is a function that is similar to what you want to optimize but is easier to evaluate. You use the surrogate function to find the next optimal suggestions to evaluate in the original objective function. A common surrogate function is a Gaussian Process.
There is a probability that the user abandons the page and a baseline probability that the user continues shopping. Let’s define the objective function as the probability that the user clicks on a recommendation given that he or she does not click on something else desired like adding the current product to the basket. Displaying recommendations to see how the user reacts is the evaluation step in the Bayesian Optimization process.
In Bayesian Optimization, an acquisition function is optimized to convert the Gaussian Process into the optimal recommendations to try in the evaluation step. Our choice for an acquisition function will be Most Probable Improvement, also known as exceedance. This represents the probability that the true target value (clicking on any recommendation) is above a given threshold (navigating away from the site altogether).
The Bayesian Optimization approach is as follows:
- Initialize by evaluating the difficult objective function at some number of random points.
- Use these points and the function evaluated at these points to fit a Gaussian Process.
- Optimize an acquisition function such as Batch Most Probable Improvement using the Gaussian Process to find the optimal next points to evaluate in the original function.
In the batch recommendation problem, the suggested next points from step 3 are the recommendations to show to the consumer. Bayesian Optimization proposes the optimal set of points that have the best chance of exceeding a threshold. It turns out that this is rarely equal to the top few recommendations.
Bayesian Optimization also has the nice feature that it balances the exploration exploitation trade-off. This simply means that there is uncertainty around how a recommendation will perform and there is a mathematical way to specify the exact amount of risk you should take in choosing to display a recommendation with a lot of uncertainty.
Home Depot Kaggle Challenge
In this dataset, there are:
- product descriptions
- search terms
- relevance — this will be used as a proxy for click-through-rate (CTR). In practice, good relevance metrics are based on CTR.
Step 1: Get embeddings for product descriptions and search terms.
Cluster search terms to account for differences in spellings, capitalization, word order, word importance, etc. Clustering also standardizes the feature vector and is continuous.
from model2vec import StaticModel
from sklearn.cluster import KMeans
train_df = pd.read_csv('train.csv', encoding='iso-8859-1')
product_desc_df = pd.read_csv('product_descriptions.csv', encoding='iso-8859-1')
df = train_df.merge(product_desc_df, on='product_uid')
desc_embed = short_model.encode([t for t in df['product_description']])
search_embeddings = short_model.encode([t for t in df['search_term']])
model = StaticModel.from_pretrained("minishlab/potion-base-2M")
unique_search_embeddings = model.encode([t for t in df['search_term'].unique()])
cluster = KMeans(n_clusters=768)
cluster.fit(unique_search_embeddings)
cluster_id = cluster.predict(search_embeddings)
To simplify, we will pretend that each cluster is the same search. Let’s look at the search terms in the first cluster.
first_cluster_search = list(set([s for x, s in zip(cluster_id, df['search_term']) if x == 0]))
first_cluster_search
['colorpoint vanity top',
"36' vanity combo white",
'glacier bay vanity 24 inch combo',
'annakin vanity',
'double vanity cabinets only',
'CUSTOM DOUBLE BOWL VANITY TOP',
'27 inch vanity combos',
'philips duramax vanity',
'4458 speck vanity top',
'vanity with tops on sale',
"49' avalon vanity top",
'Pegasus estates vanity',
'60 single cabinet vanity base',
'60 IN. VANITY COMBO',
'48 vanity choc',
'black vanity tops']
Gaussian Processes work better when the target is Gaussian. We first transform the target variable to have a normal distribution.
from scipy.stats import ecdf
from scipy.stats import norm
transf = ecdf(train_df['relevance'])
# can't use 1 or 0 in norm.ppf. They translate to inf, -inf
normal_relevance = [norm.ppf(x*.999) + .0005 for x in transf.cdf.evaluate(train_df['relevance'])]
Also, converting the target into a normal distribution in this way allows us to use a Copula. This means that we can model the interaction between each recommendation using a multivariate normal distribution.
first_cluster_desc = [s for x, s in zip(cluster_id, df['product_description']) if x == 0]
first_cluster_product_id = [s for x, s in zip(cluster_id, df['product_uid']) if x == 0]
first_cluster_desc_embed = [de for x, de in zip(cluster_id, desc_embed) if x == 0]
first_cluster_relevance = [r for x, r in zip(cluster_id, normal_relevance) if x == 0]
print(len(first_cluster_product_id), len(set(first_cluster_product_id)))
dedup_df = pd.DataFrame({'embedding': [str(x) for x in first_cluster_desc_embed],
'relevance': first_cluster_relevance}).drop_duplicates()
There are a few duplicates but in a couple of cases they have different ratings. This is to be expected as the relevance or click probability data are noisy estimates.
from math import comb
n_recommendations = 4
comb(len(first_cluster_desc), n_recommendations)
84995410
There are 84995410 number of ways to create a batch of 4 recommendations for this cluster. Your clusters may be different due to the randomization in KMeans. We can also drop some of the potential recommendations from the bottom knowing that they will never be part of the optimal batch.
Step 2: Train Gaussian Process
The transformed relevance corresponds to the standard normal distribution. Since, relevance doesn’t account for users that don’t click on any product description so, let’s adjust our choice of baseline by choosing a high number such as 3. Loosely, this means that a a user is unlikely to transformed score below 3 and likely to to click on one above 3.
Use Gaussian Process Regression to get an estimator of relevance as a function of an embedding.
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, Matern
from scipy.stats import multivariate_normal as mvn
gpr = GaussianProcessRegressor(kernel= Matern() + WhiteKernel(), copy_X_train=False)
gpr.fit(first_cluster_desc_embed, first_cluster_relevance)
recommendations = pd.DataFrame({'embedding': first_cluster_desc_embed,
'relevance': first_cluster_relevance}).\
sort_values('relevance', ascending=False).\
head(n_recommendations)
top_rec_mu, top_rec_sig = gpr.predict([x for x in recommendations['embedding']], return_cov=True)
1-mvn.cdf([3]*n_recommendations, top_rec_mu, top_rec_sig)
0.7436341896588283
The probability of clicking on a recommendation is 74%.
Calculation of Most Probable Improvement uses Monte Carlo and is expensive when doing over and over. We will use the quante_carlo package to speed this up.
from quante_carlo import bayes_opt as bo
bo = bo.Embeddings(bo_batch_size=1000, n_procs=4)
train_embeddings = {'embeddings': first_cluster_desc_embed,
'scores': first_cluster_relevance})
results = bo.suggestion_combinations(train_embeddings)
[product_desc_df[product_desc_df['product_uid'] == first_cluster_product_id[x]].product_description.iloc[0] for x in results[0]]
['The DAP 2.8 oz. Auto/Marine Sealant withstands salt water and endures sunlight for durability. The clear silicone formula will not crack or shrink.California residents: see Proposition 65 information100% silicone formula forms a strong, waterproof sealDesigned for use with windows and hatchesMildew resistantPermanently flexibleDries to the touch in 20 minutes',
'Loctite PL Marine Fast Cure Adhesive Sealant is a fast setting, moisture cure adhesive sealant which delivers strong bonds while forming a watertight, flexible seal above or below the waterline once cured. Free of solvents, phthalates and isocyanides, it will not shrink when cured, discolor from UV exposure or bubble on damp surfaces. Once cured, it withstands both saltwater and freshwater environments.Ideal for through-hull fastening and deck fittingsFast setting: sets in 30 minutes, fully cured in 24 hoursWatertight, flexible bondUse above or below waterline once curedRecommended for applications such as through-hull fastening, deck fittings, deck to hull joints, port lights, moldings, struts and planking and stem jointsAdheres to teak and other woods, fiberglass, vinyl, glass, FRP, metal, gel coat, polycarbonate and most plastics (test all plastic substrates for bond strength and compatibility before use)',
"DAP 2.8 oz. Silicone Aquarium Sealant is ideal for constructing or repairing small aquariums. The 100% silicone-rubber formulation provides a waterproof seal that won't crack or shrink. It is non-toxic when cured.California residents: see Proposition 65 information100% silicone rubber is non-toxic when curedCan be used for patching tubes and hoses and attaching plastic or ceramic accessoriesMold- and mildew resistantFlexible sealant won't crack or shrinkMeets ASTM specification C 920, class 25, type S, grade NS",
'The Duralux 1 qt. White Marine Enamel is a single package, topside enamel for use on primed steel and metal, wood, aluminum or fiberglass. Originally developed for pleasure and commercial boats, the properties of this product also make it ideal for all type onshore or offshore marine maintenance applications. It is salt water resistant, withstands repeated washings, resists oil and gasoline and holds up to the discoloration effects of harbor gases.California residents: see Proposition 65 informationFor topside use on boats and marine structuresExcellent adhesion to wood, metal, aluminum and fiberglassFights rust, chemical resistantSuperior gloss and color retentionDries in 8-16 hours']
quante_carlo simply fits a GaussianProcess and calculates the most probable improvement for thousands of random groupings. We can call the function in parallel to speed this up even more.
from multiprocessing import Pool
p = Pool(4)
results = p.map(bo.suggestion_combinations, [train_embeddings] * 4)
p.close()
The result will be 4 random batches of Bayesian Optimization that you can then reconcile.
[([43, 13, 0, 40], 0.9254731283997663),
([43, 45, 6, 13], 0.9282933810414518),
([6, 0, 13, 10], 0.9290275825581824),
([27, 0, 13, 32], 0.9275950528289152)]
In this case, we can see that we’ve improved the probability of clicking on a recommendation from 74% to 93%.
Conclusion
In showing product recommendations to a user, it is often assumed that the best choice is to display those with the top probability of being clicked. However, due to correlation, the total probability of the batch of recommendations is usually less than some other batch.