LEARNING GLOBAL AND NON-GLOBAL LATENT EMBEDDING
At the core of Agoda’s hotel search results page is our hotel recommendation engine. For each request, we need to rank a set of hotels based on relevancy for a given customer. The “real” relevancy to a customer’s needs varies greatly by customer, so we, at the Machine Learning Team try to learn a recommender which gives personalized ranking for every customer. In most cases the ranking is done within a city, which means that the hotels to be ranked are all in the same city.
USERS, ITEMS AND LATENT FEATURE SPACES
In this article, we concentrate on Matrix Factorization which is a popular and simple technique for ranking and recommendation. Also, here at Agoda we have successfully used Matrix Factorization for personalized hotel-ranking in the past.
The basic idea is to map every user and item to a common n-dimensional latent-feature space. For each user u and each item i, we will learn a feature-vector of n real numbers in such way that the dot-product
between user and item features represents the relevance-score for this user and item. Here our items are hotels, so we will use these two terms interchangeably.
One popular method to learn the latent vectors is Alternative Least Squares (ALS) optimization. In the simplest case, the vectors are computed by minimizing the objective function
where
if user u ever visited hotel i and it’s 0 otherwise.
The model thus gives relevancy close to 1 for those hotels which were visited by the user and relevancy close to 0 for non-visited hotels. As the name suggests, this objective can be efficiently minimized by alternating the optimization between user and hotel features. In practice, we also add a regularization term and consider other actions (such as clicks) and then weight each term in the sum by some confidence derived from number of bookings and clicks. For details see (Hu).
Alternative method to ALS for learning the Matrix Factorization is Pairwise-learning. One successful pairwise-methodology is BPR (Bayesian Personalized Ranking, (Rendle, 2009) ), which we will explore in this post.
In BPR we are minimizing the objective
where
and
is the sigmoid function.
Similar to ALS, the model gives higher relevancy to those hotels which were visited by the user than to those not visited.
The reader might wonder how to manage the huge set D of all possible triplets, but luckily it never needs to be explicitly materialized. The model can be learned by Stochastic Gradient Descent so all we need is to be able to sample from D, and that can be easily done.
These different methods to learn the latent features achieve the same goal, to represent users and hotels by some latent features. However, an explicit interpretation for the latent features doesn’t exist. A single dimension doesn’t necessarily correlate with any concrete attribute like a hotel’s star rating or price. Still, one can learn from hotels/users the relationships between the vectors.
Next, we will see that by subtly changing the learning process we can somewhat alter the behavior of the whole latent space.
GLOBAL VS. WITHIN CITY LEARNING
While the previous procedures are simple enough, there is one caveat, they learn global latent-feature representation, so in our example case, city information is not informing the model. This is not optimal for our use-case. A large part of the model’s representative power is wasted on separating cities from each other. We don’t really need a data science model to tell us that hotels in Bangkok are probably different from hotels in Helsinki.
In BPR, we can neatly fix this issue. Instead of defining D as a set of all possible (visited, not visited)-pairs for each user we add a restriction that both hotels are in the same city:
We call this style of training “within-city”. This is interesting because rather than learning global latent-features, the model learns a latent-representation which separates hotels from each other in each city but in a way that similar hotels in different cities have similar feature-representation.
EXAMPLE
Let’s review an example to understand the differences between the two approaches.
Imagine that the algorithms are processing the following user activities.
Bob made bookings at the Marina Bay Sands in Singapore and at the Lebua at the State Tower in Bangkok.
Alice made only one booking at Marina Bay Sands in Singapore.
What happens in the “global” training approach
In Bob’s case, the “global” algorithm will learn that Marina Bay Sands and Lebua at the State Tower should have similar latent-representation; however, for Alice’s case, it will push away Marina Bay Sands from Lebua at the State Tower because Lebua is considered to be a negative example. As a result, Marina Bay Sands and Lebua at the State Tower might end up being separated in the latent space.
What happens in “within-city” training approach
In Bob’s case, the “within-city” algorithm will learn that Marina Bay Sands and Lebua at the State Tower should have similar latent-representation and, for Alice’s case, it will not push away Marina Bay Sands from Lebua at the State Tower because Lebua is not considered to be a negative example, because the within city sampling doesn’t make it so. Ultimately, the Marina Bay Sands and the Lebua at the State Tower will have similar latent-representation.
There are 3 main benefits of having latent features that don’t incorporate the city information:
1. Learning is cheaper — removing geography from learning requires a smaller feature space. We used 300 for global via ALS, but only 100 for within city learning in BPR.
2. The latent space is more efficiently used — all cities span over the entire latent space, which means that the algorithm uses all of the features to describe hotels.
3. Global item to item similarity — the latent features can be used for item to item recommendations, either by Euclidean distance or cosine similarity. By having the cities mixed in the latent space we can better recommend a hotel in city B based on the selection in the city A.
We validated our models using bookings held aside in a validation set. For each booking in this set, we ranked all the hotels in the same city and checked where the booked hotel ended up in this list. 100-dimensional BPR trained “within-city” had considerably higher accuracy than 300-dimensional ALS. We also tried to fit 100-dimensional global-BPR but that model didn’t perform as well.
** ‘Accuracy@10’ is defined as:
with,
if booked hotel appears at rank i and zero otherwise.
FINALLY, SOME GRAPHS
To verify our hypothesis on “within-city” vs “global” training, we chose hotels from eight selected cities, and visualized their latent-representation by projecting them into three-dimensional PCA space. In figure 1, the latent vectors produced by ALS Matrix Factorization are clearly separated by cities. However, the factors from “within city” BPR do not form clusters by city, rather hotels from each city span across the entire latent space (figure 2). This implies the latent vectors don’t include a clear city feature and the whole latent space is saved for features other than geographic differences.
Figure 1 — PCA for the ALS latent features. We are taught to aim for beautifully separated clusters, but in this counter intuitive example we claim that there is more benefit in overlapping segments. Graphs were created using TensorFlow.
Figure 2 — PCA for the BPR within city latent features, all hotels co-exist in the same space regardless of city, this is a good attribute for recommendation
The followings are some selected hotels and their six nearest neighbors by cosine similarity computed with ALS and BPR feature vectors. As you can see, the neighbors of the target hotel in ALS are always in the same city, while “within-city” BPR is not restricted to the city.
Marina Bay Sands: Singapore
Shangri-La Hotel At The Shard: London
Fun Wan Hostel Bangkok: Bangkok
Finally, we need to note that while “within city” models work very well when a city is given, they can’t be used alone in a situation where we want to make global predictions. The relevancy score that we get from “within city”-BPR can be interpreted as a relevancy for the user conditioned on the user going to that city. This score can be high even if the city were totally irrelevant for the user, but if we first pick the most relevant cities by some other method, or if the city is known, then we can ensure even better recommendations.
CONCLUSION
We compared between “within city” and “global” training approaches in the context of Matrix Factorization. We found that the “within city” training approach doesn’t learn the city feature in the latent representation. The benefit to this method is that we can decrease the required model-dimension and increase accuracy in a situation where we want to rank hotels which are all in the same city. Finally, we verified our hypothesis by visually comparing the latent-features in “global” ALS vs “within city” BPR models and looking at some examples of neighbor hotels in the latent space. Even though we chose to use Matrix Factorization as an example in this blog post we have found that this idea of “within city” training generalizes to wide range of more complicated ranking/recommendation algorithms as well.
Do you find this kind of research interesting, if so let us know! We are always looking for new people to join our engineering and data science teams; check out: https://careersatagoda.com/
This post is based on a work by Pekka Aalto and Sorawit (James) Saengkyongam.
BIBLIOGRAPHY
Hu, Y. Y. (n.d.). Collaborative filtering for implicit feedback datasets. Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. Ieee, 2008.
Rendle, S. e. (2009). BPR: Bayesian personalized ranking from implicit feedback. Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press.
TensorFlow. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems.