Listings similarities using leaves
A way to compute similarities among inventory items using pre-trained leaves embeddings.
During the recent months here in our headquarter in Rotterdam, the activity in the data department has been frenetic. We embarked on a quest to improve the user experience of our platform by leveraging the pile of data that we collected over the years and take it to the next level. Our marketplace contains information about hundreds of thousands of listings that can be used to help our landlords in setting the right price for their room or apartment (henceforth referred to as listings), always making sure to ask for the correct price, aligned with the new trends of the fast-changing housing market. After some research in this direction, we developed a series of models and techniques that, aside from their initial purpose, can be leveraged for a lot of completely different purposes, such as assessing listings’ similarities or to deal with noise (read, dirty) data in a better and more consistent way.
In this blog post, we will explain how a regression model initially trained to suggest prices has been used to give meaningful listings recommendations, both for properties already available on the platform and, above all, for listings never shown before. This allowed us to successfully overcome the cold-start problem, a well-known issue that affects a lot of complex recommendation algorithms.
From Prices to Leaves
Computing a price for a listing is a classical regression problem where the features of a listing are provided as input and the price is the moving target that the algorithm tries to guess. The so trained model is able to capture the underlying market dynamics, learning which are the most relevant features in price determination based on the dynamics of each different geography. As the state-of-the-art and the results of the majority of the Kaggle competitions suggest, we immediately moved in the direction of exploiting Gradient Boosted Decision Trees (GBDT) both for their performances and for the useful insights they provide about the decision process. These algorithms work by building an ensemble of different decision trees, computing the final prediction as a sum of the weights contained in the leaves reached by the decision processes carried out in parallel. Each different leaf can be marked with an identifier, unique within each different tree of the ensemble. The process of computing a price for a listing can be seen also as assigning different leaves to each sample provided as input.
Each trained tree possibly focuses on different characteristics of a listing, trying to describe separate aspects of the housing domain. Collecting the resulting leaves is a way to have an internal representation of the features provided as input, similar, in a sense, to the one that can be obtained using the first layer of Deep Neural Networks.
Consider for example a spacious room located in Milan, very close to a university and to a metro station. These characteristics trigger different paths in each decision tree, resulting in different leaves’ indices. An apartment in the same area might share some of these leaves (the one more related to the listing’s position, for example) and be completely different for others. The hypothesis here, is that the best internal structure of the ensemble is the one that takes separately into account some key features of each listing, as being a room or an apartment, or being close or distant to a specific university, trying to assign the same leaf (or a very similar one) to different listings that share these common characteristics.
Sharing the same leaf between two listings means sharing an high level concept and the internal representation for this concept can be exploited to compute similarities and provide recommendations.
Matrix Factorization and Leaves Embeddings
After training the pricing model, each set of listing features can be associated to leaves indices. We basically have trained a function that converts an input sample to an N-sized vector of indices, where N is the number of trees in the ensemble. Each different cell might contain Li different values as the number of possible different leaves in the i-th tree of the ensemble. To have a consistent representation of the leaves domain is better to get rid of the incremental nature of the ID’s by one-hot encoding all the vectors cells. The resulting matrix size is a n_listings x (n_trees x n_leaves_i).
Matrix factorization is a linear algebra technique that is widely used in Recommender Systems where it is crucial to reduce the size of big matrices trying to learn some complex relationships between users and items. In these scenarios, users interact with items (often with a very small subset) and these interactions are tracked in the so-called user-item matrix where rows describe users and columns contain items. Each cell i, j contains the possible interaction between the user i and the item j. This interaction, or rating, it’s said to be explicit when it’s a real number (eg. assigning 1 to 5 stars to movies) or implicit when it is a binary relationship as having purchased a good or not in a marketplace or having listened to a song on Spotify. These algorithms work by decomposing the user-item matrix into the product of two lower dimensionality rectangular matrices. The new intermediate dimension introduced by these matrices can be considered as a latent representation of the user-item interactions. In this new multi-dimensional space, similarities between users and items, thus recommendations as well can be easily computed.
In our case, we can consider our listings as users and the computed leaves as items, reshaping our problem as a big user-item matrix. This matrix will contain 1s if the process of predicting the price for the listing i ends on leaf j, 0 otherwise. What we need now is exploiting matrix factorization theory and computing an internal representation that reduces the size of the leaves space, allowing to compute similarities between listings in the new low-dimensional latent space. When tasked with suggesting similar listings for an unseen sample it’s enough to ask the first model the leaves generated while predicting the market price. Using the embeddings-to-leaf matrix, the latent representation can be easily obtained by means of a matrix multiplication.
We applied this technique to the inventory of two different cities: Milan and Brussels, trying to provide recommendations for listings already on the platform or created just for this purpose. All the hypothesis explained so far have proven to make sense, leading to a consistent similarity measure that correctly prioritises some market aspects and can be exploited to give satisfactory recommendations.
More specifically, our platform contains three kinds of listings: private rooms, shared rooms, and apartments. It goes without saying that a good recommendation system should not deviate from one category, thus recommending rooms if a room is passed as a reference, and so on. For as simple as it might sound, it’s important to stress that the used latent representations don’t know anything about listings kinds and have only tried to learn this feature in the task of recommending the best market price for a listing. It’s quite surprising then (and we were certainly impressed) how the developed recommendation system tends to first recommend listings of the same kind of the starting one.
Another cool aspect of the developed system is how it prioritises the location above the other listing features. It is very common to find that in all our tests the most similar listings are those closer to the reference, where the distance from POI’s (e.g. universities) has a predominant influence.
However, it’s not always so straightforward: learning these embeddings together with listings prices might force the model to consider as similar listings very different one with the others which, for some reasons, have a similar price. As an example, if we consider the listing represented at the bottom with the filled blue dot as a reference, we would expect it to share similar features with (some of ) the apartments circled in light blue. The algorithm though recommended a listing quite far away from the starting one and the two listings don’t have much in common other than the price (rent). The exact reason why the tree ensemble considers them this similar is pretty much a mystery, but trying to consider other variables together with the final price as model’s target might be a direction of research worth investigating.
Finally, we compared the performances of this method with a basic cosine similarity computed using the listings features directly.
When compared to basic cosine similarity, leaves embeddings provided better recommendations when considering location and kind of accommodation suggested.
The cosine similarity weights all the features in the same way, without accounting for their importance nor their contribution to the price. Learned leaves embeddings have shown to well summarise some specific and non-trivial market dynamics, though this approach is not exempted from limitations, especially when used alone. Learned leaves embeddings can be efficiently used as additional input features for other tasks, possibly together with other features obtained through collaborative algorithms.
HousingAnywhere Data Engineering Team is always looking for Data Scientists and Engineers to join our team! Check our openings and send your application!