Embarrassingly Shallow Autoencoders (EASE) for Recommendations
Recommendations without the Ruffles
Recommendations without the Ruffles
Post Summary
- Learn about a new(ish) State-of-the-Art recommender engine, EASE
- See its total 5 lines of code in action, step by step
- Understand the intuition behind why it works
- Discover some side benefits of its use (speed, diversity, …)
- Check out the full implementation in my PyTorch repo
Background
I always enjoy learning about Recommendation Systems that are both simple and effective. Too often I hear stories about complex engines that have so many moving parts, and/or are incredibly expensive to train, which then fail when in a production environment. Two authors that I consistently see on the Leaderboards of PapersWithCode.com are Steffen Rendle (Google) and Harald Steck (Netflix). The amazing thing about their research?
- Simple models that are easily understood
- Implementation is quick
- Requires only user + item interactions as features
- Both have SOTA performance
I am often guilty of jumping on the bandwagon of complexity when seeing a new deep learning architecture. It can be fun and exciting to exist on the “cutting edge” of whatever domain you are interested in, adding layer upon layer of complexity. Do these actually produce the best results, though?
If you read my Graph-based recommender article, you may have noticed I referenced a great paper by Dacrema et al. (2019). In many domains/datasets, the best algorithms can be simple approaches:
- Matrix factorization
- Popularity-based
- Item or user similarity
Let’s embrace Occam’s Razor and stick to algorithms that produce great results without all of the rest of the fluff. This is where EASE comes into play. It is essentially an item-item similarity model, but with the provision that we enforce an item to not be similar to itself. This makes the model more generalized and can produce results that are both interesting and diverse.
Example
I found out about EASE while browsing through PapersWithCode.com . It is the best algorithm by a far margin on the Million Song Database, as well as in the Top 5 for a number of other datasets. The implementation was written in Numpy, and while it worked pretty well, I wanted to speed it up a bit with PyTorch. So I refactored it while attempting to learn how it worked, and you, dear reader, can follow me on this journey.
Let’s make a dummy example to see how this all works before we try to run it on a real dataset. We can make a few users + item (ratings optional) pairs and store them in a DataFrame. For each user and item, we will use an integer token to identify it. This will be our X matrix.
Now, we could store this as a Dense PyTorch Tensor, but it probably will be very sparse in a real world application. Instead, we are going to keep it as a sparse tensor until we need it:
From here, we need to build our B matrix, which is our item to item weight matrix that will be the source of our predictions:
P(user, item) = X[user]*B[item]
Steps
The EASE paper explains all of the steps needed to form this closed-form solution, but here are the basics:
- Build the Gram matrix: G = X.t() * X
- Add in a regularization value (lambda) along the diagonal of G
- P = G.inverse()
- B = -P/diagonals(P)
- Set diagonals of B to 0
Personally, I think it is easier to see these steps when trying to understand what they are doing:
Step 1
Step 2
Step 3
Steps 4 , 5
This is our weight matrix! Now we just multiply any vector that contains all of the interactions for a given user by B, and the output will be a scored list of predictions.
These predictions should make some intuitive sense. High scores are assigned to items the user has already interacted with. The 4th item has not been interacted with, and has the next largest score. We should recommend it! We can recommend a block of users by multiplying our entire X matrix by B (sparse.to_dense()@B) to get the entire set of predictions.
Intuition
I think of the matrix multiplication step like this:
- As a user, you have interacted with some items
- For each of those items, there are a number of similar items that are stored in the columns of B
- Each returned item recommendation is based on how similar it is to other items that you have already interacted with
- The best way to visualize these Matrix Multiplications might be with http://matrixmultiplication.xyz/.
Here I am using the term similar in a broad way: The Matrix and Wall-E may not be similar in genre, tone, etc., but it is possible that these two items are generally both watched by similar users, a weighted co-occurrence. If you liked Wall-E, other items watched by Wall-E watchers might be relevant to you as well. Possibly.
Now, we can go through our predicted scores, select the TopK items for each user, and then return those item names. Simple! We did not even have to use gradient descent or chain rule once.
What other reasons might we use an approach like EASE, apart from how simple it is to get SOTA results? The paper’s author makes the case:
- A single hyper-parameter to tune (lambda), which seemed to have a marginal impact on diversity or performance
- Trains in minutes for most academic datasets, compared to hours/days
- Recommends more uncommon, unique items
For me, I personally enjoy being recommended something rare and interesting that I would enjoy, rather than something popular that I was probably aware of already. The novelty and joy of discovery of something wonderful is uniquely satisfying.
For those of you that may be asking “Why is this an autoencoder? Where is the dense layer here?” That’s why this is a shallow AE! There is no need for a dense layer to compress the user vector. Rather, we store the weights of B for the output layer.
Then why is this Embarrassing? I think the only embarrassment here is the fact that such a simple set of transformations remains so powerful as to be SOTA, while complex neural nets and ensembles lag behind.
Ratings
EASE can also handle explicit user feedback as well (e.g., five star ratings) to make predictions. I have not fully tested that part of the code, but it should work in theory! More work to be done.
Metrics
Typically for these types of systems, I like to use HitRate@K as my metric of choice. It’s simple to understand (predicted [“1984”, “Attack Surface”, “Guards! Guards!”] with actual [“1984”] being a Hit). Plus, it aligns nicely with most business use cases.
As a test case, I grabbed the GoodReads dataset, filtered books in English, and trained EASE (lambda = 250.0) on 5 million book ratings. I then made 20 predictions for 100k users and compared the results: HitRate@20 of 9.1%! Not too bad. If I just predicted the most popular books on the GoodReads data, I would have only gotten 6%, so a definite improvement along with some personalization. The entire pipeline took less than a minute to run!
Conclusion
I would encourage you all to check out my repo here, as well as the original Numpy version here. The paper itself is worthy of a read, as it is short and to the point. Tuan also just posted an article on other simple recommendation systems that you might want to check out.