Vegas, Baby! A Distance-based Recommender System with the Yelp Dataset

7 min readOct 8, 2019

The Las Vegas Strip (Credit: Timelapse Photography — Emeric Le Bars)

Ah, Vegas. The lights, the sounds, the joyful lap-dancing abound. Or so I’ve heard (I haven’t actually been there). As they say, whatever happens in Vegas, stays in Vegas; we all know what happens in Vegas, but perhaps culinary experiences do not top that list.

For the 5th project in Metis’ Data Science Bootcamp, I decided to have a go at using Yelp’s Kaggle Dataset to build a distance-based recommender system. In this project, I hypothesize food being an afterthought, that people are indecisive, constantly hangry and want to be told what restaurants are good within their immediate proximity while still taking into account personal preferences and visit histories.

Recommender systems you say?

Recommender systems come implemented in various forms all across the web, the advantages of which are well documented. It is how you are kept in a never-ending loop of YouTube / Netflix binging or impromptu Amazon shopping sprees, often leaving you in a state of self-disgust (yes, we’ve all been there).

While platforms see a direct increase in revenue, consumers also benefit from having tailored suggestions pushed to them. Put simply, recommender systems show the right things to the right people at the right time.

Collaborative & Content-based Filtering

Whilst the level of sophistication in a recommender system varies greatly from use-case to use-case, they are generally based on two fundamental methods of filtering: Collaborative & Content.

Collaborative Filtering vs Content Filtering

In a collaborative setting, birds of a feather flock together. Users with similar likes and dislikes are grouped together. If a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.

In this example, as both Users A & B have rated Crystal Jade & Han’s highly, it can be assumed that Users A & B are similar because they have identical tastes and preferences. In this case, if User A then goes on to rate NamNam highly (a restaurant not yet visited by User B), NamNam would be recommended to User B.

In a content setting, however, recommendations are based on a similarity comparison between the description of an item and the profile of a user’s preferences. Similar restaurants will be recommended to the User based on what the User has rated highly previously, as in this KFC — McDonald’s fast-food chain example above. There are many ways to assess item-preference similarity but they are beyond the scope of this post.

Data Wrangling & Exploration

Now on to business. The datasets of focus here are the business & review jsons. The former provides information on a business’s attributes, whereabouts, and average ratings. The latter contains user reviews in the form of text and the ratings given for the establishments they have visited. Both ratings are on a scale of 1–5.

Restaurant-only establishments in Vegas were isolated from the rest. In total, there were about 1.2 million reviews for ~6,500 restaurants given by ~440,000 users. The visualization below shows a rather expected spread of user ratings — the better restaurants received more reviews than the worse ones.

User Rating Spread for Las Vegas Restaurants using Tableau

By using review activity as a proxy, it quickly becomes clear where some of the eating hotspots exist in the city.

Heatmap of Restaurant Review Activity using Folium

Unsurprisingly, a lot of activity is concentrated in the Las Vegas Strip, with smaller pockets of concentrated activity in several other places. The heatmap also satisfies us that the data has good coverage throughout the city.

Modeling

Alternating Least Squares Matrix Factorization

The first model used here is a Matrix Factorization Model built in Apache Spark. Matrix Factorization is the process of decomposing a larger user-item matrix into the approximate product of two lower-dimensional matrices. The idea behind this is to represent users and items in a lower-dimensional latent space (source).

By minimizing the loss function, the two lower-dimensional matrices can then be used to reconstruct the user-item matrix with values filled-in as “predicted ratings”.

Watch: Luis Serrano’s Video on Matrix Factorization

In the figure above, you see a 4 x 5 user-item matrix as the approximate product of a 4 x 2 user matrix and a 2 x 5 item matrix. While latent features can be seen as different cuisines in the example above, they should be thought of as groupings of similar observed features (like topics) identified from patterns in the data — these are often not easily labeled.

Having preprocessed the data into a form readily acceptable by Spark’s MLlib, a custom grid search was used to find the optimal hyperparameters for the model on AWS EC2.

AutoEncoder Deep Learning Model

The second model used was a simple Autoencoder Model built in Keras. Autoencoders work by squeezing the data (encoder) into a latent-space representation and then attempting to reconstruct the original (decoder) from this same representation. The effect of which is to enable the bottleneck to learn the most useful latent-features so that it can be generalized to reconstruct the user-item matrix.

While both models described here can be said to be similar in that they reconstruct the user-item matrix from a latent-space representation of the data, the Autoencoder model allows non-linear relationships to be learned at the cost of model scalability. Check out this excellent read for an in-depth exploration of autoencoders.

Recommender App Features

Onward to location-based functionality! The envisioned app comes with four types of recommendations: Same Same, Try it Again, Explore the Beyond & Try Something New. In building these features, businesses’ coordinate data were integrated into the recommendation process to identify only those within a certain fixed geographical radius.

In serving recommendations to existing users, the user’s current location would be identified and the system would generate recommendations based on this reference point. The Yelp API would also be called upon to display salient business information such as contact details and restaurant photos on the map.

But, wait. How do we recommend restaurants to new users with no review histories? This is known as the Cold-Start Problem — a well-known machine learning problem (especially with recommender systems) where insufficient data undermines the model’s ability to draw inferences.

Literally a Cold-Start Problem Right Here

In tackling the said problem, an analysis was performed on cuisine popularity and presence in the city. Identification of popular cuisines meant that new users would be able to choose cuisine preferences during the creation of a new account on the platform akin to how Netflix deals with new user sign-ups.

Cuisine Presence in Las Vegas using Tableau

Having given their cuisine preferences, popular and highly-rated restaurants within the area can then be served to the user and, moving forward, the user is left to create a personalized data trail for the recommender system to “adapt” to.

Wrapping Up

Customary Image on Recommender Systems All Write-Ups Seem to Have

The work produced here could enjoy a greater level of sophistication, of course. If time was a non-issue, it may have been worth exploring and incorporating NLP on text data, some time-series analysis as well as throwing restaurant attributes into the mix to further personalize a user’s experience.

I will revisit this in the future. But for now, this is where I conclude.

Vegas, Baby! A Distance-based Recommender System with the Yelp Dataset

Written by Ken Cheah