A Word of Advice: Revamping Foursquare’s Tip Ranking Methodology
By: Enrique Cruz
At Foursquare, we pride ourselves on empowering our community to explore the world around them. Our consumer app, Foursquare City Guide, is a location-based recommendation engine and exploration app. One of the primary actions for our users is to write tips (or a short, public blurb of text) attached to a venue that often serves as a quick review or suggestion. Over the years, Foursquare users have written more than 95 million tips. While these tips are valuable, they provide a ton of information for users to sift through. This is why determining which tips are “better” than others for a given venue is an important task within the Foursquare app ecosystem.
A few months ago, we revamped our strategy to select the best tips for a given venue. Our new ranking model greatly improves on our prior approaches and leverages contextual, text-based and social signals , which allows us to select the tips that provide our users with the most informative, relevant, and high quality content. In this post, we’ll go over our new methodology as well as how the model’s introduction yielded significant positive results as measured by various A/B tests across different use cases.
Historically, Foursquare has used a few different mechanisms for sorting and selecting the best tips at a venue — but we felt none of them were fully satisfactory on their own.
Let’s discuss a few of the most prominent ranking strategies previously used and review their challenges:
Popularity: This is a measure of the positive interactions a tip has garnered since its creation, such as “upvotes”. Generally, showcasing content that is relevant or useful to users tends to favor content that is old or stale, leading to a feedback cycle where highly-ranked tips are more prominently exposed (thus gaining even more popularity). Continuously showing old tips can make our apps appear outdated, failing to leverage the very active user community we have that continuously provides us with awesome new tips.
Recency: This is a measure of the amount of time that has passed since the tip was created. This measurement does a great job at showcasing the vibrancy of the Foursquare community, yet it offers no guarantee of quality or relevance.
Our Shiny, New Tip Ranker
For our new tip ranker, we wanted to build on the successes of prior approaches and develop a system that not only balanced popularity and recency, but also allowed us to factor in other nuanced signals that help differentiate a bad tip from a great one.
In addition to popularity and recency as defined above, we included the following features in our revamped tip ranking model:
Language Identification: This is a language classifier built using an ensemble of open source and home-grown solutions in order to avoid serving tips in languages that a user does not understand.
Content Richness: These are several signals that track more general attributes and metadata about the tip beyond the actual information contained within the tip itself. Among these factors is the presence or absence of a photo, links to external sources, as well as the number of words the tip contains.
Author Trust: These are author statistics such as tenure as a Foursquare City Guide user, total popularity, and other aggregate facts around the user’s previously written tips. These signals attempt to capture a user’s trustworthiness as a tip author.
Global Quality: This is a set of scores from various statistical classifiers that are trained to identify specific traits, such as the sentiment of a tip (trained by using explicit “like” and “dislike” ratings) that a user provided for a venue on the same day that the tip was written. Natural Language Processing (NLP) is then used to learn which words and phrases best predict each class of tips. As for the likelihood of a tip being reported as spam — this is trained by looking at past tips reported as spam and learning the attributes that best correlate with this.
Putting the New Features to Work and Collecting Training Data
In order to train our model using these new features, we generated some training data by leveraging existing crowdsourcing platforms. To collect our data, we first determined the top 1,000 most popular venues by user views and proceeded to randomly sample 100 distinct pairs of tips from each of these venues. After accounting for some language filtering and de-duplicating, this yielded a dataset of 75,000 tip pairs.
We then created labels for this data by designing a job on Figure Eight (formerly CrowdFlower, a crowdsourcing platform for tasks similar to Amazon Mechanical Turk) where the judges would be shown a tip pair from our sample pool alongside the relevant venue. The judges were then asked the question, “If you were currently at this venue or considering visiting this venue, which of the following pieces of content is more informative?” We designed the test so that the tips would be shown in a similar context to the way they are displayed in the City Guide app, exposing our judges to all the same contextual information that affects the way our real users view a tip. The outcome of our Figure Eight job yielded around 50,000 labeled pairs of tips which we divided into training and evaluation data.
To train our new tip ranker further, we explored a variety of algorithms including LambdaMART, Coordinate Ascent, and RankBoost. After evaluating the results, we settled on using SVMrank (an implementation of Support Vector Machines) as our supervised learning algorithm. Our objective was to minimize the number of disordered pairs of tips in light of our crowdsourced training labels.
As we iterated and tuned our new ranker, we evaluated its performance against a “held out” dataset, comparing it against some baseline metrics. We also evaluated the rankers qualitatively with a new side-by-side tool to look at the best tips for a venue chosen by each model.
In the final model, Tip Ranker with text features, these were the features with the highest weight:
- Tip length and number of tokens
- Presence of a photo
- Positive sentiment
The features with the least amount of predictive power turned out to be:
- Author’s aggregate statistics
A/B Testing Results and Applications
After the encouraging results of the newly-trained tip ranker on our held out dataset, we brought the model into production to be used on our entire venue corpus and leveraged it in various touch points within the Foursquare ecosystem. Below are some of the places we experimented with the new ranker, and the results from running A/B tests with a 50% split of our user base.
- Application “At a Venue Ping”: When we detect that a user is at a given venue with a certain likelihood, Foursquare sends the user a ping containing the best tip (not previously seen by the user) for the venue. This was previously determined using only the global quality features which fed into a random forest model for scoring, sorting, and filtering tip candidates.
- Result: Our new ranker yielded significant improvements against the control group, resulting in a 1.5% increase in the click through rate, while also allowing us to send 32% more tip pings by removing some existing calibrated filters that existed due to a lack of confidence in the prior selection method. Furthermore, the experiment group resulted in a 5% increase in core app activity days.
- Application “Post Check-in Insight”: When our users check in on our other consumer app, Foursquare Swarm, we show certain pieces of content for the place the user just checked into. Among these is a Foursquare City Guide tip for the venue and an up-sell to view all tips if they have the Foursquare City Guide app installed (or download it otherwise). Previously, this tip selection was done purely on social signals.
- Result: The A/B test with the new model saw a significant increase in all tip related actions (such as “likes”, tips, and photos) as well as a net lift of 1% active users for Foursquare City Guide due to more users choosing to tap the up-sell.
- Application “Venue Page Default Sort”: When displaying a venue page, we show a list of the venue’s best tips in the highlight tab. This previously defaulted to a sort on the positive social signals for the tips. We ran an A/B test grouped by venues in order to measure any SEO changes.
- Result: While the logged in version of the experiment yielded no significant results, the SEO version resulted in a lift of 2.40% in total global referral traffic. We hypothesize that this was mostly driven by the ranker’s preference for content that was longer, included more photos, and was written more recently.
Future Work and Possible Extensions
There are a few areas of work left to explore that could yield further improvements in the way we select tips by incorporating new features into the model.
Some of these include:
- Negative Social Signals: At the time the model was built, Foursquare City Guide provided users only with ways to either “like” and save a tip, or flag it as spam. Since then, we have introduced a new interaction to downvote a tip, in the future it would be interesting to retrain the model with this new signal to validate whether it has any predictive power.
- Sentiment to Rating Matching: The model overwhelmingly prefers tips with positive sentiment. While this is helpful, it presents some dissonance when a venue has a low rating, yet the top tips are mostly positive. An extension of this work can rank tips to show a sentiment distribution that better reflects the venue’s rating and its underlying distribution of votes.
All in all, it’s critical for us to continuously evaluate the way we process, track and showcase user feedback — which contributes to our active user base and influx of location-based insights. Through analyzing past approaches and experimenting with new techniques, we are able to serve our community with the most valuable information possible.