Switching to a Probabilistic Model for Venue Search in Foursquare

Our Pilgrim contextual location technology (used by Foursquare City Guide and Foursquare Swarm as well as by numerous other partners via Pilgrim SDK) involves:

  • collecting data from our users’ phones
  • detecting whether the user has stopped at some venue
  • and then determining which specific venue the user is most likely to be in right now using all of the phone’s available sensor data. This is what we will look at in this post!

Ever since February when we rolled out a new probabilistic model for determining our Pilgrim users’ locations, we have seen dramatic improvements to both our accuracy and user experience.

Our improvements:

  • Our accuracy in venue search improved by 4%.
  • We had a 6% increase in active users.
  • We doubled the number of unique venues we detect our users have visited: we correctly identify a wider variety of locations.
  • Our model became more precise in modeling the visit distribution of sub-venues (such as bars or shops in hotels, malls and airports) and resulted in an increase in passive visits detected for this key commercial slice.
  • The new model significantly lowered the time needed before we start making predictions for new venues (those recently opened or added to our database).

Our machine-learned model uses all of the spatiotemporal data available to the phone (latitude and longitude, time of day, visit history, wifi data, and more) in order to correctly predict within the venues space what the most likely place the user is and accurately assign a probability to each likely venue.

After a thorough analysis we hypothesized that our past approach of deterministically selecting the top venue led us to under-represent some venues, and that a probabilistic approach, in the limit, would yield a much more accurate representation of the true visit distribution as represented by our ground truth data (check-ins and place confirmations).

By modeling visits as a probability distribution over many venues, as opposed to our previous winner-take-all approach, we are able to better represent our uncertainty and deliver more accurate results in the aggregate.

Old Model: search ranking algorithms

Previously we have used a search ranking algorithm, optimized to rank the correct venue at the top spot as often as possible. Once a user stopped somewhere, the model would perform a search and would rank all possible choices, then it would suggest that the user check-in to the top choice.

We trained this earlier model using RankLib, a library of ranking algorithms, but there were certain things we could not get our model to do.

For instance, while it was decently good at predicting the most likely venue the user was at, it was not as good at estimating the total number of visits to a certain venue in a month, nor could it provide an easily interpretable likelihood of how confident the model was for a given prediction.

Additionally, these training algorithms were not easily parallelized, so as the size of our data grew, we found ourselves limited to training with only a subset of our data.

Much Better: regression trees and probabilities

We have worked on the new method for eight months. Our latest model, “Weeping Angel,” estimates the probability of a user being at a venue using the given context. This model is not only more accurate for individual predictions (since it is able to leverage numerous newly-added features), but it gives us the ability to create fractional visits based on the assigned probabilities to each candidate, which aggregated in the limit gives us a much more accurate representation of the true distribution of visits to venues around the world.

Our thought process:

  • come up with a best possible hypothesis from our available data;
  • come up with a best hypothesis given infinite data and resources;
  • our ideal venue search should return true results.

We know that as we train our model on more data our estimate converges to the true probability (a property known as “statistical consistency”). So we trained our models using “maximum likelihood estimation” — a technique whose results are proven to approach the truth as our data grows.

But we also had to make sure that our hypothesis is representative of the reality — as close as possible to the truth. Hence our models consist of a deep ensemble of regression trees rather than overly-simple linear models, incorporating as many features as necessary.

To implement this vision, we decided to use XGBoost, an optimized distributed gradient boosting library. Not only is it a framework designed and optimized for boosting tree algorithms, but it allows for distributed training. It used to take us 24 hours to train on half a million data points (which was the maximum a single machine could handle), now it takes 1 hour to train on 2 million examples.

We customized XGBoost to adapt to our needs — to learn on a much larger scale (with millions of classes, each representing a different venue) and to learn from denials as well as from confirmations (from users).

Testing and calibrating

One of Pilgrim’s biggest advantages is having millions of users that we can leverage to assess our new model’s efficacy in detecting at which venue they have stopped. Our users can choose whether to confirm or deny a detected visit. This feedback provides us not only with the rich training data necessary to build our models but it also provides a nice A/B testing setup to evaluate their accuracy. Based on users’ feedback we now know that our updated model is returning more accurate predictions and we are getting closer to the ideal venue search.

We were also able to analyze probabilities and we have figured out that most of our errors come from assigning less than 1% probability to venues. We observed that once we remove these cases, our accuracy improves. Using a metric called Normalized Absolute Deviation (NAD) we figured that our probability miscalibration is largely due to overestimating probabilities less than 1%.

Applying the law of large numbers (a principle of probability according to which the average of the empirical results obtained from a large set of data should be approaching the true underlying value) we sum probabilities for a venue being visited by a user over the course of a month, then use the NAD metric (normalized absolute deviation) with respect to our ground truth data, and end up with an accurate estimation of the share of visits a venue gets per month.

By calculating visits this way we were able to address a few common issues for location intelligence:

  • popular venues get too many visits.
  • sub-venues within bigger venues (bars or shops inside hotels, airport or malls) get too few visits.
  • new venues get few visits and it takes a long time to start getting more.

With the launch of the new model we were able to successfully address these issues in a way that will continue to improve as our panel size grows.

We are continuously making improvements to our Pilgrim contextual location technology. We are always looking for talented engineers and data scientists to help us push the boundaries of our tech and continue to offer a best in class service to our apps and partners. Reach out to us if you are interested!