Providing predictive geo-location/spatial data recommendation services

Geolocated data exists across a wide range of industries: Travel, telecom, finance, marketing, advertising, manufacturing, etc. Applying machine learning techniques to such data makes it possible to identify and extract the hidden patterns and insights that are meaningful to applications like fraud prevention and personalized marketing. My experience specifically with spatial data has been more in ad hoc Data Analysis like cohort analysis and customer segmentation and creating data warehouses/ETL for example, but less in machine learning/predictive analytics.

I’ll use scikit-learn to explore few algorithms and eventually build a geolocated venue recommender and a geofencing alerting engine. In particular, also see how can we:

  • extract patterns from geolocated data when metadata is lacking
  • machine learning clustering algorithms and clustering algorithm techniques

As an example, the data I’ll use to demonstrate this is: Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API & has a total of 6,442,890 check-ins of these users over the period of Feb. 2009 — Oct. 2010.

OUTPUT: Assuming you are new to a place and use some app, where I have access to your geo-location: latitude & longitude. Based on past data and ‘popular places’, our recommendation engine can provide a place to visit nearby: The hyperlink is for a wiki short description or yelp page
For the impatient: Find the complete analysis here:

Clustering algorithms, can be used to determine which geographical areas are commonly visited and “checked into” by a given user, and which areas are not. Such geographical analyses enable a wide range of services, from location-based recommenders to advanced security systems, and in general, provide a more personalized user experience.

We will look at basic analysis:

Basic analysis

First, a simple recommendation engine for example, most trending venues in a given area. In particular, k-means clustering can be applied to the dataset of geolocated events to partition the map into regions.

Here an image from the analysis using k-means++ & scipy voronoi triangulation:

Voronoi diagram from our analysis. Included in the jupyter notebook

Second, determine geographical areas that are specific and personal to each user. In particular, I will use a density-based clustering technique such as DBSCAN to extract the areas where a user usually goes. This analysis can be used to determine if a given data point is an outlier with respect to the areas where a user normally checks in.

Scikit learn provides this great comparison:

From scikit website

For the remaining part of this post, I’ll stick with jupyter notebook which includes explanations and complete analysis + code. Find the larger view of the notebook here: