Published in


You Are (Probably) Here: Better Map Pins with DBSCAN & Random Forests

Unbeholden to Geocoders

Geocoding venue addresses is not only expensive to do at scale, but also severely limited by the ability of geocoders to parse and error-correct a wide variety of international addresses. In addition, commercial geocoders’ address coverage and resolution vary widely across, and even within, countries. The point a geocoder returns could be the exact rooftop or doorstep coordinates in the best case, or an interpolation against published address ranges along the length of the street in the average case, or simply the centroid of the postal code or city.

From Filtering to Clustering

In the past, our approach to using these check-ins to place map pins was rather straightforward — for every venue, we simply computed the centroid of the latitude/longitude coordinates of its check-ins, and dropped the pin at that centroid. One problem with this approach is that GPS data is noisy, so not all check-ins are equal. Mobile phones report a horizontal accuracy along with latitude/longitude coordinates, where a numerically larger value corresponds to a higher error or lower confidence. When we noticed our centroids were thrown off by check-ins with large horizontal accuracy values, we naturally resorted to filtering them out, but that did not solve the entire problem either.

Check-ins at a Red Robin in Bellevue, WA showing distinct clusters centered around the actual restaurant location and on the other side of the street.

Combining DBSCAN and Random Forests

Having reduced our situation to a flavor of clustering problem, our goal became to build a system that was robust to outliers and which made no assumptions about the size or shape of individual clusters or how many clusters there were in the data. We identified DBSCAN, a clustering algorithm that is well known and tested in the industry, as being an effective option for our purposes. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a highly versatile algorithm, and, as the name suggests, it is particularly well suited to datasets that contain noise in a way that more popular clustering methods, such as k-means, are not. At its core, once again as its name suggests, DBSCAN is density-based, and especially when operating in two-dimensional space, the algorithm reflects the intuition by which our brains visually pick out clusters of points.

Distribution of check-ins at a Target in Tucker, GA (left) and at Sweet and Vicious, a bar in New York (right). Note that the map on the left is zoomed out to show a much larger area than the one on the right.
Top: The true shape (left) and street block shape (right) of a hair salon in Greece.
Bottom: The true shape (left) and street block shape (right) of a car wash in Turkey.
Even though the street block shapes in both cases contain other streets, these are pedestrian or private streets, and our definition excludes them.

Where We Will (Probably) Go From Here

At Foursquare, we have always believed in the power of mobile location data, and we are uniquely positioned in the industry to bring this power to bear on such a diverse set of problem domains. We continue to research other innovative uses of our location data to improve the coverage and quality of our venue database. Watch this space to learn more about how we’re using machine learning techniques to consume large amounts of data from external sources and use it to intelligently augment our own.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rahul Pratap Maddimsetty

Engineering Manager at Facebook. Previously Engineering Director at Foursquare and Software Engineer at Microsoft.