Geolocalization plays an important role in most companies today. Some of them wouldn’t even make sense without this feature. Generalised use of smartphones allow for a constant gathering of geolocated data. Where did you have dinner yesterday? Where do you work? Which stores have you visited? Business opportunities surrounding the use of this kind of data are ubiquitous.
When thinking about geolocation-based services, Foursquare comes easily to mind. For those who don’t know, it consists of a location-based social network where users can find nearby events, restaurants or places, and ‘check-in’ to gain score points. Other popular, and well-known apps that use geolocation services are Pokemon Go and Uber, which would be rendered useless without the ability to track their users’ positions.
A bit of business logic
Tiendeo uses geolocated data to match retailer’s physical store locations across different sources in order to create a one-to-one link between contents.
Our sources consists of lists of stores pertaining to certain retailer. Each store has an associated geographical position represented by their latitude and coordinates. These sources are variable in size, ranging from less than 50 stores to more than 3000 in certain cases.
These sources’ geographical position of stores often differ slightly, on the order of tens of meters. Content matching must be done in order to avoid duplicate content: showing the same store twice in Tiendeo’s website.
Associating content was a manual, repetitive task, creating a huge time sink, as well as employee burnout from having to perform a seemingly endless mechanical task during hours at a time.
Coding the solution
What used to be a resource waste was automated using the following approach. We decided to develop an efficient, one-to-one content matching algorithm based on geolocation. Our code accepts a tunable parameter, the maximum allowed distance between elements, which we empirically established in 750m according to knowledge about our sources.
A first iteration to this task involved creating a full distance matrix, computing all distances between all elements of our sources. This soon proved to be an inadequate approach, as computation cost was exponential and many distance calculations were not needed. For a 10 elements lists, 100 distance calculations where needed. But for a 1k elements lists, the number grows to a million, resulting in computations times of about 10 minutes, excessive for our intended use. Thus, a second iteration was performed.
For each element of a lists, we find nearby elements in the other source within a certain radius. We establish this radius via a simple coordinate subtraction of 0.08. Correspondence between coordinates differential and radius in meters isn’t constant. A difference of 0.08 in coordinates results in a radius of 1400m at the equator, and about 1900m in northern latitudes, such as Norway.
Then we calculate the distance from each element to those nearby elements and match it to the lowest distance element in this nearby elements group. This result in computational cost reduction, as we no longer require to compute distances between all of elements, but for those in an small radius.
Thanks to this hack we are able to reduce our computation time from 10 minutes to less than 5 seconds in a worst case scenario.
Calculating distances between coordinates
Thanks to these tricks, precise and fast results are achieved for distances on a city scale and not on the poles. You can read more about cheap ruler here.
Proof of concept
A small script can be found at my GitHub repository, comparing performance gains by using method described above (FastMatching) instead of computing all distances between all elements (SlowMatching). Sample results look like this:
Made 12000000 distance computation between elementsFastMatching: 66.746ms
Made 18699 distance computation between elements