Using the Triangle Inequality to Query Geographic Data

A fast and straightforward method of querying large sets of locations.

João Paulo Figueira
tb.lx insider
7 min readOct 28, 2019

--

Photo by henry perks on Unsplash

A few years back, I found myself in the situation of having to implement the DBSCAN algorithm. I did so with the support of a paper that illustrated all the implementation with pseudo-code¹. It also used an ingenious way of indexing locations using a simple mathematical principle, known as the triangle inequality.

The triangle inequality states that on any triangle, any side is smaller than or equal to the sum of the other sides’ lengths. On metric spaces, this is also a defining property of distance functions. Distance functions take as parameters two arbitrary points and return a metric that bears some mathematical features, including the triangle inequality. So how can we use this property to query geographic data? And what do I mean by “query the data”? Two useful queries report essential information about a geographic database. The first query is about the number of points in a given radius of a given center. Like, how many Italian restaurants can I find within 5 km of my house?

The second query is about the nearest locations of a given reference point. As an example, you would ask what the three Italian restaurants closest to your house are. Both queries are variations on a theme so that we can focus on the first only.

The first thing we must observe is that calculating distances is an expensive process. When calculating distances over geodesic lines, we must resort to complex calculations like the well-known Haversine Distance. Its formula uses lots of trigonometric functions, which are somewhat expensive to calculate, even in modern hardware. So what can we do about this issue? On the one hand, we can try to simplify the calculation cost at the expense of precision. On the other hand, we can try and avoid calculating the distance altogether, where we can use the smarts that the triangle inequality allows.

The Dublin Bus Example

To illustrate the concepts and the code, I use an open-source dataset of vehicle trajectories published by the Smart Dublin website. The dataset contains information about the daily trips of a set of buses driving through the city of Dublin during January of 2013. Once fully loaded, this dataset includes over forty-four million geographically referenced points. It is a unique challenge for our geographic queries.

To automatically download the data, you can use a specialized GitHub repository that I have prepared for this purpose and further exploration of the dataset. The data downloading feature is available in the first notebook of the catalog. Once you have the data ready, please head on to this article’s repository.

In this article, I compare and contrast three different approaches to calculating such queries. The discussion starts with the naïve brute-force approach, where all distances require calculation. Next, I adjust this approach and derive the triangle inequality-based variety. Finally, I illustrate the use of a standard method to handle these queries and their unexpected weaknesses.

Brute-force Queries

The brute-force approach is very simple-minded as it always calculates all distances when answering a query. So when asking which points are in a hundred meters of a given position, the brute-force query first calculates all the metrics between the reference point and all the dataset points. Then the code filters the records that are within the given radius and reports them. For the query that asks about the closest locations, the principle is the same.

If you expect this brute-force method to be slow, you are in for a surprise. In my MacBook Pro, I get a radius query result in under four seconds. Well, it is usable for a couple of ad-hoc queries, but I am sure it scales poorly, even with some smart code parallelization. Factoring in this performance is very likely the vectorized implementation of the Haversine distance calculation.

If you are following this discussion with the notebook, you see that I used two locations for testing purposes. When asked to calculate the indices of the places within 100 meters of the second location, the code is dumb enough to recalculate all 44 million distances. However, could it do any better?

After calculating the distances to the first location, we acquired a vital piece of information. Not only do we know which places are near, but we also know which ones are far. Maybe if they are far enough from the first location, we should be able to avoid calculating them for the second location.

Here’s the gist of the idea. Start by calculating the distances between location A and all points in the dataset. Next, calculate the distance between locations A and B, the second one to have a radius query calculated. Finally, remove from the computation all points that are farther from A by at least AB + R. This is the triangle inequality principle in action.

Triangle Inequality Queries

The implementation of the triangle inequality queries in the accompanying code extends the above concept to two foci. Instead of using just one point from where to compute the initial distances, the implemented Python code uses two. Both locations lie along the equator, ninety degrees apart. This arrangement should help further reduce the number of places to drop.

The following image illustrates how these queries work.

The triangle inequality principle at work

Point A represents our query location. We want to know how many points are within a given query radius, represented by the red circle. Points B and E are arbitrary foci. Centered on these, we draw to circles tangent to the red one. The resulting shape (HIJK) is an excellent approximation of the red circle. We only need to query points within this shape for their distance relative to location A, ignoring everything else.

The algorithm starts by calculating the distances of all points relative to both foci. It then sorts each array and keeps the original order in auxiliary index arrays. A radius query is implemented by first calculating the distances AB and AE. It then searches the sorted distance arrays in the ranges BC-BD and EF-EG. The intersection of indices from both searches is the tentative set. The algorithm uses this set to determine the final solution using a brute-force search.

However, now we have a time penalty when the query object is set up. The object initialization must calculate the distances of all points in the dataset to both foci. It must also calculate the sequence of sorted indices. Query object initialization takes about 28 seconds, but querying drops to sub-second performance. For the locations I selected, the query runs in 350 milliseconds for the first one and under 70 milliseconds for the second. The first query reports 97867 locations while the second returns none.

If we can bear the initial time to initialize the query object, we get speedy performance during query time. It is quite an improvement over the brute-force approach, especially when querying a large number of points.

Can we do any better than this?

Tree-based Queries

For the sake of completeness, I also ran the same queries with a “ball tree” object. You can find this feature buried deep within the SciKit-Learn² package, under the “neighbors” namespace. I came across this geographic indexing scheme through the DBSCAN implementation of the same library. So it came as a shock that initializing a ball tree object with the Dublin Bus dataset took 7 minutes to complete! Worse still, query performance was also unimpressive, with each radius query taking more than 960 milliseconds.

Maybe this performance hit is the price to pay when using code built for general purposes. After all, the implementation I provide here based on the triangle inequality works only for geographic coordinates, while the ball tree can serve diverse types of geometries.

Please take some time and run through the demonstration notebook. You should get a better feeling about the timings.

Final Thoughts

We have gone through three different approaches to query large geographic datasets in memory. These queries tend to be very computational intensive, as can be demonstrated by the brute-force approach. Using some geometric smarts, we can reduce the number of computations through smart pre-calculations, as long as we can offset the setup time with fast query executions.

The bottom line is that you don’t necessarily need fancy data structures to solve hard problems. Sometimes small insights can go a long way.

Acknowledgments

This article was revised by my dear colleagues Margarida Pedro and Sérgio Pereira, to whom I am deeply indebted. All errors are my responsibility.

References

[1] Kryszkiewicz M., Lasek P. (2010) TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality. In: Szczuka M., Kryszkiewicz M., Ramanna S., Jensen R., Hu Q. (eds) Rough Sets and Current Trends in Computing. RSCTC 2010. Lecture Notes in Computer Science, vol 6086. Springer, Berlin, Heidelberg [Springer Link]

[2] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011

[3] GitHub repository

João Paulo Figueira works as a Data Scientist for tb.lx in Lisbon, Portugal.

--

--

João Paulo Figueira
tb.lx insider

Addicted to math and data, slightly off-centered. Data Scientist at tblx.io