Technical Note on Calculating Many Distances for Analytics Projects

by Samuel Hillis

Published in

The Opex Analytics Blog

3 min readApr 17, 2014

Distance matrices of different varieties have applications throughout the field of analytics; though the computation varies depending on the application, they always serve the same purpose: providing a pairwise measure of difference between all data points. Some applications require geographical distance matrices; these can prove computationally intensive if the distances between geographies is on the analyst to calculate.

If latitude and longitude are not given, several simple shortcuts exist such as creating a database of distances between agglomerated areas (zip codes, cities, etc.). Intuitively, some accuracy is lost here. If the analyst knows the latitudes and longitudes of the data points, the Haversine formula is often preferred. However, at Opex Analytics, we have found that this method does not scale well for creating distance matrices as magnitude increases (note that for road distances in the US, we have found that Radical Logistics is a very good and scalable solution).

Instead, we have chosen to convert latitudes and longitudes to earth-centered, earth-fixed (ECEF) coordinates. This ECEF tool converts latitudes, longitudes, and height (if desired) into a Cartesian coordinate system, allowing the analyst to perform much quicker calculations by minimizing the use of trigonometric functions.

The graph below shows the output of a series of time trials carried out using Python’s NumPy and SciPy libraries. Distance matrices were created from 10, 100, 1,000 and 10,000 randomly generated points throughout the continental United States: the Haversine method is shown in blue and ECEF in green. When the number of data points is small, the difference in time may be acceptable; with a 100 point dataset, it is a difference of fractions of a second. However, datasets in the tens of thousands are not uncommon. Datasets containing 10,000 data points averaged over 20 minutes using the Haversine formula and just over one second using ECEF.

So why isn’t ECEF the standard? Because the Earth is round of course. If the Earth is approximately spherical, the city of Perth in Western Australia is almost exactly on the opposite side of the globe from the Opex Analytics office in Evanston, IL. The Haversine formula takes into account the curvature of the Earth (spherical law of cosines), so the distance across the surface of the sphere is used. ECEF coordinates are Cartesian in nature so the distance through the body of the sphere is used.

This realization may lead you to question the validity of all ECEF results, but consider this: in our tests of 10,000 random data points in the US, the difference in distance calculated using both methods was less than 0.5%; however, the difference in calculated distance between Perth and Evanston is nearly 30%. Clearly, ECEF coordinates should not be used to calculate distances between points on the other side of the world, but even on the scale of a large country the difference may be acceptable.

If you are in a line of business where the accuracy of measurement is the most important factor, the ECEF coordinate system would not be acceptable. If you are performing a spatial analysis of 10’s of thousands of data points that are relatively close and a small margin of error is acceptable, ECEF is definitely worth a consideration.

___________________________________________________________________

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.

Technical Note on Calculating Many Distances for Analytics Projects

by Samuel Hillis

Written by Opex Analytics