Calculating distance between addresses using Python Pandas and Geopy

Ethan Rosenrauch
5 min readFeb 7, 2022

--

Photo by Linh Pham on Unsplash

It is now tax season, and many small business owners are retrieving documentation for all expenses that could be listed as deductions against gross income. While one can easily calculate the year’s standard mileage deduction by observing their odometer value twice a year, some taxpayers utilize their vehicles for both work and personal transportation. Without proper record keeping, this can create an issue for a “by the book” taxpayer as they will not have certainty regarding the extent of the year’s business travel. To resolve such an issue, I will be showing how to utilize Python to convert raw unstandardized client addresses into coordinates, and compute the total miles traveled for the year. In this example, the taxpayer is self-employed in the trades, and travels on business service calls in their personal vehicle. They have kept track of every location that they traveled to for work using events in a Google calendar. While your method of obtaining location data may differ, the calendar data utilized in this example was easily scraped and placed in an excel file.

To begin, we will import our excel file with the year’s calendar information and display the format of our starting point:

import pandas as pdcalendar_df= pd.read_excel("calendar_2021.xlsx")
Google Calendar Data

As we can observe, there are rows with NaN values, as well as columns that will be unnecessary for our analysis. Let’s clean the data by dropping rows that have NaN locations, and only keep the description and location columns. Furthermore, we will remove potential whitespace in the location column that may affect our triangulation using the strip() function:

#Only keep the 2 columns we need and save to a new data frame 
addresses_df= calendar_df[["description","location"]]
#drop rows without proper location values
addresses_df.dropna(axis="rows", subset=["location"], inplace=True)
#strip the location column values to remove whitespace
addresses_df.location= addresses_df.location.str.strip()
#Reset the index
addresses_df.reset_index(inplace=True)
addresses_df.head()

Our cleaned data appears as follows and contains over 900 rows of documented trips:

addresses_df dataframe after pre-processing

To obtain the coordinates of each location, we will be using the Geopy library. The following code imports the library and the functions we need for our analysis. As stated in its documentation, Geopy is a “Python client for several popular geocoding web services.” Nomitatim is the free geolocation service that we will use to return standardized location geocodes and their respective coordinates. Since there are nearly one thousand rows of data to process, we will also use Geopy’s rate limiter feature to delay the processes and prevent our requests from being denied by the service.

import geopyfrom geopy.geocoders import Nominatimfrom geopy.extra.rate_limiter import RateLimitergeolocator = Nominatim(user_agent="Your_Name")
#Taken from documentation: Retrieves the standardized geocode
geocode_limited= RateLimiter(geolocator.geocode, min_delay_seconds=1)

To prevent an overload of Nominatim, we will also split our 900 rows into 9 separate data frames. We will then encapsulate each of the data frames in a list (outer_container) so that we can perform a list comprehension:

first=addresses_df[:100]second=addresses_df[100:200]third=addresses_df[200:300]fourth=addresses_df[300:400]#and on...outer_container=[first,second,third,fourth,fifth,sixth,seventh,eigth,ninth]

Now that we have smaller data frames with location data, let’s continue preprocessing for use with geolocator. Our location columns are not formatted uniformly so our best bet is to slice out important parts of the address. To do this, we will start by splitting every row in the location column using the str.split() function. The split function also creates entirely new data frames when its expand argument is set to true. We will again encapsulate every one of the split data frames into a new list named “split_dfs”:

split_dfs=[i["location"].str.split(",", expand=True) for i in outer_container][first_group, second_group, third_group, fourth_group, fifth_group, sixth_group, seventh_group, eight_group, ninth_group]= split_dfs

Each group now consists of columns with each part of the location before a comma. We will now generate new home and destination columns consisting of parts of the locations that can be processed by geolocator:

for i in split_dfs:
i["usable_address"]=i[1]+","+ i[2]
i["Home"]="Willowbrook,Staten Island,NY"
first_group with usable addresses

Our addresses are now sufficiently formatted to retrieve standardized geocodes and coordinates. The following code took over an hour to run due to the Rate Limiter and sheer amount of data. However, it successfully generated the information that we need for the distance computation:

#Get the latitude and longitude for our home address
for i in split_dfs:
i["Home Coordinates"]= list(geolocator.geocode(i["Home"]).point)
#Get the geocode for every destination address using the rate limiter
for i in split_dfs:
i["destination_geo"]=i["usable_address"].apply(geocode_limited)
#Place the coordinates in a list and slice out only latitude and longitudefor i in split_dfs:
i["Destination Coordinates"]=i["destination_geo"].apply(lambda x: list(x.point) if x else None)
for i in split_dfs:
i["Destination Coordinates"]=i["Destination Coordinates"].transform(lambda x: x[:2] if isinstance(x,list) else x)
i["Home Coordinates"]=i["Home Coordinates"].transform(lambda x: x[:2] if isinstance(x,list) else x)

Each of the data frames now contain the coordinate data that we need:

second_group after generating columns with coordinates

To compute distance traveled, we will write a function to return the Geopy distance in miles between the home and destination coordinates. We will then apply that function to every group and combine the groups back into one full data frame using pd.concat():

#Return the mile distance between the two points
def distance_func(df):
return geopy.distance.distance(df["Destination Coordinates"],df["Home Coordinates"]).miles
#apply the function to every row
for i in split_dfs:
i["Distance"]=i.apply(distance_func, axis=1)

final_df= pd.concat([i for i in split_dfs])
Data Frame with computed distance column

To find the total distance traveled this year, we can sum the distance column:

np.sum(final_df["Distance"])
#19400.83987570984

The total miles traveled came out to over nineteen thousand, which could result in quite a handsome deduction for the taxpayer. Furthermore to map all of the locations, we can utilize the folium module. As evident, the business has quite a far reach with clients around NY/NJ:

import foliumwork_map= folium.Map(location=(final_df["Home Latitude"][0],final_df["Home Longitude"][0]),zoom_start=10)
for i in final_df["Destination Coordinates"]:
folium.Marker(location=(i[0],i[1]), icon=folium.Icon(color="green")).add_to(work_map)
display(work_map)
The taxpayer serviced the entire tri-state area

While the mileage figure calculated is not exact due to the fact that our initial data was not uniform, it is quite close to the actual number and provides useful business intelligence. The taxpayer now has sufficient evidence of travel expense in the event an audit is initiated by the IRS.

The purpose of this article is to illustrate how Python and its various modules can be utilized to simplify what could have been tedious work in Excel. Python can be used for many purposes, and its libraries are especially valuable when used in accounting and finance settings that require data manipulation and extensive analysis.

Thanks for reading,

Ethan R

--

--