When Addresses Alone Are Not Enough

Geocoding and Reverse Geocoding in Python

Amy Li
Analytics Vidhya
7 min readSep 13, 2019

--

A cartoon representing what is geocoding from the ArcGIS website: https://developers.arcgis.com/assets/img/features/features-hero_geocoding.png

In a prior project which attempted to predict if a restaurant would receive an inspection grade of A from the NYC Department of Health and Mental Hygiene, a thought I had was: wouldn’t it be great to visualize areas where there are restaurants that have been cited for critical violations to see if there is a pattern. The more critical the violations, the less of a chance the restaurants will receive an A. However, the dataset used for the project did not include longitude and latitude coordinates, which were necessary for plotting the locations on a map. Then I heard about geocoding, the solution to this problem.

What is geocoding?

According to the Google Maps Geocoding API’s website:

Geocoding is the process of converting addresses (like a street address) into geographic coordinates (like latitude and longitude), which you can use to place markers on a map, or position the map.

Since there is a way to convert addresses into coordinates, there must be a way to work backwards and convert coordinates into addresses. This is called reverse geocoding. I will be focusing on geocoding here but the same steps can be used for reverse geocoding.

How is geocoding done?

Now that we understand what is geocoding, the next big question is: how is geocoding done? The answer is: geocoders!

A geocoder is a piece of software or web service made up of a reference dataset and an algorithm that implements a process to convert addresses to longitude and latitude coordinates. There are a variety of geocoders available such as ArcGIS, Google Maps Geocoding API, TIGER (US Census), and Nominatim (using data from OpenStreetMaps).

But which one is the best geocoder to use? Choosing a geocoder depends on the data and experience. There are a few ways to test out the geocoders to see which matches your purpose such as working with each geocoding API individually. As a newbie, I decided to use a library which can access a variety of geocoders.

GeoPy — A Python Solution for Geocoding

GeoPy logo from its website: https://geopy.readthedocs.io/en/stable/_images/logo-wide.png

As described in their documentation, GeoPy is a Python package that serves as a client to provide consistent access to some of the most popular geocoders available on the web. For those familiar with working with databases, this package is similar to using a Python driver to access a MySQL (ie. mysql connector) or PostgreSQL database (ie. psycopg2).

“Geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.”

The best part is that it works well with Pandas, which is really helpful for batch geocoding (imagine you have a large list of addresses, batch geocoding is more efficient than converting individual addresses).

Working with GeoPy and Pandas

The best way to learn about a package is to just jump in. Looping back to why I started looking into geocoding, I decided to use the NYC Restaurant Inspections data from my prior project to experiment with GeoPy. For those interested in the full code of this mini-project, it is available on Github here.

  1. To begin, first install GeoPy using pippip install geopy.

2. Next, clean the data. For batch geocoding with Pandas, this also involves isolating the addresses into its own column.

Why is this step important?

Well, if there is a difference in the spelling in the address or a difference in the order of the words in the address, this may cause an error. Because each address is compared to an entry on the reference table on the backend, if an address is not formatted the same as an entry on the table, it will not be geocoded.

3. Now that the data is clean and the addresses are isolated, it is time to run it through a geocoder. As it was my first time geocoding, for this mini-project, the default geocoder, Nominatim, was used. However, other geocoders can be used instead.

  • Each geocoder uses its own reference dataset. If a geocoder does not produce any results, trying another geocoder may solve the problem. For example, when I used Nominatim to geocode the address: “1057 LEXINGTON AVENUE MANHATTAN NY 10021”, there was an error that the address was not found. However, when I looked it up on Google Maps, it was found.

Here is the general code for batch geocoding with GeoPy and Pandas:

# import modules from geopy library
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

# create an instance of the geocoder
geolocator = Nominatim(user_agent="inspections_locations")

# create an instance or geocoder object with rate limit implemented
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1, max_retries=0)

# create a new column in dataframe for storing the location details from geocoding
df['LOCATION'] = df['ADDRESS'].apply(geocode)

# create a new column called POINT and fill column with coordinates as tuple pulled from the Location column
df['POINT'] = df['LOCATION'].apply(lambda LOC: tuple(LOC.point) if LOC else None)
  • As you probably noticed in the above snippet, a RateLimiter object was used. This is because batch geocoding can overwhelm the geocoding service with its volume same with any API. To prevent this, it is necessary to add a delay either after each address or after a subset of addresses is processed. For this mini-project, the data were fed to the geocoder in subsets of 100 with a delay of 1 second after each address.
  • The number of retries can also be edited under the RateLimiter object. Running an address through the geocoder multiple times, which can be due to difficulty matching the address to a set of coordinates or due to connection issues, can delay the overall geocoding process. Therefore, it is helpful to have this parameter which can limit the number of retries, thus, reducing the overall time spent. In this case, I set the max number of retries to 0.

In the end, out of 800 unique addresses, 486 addresses were able to be converted to geographic coordinates. Not perfect but these results were sufficient for my use.

Phew! Ok that’s the end to the geocoding step. Pretty simple right? Even for batch geocoding, there are just 3 major steps.

Geocoding Results At Play

Now that I have the coordinates, I can create the visualizations to see if there are certain areas that have restaurants with more critical violations. For this, I used Folium, which is a Python package built on a Javascript library called Leaflet.js for map visualizations.

The map below, which visualizes each restaurant location (in red), seems to show that most of the restaurants with critical violations are in Manhattan.

A snapshot of the map visualization (individual locations) created using Folium

Another view of this is in the map below, which clusters the restaurant locations with critical violations in each area for a cleaner look.

A snapshot of the map visualization (clustered locations) created using Folium

However, this is probably because most restaurants are located in Manhattan. Looking at the distribution of restaurant inspections by borough, the borough with the most inspections is Manhattan.

Distribution of restaurant inspections by borough in NYC

Therefore, seeing that most of the restaurants in Manhattan have critical violations does not really produce much insight on the areas to avoid when looking for a safe place to eat. In addition, while it is true that if a restaurant has more critical violations it is less likely to obtain an A, there are restaurants who still obtain an A despite having critical violations whether during the same inspection visit or at a subsequent inspection visit.

Despite not obtaining a lot of useful insights from this project, it was a great way to learn about geocoding and see how it can be used in addition with map visualization. Only by actually trying it out in a simple project like this that I was able to understand some of the challenges with geocoding.

Resources

--

--