How to inexpensively and quickly map millions of geographic coordinates to their zip codes

Brandon Odaniel
Xylem | AI and Big Data
8 min readOct 13, 2022

By rahul bhasker and Brandon Odaniel

Working on the Xylem/Sensus Data Science team, we are presented with lots of interesting data analysis opportunities. Sometimes to support analysis requests from our team, we combine internal datasets with external third-party data sets.

We look for innovative ways to get the data we need without spending a lot of money. This is one such case.

Recently, we needed to identify zip codes for roughly 20 million geocoordinates we had in our data lake. This was needed so that we could associate zip code data from a third party for a one-time analysis.

We figured out a way to achieve this in a performant manner — for just a few dollars — using Amazon Web Services. We wanted to share this technique with the greater data science community in hopes that others could benefit.

Reverse geocoding

First option we explored was reverse geocoding each of the 20 million geocoordinates.

Per Wikipedia (https://en.wikipedia.org/wiki/Reverse_geocoding), “Reverse Geocoding is the process of converting a location as described by geographic coordinate (latitude, longitude) to a human-readable address or place name”.

API services, like Google Geocoding API (https://developers.google.com/maps/documentation/geocoding/usage-and-billing) allows one to provide a set of geocoordinates and get an address containing a zip code.

For our use case, we could have taken the 20 million geocoordinates for which we needed the zip code and fed them through the Google Geocoding API. Using that approach, however, the costs and performance would have been:

  • Assuming that we could have gotten the 500,000 monthly volume price of 0.004 USD per each API call invocation (https://developers.google.com/maps/documentation/geocoding/usage-and-billing) the cost for reverse geocoding 20 million geocoordinates would’ve been: 20 million geocoordinates * $0.004 USD per API call = $80,000 USD
  • At an API limit of 50 requests per second for Google Geocoding API, it would have taken: 20 million geocoordinates / 50 requests per second = 277 days (assuming a single threaded sequential process).

Yikes!

For the purpose of performing a one-time analysis, it was not feasible to spend $80,0000 USD and then wait 277 days to get zip codes associated to each of the 20 million geographic points we wanted to align with third-party zip code-based data.

How did we match up zip codes to geocoordinates at scale for a few dollars?

There were two major components that helped us avoid the major cost and achieve our desired result.

  • Zip Code Tabulation Areas (ZCTAs) data made available through the US Census, which is available for free.
  • Geospatial Queries through use of Amazon Web Services (AWS) Athena and S3 services

ZCTAs

As per the official documentation, “ZIP Code Tabulation Areas (ZCTAs) are generalized areal representations of United States Postal Service (USPS) ZIP Code service areas.” Whereas. “USPS ZIP Codes are not areal features but a collection of mail delivery routes.” (Source: https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html)

ZCTAs are used by U.S. Census Bureau for calculating various summary statistics and in most instances, it’s the same as the zip code for an area. However, there are approximately 42,000 zip codes and 32,000 ZCTAs. The reason for there being fewer ZCTAs than zip codes is that ZCTAs are calculated for populated areas. Because our geocoordinates were for entities that were only present in populated areas, this was not an issue for our needs.

AWS Athena Geospatial Queries

Geospatial queries are specialized types of SQL queries supported in Athena. They differ from non-spatial SQL queries in the following ways:

  • They use the following specialized geometry data types: point, line, multiline, polygon and multipolygon
  • They express relationships between geometry data types, including distance, equals, crosses, touches, overlaps and disjoint

(Source: https://docs.amazonaws.cn/en_us/athena/latest/ug/geospatial-query-what-is.html)

Procedure for combining US Census ZCTA data & Amazon Athena to figure out zip codes associated with millions of geocoordinates

  1. Download ZCTA dataset from https://www2.census.gov/geo/tiger/TIGER2021/ZCTA520/

2. Create a Bucket in S3

a. Log into your AWS console

b. Search for S3 and click on S3

c. Click on Create Bucket

d. Give the Bucket Name (this can be anything) and choose your AWS region (Choose the preferred region as per your preference). You can also allow public access to the bucket, however it’s a good security practice to not do so unless specifically required.

e. Click on Create Bucket and your Bucket will be created

f. Once created, the folder which we downloaded in the first step can be uploaded to this bucket

g. Search for the bucket and go inside the bucket by clicking on the name of it

h. Click on Upload  Add Folder and choose the folder which you downloaded the US Census ZCTA data to during step 1. Then click on the upload button to upload the data.

i. One of the files is large (approx. 819 MB). After compression it comes up to ~782 MB. If upload happens to fail, try switching to a better speed internet connection or changing the region to the geographic region where you are physically present. (If it still fails, you can try using a multipart upload through programmatic means with the AWS CLI or BOTO3 libraries.)

j. Once uploaded it will show something akin to the below image, in the bucket.

3. We needed to use shape files (aka .shp file) data. Shape files cannot be directly consumed by Athena to create a table. To solve this problem, we used python and the AWS SDK outlined below.

a. Copy the tl_2021_us_zcta520.shp file to a directory where your python code can access it.

b. Execute the code below which reads the .shp file and exports the result as an CSV. Note: This code was modified from code sourced from the github URL listed underneath the code snippet below.

def centroid(points):
'''
Computes centroid of a 2-dimensional area https://en.wikipedia.org/wiki/Centroid

Parameters:
points: numpy array of polynom coordinates

Returns:
x/y-coordinates of centroid, area of centroid
'''
pts = np.array(points)
assert pts.shape[1] == 2, "Not 2-dimensional"
n = pts.shape[0]
x = pts[:,0]
y = pts[:,1]
xyxy = [ x[i] * y[(i+1)%n] - x[(i+1)%n] * y[i] for i in range(n) ]
A = 0.5 * np.sum(xyxy)
Cx = np.sum([ (x[i] + x[(i+1)%n]) * xyxy[i] for i in range(n) ])/(6*A)
Cy = np.sum([ (y[i] + y[(i+1)%n]) * xyxy[i] for i in range(n) ])/(6*A)
return Cx, Cy, np.abs(A)
def load_shape_file(shape_file_name):
'''
Function to load

Parameters:

Returns:
Pandas DataFrame: the field `shape` holds the polygon defintion of the shape that can be interpreted
by the Athena function `ST_GeometryFromText()`
Other fields include coordinates of the bounding boxes and centroids, and meta data from the shapefile
'''
sf = shapefile.Reader(shape_file_name)
recs = sf.records()
shps = sf.shapes()
fields = sf.fields

shape_df = pd.DataFrame()
for i, rec in enumerate(recs):
rdata = {}
for fld, dt, le, ze in sf.fields[1:]:
rdata[fld] = [ rec[fld] ]
bbox = shps[i].bbox
for j, b in enumerate(['bb_west', 'bb_south', 'bb_east', 'bb_north']):
rdata[b] = [ bbox[j] ]
rdata['shape'] = [ """POLYGON(({poly}))""" \
.format(poly = ','.join(["{lo} {la}".format(lo=p[0], la=p[1])
for p in shps[i].points ]))
]
cog_long, cog_lat, _ = centroid(shps[i].points)
rdata['cog_longitude'] = cog_long
rdata['cog_latitude'] = cog_lat
tmp_df = pd.DataFrame(rdata)
shape_df = pd.concat([shape_df, tmp_df])

shape_df.index = range(shape_df.shape[0])
return shape_df

(Source:- https://github.com/aws-samples/dynamodb-ml-prediction-amazon-athena/blob/main/notebook/GeoSpatialProcessingGISshapefilesWithAmazonAthena.ipynb)

zcta_df = load_shape_file('tl_2021_us_zcta520.shp')print(f"Number of records: {zcta_df.shape[0]:,}")

Exporting to CSV

zip_code_df.to_csv('ZCTA.csv')

c. Execute a python snippet like the below to upload the output CSV file back to your S3 bucket.

import boto3s3_client=boto3.resource('s3','us-east-1')s3_client.meta.client.upload_file(Filename='ZCTA.csv',Bucket='zctademobucket',Key='Processed_Data/ZCTA.csv')

d. After uploading, the file can be found in the S3 bucket.

4. Next, setup an Athena table.

a. Go to Athena by searching and then clicking the below in the AWS Console

b. Click on Create and then S3 Bucket Data

c. Provide the table name

d. We can choose from an existing database or create a new one. For this tutorial, I am creating a new one.

e. Click on Browse S3. Select the folder where data is present by clicking on Choose.

f. Choose the Data Format as CSV

g. Add Column Names and their Data Types.

Below is the column name and associated datatype:

`zcta5ce20` string,`geoid20` string,`classfp20` string,`mtfcc20` string,`funcstat20` string,`aland20` int,`awater20` int,`intptlat20` string,`intptlon20` string,`bb_west` double,`bb_south` double,`bb_east` double,`bb_north` double,`shape` string,`cog_longitude` double,`cog_latitude` double

h. Click on Create Table.

i. Table Creation is successful.

5. Now it’s just a matter of running a geospatial query in Athena. The following geospatial SQL functions will be utilized:

a. ST_Point(double, double): This returns a geometry type point object, in our case we use it for converting latitude and longitude.

b. ST_polygon(varchar): This again returns geometry type object and accepts only polygon type as input. Ex- ST_Polygon(‘polygon ((1 1, 1 4, 4 4, 4 1))’). Note: ZCTAs boundaries are present as a polygon (column_name — shape) in the dataset downloaded during step 1.

c. ST_within (geometry, geometry): This returns TRUE if and only if the left geometry touches the right geometry.

(Source https://docs.aws.amazon.com/athena/latest/ug/geospatial-functions-list-v2.html)

Example query to get zip code matched to geocoordinates:

Select  zcta_data.zcta5ce20 as zip_code,        geo.latitude,        geo.longitudefrom    zcta_data        cross join <table where your geocoordinates reside> geowhere  ST_within(ST_Point(geo.longitude, geo.latitude),ST_polygon(zcta_data.shape))

Summary

· Now we have zip codes for all of our 20M geocoordinates

· Cost for this solution was about $5 in total for a query that took about two minutes to run 

· That’s a whooping benefit of saving more than $80,000 USD and 270+ processing days compared to reverse geocoding!

Happy learning!!

--

--

Brandon Odaniel
Xylem | AI and Big Data

Brandon O’Daniel, Data Geek, AI/Big Data Director @ Xylem, MBA, Proud Dad of three awesome kids :)