Reverse US Geocoding in BigQuery

You can convert GPS coordinates into cities, counties, states and even ZIP codes for free!

Edoardo Nosotti
Jun 3 · 6 min read
Photo by Denise Jans on Unsplash

Geocoding APIs are common and popular today. Many of them also offer generous free tiers. Using such APIs with big data, though, can present developers with a few challenges. In order to enrich the big data with geographical information, triggers usually need to be set for external code to be executed, invoke a geocoding API and update the records. Most APIs also enforce rate limits, so queues need to be properly implemented and it might take time for all of the data to become ready and actionable.

On the other hand, joining records in a database is usually easier to implement and faster to execute. Turns out that if your geographical scope is limited to the US and the required precision down to the zip code, you can reverse-geocode your coordinates with a JOIN. If it sounds too good to be true, I can also add that the geography data is already available to you on the Google Cloud Platform and it’s free.

Before we start

“Free as in beer”

I have run the examples used in this article in the “BigQuery sandbox”, in a GCP Project without a linked billing account. The public datasets are also free to use in “billed” projects, but query & computing time is not. In the real world, you would probably need to cross-reference a lot of records and end up accruing some costs. BigQuery has a generous free tier, but do the math and set “budget alerts” before you go to production.

Locations matter

If you want to directly JOIN data from a public dataset in your queries, your tables need to be in same region (or multi-region) as the public dataset’s tables. From the official BigQuery documentation:

BigQuery processes queries in the same location as the dataset that contains the tables you’re querying.

the note above implies that all of the joined tables need to be in same location.
At the time of writing, all of the public datasets are in the US multi-region.

Working across different locations

Should you need to store your data in a different location, you will need to copy the dataset to the desired one. This also means that you will be responsible for keeping your copy synchronized. In this article we will be using the kind of data that is not supposed to change frequently, but do take note of this. The dataset copy feature in the Google Cloud Platform is built upon the BigQuery Data Transfer Service, meaning that you can schedule the sync jobs anyway.

Let go of any performance anxiety

You will notice that I have used a plain JOIN in the SQL code to reference the datasets with the geography data. It screams “cartesian product” all over the place and it is clearly less than ideal. LEFT or RIGHT OUTER JOINs do not apply here, though. If you try to use any of them, BigQuery will complain about:

LEFT[/RIGHT] OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.

that is because of the: JOIN ... ON ST_CONTAINS(...) condition used to join records where the GPS coordinates (point) of our “iconic places of interest” dataset fall within the geometry of an “area”.

It’s funny how the BigQuery documentation states that JOINs “effectively” do a “cartesian product”, but eventually they might not. I will clarify that one day, for now I will stick to the basics.

Up to speed

To make the examples easy to try, I have JOINed the existing, public datasets with an in-memory list of “iconic places of interest”, acting as a table, that is created on-the-fly by the queries themselves. So you don’t have to set up anything in GCP. Just copy-and-paste and run the queries in BigQuery.
Yes, Seafood Sam’s is iconic to me, that’s a non-negotiable.

Hands-on

Choosing the right table to match the coordinates against is very important, on a scale, for both performance and costs. The difference in granularity of the areas means a lot of difference in the amount of data stored in the respective tables. The smaller the table, the faster the matching. Which in BigQuery also means “cheaper”.
The zip codes table also stores state, county and city for each record, so it would be tempting to just put it in your queries as a one-size-fits-all, reusable solution, but it would come with some hefty performance and price tags.

Matching states

Table ID: bigquery-public-data.geo_us_boundaries.states

The table above is the source for the states’ geographies and at the time of writing it stores a total of 56 rows, it’s only ~22 MB in size. Well, states are just a handful and their boundaries pretty wide. Matching coordinates against this table is very fast, so stick to it unless you need a more precise location.

Open the SQL workspace in BigQuery and execute the following query:

The results should be a perfect match for every record:

Reverse geocoding to US states

Matching counties

Table ID: bigquery-public-data.geo_us_boundaries.counties

Changing from states to counties means taking a big performance leap already. This table stores ~3200 rows and is ~190 MB in size. The query is:

and the results are, again, a full match:

Reverse geocoding to US counties

Matching cities

Table ID: bigquery-public-data.geo_us_boundaries.urban_areas

The “urban areas” are, not surprisingly, pretty close to the counties’ numbers. This table stores ~3600 rows and is ~200MB in size. The query is:

since not all of the “iconic places” are located in a city and we are using a “cartesian product”, we only get 3 rows back this time:

Reverse geocoding to US cities

Matching zip codes

Table ID: bigquery-public-data.geo_us_boundaries.zip_codes

The zip codes table has a whopping amout of ~33000 rows and is ~1.28 GB in size. This is a lot more than the other tables, so do not match coordinates against this table unless you really need to. Well, at least you get the state, county and city “for free”, bundled with the zip codes. The query is:

and the results only match 4 places with a zip code:

Reverse geocoding to US zip codes

Bonus lesson: carefully plan your trips to Yosemite, I am not sure the quick delivery services can help you there, in case you forget anything.

Other playgrounds

The “old school” geographies

The tables used in the examples above are, to my knowledge, the latest versions offered by the public datasets. There were older versions and just in case you are interested, here are the example queries for such versions:

Please note that the old zip code table had its zipcode_geom column set to the STRING type instead of GEOGRAPHY. In order to find the matching areas, the query had to cast the zipcode_geom column to GEOGRAPHY on-the-fly using the ST_GEOGFROMTEXT() function. It took ages to even match five points, so it was clearly unusable in production unless you really wanted to hurt your CFO’s feelings.

Alternatives

I have also found this zip codes table:

fh-bigquery.gis.zipcodes

Its Description reads (literally…):

CREATE TABLE `gis.zipcodes` AS SELECT *, ST_GEOGFROMTEXT(zipcode_geom) AS geog FROM `bigquery-public-data.utility_us.zipcode_area`

So apparently they wanted to fix the zipcode_geom data type problem I mentioned above. It stores ~33000 rows like the new zip codes table, but it’s only ~256 MB in size. Queries still run pretty slow on this one, so I could not find any reason to recommend it.

Locating IPs

If you want to “reverse geocode” an IP address, you can load the free MaxMind GeoLite2 Free database into BigQuery. These geo-IP databases are frequently updated, so in order to use them in production you need to set up proper update processes. To quickly experiment with GeoLite2 and understand if it suits your needs, you can use this old version I have found in:

fh-bigquery.geocode.201806_geolite2_city_ipv4_locs

Note that it is very obsolete and outdated, so do not use it in production. Create your own table in your dataset and load the latest GeoLite2 Free first.

RockedScience

Tutorials, tips and fast news on Cloud, DevOps and Code

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store