Reverse US Geocoding in BigQuery
You can convert GPS coordinates into cities, counties, states and even ZIP codes for free!
Geocoding APIs are common and popular today. Many of them also offer generous free tiers. Using such APIs with big data, though, can present developers with a few challenges. In order to enrich the big data with geographical information, triggers usually need to be set for external code to be executed, invoke a geocoding API and update the records. Most APIs also enforce rate limits, so queues need to be properly implemented and it might take time for all of the data to become ready and actionable.
On the other hand, joining records in a database is usually easier to implement and faster to execute. Turns out that if your geographical scope is limited to the US and the required precision down to the zip code, you can reverse-geocode your coordinates with a
JOIN. If it sounds too good to be true, I can also add that the geography data is already available to you on the Google Cloud Platform and it’s free.
Before we start
“Free as in beer”
I have run the examples used in this article in the “BigQuery sandbox”, in a GCP Project without a linked billing account. The public datasets are also free to use in “billed” projects, but query & computing time is not. In the real world, you would probably need to cross-reference a lot of records and end up accruing some costs. BigQuery has a generous free tier, but do the math and set “budget alerts” before you go to production.
If you want to directly
JOIN data from a public dataset in your queries, your tables need to be in same
multi-region) as the public dataset’s tables. From the official BigQuery documentation:
BigQuery processes queries in the same location as the dataset that contains the tables you’re querying.
Working across different locations
Should you need to store your data in a different location, you will need to copy the dataset to the desired one. This also means that you will be responsible for keeping your copy synchronized. In this article we will be using the kind of data that is not supposed to change frequently, but do take note of this. The dataset copy feature in the Google Cloud Platform is built upon the BigQuery Data Transfer Service, meaning that you can schedule the sync jobs anyway.
Let go of any performance anxiety
You will notice that I have used a plain
JOIN in the SQL code to reference the datasets with the geography data. It screams “cartesian product” all over the place and it is clearly less than ideal.
OUTER JOINs do not apply here, though. If you try to use any of them, BigQuery will complain about:
LEFT[/RIGHT] OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
that is because of the:
JOIN ... ON ST_CONTAINS(...) condition used to join records where the GPS coordinates (
point) of our “iconic places of interest” dataset fall within the
geometry of an “area”.
It’s funny how the BigQuery documentation states that
JOINs “effectively” do a “cartesian product”, but eventually they might not. I will clarify that one day, for now I will stick to the basics.
Up to speed
To make the examples easy to try, I have
JOINed the existing, public datasets with an in-memory list of “iconic places of interest”, acting as a table, that is created on-the-fly by the queries themselves. So you don’t have to set up anything in GCP. Just copy-and-paste and run the queries in BigQuery.
Yes, Seafood Sam’s is iconic to me, that’s a non-negotiable.
Choosing the right table to match the coordinates against is very important, on a scale, for both performance and costs. The difference in granularity of the areas means a lot of difference in the amount of data stored in the respective tables. The smaller the table, the faster the matching. Which in BigQuery also means “cheaper”.
The zip codes table also stores state, county and city for each record, so it would be tempting to just put it in your queries as a one-size-fits-all, reusable solution, but it would come with some hefty performance and price tags.
Table ID: bigquery-public-data.geo_us_boundaries.states
The table above is the source for the states’ geographies and at the time of writing it stores a total of 56 rows, it’s only ~22 MB in size. Well, states are just a handful and their boundaries pretty wide. Matching coordinates against this table is very fast, so stick to it unless you need a more precise location.
SQL workspace in BigQuery and execute the following query:
The results should be a perfect match for every record:
Table ID: bigquery-public-data.geo_us_boundaries.counties
Changing from states to counties means taking a big performance leap already. This table stores ~3200 rows and is ~190 MB in size. The query is:
and the results are, again, a full match:
Table ID: bigquery-public-data.geo_us_boundaries.urban_areas
The “urban areas” are, not surprisingly, pretty close to the counties’ numbers. This table stores ~3600 rows and is ~200MB in size. The query is:
since not all of the “iconic places” are located in a city and we are using a “cartesian product”, we only get 3 rows back this time:
Matching zip codes
Table ID: bigquery-public-data.geo_us_boundaries.zip_codes
The zip codes table has a whopping amout of ~33000 rows and is ~1.28 GB in size. This is a lot more than the other tables, so do not match coordinates against this table unless you really need to. Well, at least you get the state, county and city “for free”, bundled with the zip codes. The query is:
and the results only match 4 places with a zip code:
Bonus lesson: carefully plan your trips to Yosemite, I am not sure the quick delivery services can help you there, in case you forget anything.
The “old school” geographies
The tables used in the examples above are, to my knowledge, the latest versions offered by the public datasets. There were older versions and just in case you are interested, here are the example queries for such versions:
Please note that the old zip code table had its
zipcode_geom column set to the
STRING type instead of
GEOGRAPHY. In order to find the matching areas, the query had to cast the
zipcode_geom column to
GEOGRAPHY on-the-fly using the
ST_GEOGFROMTEXT() function. It took ages to even match five points, so it was clearly unusable in production unless you really wanted to hurt your CFO’s feelings.
I have also found this zip codes table:
Description reads (literally…):
CREATE TABLE `gis.zipcodes` AS SELECT *, ST_GEOGFROMTEXT(zipcode_geom) AS geog FROM `bigquery-public-data.utility_us.zipcode_area`
So apparently they wanted to fix the
zipcode_geom data type problem I mentioned above. It stores ~33000 rows like the new zip codes table, but it’s only ~256 MB in size. Queries still run pretty slow on this one, so I could not find any reason to recommend it.
If you want to “reverse geocode” an IP address, you can load the free MaxMind GeoLite2 Free database into BigQuery. These geo-IP databases are frequently updated, so in order to use them in production you need to set up proper update processes. To quickly experiment with GeoLite2 and understand if it suits your needs, you can use this old version I have found in:
Note that it is very obsolete and outdated, so do not use it in production. Create your own table in your dataset and load the latest GeoLite2 Free first.