How to load geographic data like shapefiles into BigQuery
Now that BigQuery has support for GIS queries, let’s look at how we could load in geographic data into BigQuery. I’ll use an example of uploading boundaries polygons corresponding to US zipcodes. Note that this is just an example — the zipcode polygons are already a public dataset in BigQuery, so you don’t have to upload the data; you can simply use it.
Every year, the Census Bureau publishes the cartographic boundaries they use for their census. There is a generalized set (50 MB) suitable for mapping and a detailed one (500 MB) suitable for GIS analysis. Let’s load them both into BigQuery.
What? Why do the zipcodes change every year? And why are there two boundary files? Well … US zipcodes are not actually polygons. Instead, they are a collection of postal routes, and so the boundaries are subject to change. Also, because they are a collection of postal routes, there are infinitely many polygons that could be drawn to fit those postal routes.
1. Setup GCE VM
Launch a Google Compute Engine (GCE) instance. Change the VM to have access to all Cloud Platform APIs:
Then, accept all the defaults and launch the instance.
Once the instance is up, ssh to the GCE instance and install zip and gdal:
sudo apt-get install gdal-bin unzip
2. Download zip files
Then, download the two files from the Census Bureau:
curl -O ftp://ftp2.census.gov/geo/tiger/TIGER2017/ZCTA5/tl_2017_us_zcta510.zip
curl -O http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_zcta510_500k.zip
The first zip is the comprehensive (500 MB) data and the second one has the generalized polygons (50 MB).
If the second Curl command doesn’t work (the http server seems to periodically reject http requests), you might have to download using your browser and upload to the VM (perhaps via Google Cloud Storage).
3. Expand the zip files
4. Convert to GeoJSON
The data are published as shapefiles, a “mostly” open data format for GIS software. BigQuery understands GeoJSON, an open standard. So, we need to convert the shapefiles the Census Bureau provides to GeoJSON. That’s why I had you install gdal — it comes with a useful tool called ogr2ogr:
ogr2ogr -f csv -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from cb_2017_us_zcta510_500k" zipcode_polygon2017.csv cb_2017_us_zcta510_500k.shp
ogr2ogr -f csv -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from tl_2017_us_zcta510" zipcode_polygon_detailed2017.csv tl_2017_us_zcta510.shp
The commands above convert the shapefiles to GeoJSON-encoded geographies within a CSV file.
Note: An earlier version of this blog post recommended creating Well-Known-Text (WKT), but now that BigQuery supports GeoJSON, it is better to use GeoJSON. The reason has to do with technical details around how planar geometries are handled.
5. Upload CSV files to GCS
Because the CSV files are rather large, it is faster/safer to do a multithreaded upload Google Cloud Storage and load into BigQuery from there:
gsutil -m cp *.csv gs://BUCKET/
6. Load data into BigQuery
Create a dataset in BigQuery:
bq mk demos
Use bq-load to load the data into BigQuery, asking BigQuery to autodetect the schema from the CSV file:
bq load --autodetect --replace demos.zipcode_polygon2017 gs://BUCKET/zipcode_polygon2017.csv
bq load --autodetect --replace demos.zipcode_polygon_detailed2017 gs://BUCKET/zipcode_polygon_detailed2017.csv
7. Add information to the tables
It’s a good idea to go to the console and add some information about the columns. Key ones are:
WKT: Polygon geometry of the zipcode
ZCTA: Zipcode tabulation areas (5-digit zipcode)
GEOID: See https://www.census.gov/geo/reference/geoidentifiers.html
ALAND: Land/Water area in square meters
8. Try a query
You will use ST_GeogFromText to parse the WKT column as a polygon and then apply methods like ST_Distance and ST_DWithin to it. This is a query that finds weather stations within 10 km of a particular zipcode:
WITH params AS (
SELECT 60626 AS zipcode,
10 AS maxdist_km
zipcode AS (
SELECT ST_GeogFromGeoJson(geom) AS polygon
FROM `cloud-training-demos.demos.zipcode_polygon2017`, params
WHERE ZCTA5CE10 = params.zipcode
stations AS (
ST_GeogPoint(longitude, latitude) AS loc,
ST_Distance(ST_GeogPoint(longitude, latitude), zipcode.polygon) AS dist_meters
WHERE ST_DWithin(ST_GeogPoint(longitude, latitude), zipcode.polygon, params.maxdist_km*1000)
SELECT * from stations
ORDER BY dist_meters ASC
9. Visualize geographic data
You can visualize the data above by using the BigQuery Geo Viz. Simply run the query above, select “loc” as the geometry column and specify a style based on other columns if desired. Here’s an example of what that looks like:
10. Using the public dataset of zipcodes
The visualization tool is quite cool. Here’s another visual, this time joining the public Census data with zipcode boundaries. The query is:
with zipcodes as (
zip_census.zipcode as zipcode,
zipcode_geom as geometry,
ST_CENTROID(ST_GeogFromText(zipcode_geom)) as centroid
`bigquery-public-data.census_bureau_usa.population_by_zip_2010` AS zip_census
join `bigquery-public-data.utility_us.zipcode_area` as zip_geom
on zip_census.zipcode = zip_geom.zipcode
(gender not like 'male' and gender not like 'female') and
minimum_age is NULL and
maximum_age is NULL
`bigquery-public-data.utility_us.zipcode_area` as zip_area
join zipcodes on zip_area.zipcode = zipcodes.zipcode
ST_DWITHIN(centroid, ST_GeogPoint(-122.3321,47.6062), 10000)
and here’s a visual:
If you looked at the table name carefully, you would have noticed that the table is actually a public dataset and not the one we just uploaded. Yup, the zipcode polygon information is already a BigQuery public dataset (but this article explains how you could upload your own geographic data into BigQuery).