How to load geographic data like shapefiles into BigQuery

Lak Lakshmanan
Google Cloud - Community
5 min readJul 26, 2018

Now that BigQuery has support for GIS queries, let’s look at how we could load in geographic data into BigQuery. I’ll use an example of uploading boundaries polygons corresponding to US zipcodes. Note that this is just an example — the zipcode polygons are already a public dataset in BigQuery, so you don’t have to upload the data; you can simply use it.

Every year, the Census Bureau publishes the cartographic boundaries they use for their census. There is a generalized set (50 MB) suitable for mapping and a detailed one (500 MB) suitable for GIS analysis. Let’s load them both into BigQuery.

What? Why do the zipcodes change every year? And why are there two boundary files? Well … US zipcodes are not actually polygons. Instead, they are a collection of postal routes, and so the boundaries are subject to change. Also, because they are a collection of postal routes, there are infinitely many polygons that could be drawn to fit those postal routes.

1. Setup GCE VM

Launch a Google Compute Engine (GCE) instance. Change the VM to have access to all Cloud Platform APIs:

Then, accept all the defaults and launch the instance.

Once the instance is up, ssh to the GCE instance and install zip and gdal:

sudo apt-get install gdal-bin unzip

2. Download zip files

Then, download the two files from the Census Bureau:

curl -O ftp://ftp2.census.gov/geo/tiger/TIGER2017/ZCTA5/tl_2017_us_zcta510.zip
curl -O http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_zcta510_500k.zip

The first zip is the comprehensive (500 MB) data and the second one has the generalized polygons (50 MB).

If the second Curl command doesn’t work (the http server seems to periodically reject http requests), you might have to download using your browser and upload to the VM (perhaps via Google Cloud Storage).

3. Expand the zip files

unzip tl_2017_us_zcta510.zip
unzip cb_2017_us_zcta510_500k.zip

4. Convert to GeoJSON

The data are published as shapefiles, a “mostly” open data format for GIS software. BigQuery understands GeoJSON, an open standard. So, we need to convert the shapefiles the Census Bureau provides to GeoJSON. That’s why I had you install gdal — it comes with a useful tool called ogr2ogr:

ogr2ogr -f csv -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from cb_2017_us_zcta510_500k" zipcode_polygon2017.csv cb_2017_us_zcta510_500k.shpogr2ogr -f csv -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from tl_2017_us_zcta510" zipcode_polygon_detailed2017.csv tl_2017_us_zcta510.shp

The commands above convert the shapefiles to GeoJSON-encoded geographies within a CSV file.

Note: An earlier version of this blog post recommended creating Well-Known-Text (WKT), but now that BigQuery supports GeoJSON, it is better to use GeoJSON. The reason has to do with technical details around how planar geometries are handled.

5. Upload CSV files to GCS

Because the CSV files are rather large, it is faster/safer to do a multithreaded upload Google Cloud Storage and load into BigQuery from there:

gsutil -m cp *.csv gs://BUCKET/

6. Load data into BigQuery

Create a dataset in BigQuery:

bq mk demos

Use bq-load to load the data into BigQuery, asking BigQuery to autodetect the schema from the CSV file:

bq load --autodetect --replace demos.zipcode_polygon2017 gs://BUCKET/zipcode_polygon2017.csvbq load --autodetect --replace demos.zipcode_polygon_detailed2017 gs://BUCKET/zipcode_polygon_detailed2017.csv

7. Add information to the tables

It’s a good idea to go to the console and add some information about the columns. Key ones are:

geom: Polygon geometry of the zipcode
ZCTA: Zipcode tabulation areas (5-digit zipcode)
GEOID: See https://www.census.gov/geo/reference/geoidentifiers.html
ALAND: Land/Water area in square meters

8. Try a query

You can use ST_GeogFromGeoJson to parse the JSON column as a polygon and then apply methods like ST_Distance and ST_DWithin to it. This is a query that finds weather stations within 10 km of a particular zipcode:

#standardsql
WITH params AS (
SELECT 60626 AS zipcode,
10 AS maxdist_km
),
zipcode AS (
SELECT ST_GeogFromGeoJson(geom) AS polygon
FROM demos.zipcode_polygon2017, params
WHERE ZCTA5CE10 = params.zipcode
),
stations AS (
SELECT
id,
name,
ST_GeogPoint(longitude, latitude) AS loc,
ST_Distance(ST_GeogPoint(longitude, latitude), zipcode.polygon) AS dist_meters
FROM
`bigquery-public-data.ghcn_d.ghcnd_stations`,
params,
zipcode
WHERE ST_DWithin(ST_GeogPoint(longitude, latitude), zipcode.polygon, params.maxdist_km*1000)
)
SELECT * from stations
ORDER BY dist_meters ASC
LIMIT 100

9. Transform the GeoJSON String to Geography Types

Instead of storing the polygons as strings and doing ST_GeogFromGeoJson each time, it is more efficient to store the data as a Geography type in the first place. We can use SQL to do this:

CREATE OR REPLACE TABLE demos.us_zipcodes ASSELECT * EXCEPT(geom), ST_GeogFromGeoJson(geom) AS polygon
FROM demos.zipcode_polygon2017

The query above now becomes:

#standardsql
WITH params AS (
SELECT 60626 AS zipcode,
10 AS maxdist_km
),
zipcode AS (
SELECT polygon
FROM demos.us_zipcodes, params
WHERE ZCTA5CE10 = params.zipcode
),
stations AS (
SELECT
id,
name,
ST_GeogPoint(longitude, latitude) AS loc,
ST_Distance(ST_GeogPoint(longitude, latitude), zipcode.polygon) AS dist_meters
FROM
`bigquery-public-data.ghcn_d.ghcnd_stations`,
params,
zipcode
WHERE ST_DWithin(ST_GeogPoint(longitude, latitude), zipcode.polygon, params.maxdist_km*1000)
)
SELECT * from stations
ORDER BY dist_meters ASC
LIMIT 100

When I did it, this query finished in 6.3 seconds (compared to16.8 sec for the one where ST_GeogFromGeoJson was carried out within the query itself). This is a nearly 3x speedup!

10. Visualize geographic data

You can visualize the data above by using the BigQuery Geo Viz. Simply run the query above, select “loc” as the geometry column and specify a style based on other columns if desired. Here’s an example of what that looks like:

Weather stations within 10km of a Chicago zipcode

10. Using the public dataset of zipcodes

The visualization tool is quite cool. Here’s another visual, this time joining the public Census data with zipcode boundaries. The query is:

#standardsql
with zipcodes as (
SELECT
zip_census.zipcode as zipcode,
population,
zcta_geom as geometry
FROM
`bigquery-public-data.census_bureau_usa.population_by_zip_2010` AS zip_census
join `bigquery-public-data.geo_us_boundaries.us_zip_codes` as zip_geom
on zip_census.zipcode = zip_geom.zip_code
WHERE
gender IS NULL and
minimum_age is NULL and
maximum_age is NULL
)
SELECT
zipcodes.*,
state_code,
city,
county
FROM
`bigquery-public-data.utility_us.zipcode_area` as zip_area
join zipcodes on zip_area.zipcode = zipcodes.zipcode
where
ST_DWITHIN(geometry, ST_GeogPoint(-122.3321,47.6062), 10000)

and here’s a visual:

If you looked at the table name carefully, you would have noticed that the table is actually a public dataset and not the one we just uploaded. Also, the public dataset stores the polygons as geography types to provide the 3x speedup. Yup, the zipcode polygon information is already a BigQuery public dataset (but this article explains how you could upload your own geographic data into BigQuery).

Enjoy!

--

--

Lak Lakshmanan
Google Cloud - Community

articles are personal observations and not investment advice.