How to load geographic data like shapefiles into BigQuery

Lak Lakshmanan
Jul 26, 2018 · 5 min read

Now that BigQuery has support for GIS queries, let’s look at how we could load in geographic data into BigQuery. I’ll use an example of uploading boundaries polygons corresponding to US zipcodes. Note that this is just an example — the zipcode polygons are already a public dataset in BigQuery, so you don’t have to upload the data; you can simply use it.

Every year, the Census Bureau publishes the cartographic boundaries they use for their census. There is a generalized set (50 MB) suitable for mapping and a detailed one (500 MB) suitable for GIS analysis. Let’s load them both into BigQuery.

What? Why do the zipcodes change every year? And why are there two boundary files? Well … US zipcodes are not actually polygons. Instead, they are a collection of postal routes, and so the boundaries are subject to change. Also, because they are a collection of postal routes, there are infinitely many polygons that could be drawn to fit those postal routes.

1. Setup GCE VM

Launch a Google Compute Engine (GCE) instance. Change the VM to have access to all Cloud Platform APIs:

Image for post
Image for post

Then, accept all the defaults and launch the instance.

Once the instance is up, ssh to the GCE instance and install zip and gdal:

sudo apt-get install gdal-bin unzip

2. Download zip files

Then, download the two files from the Census Bureau:

curl -O ftp://ftp2.census.gov/geo/tiger/TIGER2017/ZCTA5/tl_2017_us_zcta510.zip
curl -O http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_zcta510_500k.zip

The first zip is the comprehensive (500 MB) data and the second one has the generalized polygons (50 MB).

If the second Curl command doesn’t work (the http server seems to periodically reject http requests), you might have to download using your browser and upload to the VM (perhaps via Google Cloud Storage).

3. Expand the zip files

unzip tl_2017_us_zcta510.zip
unzip cb_2017_us_zcta510_500k.zip

4. Convert to GeoJSON

The data are published as shapefiles, a “mostly” open data format for GIS software. BigQuery understands GeoJSON, an open standard. So, we need to convert the shapefiles the Census Bureau provides to GeoJSON. That’s why I had you install gdal — it comes with a useful tool called ogr2ogr:

ogr2ogr -f csv -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from cb_2017_us_zcta510_500k" zipcode_polygon2017.csv cb_2017_us_zcta510_500k.shp

The commands above convert the shapefiles to GeoJSON-encoded geographies within a CSV file.

Note: An earlier version of this blog post recommended creating Well-Known-Text (WKT), but now that BigQuery supports GeoJSON, it is better to use GeoJSON. The reason has to do with technical details around how planar geometries are handled.

5. Upload CSV files to GCS

Because the CSV files are rather large, it is faster/safer to do a multithreaded upload Google Cloud Storage and load into BigQuery from there:

gsutil -m cp *.csv gs://BUCKET/

6. Load data into BigQuery

Create a dataset in BigQuery:

bq mk demos

Use bq-load to load the data into BigQuery, asking BigQuery to autodetect the schema from the CSV file:

bq load --autodetect --replace demos.zipcode_polygon2017 gs://BUCKET/zipcode_polygon2017.csv

7. Add information to the tables

It’s a good idea to go to the console and add some information about the columns. Key ones are:

geom: Polygon geometry of the zipcode
ZCTA: Zipcode tabulation areas (5-digit zipcode)
GEOID: See https://www.census.gov/geo/reference/geoidentifiers.html
ALAND: Land/Water area in square meters

8. Try a query

You can use ST_GeogFromGeoJson to parse the JSON column as a polygon and then apply methods like ST_Distance and ST_DWithin to it. This is a query that finds weather stations within 10 km of a particular zipcode:

#standardsql
WITH params AS (
SELECT 60626 AS zipcode,
10 AS maxdist_km
),

9. Transform the GeoJSON String to Geography Types

Instead of storing the polygons as strings and doing ST_GeogFromGeoJson each time, it is more efficient to store the data as a Geography type in the first place. We can use SQL to do this:

CREATE OR REPLACE TABLE demos.us_zipcodes AS

The query above now becomes:

#standardsql
WITH params AS (
SELECT 60626 AS zipcode,
10 AS maxdist_km
),

When I did it, this query finished in 6.3 seconds (compared to16.8 sec for the one where ST_GeogFromGeoJson was carried out within the query itself). This is a nearly 3x speedup!

10. Visualize geographic data

You can visualize the data above by using the BigQuery Geo Viz. Simply run the query above, select “loc” as the geometry column and specify a style based on other columns if desired. Here’s an example of what that looks like:

Image for post
Image for post
Weather stations within 10km of a Chicago zipcode

10. Using the public dataset of zipcodes

The visualization tool is quite cool. Here’s another visual, this time joining the public Census data with zipcode boundaries. The query is:

#standardsql
with zipcodes as (
SELECT
zip_census.zipcode as zipcode,
population,
zcta_geom as geometry
FROM
`bigquery-public-data.census_bureau_usa.population_by_zip_2010` AS zip_census
join `bigquery-public-data.geo_us_boundaries.us_zip_codes` as zip_geom
on zip_census.zipcode = zip_geom.zip_code
WHERE
gender IS NULL and
minimum_age is NULL and
maximum_age is NULL
)
SELECT
zipcodes.*,
state_code,
city,
county
FROM
`bigquery-public-data.utility_us.zipcode_area` as zip_area
join zipcodes on zip_area.zipcode = zipcodes.zipcode
where
ST_DWITHIN(geometry, ST_GeogPoint(-122.3321,47.6062), 10000)

and here’s a visual:

Image for post
Image for post

If you looked at the table name carefully, you would have noticed that the table is actually a public dataset and not the one we just uploaded. Also, the public dataset stores the polygons as geography types to provide the 3x speedup. Yup, the zipcode polygon information is already a BigQuery public dataset (but this article explains how you could upload your own geographic data into BigQuery).

Enjoy!

Google Cloud - Community

Google Cloud community articles and blogs

Lak Lakshmanan

Written by

Data Analytics & AI @ Google Cloud

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Lak Lakshmanan

Written by

Data Analytics & AI @ Google Cloud

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store