An ETL pipeline for GeoJSON to BigQuery

Lak Lakshmanan
Jul 17 · 2 min read

While this tutorial is specifically for GeoJSON files for European administrative areas, I hope you will be able to use it as a guide for other sources of data. Do this in CloudShell or any machine where you have the Google Cloud SDK and Python3 installed.

To query polygon boundaries, load European administrative areas’ GeoJSON into BigQuery

First download the file you are interested in from the Eurostat website:

wget https://ec.europa.eu/eurostat/cache/GISCO/distribution/v2/nuts/download/ref-nuts-2016-01m.geojson.zip

Then, unzip it to get the GeoJSON file of interest.

mkdir tmp
cd tmp
unzip ../ref-nuts-2016-01m.geojson.zip

The GeoJSON file has a list of geometry features and some properties for each geometry. BigQuery requires new-line delimited JSON files where the geometry column is single string.

Here’s a Python program that will do the necessary conversion (note that the filename corresponds to the EPSG::4326 version — these are the ones with coordinates in longitude/latitude):

Here’s what a single line of the JSON file now looks like:

{"geometry": "{\"type\": \"LineString\", \"coordinates\": [[25.4907, 35.29833], [25.49187, 35.28897], [25.50206, 35.28633], [25.50528, 35.273], [25.49648, 35.2621], [25.49437, 35.24823], [25.49879, 35.24146], [25.51013, 35.23245], [25.51199, 35.22759], [25.49429, 35.22849], [25.47941, 35.22515], [25.47122, 35.21485], [25.46806, 35.21088], [25.4512, 35.20152], [25.43949, 35.20199], [25.42395, 35.21019], [25.41463, 35.20585], [25.41267, 35.20494], [25.41331, 35.19424], [25.41463, 35.19189], [25.41792, 35.18603], [25.41463, 35.18273], [25.39944, 35.16751], [25.40761, 35.15651], [25.41199, 35.15531], [25.40992, 35.13521], [25.41463, 35.13308], [25.42063, 35.13035], [25.42306, 35.12754], [25.42105, 35.11549], [25.4345, 35.11028], [25.44723, 35.11027], [25.46348, 35.09302], [25.47211, 35.08832], [25.50064, 35.08005], [25.5182, 35.0656], [25.50834, 35.05567], [25.49872, 35.05048], [25.49786, 35.0427], [25.50702, 35.03823], [25.52577, 35.03663], [25.52902, 35.02909], [25.53016, 35.02644], [25.54641, 35.01885], [25.54925, 35.00995], [25.54713, 34.99973], [25.55057, 34.99076]]}", "EU_FLAG": "T", "CC_FLAG": "F", "OTHR_CNTR_FLAG": "F", "LEVL_CODE": 3, "FID": 102, "EFTA_FLAG": "F", "COAS_FLAG": "F", "NUTS_BN_ID": 102}

The resulting schema printed by the program is:

geometry:GEOGRAPHY,EU_FLAG,CC_FLAG,OTHR_CNTR_FLAG,LEVL_CODE:int64,FID:int64,EFTA_FLAG,COAS_FLAG,NUTS_BN_ID:int64

Now, load the new-line separated JSON into BigQuery (change the dataset and table name appropriately):

bq load --source_format NEWLINE_DELIMITED_JSON advdata.eurostat to_load.json geometry:GEOGRAPHY,EU_FLAG,CC_FLAG,OTHR_CNTR_FLAG,LEVL_CODE:int64,FID:int64,EFTA_FLAG,COAS_FLAG,NUTS_BN_ID:int64

That’s it! Try out a query to order the polygons in terms of their distance from Frankfurt, Germany:

SELECT 
* EXCEPT(geometry)
, ST_Centroid(geometry) AS center
, ST_Distance(geometry, ST_GeogPoint(8.68, 50.11))/1000 AS dist
FROM advdata.eurostat
ORDER by dist ASC
LIMIT 5

This returns:

Enjoy!

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Lak Lakshmanan

Written by

Professional Services @ Google

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade