How to load GeoJSON files into BigQuery GIS
An ETL pipeline for GeoJSON to BigQuery
While this tutorial is specifically for GeoJSON files for European administrative areas, I hope you will be able to use it as a guide for other sources of data. Do this in CloudShell or any machine where you have the Google Cloud SDK and Python3 installed.
First download the file you are interested in from the Eurostat website:
wget https://ec.europa.eu/eurostat/cache/GISCO/distribution/v2/nuts/download/ref-nuts-2016-01m.geojson.zip
Then, unzip it to get the GeoJSON file of interest.
mkdir tmp
cd tmp
unzip ../ref-nuts-2016-01m.geojson.zip
The GeoJSON file has a list of geometry features and some properties for each geometry. BigQuery requires new-line delimited JSON files where the geometry column is single string.
Here’s a Python program that will do the necessary conversion (note that the filename corresponds to the EPSG::4326 version — these are the ones with coordinates in longitude/latitude):
Here’s what a single line of the JSON file now looks like:
{"geometry": "{\"type\": \"LineString\", \"coordinates\": [[25.4907, 35.29833], [25.49187, 35.28897], [25.50206, 35.28633], [25.50528, 35.273], [25.49648, 35.2621], [25.49437, 35.24823], [25.49879, 35.24146], [25.51013, 35.23245], [25.51199, 35.22759], [25.49429, 35.22849], [25.47941, 35.22515], [25.47122, 35.21485], [25.46806, 35.21088], [25.4512, 35.20152], [25.43949, 35.20199], [25.42395, 35.21019], [25.41463, 35.20585], [25.41267, 35.20494], [25.41331, 35.19424], [25.41463, 35.19189], [25.41792, 35.18603], [25.41463, 35.18273], [25.39944, 35.16751], [25.40761, 35.15651], [25.41199, 35.15531], [25.40992, 35.13521], [25.41463, 35.13308], [25.42063, 35.13035], [25.42306, 35.12754], [25.42105, 35.11549], [25.4345, 35.11028], [25.44723, 35.11027], [25.46348, 35.09302], [25.47211, 35.08832], [25.50064, 35.08005], [25.5182, 35.0656], [25.50834, 35.05567], [25.49872, 35.05048], [25.49786, 35.0427], [25.50702, 35.03823], [25.52577, 35.03663], [25.52902, 35.02909], [25.53016, 35.02644], [25.54641, 35.01885], [25.54925, 35.00995], [25.54713, 34.99973], [25.55057, 34.99076]]}", "EU_FLAG": "T", "CC_FLAG": "F", "OTHR_CNTR_FLAG": "F", "LEVL_CODE": 3, "FID": 102, "EFTA_FLAG": "F", "COAS_FLAG": "F", "NUTS_BN_ID": 102}
The resulting schema printed by the program is:
geometry:GEOGRAPHY,EU_FLAG,CC_FLAG,OTHR_CNTR_FLAG,LEVL_CODE:int64,FID:int64,EFTA_FLAG,COAS_FLAG,NUTS_BN_ID:int64
Now, load the new-line separated JSON into BigQuery (change the dataset and table name appropriately):
bq load --source_format NEWLINE_DELIMITED_JSON advdata.eurostat to_load.json geometry:GEOGRAPHY,EU_FLAG,CC_FLAG,OTHR_CNTR_FLAG,LEVL_CODE:int64,FID:int64,EFTA_FLAG,COAS_FLAG,NUTS_BN_ID:int64
That’s it! Try out a query to order the polygons in terms of their distance from Frankfurt, Germany:
SELECT
* EXCEPT(geometry)
, ST_Centroid(geometry) AS center
, ST_Distance(geometry, ST_GeogPoint(8.68, 50.11))/1000 AS dist
FROM advdata.eurostat
ORDER by dist ASC
LIMIT 5
This returns:
Enjoy!