Call PostgreSQL from BigQuery for extra GIS powers
--
Today we’ll have a complex script that ties together new features of BigQuery to transparently talk from BigQuery Script to a PostgreSQL instance. With Dynamic SQL we can build BigQuery SQL queries dynamically, and here we’ll build and execute PostgreSQL queries dynamically in BigQuery.
A practical motivating example here is being able to fix invalid polygons that don’t conform to OGC spec and cannot be loaded to BigQuery. Unfortunately, such bad data can be found in various datasets one has to work with. You can often fix it using ST_MakeValid function in PostgreSQL, but there is no such facility yet in BigQuery GIS. Let’s call this function from BigQuery!
Update: BigQuery now has native make_valid options for ST_GeogFromText
and ST_GeogFromGeoJson
function.
We’ll use SanFrancisco zoning dataset. Several zone polygons there have self intersecting loops and are not valid polygons. These “spikes” are probably artifacts of drawing tools. Let’s fix them. Obviously you can first load data to PostgreSQL, fix it there using ST_MakeValid, extract and load to BigQuery. But it is tedious. Let’s do it fully in a BigQuery script. The idea is to find a bad shape, use connection to PostgreSQL to invoke the query with ST_MakeValid, and update the BigQuery table.
Setup
First, we’ll need to create a Cloud SQL managed instance of PostgreSQL, and enable its PostGIS extensions by running (e.g. in Cloud shell):
CREATE EXTENSION postgis;
Then create an external connection to this instance, by using the ‘Add Data > External Data Source’ link in BigQuery UI. I called mine psql
here. See https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#setting_up_database_connections
Upload
On the SanFrancisco zoning dataset page, I clicked Export and selected the CSV version. Then I uploaded it using BigQuery UI, with automatic schema detection to a table tmp.sfzoning
. Field the_geom
is uploaded as string type. If we try to run a query like
SELECT st_geogfromtext(the_geom) FROM `tmp.sfzoning`
BigQuery reports an error similar to (details may vary depending on specific invalid polygon query encounters first):
ST_GeogFromText failed: Invalid polygon loop: Edge 2 has duplicate vertex with edge 354
There are 9 such bad polygons, we can count them using SAFE. prefix that makes a function return NULL
instead of failing.
SELECT count(*) FROM `tmp.sfzoning`
WHERE safe.st_geogfromtext(the_geom) IS NULL
Script
The scripts repeats the following as long as it can find a bad WKT string:
- Finds WKT that cannot be ingested in BigQuery.
- Builds a query string, that uses
EXTERNAL_QUERY
statement to run the following PostgreSQL query:SELECT ST_AsText( ST_Make_Valid( ST_GeomFromText( ‘wkt-string’ )))
. - Uses the result of this query to
UPDATE
value in the BigQuery table.
declare row_id int64;
declare bad_wkt, good_wkt, query string;LOOP
-- Find one bad WKT string
SET (row_id, bad_wkt) = (
SELECT AS STRUCT objectid, the_geom
FROM `tmp.sfzoning`
WHERE the_geom IS NOT NULL AND
SAFE.ST_GeogFromText(the_geom) IS NULL
LIMIT 1); -- If none - we are good, done!
IF bad_wkt IS NULL THEN
BREAK;
END IF; -- Safety check, prevent SQL injection.
IF (REGEXP_CONTAINS(bad_wkt, '[\'\"]')) THEN
RAISE USING MESSAGE =
CONCAT('Bad WKT at row #', CAST(row_id AS STRING));
END IF; -- Built our PostgreSQL query
SET query = CONCAT(
"SELECT ST_AsText(ST_MakeValid(ST_GeomFromText('",
bad_wkt, "'))) AS wkt;"); -- Run it in PostgreSQL to make this geography valid,
-- store the result into good_wkt variable.
EXECUTE IMMEDIATE
'SELECT wkt FROM EXTERNAL_QUERY(\"us.psql\", ?)'
INTO good_wkt USING query; -- Update the value in BigQuery table.
UPDATE `tmp.sfzoning`
SET the_geom = good_wkt
WHERE objectid = row_id;
END LOOP;
This should finish in about a minute.
Limitations
This is a simple script created mostly to demonstrate various features of BigQuery scripting. Don’t use it for anything near production workflow. There are lots of limitations here.
First, the solution discussed here does not scale. It performs this operation row-by-row, doing everything wrong from the performance point of view — row by row, instead of batch mode:
- instead of extracting all bad WKT strings in a batch, it extracts a single WKT, then repeats it until all strings are fixed.
- it calls PostgreSQL for each WKT separately.
- it runs a separate UPDATE query for each string.
It is simple, and works reasonably well for this small dataset and small number of fixes — but don’t try to run it for anything large.
Second, there might be problems that prevent BigQuery from ingesting the WKT string that cannot be fixed by ST_MakeValid
. If the dataset happens to have such a string, the script will loop indefinitely. It can be fixed by keeping track of what was fixed, but that would complicate this sample script too much so I left it to the reader :).