The Startup
Published in

The Startup

Call PostgreSQL from BigQuery for extra GIS powers

Today we’ll have a complex script that ties together new features of BigQuery to transparently talk from BigQuery Script to a PostgreSQL instance. With Dynamic SQL we can build BigQuery SQL queries dynamically, and here we’ll build and execute PostgreSQL queries dynamically in BigQuery.

A practical motivating example here is being able to fix invalid polygons that don’t conform to OGC spec and cannot be loaded to BigQuery. Unfortunately, such bad data can be found in various datasets one has to work with. You can often fix it using ST_MakeValid function in PostgreSQL, but there is no such facility yet in BigQuery GIS. Let’s call this function from BigQuery!

Update: BigQuery now has native make_valid options for ST_GeogFromText and ST_GeogFromGeoJson function.

We’ll use SanFrancisco zoning dataset. Several zone polygons there have self intersecting loops and are not valid polygons. These “spikes” are probably artifacts of drawing tools. Let’s fix them. Obviously you can first load data to PostgreSQL, fix it there using ST_MakeValid, extract and load to BigQuery. But it is tedious. Let’s do it fully in a BigQuery script. The idea is to find a bad shape, use connection to PostgreSQL to invoke the query with ST_MakeValid, and update the BigQuery table.


First, we’ll need to create a Cloud SQL managed instance of PostgreSQL, and enable its PostGIS extensions by running (e.g. in Cloud shell):


Then create an external connection to this instance, by using the ‘Add Data > External Data Source’ link in BigQuery UI. I called mine psql here. See


On the SanFrancisco zoning dataset page, I clicked Export and selected the CSV version. Then I uploaded it using BigQuery UI, with automatic schema detection to a table tmp.sfzoning. Field the_geom is uploaded as string type. If we try to run a query like

SELECT st_geogfromtext(the_geom) FROM `tmp.sfzoning`

BigQuery reports an error similar to (details may vary depending on specific invalid polygon query encounters first):

ST_GeogFromText failed: Invalid polygon loop: Edge 2 has duplicate vertex with edge 354

There are 9 such bad polygons, we can count them using SAFE. prefix that makes a function return NULL instead of failing.

SELECT count(*) FROM `tmp.sfzoning`
WHERE safe.st_geogfromtext(the_geom) IS NULL


The scripts repeats the following as long as it can find a bad WKT string:

  • Finds WKT that cannot be ingested in BigQuery.
  • Builds a query string, that uses EXTERNAL_QUERY statement to run the following PostgreSQL query:
    SELECT ST_AsText( ST_Make_Valid( ST_GeomFromText( ‘wkt-string’ ))).
  • Uses the result of this query to UPDATE value in the BigQuery table.
declare row_id int64;
declare bad_wkt, good_wkt, query string;
-- Find one bad WKT string
SET (row_id, bad_wkt) = (
SELECT AS STRUCT objectid, the_geom
FROM `tmp.sfzoning`
SAFE.ST_GeogFromText(the_geom) IS NULL
-- If none - we are good, done!
-- Safety check, prevent SQL injection.
IF (REGEXP_CONTAINS(bad_wkt, '[\'\"]')) THEN
CONCAT('Bad WKT at row #', CAST(row_id AS STRING));
-- Built our PostgreSQL query
SET query = CONCAT(
"SELECT ST_AsText(ST_MakeValid(ST_GeomFromText('",
bad_wkt, "'))) AS wkt;");
-- Run it in PostgreSQL to make this geography valid,
-- store the result into good_wkt variable.
'SELECT wkt FROM EXTERNAL_QUERY(\"us.psql\", ?)'
INTO good_wkt USING query;
-- Update the value in BigQuery table.
UPDATE `tmp.sfzoning`
SET the_geom = good_wkt
WHERE objectid = row_id;

This should finish in about a minute.


This is a simple script created mostly to demonstrate various features of BigQuery scripting. Don’t use it for anything near production workflow. There are lots of limitations here.

First, the solution discussed here does not scale. It performs this operation row-by-row, doing everything wrong from the performance point of view — row by row, instead of batch mode:

  • instead of extracting all bad WKT strings in a batch, it extracts a single WKT, then repeats it until all strings are fixed.
  • it calls PostgreSQL for each WKT separately.
  • it runs a separate UPDATE query for each string.

It is simple, and works reasonably well for this small dataset and small number of fixes — but don’t try to run it for anything large.

Second, there might be problems that prevent BigQuery from ingesting the WKT string that cannot be fixed by ST_MakeValid. If the dataset happens to have such a string, the script will loop indefinitely. It can be fixed by keeping track of what was fixed, but that would complicate this sample script too much so I left it to the reader :).



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Michael Entin

Michael Entin


Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.