The Startup
Published in

The Startup

Call PostgreSQL from BigQuery for extra GIS powers

Today we’ll have a complex script that ties together new features of BigQuery to transparently talk from BigQuery Script to a PostgreSQL instance. With Dynamic SQL we can build BigQuery SQL queries dynamically, and here we’ll build and execute PostgreSQL queries dynamically in BigQuery.

A practical motivating example here is being able to fix invalid polygons that don’t conform to OGC spec and cannot be loaded to BigQuery. Unfortunately, such bad data can be found in various datasets one has to work with. You can often fix it using ST_MakeValid function in PostgreSQL, but there is no such facility yet in BigQuery GIS. Let’s call this function from BigQuery!

We’ll use SanFrancisco zoning dataset. Several zone polygons there have self intersecting loops and are not valid polygons. These “spikes” are probably artifacts of drawing tools. Let’s fix them. Obviously you can first load data to PostgreSQL, fix it there using ST_MakeValid, extract and load to BigQuery. But it is tedious. Let’s do it fully in a BigQuery script. The idea is to find a bad shape, use connection to PostgreSQL to invoke the query with ST_MakeValid, and update the BigQuery table.

Setup

First, we’ll need to create a Cloud SQL managed instance of PostgreSQL, and enable its PostGIS extensions by running (e.g. in Cloud shell):

CREATE EXTENSION postgis;

Then create an external connection to this instance, by using the ‘Add Data > External Data Source’ link in BigQuery UI. I called mine psql here. See https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#setting_up_database_connections

Upload

On the SanFrancisco zoning dataset page, I clicked Export and selected the CSV version. Then I uploaded it using BigQuery UI, with automatic schema detection to a table tmp.sfzoning. Field the_geom is uploaded as string type. If we try to run a query like

SELECT st_geogfromtext(the_geom) FROM `tmp.sfzoning`

BigQuery reports an error similar to (it might vary depending on specific invalid polygon query encounters first):

ST_GeogFromText failed: Invalid polygon loop: Edge 2 has duplicate vertex with edge 354

There are 9 such bad polygons, we can count them using

SELECT count(*) FROM `tmp.sfzoning`
WHERE safe.st_geogfromtext(the_geom) IS NULL

Script

The scripts repeats the following as long as it can find a bad WKT string:

  • Finds WKT that cannot be ingested in BigQuery
  • Builds a query string, that uses EXTERNAL_QUERY statement to run the following PostgreSQL query:
    SELECT ST_AsText( ST_Make_Valid( ST_GeomFromText( ‘wkt-string’ ))).
  • Uses the result of this query to UPDATE value in the BigQuery table.
declare row_id int64;
declare bad_wkt, good_wkt, query string;
LOOP
-- Find one bad WKT string
SET (row_id, bad_wkt) = (
SELECT AS STRUCT objectid, the_geom
FROM `tmp.sfzoning`
WHERE the_geom IS NOT NULL AND
SAFE.ST_GeogFromText(the_geom) IS NULL
LIMIT 1);
-- If none - we are good, done!
IF bad_wkt IS NULL THEN
BREAK;
END IF;
-- Safety check, prevent SQL injection.
IF (REGEXP_CONTAINS(bad_wkt, '[\'\"]')) THEN
RAISE USING MESSAGE =
CONCAT('Bad WKT at row #', CAST(row_id AS STRING));
END IF;
-- Built our PostgreSQL query
SET query = CONCAT(
"SELECT ST_AsText(ST_MakeValid(ST_GeomFromText('",
bad_wkt, "'))) AS wkt;");
-- Use PostgreSQL connection to make this geography valid.
EXECUTE IMMEDIATE
'SELECT wkt FROM EXTERNAL_QUERY(\"us.psql\", ?)'
INTO good_wkt USING query;
-- Next line is PostgreSQL query
SET query = CONCAT(
"SELECT ST_AsText(ST_MakeValid(ST_GeomFromText('",
bad_wkt, "'))) AS wkt;");
-- Use PostgreSQL connection to make this geography valid.
EXECUTE IMMEDIATE
'SELECT wkt FROM EXTERNAL_QUERY(\"us.psql\", ?)'
INTO good_wkt USING query;
-- Update the value in BigQuery table.
UPDATE `tmp.sfzoning`
SET the_geom = good_wkt
WHERE objectid = row_id;
END LOOP;

This should finish in about a minute.

Limitations

This is a simple script created mostly to demonstrate various features of BigQuery scripting. Don’t use it for anything near production workflow. There are lots of limitations here.

First, the solution discussed here does not scale. It performs this operation row-by-row, doing everything wrong from the performance point of view — row by row, instead of batch mode:

  • instead of extracting all bad WKT strings in a batch, it extracts a single WKT, then repeats it until all strings are fixed.
  • it calls PostgreSQL for each WKT separately.
  • it runs a separate UPDATE query for each string.

It is simple, and works reasonably well for this small dataset and small number of fixes — but don’t try to run it for anything large.

Second, there might be problems that prevent BigQuery from ingesting the WKT string that cannot be fixed by ST_MakeValid. If the dataset happens to have such a string, the script will loop indefinitely. It can be fixed by keeping track of what was fixed, but that would complicate this sample script too much so I left it to the reader :).

--

--

--

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +756K followers.

Recommended from Medium

Write Your First API Test with RestAssured in 7 minutes

Upskilling on Azure AD B2C-Error messages

If you want something, just do it.

Consider All-in-One SDK For Business Success & Improved Customer Experience

Manta’s Ambassador Program

Deploy Your Java Spring Boot Application on AWS EC2 Using GitHub Actions and Docker

Rational numbers in python

Leaflet Javascript Library Point Coordinate Styling and Design

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Michael Entin

Michael Entin

Hi, I'm TL of BigQuery Geospatial project. Posting small recipes and various notes for BQ Geospatial users.

More from Medium

An introduction to Dynamic SQL in BigQuery — Part 2: flow control and loops

SQL String Templating in BigQuery: Four Methods

Nested Record Processing with BigQuery

Google BigQuery now supports JSON as a Data Type