Efficient spatial matching in BigQuery
Use a spatial join — a SQL join with a ST_* condition
This was the question (suitably anonymized):
“I have a huge dataset of points and another huge dataset of polygons and I want to find all the points within each of the polygons. In other words, I have to loop through the polygons one-by-one and find all the points inside each one. Can BigQuery do this?”
The answer? Yes, BigQuery can do it, but you don’t have to loop through the polygons. If your experience consists primarily of functional/object-oriented programming, SQL can seem very weird (I can empathize since two years ago, I had done nothing beyond a SELECT * in SQL). This is the sort of thing that is quite efficient to perform with a spatial join.
The spatial join works for any two geometries and a wide variety of spatial functions. For example, let’s find all the zipcodes that cover an urban-area:
WITH city AS (
urban_area_geom AS geom
),zip AS (
zcta_geom AS geom
city.name, TO_JSON_STRING(ARRAY_AGG(zip_code)) AS zipcodes
GROUP BY city.name
The key bit is that we do a join of the city table to the zipcode table on the condition that the city geometry intersects the zipcode geometry.
There are other functions you can use, of course. You can check that the distance of a point from a polygon is < 100 km. Or that one geography fully contains the other geography.
Try it out in BigQuery!
Many of my blog posts come from customer questions even if I don’t always mention the conversation that prompted the post. So, if you’ve got a question on doing data analytics or ML on GCP, tweet it to me at @lak_gcp.