Blueprints to BigQuery: A deep dive into large-scale spatial joins for building footprints
This blog post is a collaborative effort by Shammah & Darell van der Voort
At VIDA, handling millions of building footprints is part of our daily operations. There have been exciting developments over recent years with many building footprint datasets being published by various actors. Maxar, Google and Microsoft have all published continent-scale building footprints which have been transformative for many of our customers. These datasets provide comprehensive geographical data about the location of all buildings across entire nations - crucial information for many analytical tasks.
Recently, Microsoft released an updated building footprint dataset that covers large parts of the world. Each of these datasets has different coverage, and accuracy, and is based on different imagery dates. From our experience working across Africa and Asia, we've observed variations in data quality and significant discrepancies in coverage. To address this, we found it necessary to combine multiple building footprint datasets to create a more complete dataset for downstream analyses.
The specific challenge we faced was to combine the Google and Microsoft building footprints datasets. Our aim was to retain only those Microsoft footprints that didn't intersect with any Google footprint. Although this is straightforward for small datasets, this becomes computationally intensive at scale, especially for global datasets with several millions of rows. We used BigQuery’s distributed system to tackle this problem. Along the way we learned some interesting perks and features about BigQuery which we want to share.
Leveraging Spatial Clustering for Efficient Data Processing
To improve data processing efficiency, we started with spatial clustering. This method optimises tables based on building footprints, which we applied to both our tables.
Clustering tables on a geography column in BigQuery significantly reduces the amount of data read to serve a query. This makes queries not only cheaper but also faster when filtering on the clustering column.
The inner workings of BigQuery's spatial clustering implementation are not publicly available. Hence, making direct comparisons with PostGIS and GeoPandas spatial indices is challenging. Nonetheless, we observed that it seemingly improves performance.
Another noteworthy aspect of BigQuery is that, unlike PostGIS, it doesn't offer the option to choose a clustering algorithm, and it doesn't support applying a spatial cluster in-place. Instead, a new table needs to be created. Despite this additional step, the process is simple::
CREATE
OR REPLACE TABLE {dataset}.{clustered_table} cluster by geometry AS
SELECT
*
FROM
{dataset}.{table}
Applying the clustering to a dataset containing 125 million building footprints took just 19 seconds, albeit with a 1-hour time slot consumed, and shuffled 73 GB of data
To showcase the performance improvement offered by spatial clustering, we ran a query on both clustered and unclustered tables from the Microsoft Google Building footprint dataset containing 125 million buildings:
SELECT
*
FROM
{dataset}.{building_footprints_table}
WHERE
ST_WITHIN(
geometry,
ST_GEOGFROMTEXT(
'POLYGON ((22.329426238516334 -5.77210375914855, 22.329426238516334 -6.127402332964067, 22.72180948050388 -6.127402332964067, 22.72180948050388 -5.77210375914855, 22.329426238516334 -5.77210375914855))'
)
)
The results were clear. The processing time went down from 7 seconds without clustering to 3 seconds with spatial clustering. This means it was more than twice as fast.
There were also big improvements in the use of processing resources. Without clustering, it took 1 minute and 30 seconds, while with spatial clustering, the time dropped to just 7 seconds. This means it was 13 times faster, saving 1 minute and 23 seconds.
In addition, the number of records looked at greatly decreased. Without clustering, we had to go through 31,830,668 records, but spatial clustering brought this number down to just 2,139,786 records - 15 times less, a decrease of over 93%.
Even with these big reductions in time and the number of records looked at, we still got 158K buildings. This shows how powerful and efficient spatial clustering techniques are in data analysis. It lets us handle big datasets more efficiently, saving both time and computing power.
Spatial Partitioning at Scale
When trying to improve query performance involving large tables, it is common to leverage partitioning to achieve this. Therefore, our next stop in the quest for an efficient divide-and-conquer approach to achieving our objective was to:
- Split up the building footprints spatially
- Assign an integer identifier to each subset
- Partition the table on this identifier column
BigQuery offers a native partitioning feature that can help us tackle this challenge. This technique subdivides a table into segments, called partitions, which can be queried independently. The beauty of it is that when a query is executed, only the relevant partitions are scanned, drastically reducing the amount of data scanned and potentially accelerating query execution times.
However, BigQuery partitioning is not without its limitations. It requires us to partition the data based on either a date or a range of integers.
The question:
What spatial boundaries would give the best results?
How do you spatially partition a large footprint dataset effectively, such that each partition is within an ideal footprint count threshold, whilst also limiting the overhead that could be incurred from having too many partitions?
Through experimentation with much smaller subsets of the data, we ascertained that a reasonable threshold for the footprint count falls within the range of approximately 100,000
to 150,000
building footprints.
The Grid-based Approach
We began with a simple approach using a homogeneous grid (one with cells of equal size) which covers the entire boundary of the AOI (Area Of Interest). With BigQuery being our primary tool in this implementation, a native function to generate a grid would have been ideal. At the time of writing this article, the only grid-like implementation we could find was ST_SNAPTOGRID
. This didn’t quite give the results we wanted.
SELECT
ST_SNAPTOGRID(geometry, 0.01) as geometry
FROM `area_of_interest_table`
Visualised query output using GeoViz:
SELECT
ST_SNAPTOGRID(geometry, 0.8) as geometry
FROM `area_of_interest_table`
Result:
This is because ST_SNAPTOGRID
does not exactly create a grid partition of the AOI, but only rounds the coordinates of the AOI geometry to a grid.
In essence, BigQuery does not have a native grid function like ST_SquareGrid
provided by PostGIS. Hence, to implement a grid approach, an alternate solution was needed.
We resorted to a rather hacky approach, computing each grid cell geometry beginning from the bottom left origin of the bounding box of the AOI. This generates the adequate grid and speedily so. We then used this grid to implement the three(3) steps mentioned earlier. Here’s what it looks like.
From the spatial distribution of the building footprints depicted in the visual above, it is apparent that employing a uniform grid may not be optimal for this use case. This is due to the significant imbalance in the contents of the grid cells, potentially resulting in empty cells containing no data. This tendency to have heavily co-located features (datapoints clustered in certain regions) is rather typical of geospatial data!
Furthermore, we found that in order to adhere to our threshold of 100-150k
footprints per partition, we would require tens of thousands of grid cells and just as many partitions by consequence.
You guessed it. This doesn’t work, as BigQuery enforces a strict limit of 4000 partitions per table. Yet another bummer!
As a result, we sought a more appropriate criterion for spatial partitioning capable of generating subsets with significantly greater content balance across all partitions.
💡 One great alternative to the homogeneous grid that resolves grid cell content imbalance is based on the Quad Tree data structure. It involves recursively decomposing the AOI (beginning with its bounding box) into four cells, until the data within every generated cell falls below the specified cell content threshold. This way, grid cell size is directly determined by cell content. This approach is neat and should yield highly balanced partitioning results. But, it is much more involved and should only be used where a finely partitioned table is required. This was not the case for us.
A Better Way: Administrative Boundaries
The AOI in our scenario is administrative, hence, partitioning the building footprints by a lower admin level seemed, in theory, like an ideal approach to explore for the following reasons:
- Admin boundaries at least two(2) levels deep are readily available (e.g. GADM, CGAZ, etc)
- The geometries are such that the smaller admin boundaries are in regions with high footprint density, with the more sparse regions covered by much larger admin boundaries
- The issue of having cells with no data is nonexistent
and so on.
This turned out to work quite nicely at admin level 2, with the results abiding by our partition and building footprint count restrictions. In most of the cases, the number of partitions was drastically reduced from the tens of thousands to hundreds!
This strategy does not completely get rid of content imbalance within the partitions. However, its configuration seems to inherently mimic a partition sizing that’s based on the number of features within each partition, thereby eliminating the need for a more involved approach as the Quad Tree.
Optimising Spatial Joins in BigQuery: Merging Datasets Efficiently
Now, let's focus on merging the datasets. We’ll take an easy-to-follow approach, utilizing a single administrative boundary for testing and applying a straightforward join and intersect condition. This method will effectively highlight the performance disparities and set the foundation for our final solution.
The Simple Approach: Brute Force
A straightforward method to merge our datasets based on intersection can be achieved with a simple ST_INTERSECTS
spatial join like so:
SELECT *
FROM google_buildings AS g
INNER JOIN microsoft_buildings AS m ON ST_INTERSECTS(g.geometry, m.geometry)
This query performs a spatial join on both tables without any additional filtering or pre-processing. It seems simple and intuitive, right? However, simplicity comes at a cost. Because both google_buildings
and microsoft_buildings
tables contain millions of rows, this spatial join is resource-intensive and slow. It checks every pair of buildings from Google and Microsoft for intersections - resulting in up to a trillion combinations for a million rows in each table. This lack of optimisation could lead to extremely long query times and high costs.
In the BigQuery environment, running such an expensive query would consume vast amounts of computational resources, known as “time slots.” Without taking advantage of best practices like filtering or partitioning, we would end up running very resource-heavy queries.
The Filtered Approach: Adding Non-Spatial Filters
An immediate thought for optimisation might be to use additional non-spatial filters, in this case, the administrative boundaries (adm_id
). By filtering based on adm_id
, we could reduce the size of the dataset and theoretically improve the speed of our query. So we tried that:
SELECT *
FROM google_buildings AS g
INNER JOIN microsoft_buildings AS m ON g.adm_id = m.adm_id
WHERE ST_INTERSECTS(g.geometry, m.geometry)
AND g.adm_id = @selected_adm_id
In this query, we've added an adm_id
filter. However, we still observed a high compute time; the query for a single administrative boundary completed in 2 minutes and 2 seconds, consuming 12 minutes and 13 seconds of slot time. We found out that the issue lies in BigQuery's inability to efficiently optimise such 'mixed conditions' where we're adding a non-spatial condition (adm_id
equality) to our spatial condition (ST_INTERSECTS
).
Specialised spatial databases, like PostGIS, handle mixed conditions more efficiently due to their indexing capabilities. PostGIS, for instance, allows the creation of spatial indices (GiST or GIN index on a geometry column) and efficiently uses these spatial indices alongside traditional B-tree indices for non-spatial conditions like A.gadm_id = B.gadm_id
. However, BigQuery does not share these capabilities.
Further Optimisation: Two-Step Filtering
Discovering these limitations made us reconsider our approach of mixing spatial and non-spatial conditions within the same query. To optimise performance, we divided these conditions into separate stages. So let's adopt our approach by first filtering the data on the adm_id
equality condition, then applying the spatial condition (ST_INTERSECTS
) on the reduced datasets:
WITH google_buildings_filtered
AS (
SELECT *
FROM google_buildings
WHERE adm_id = @selected_adm_id
)
,microsoft_buildings_filtered
AS (
SELECT *
FROM microsoft_buildings
WHERE adm_id = @selected_adm_id
)
SELECT *
FROM google_buildings_filtered AS g
INNER JOIN microsoft_buildings_filtered AS m ON ST_INTERSECTS(g.geometry, m.geometry)
The results were much better! The query completed in just 2 seconds and consumed only 18 seconds of slot time.
Up until now, we've made great strides in optimising our queries. But there's another level of optimisation still to unlock. Remember our administrative boundaries partitions? Even with our refined approach, BigQuery still scans the whole table when executing queries. Considering that BigQuery billing is based on data scanned, reducing the size of scanned data is a worthwhile pursuit.
Even more efficiency: Welcome to the World of Partitioned Tables!
In our previous discussions, we identified the adm_id
as the basis for our partitions. Since these adm_id
values are essentially integers, we’re going to use range partitioning to our advantage. In order to create an individual partition for each administrative boundary, we’ll need to generate an array that ranges from 0 to the highest value of adm_id
, with each value increasing by an increment of 1. The GENERATE_ARRAY
function can help us accomplish this.
Here's the query to create a partitioned table:
CREATE TABLE google_buildings_partitioned PARTITION BY RANGE_BUCKET (
adm_id
,GENERATE_ARRAY(0, {max_adm_id}, 1)
) AS
SELECT *
FROM google_buildings
So, what does partitioning bring to the table? Let's compare our metrics before and after applying partitioning:
The results are indeed impressive! The elapsed time decreased by one second, but the real magic lies in the other two metrics. The amount of data scanned dropped from a hefty 2.27 GB to a mere 3.12 MB, reducing data costs significantly. Furthermore, the number of records read saw a phenomenal reduction - from over 2.3 million to less than 5,000.
In short, partitioning allows us to analyse massive datasets more efficiently, saving both time and computational resources. Now, isn't that something to celebrate?
Conclusion
Wrapping up our exploration into BigQuery optimisation, we learned that the journey towards improved efficiency is not always straightforward. The BigQuery documentation was helpful but didn't always provide an immediate road map to the best strategies for our specific task. The process required us to experiment, iterate, and learn from trial and error. But the end results proved that the effort was a worthy investment.
Our adaptations and refinements led to substantial improvements in performance, with our spatial join operations becoming approximately 61 times faster in terms of elapsed time, and about 41 times quicker in slot time consumed. This demonstrates the significant potential of using efficient filtering and join conditions in BigQuery, showcasing its power in handling complex, large-scale data tasks.
Taking our optimisation a step further, we harnessed the power of BigQuery's built-in partitioning feature. By partitioning our table on adm_id
, we drastically reduced the amount of data scanned during queries, leading to more efficient and cost-effective operations.
Here are the key takeaways from our work:
- BigQuery’s use of distributed compute power can lead to increased costs due to more “time slots consumed”. Always test and monitor queries, especially on large datasets, to control expenses.
- Use spatial clustering to improve query performance. It significantly reduces the elapsed time and the number of records read, making data analysis faster and more efficient.
- Don’t mix spatial and non-spatial conditions in the same join operation. Keep these conditions separate to maintain efficiency.
- Take advantage of BigQuery’s partition feature to minimise the amount of data scanned during queries. This can lead to considerable cost savings and improved efficiency.
Our final solution led to a merged dataset that contained an additional 1.6 million buildings for Angola. This enriched dataset, free of duplicates, is an invaluable asset, particularly for regions where the Google dataset had limited coverage. Keep exploring, and keep optimising! 🚀💪