Load Geospatial data in BigQuery using DataFlow and GDAL

Sandeep Manocha
Google Cloud - Community
5 min readJul 2, 2023

Extracting and Loading Raster Images (TIFF files) into BigQuery using Google Cloud Dataflow, GDAL and JQ

“Extracting Geospatial raster images for utilization in Geospatial analysis entails polygonizing complex images, validating extracted polygons, performing data validation checks, and more. The Google PSO team took on the task of extracting and loading over 2500 raster files comprising data for numerous countries, many of which had overlapping parcels and structures. We began development with the Dataflow Geobeam library, however owing to numerous challenges, including concerns with speed and scalability caused by the polygonization process, we refactored the code and transitioned to the GDAL library. This blog intends to share challenges and explore appropriate solutions.”

Challenges

Polyganization

Polygonization with DataFlow GeoBeam, as previously noted, was quite slow paced for larger complex multi-band Raster images and did not scale well with DataFlow causing higher wall time. Following that, using the GDAL library, we transformed raster (TIFF) data to GeoJSON with the gdal_polyganize utility. We continued to utilize DataFlow for scalability, but we weren’t taking advantage of essential DataFlow features, as GDAL polygonization works on the entire file. However, retiling the raster images (as discussed in a later section of this blog) helped achieve even quicker polygonization for these images. Furthermore, Google Kubernetes engine or Cloud Run can be used along with GDAL libraries as an alternative for massive compute.

Memory Issues with Large files

Using GDAL to scale the polygonization process alone would not address the problem. Large compute worker instances would be required for bigger raster files that may extract polygonized data ranging in size from GBs to TBs, increasing the cost and execution time. To demonstrate the size issue, each raster file considered had ten bands, and a 300 MB raster file might expand to 200 GB when converted to GeoJSON. To overcome this issue, big raster files were re-tiled into smaller and more manageable tiles while maintaining image quality.

Re-tiling of raster images could provide multiple benefits, such as

  • Improved Scalability: With more smaller files, even better parallelization can be achieved during the polygonization process
  • Reduced Data Transfer Time: As these images are to be read from Google Cloud Storage and copied to a local worker instance for further processing, less size images could result in faster transfer time, thus improving the execution time
  • Lower Memory usage: Smaller raster images could be easily processed using lesser compute, thus lowering the overall cost
  • Error Containment: If any error occurs during processing, it’s typically contained to a smaller tiled raster image, rather than affecting the entire image

Slower BigQuery Inserts for GeoJSON using DataFlow

We used the BigQuery DataFlow module to load data into BigQuery after converting the files to GeoJSON, but this library was quite slow in this use case as it requires all JSON elements to be first extracted and stored in the form of a dictionary. To minimize the execution time further, we used BQ Load utility with GeoJSON data extension.

BigQuery 100 MB Row limit

We experienced another issue with some rows with sizes greater than 100 MB after resolving the issue with delayed BigQuery inserts. This was due to the fact that certain Multipolygons in GeoJSON contained an excessive amount of polygons and vertices. This problem was solved as part of the raster re-tiling for larger files. We knew that files smaller than 50MB ran flawlessly even on lower compute, hereby we elected to divide any file larger than 80MB (with a consideration buffer of 30MB) into smaller tiles.

  1. Scaling: Splitting large raster files into smaller ones makes it easier to scale across multiple compute nodes.
  2. Auditing: Keeping track of all the files and their progress was challenging itself, but after splitting it became more challenging. So we kept this tracking in two tables, one at the file plus band level and one at the split plus band level.

Multiprocessing

Although DataFlow provided the scalability benefits of distributed computing, we also used native Python multiprocessing to further optimize the concurrent processing of several files and bands. This was particularly helpful because we were just using DataFlow for dynamic scaling of worker nodes and not for core DataFlow features such as distributed data-frames. Because each file or set of files was handled by a single worker node at a time, having several threads helped us gain better performance.

# PARALLEL EXECUTION - Using Thread Pool Executor to process the files in parallel
with ThreadPoolExecutor(max_workers=conv_parallelism) as executor:
for result in executor.map(ConvertGeotifToGeoJSON, repeat(job_id), repeat(current_working_directory), file_list_with_band_ids):
# Adding the staging file to the list
result_bucket_list.append(result)

result_bucket_list = list(dict.fromkeys(result_bucket_list))

Limitations

  • Some files still generated very large Multipolygons resulting in rows of size greater than 100MB. As of the day this article is published, BigQuery supports row size only till 100MB. Separate logic needs to be implemented to further divide these polygons into smaller polygons or ignore these records during ingestion.
  • BQ load job quota per day — Parallelisation of BigQuery load jobs requires to be implemented keeping daily quota into consideration

Possible Alternative Approach

  1. Use Composer Kubernetes Operator instead of DataFlow Job Operator
  2. Use Server-less Dataproc to manage the scalability and still get the flexibility of using different libraries in Python
  3. Use Cloud Run (Batch Mode) for Compute with Eventarc trigger for Cloud Storage
  4. 4-Connectedness polygonization can be performed instead of 8-connectedness for faster polygonization (based on accuracy requirement as per the Business scenario)

Conclusion:

Using Cloud Composer and Dataflow, an end to end solution could be developed to process Terabytes of Geospatial Raster images and ingest data into BigQuery for further processing and analysis.

Special Thanks to Rahul Panda for helping on this publication and working to a successfull completion of this massive project, along with the whole team.

--

--