Loading Geographic Multiband Raster Data in BigQuery

Himanshu Rai
Google Cloud - Community
6 min readDec 19, 2022

Goal: Load Raster Data in BigQuery using the below two approaches,

1. Using Dataflow with GeoBeam

2. Using GDAL core libraries.

To set the context, here are few terminologies around raster data explained,

Raster Images

In Raster form, data is captured through remote sensing as images and represented as a grid of pixels. A satellite or a camera mounted aeroplane captures imagery of a certain area on the earth representing continuous varying data e.g. nature of the soil, plantations in a forest land and others.

The images are geotagged and hold geographic positional information. Normally this information consists of a coordinate for the top left pixel in the image, the size of each pixel in the X direction, the size of each pixel in the Y direction, and the amount (if any) by which the image is rotated. With these few pieces of information, the GIS application can ensure that raster data are displayed in the correct place.

Bands

Raster images can contain one or more bands, each covering the same spatial area, but containing different information. Some rasters have a single band, or layer (a measure of a single characteristic), of data, while others have multiple bands. Basically, a band is represented by a single matrix of cell values, and a raster with multiple bands contains multiple spatially coincident matrices of cell values representing the same spatial area.

Bands can represent any portion of the electromagnetic spectrum including ranges not visible to the eye such as the infrared or ultraviolet sections. Images with a single band are called grayscale images. Three of the bands of a multi-spectral image can be shown in the colours Red, Green and Blue so that we can see them.

Representing Features on Raster Images

Each pixel or a set of pixels can represent a geo-point or a feature line like a river or an area shown as a polygon.

Loading Raster Data in BigQuery dataset

GeoTIFF is a widely used raster data format. This format lets one map pixels in a TIFF image to geographic coordinates. Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics.

Here is a tiff file I got from the geobeam examples,

Soil Grid Tiff File

This raster file is loaded into BigQuery for geospatial analysis. The schema of the data in this file is as below,

[
{
"name": "h3",
"type": "INT64",
"description": "Derived available soil water capacity (volumetric fraction) with FC = pF 2.5 for depth 0 cm"
},
{
"name": "geom",
"type": "GEOGRAPHY"
}
]

We can load some geospatial data directly into BigQuery and Earth Engine, depending on data type. BigQuery lets one load vector data in the WKT, WKB, and GeoJSON file formats if the data uses the WGS 84 reference system. Earth Engine integrates directly with the data that’s available in the Earth Engine catalog and supports loading raster images directly in the GeoTIFF file format. However, we can always convert GeoTiff file to a Vector Data file and load it in BigQuery for geospatial analysis.

BigQuery offers many functions that lets us construct new geography values, compute the measurements of geographies, explore the relationship between two geographies, and more. We can do hierarchical geospatial indexing with S2 grid cells using BigQuery S2 functions. In addition, the machine learning features of BigQuery ML can be used to identify patterns in the data, such as creating a k-means machine learning model to cluster geospatial data.

Ingestion of GeoTiff Files in BigQuery is done using the below two methods as in the scope of this article,

  1. Use of GeoBeam Library which is based on GDAL, PROJ and other libraries
  2. Using OSGEO libraries — GDAL, Rasterio directly

Ingestion using GeoBeam

I used dataflow with geobeam example code to load multiple images and in this article I am going to cover the Soil Grid tiff file loading steps and findings.

  1. Install geobeam,
pip install geobeam

2. Run dataflowrunner to call the geobeam example python code to load tiff file into BigQuery,

python -m geobeam.examples.geotiff_soilgrid \
--runner DataflowRunner \
--sdk_container_image gcr.io/dataflow-geobeam/base \
--temp_location <your temp bucket> \
--project <your project> \
--service_account_email <your service account> \
--region us-central1 \
--worker_machine_type c2-standard-8 \
--gcs_url gs://geobeam/examples/soilgrid-test-clipped.tif \
--dataset examples \
--table soilgrid \
--band_column h3

3. The example code in the geobeam library, geobeam.examples.geotiff_soilgrid is used to demonstrate however it can be customised to suit the needs of any use case. The example code loads a single band image file. I modified it to load multiple bands.

Basically the code converts the raster pixels to polygon features using a polygonization method, which creates vector polygons for all connected regions of pixels in the raster sharing a common pixel value. Each polygon is created with an attribute indicating the pixel value of that polygon.

The RasterPolygonSource in GeoBeam was used to enable row level data load which is ready for immediate querying in BigQuery.

class geobeam.io.RasterPolygonSource(file_pattern, bidx=1, in_epsg=None, in_proj=None, **kwargs)

with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.Read(GeotiffSource(known_args.gcs_url))
| 'MakeValid' >> beam.Map(geobeam.fn.make_valid)
| 'FilterInvalid' >> beam.Filter(geobeam.fn.filter_invalid)
| 'FormatRecords' >> beam.Map(geobeam.fn.format_record,
known_args.band_column, known_args.band_type)
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('DATASET.TABLE'))

The above sample pipeline code shows the read to write steps and options to filter records, format it and then write to BigQuery.

The Polygon data gets loaded in the geography data type column, geom, and is used for further querying.

Ingest data using GDAL

This approach I used the OSGEO open source libraries, GDAL and OGR. This method gives more control in the data loading methods and steps like polygonizing the raster pixels can be done in terms of connectedness levels.

GDAL can be installed as,

sudo apt-get install gdal-bin unzip 

The below code loads the tiff file and finds all the bands in the file and loops through each band. Each band in the image is converted to a shape file(A vector data format that was developed by Esri. It lets you store geometric locations and associate attributes). However, other formats can be used instead of shape files.

from osgeo import gdal, ogr
import sys

src_ds = gdal.Open('./SoilGrid.tif')
band = 0
if src_ds is None:
print ('Unable to open tif ')
sys.exit(1)

print ("[ RASTER BAND COUNT ]: ", src_ds.RasterCount)
for band in range( src_ds.RasterCount ):
band += 1
print ("[ GETTING BAND ]: ", band)
srcband = src_ds.GetRasterBand(band)
if srcband is None:
continue

stats = srcband.GetStatistics( True, True )
print("band num is : ", band)
print ("[ NO DATA VALUE ] = ", srcband.GetNoDataValue())
print ("[ MIN ] = ", srcband.GetMinimum())
print ("[ MAX ] = ", srcband.GetMaximum())
print ("[ SCALE ] = ", srcband.GetScale())
print ("[ UNIT TYPE ] = ", srcband.GetUnitType())

dst_layername = "POLYGONIZED_DATA_BAND" + str(band)
drv = ogr.GetDriverByName("ESRI Shapefile")
dst_ds = drv.CreateDataSource( dst_layername + ".shp" )
dst_layer = dst_ds.CreateLayer(dst_layername, srs = None )

gdal.Polygonize( srcband, None, dst_layer, -1, [], callback=None )

if stats is None:
continue

The Shape file can be loaded using geobeam or can be converted to CSV using OGR and then loaded into BigQuery using BQ load command,

ogr2ogr -f csv -dialect sqlite -sql 'select AsGeoJSON(geometry) AS geom, * from "crop_data"' crop_data.csv crop_data.shp
bq load --autodetect --replace sample_data.crop_data_vector gs://sample_data/crop_data.csv

Further the loaded data can be used for analysis using the geo functions in BigQuery.

For testing and prototyping the geospatial analyses, I used BigQuery GeoViz as a way to validate the queries and to generate a visual output from BigQuery. For business intelligence reporting, Looker Studio can be used or Looker can be used to connect to BigQuery and combine the geospatial visualisations with a wide variety of other report types in order to present a unified view of the insights needed.

I am planning to publish more elaborate examples in the coming blogs on analysis of geospatial data and creating dashboards.

References

https://webhelp.esri.com/

https://docs.qgis.org/2.8/en/docs

--

--