Unifying Raster and Vector data with Mosaic

Milos Colic
6 min readFeb 14, 2023

--

We face a remarkably diverse and broad set of data formats regarding geospatial data processing at scale. These data formats have been designed with particular use cases in mind and, within that context, are very efficient. However, combining data across these different data sources can prove to be a tough challenge. Frameworks like GDAL have emerged aiming to abstract these complexities. In this blog, we will cover how Mosaic integrates with GDAL, the abstractions Mosaic provides in spatial SQL and how that allows us to easily combine raster and vector data into a unified system.

The value behind the raster “paywall”

Rasters are a compelling set of data formats. Data assets describing climate data, weather data, satellite data, flood regions, soil data, harvest data, and many more are often supplied as raster data. The trouble is that most data scientists are not trained to handle raster formats. In most cases, barriers like obtaining requisite GIS expertise and complexities of advanced frameworks such as GDAL, create a “paywall” that hides available insights from those seeking to work with raster data. However, the value inside these assets can help us address use cases such as climate risk, flood risk, ESG, alternative energy site planning; the list goes on and on. How can we bring these data assets closer to data scientist and analyst communities and decouple from GIS expertise?

GDAL as an enabler

GDAL stands for “Geospatial Data Abstraction Library”, and as the name suggests, it provides an abstraction layer over raster and vector data formats. This is a robust approach toward simplifying and unifying the code required to process data. Indeed some formats are more complex than others (e.g. NetCDF) and require reasoning about subdatasets, but the abstractions provided allow us to overcome these complexities easily.

import org.gdal.gdal.gdal;
import org.gdal.gdal.Band;
import org.gdal.gdal.Dataset;
...
Dataset dataset = gdal.Open(filename);
Band band = dataset.GetRasterBand(1);

The code above illustrates how we can read a band of raster data using GDAL and a filename. It is important to note that we aren’t explicitly providing the format, and we expect GDAL to infer this from the file extension. This can be a massive enabler that simplifies and unifies the code needed to produce a piece of analysis.

Finally, GDAL is compiled using system native code and provides a very performant framework. Unfortunately, this comes with a cost when considering specific applications of this powerful tool. Being a platform-specific framework means there is a higher risk of encountering challenges when installing GDAL. Furthermore, using java bindings for GDAL comes with additional complexities. It is not an overstatement to say that there are expectations of skill levels for GDAL users. So how can we leverage this framework to democratise geospatial data?

Easy GDAL for Databricks with Mosaic

Since version 0.3.7, Mosaic integrates with GDAL and provides a set of abstractions that make raster data more accessible to individuals that would not classify themselves as GIS experts.

Firstly, Mosaic introduces an easy install of GDAL for Databricks. This involves the generation of an init.sh script and providing necessary shared object files (without this, GDAL won’t work in java based languages nor java based frameworks like Apache Spark™).

# Once per cluster/workspace, the generated init script can be reused
import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.setup_gdal(spark)
# Configure the init script on a fresh cluster
import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

The benefit of this approach is that the packaged scripts are version-controlled and unit tested for the version of Mosaic. This way, we ensure consistency between installed GDAL as a dependency and the exposed functionality.

RST_ APIs in Mosaic

Mosaic brings 30+ raster functions/expressions. All the raster functions that you can use in Mosaic are prefixed by RST_. RST_ stands for “raster” or “raster spatiotemporal” functions.

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)
df = spark.read \
.format("binaryFile").option("pathGlobFilter", "*.nc") \
.load("dbfs:/sample_raster_data/binary/netcdf-coral")
df.select(mos.rst_georeference("path")).limit(1).display()

All the APIs provided by Mosaic are available in Python, SQL, R and Scala/Java; please refer to the documentation page for examples of function usage.

RST_ expressions including metadata inspection, re-tiling of rasters and conversion of rasters into grid index system representation. Grid index systems such as H3 and BNG are core to the way Mosaic represents vector data.

df = spark.read \
.format("binaryFile").option("pathGlobFilter", "*.nc") \
.load("dbfs:/sample_raster_data/binary/netcdf-coral")
df.select(mos.rst_rastertogridavg('path', F.lit(3)).show()
RST_RasterToGridAvg example of NetCDF data

Using RST_RasterToGrid* set of operations, Mosaic users can easily convert raster data into grid-based data that can be accessed from notebooks or Databricks SQL. This is a significant simplification of data access for individuals that aren’t GIS experts and have no knowledge of underlying data formats and/or of GDAL’s existence.

Combining vector and raster

We have followed the same philosophy for raster data as we did for vector data. Through this shared approach, we can ensure both vector and raster data are represented in a unified way that allows for easy combinatorics between data assets.

In Mosaic, users can use ST_GridTesselate* set of operations to represent vector data in grid index systems as a set of piecewise geometries (chips) contained within a single index cell.

Tessellation of a polygon in H3(8)
Tessellation of a polygon in BNG(4)

Now that both vector and raster data are represented in the same domain, we can easily combine data assets between vector and raster domains and produce insights. Combining data is as easy as doing a simple SQL join based on grid index cell ids.

vector_df = spark.read.load("dbfs:/sample_vector_data/regions")
raster_df = spark.read \
.format("binaryFile").option("pathGlobFilter", "*.nc") \
.load("dbfs:/sample_raster_data/binary/netcdf-coral")
vector_df = vector_df.select(grid_tessellateexplode('wkt', lit(3)))
raster_df = raster_df.rst_rastertogridavg('path', lit(3))
# RST_RasterToGrid* functions return a set of grid index cells
# for each band of the raster, so we need to flatten the dataframe
# Please note that we need to explode the dataframe twice
raster_df = raster_df \
.withColumn("grid_values",
mos.rst_rastertogridavg("path", lit(3))) \
# explode bands
.withColumn("grid_values", explode("grid_values")) \
# explode (cell, measure) pairs
.withColumn("grid_values", explode("grid_values")) \
.withColumn("index_id", col("cellID"))
result_df = vector_df.join(raster_df, on="index_id")

Mosaic support all raster drivers supported by GDAL, including TIF, COG, GRIBB, NetCDF and many more formats. Currently, Mosaic doesn’t expose OGR vector drivers from GDAL; this will come with future releases.

Future releases

In future releases, Mosaic will focus on expanding both the vector and raster set of APIs to simplify more use cases. In addition, we will bring custom grid index system support and user-defined functions (UDFs) for easy integration of custom shapely, rasterio and GDAL code. Finally, Mosaic will support automatic query optimization rules that can reorder your queries automatically to automatically use grid index systems if the operation can benefit from them.

This blog has covered the unification of raster and vector data under grid index systems as a unification framework. We have covered the pain points of handling frameworks like GDAL and their value and provided an approach of easy access to such valuable tools within Databricks lakehouse platform. Finally, we have provided a summary of what next is coming into Mosaic framework.

--

--