Managing raster (Satellite Imagery) in DuckDB with the spatial extension (I)

ah
8 min readApr 2, 2024

--

Spatial data management has become increasingly essential across various domains, ranging from geographic information systems (GIS) to urban planning and beyond. With the proliferation of spatial data sources, efficient handling and analysis of such data have become paramount.

DuckDB has emerged as a versatile analytical database management system, renowned for its performance, flexibility, and ease of integration.

In recent dates, its capabilities have been extended to include spatial data management via the spatial extension.

Understanding the Components

Before we embark on this article, let’s briefly introduce the key components involved:

  1. DuckDB is an open-source, embeddable database management system designed for analytical workloads. It boasts features such as columnar storage and vectorized query execution, making it well-suited for analytical tasks requiring high performance and concurrency. Its lightweight nature and ease of integration make it a popular choice for applications ranging from data exploration to data transformation.
  2. Spatial Extension: The spatial extension of DuckDB enhances its capabilities to handle vector data, enabling users to store and query spatial datasets efficiently.

Key features of DuckDB’s spatial extension include:

  • Spatial Data Types: The extension introduces spatial data types, such as points, lines, polygons, and multi-geometries, allowing users to represent various spatial entities accurately.
  • Spatial Functions: The extension offers a rich set of spatial functions for geometric calculations, spatial relationships, and transformations, facilitating advanced spatial analysis directly within the database.
  • Vector Data Import and Export: The extension provides utilities for importing and exporting vector data in popular formats such as GeoJSON, Shapefile, and Well-Known Text (WKT), ensuring interoperability with existing spatial data workflows.
  • Integration with Analytical Workflows: The spatial extension seamlessly integrates with existing analytical workflows, enabling the incorporation of spatial data into data science pipelines and business intelligence applications.

But, what about raster types? Raster types are not supported and there is currently no plan to add them to the extension.

Raster data is fundamental to remote sensing applications, where satellite and aerial imagery are used to monitor Earth’s surface from a distance. Remote sensing techniques, enabled by raster data, facilitate the monitoring of natural disasters, assessment of environmental changes, and management of natural resources.

Proposing support for raster in DuckDB

Alright, this is the idea behind this post: to describe this implementation to support raster in the DuckDB’s spatial extension and to outline the capabilities of the developed functions using Spatial SQL statements.

Raster functions are using GDAL under the hoods, DuckDB’s spatial extension already includes this amazing library who loads/processes/writes an incredible set of geospatial data formats.

Let me take you on this journey writing SQL in DuckDB and handling raster data!

Sample data

For all of the following commands, I will use a grid of four Gtiffs to test functions. Raster files have an unique band (Int16Grayscale) with some of overlap between them, but aligned on the same grid.

Raster files are included in the Pull Request, because they are loaded by the set of SQL unit tests developed.

Basic functions, loading rasters

First of all, let me show you all spatial drivers that DuckDB provides, including now drivers those support raster data sources:

SELECT * FROM ST_Drivers();

Great, we can view as well drivers managing raster (Gtiff, COG!). Table now shows a new column “type” indicating the data type supported by each driver (some of them support both spatial types).

The most basic table function is ST_ReadRaster(), to load a raster from a file:

SELECT * FROM ST_ReadRaster('./test/data/mosaic/SCL.tif-land-clip00.tiff');
-- or
SELECT * FROM './test/data/mosaic/SCL.tif-land-clip00.tiff';

The second column “raster” is a raster-type, this is the field with the object with raster data (A wrapper of the GDALDataset class).

You can use as well the scalar function ST_RasterFromFile() to load rasters:

WITH __input AS (
SELECT
file,
ST_RasterFromFile(file) AS raster
FROM
glob('./test/data/mosaic/*.tiff')
)
SELECT file, ST_GetGeometry(raster) FROM __input;

ST_GetGeometry() returns the spatial envelope of the raster, as we will see later.

Saving rasters

We can use ST_RasterAsFile() function or COPY TO (…) to save a raster to file:

WITH __input AS (
SELECT
raster
FROM
ST_ReadRaster('./test/data/mosaic/SCL.tif-land-clip00.tiff')
)
SELECT
ST_RasterAsFile(
raster,
'./test/data/output.tiff',
'COG',
['COMPRESS=LZW']
)
AS result
FROM
__input
;

Using COPY TO…

COPY (
SELECT * FROM './test/data/mosaic/SCL.tif-land-clip00.tiff'
)
TO
'./test/data/output.tiff'
WITH (
FORMAT RASTER, DRIVER 'COG', CREATION_OPTIONS ('COMPRESS=LZW')
);

Raster / Raster Band accessors

Getting basic metadata of a raster:

SELECT
ST_SRID(raster) AS srid,
ST_Width(raster) AS cols,
ST_Height(raster) AS rows,
ST_NumBands(raster) AS num_bands,
ST_UpperLeftX(raster) AS ulx,
ST_UpperLeftY(raster) AS uly,
ST_ScaleX(raster) AS scale_x,
ST_ScaleY(raster) AS scale_y,
ST_SkewX(raster) AS skew_x,
ST_SkewY(raster) AS skew_y,
ST_PixelWidth(raster) AS px_width,
ST_PixelHeight(raster) AS px_height
FROM
ST_ReadRaster('./test/data/mosaic/SCL.tif-land-clip00.tiff')
;

ST_ReadRaster_Meta() table function loads the raster metadata more easily:

SELECT * 
FROM ST_ReadRaster_Meta('./test/data/mosaic/SCL.tif-land-clip00.tiff');

ST_GetGeometry() returns the polygon representation of the extent of the raster:

SELECT 
ST_GetGeometry(raster)
FROM
ST_ReadRaster('./test/data/mosaic/SCL.tif-land-clip00.tiff')
;

For metadata of Raster Bands, we can query:

SELECT 
ST_HasNoBand(raster, 1) AS hb1,
ST_HasNoBand(raster, 2) AS hb2,
ST_GetBandPixelType(raster, 1) AS pxt1,
ST_GetBandPixelTypeName(raster, 1) AS pxtn1,
ST_GetBandNoDataValue(raster, 1) AS ndv1,
ST_GetBandColorInterp(raster, 1) AS ci1,
ST_GetBandColorInterpName(raster, 1) AS cin1
FROM
'./test/data/mosaic/SCL.tif-land-clip00.tiff'
;

And ST_Value() to get a pixel value:

SELECT
ST_Value(raster, 1, 0, 0) AS val00,
ST_Value(raster, 1, ST_Width(raster)-1, 0) AS val10,
ST_Value(raster, 1, ST_Width(raster)-1, ST_Height(raster)-1) AS val11,
ST_Value(raster, 1, 0, ST_Height(raster)-1) AS val01,
FROM
'./test/data/mosaic/SCL.tif-land-clip00.tiff'
;

-9999.0 is the “NODATA” value for this raster.

We can also transform between World and Pixel coordinates, using ST_RasterToWorldCoord[XY] and ST_WorldToRasterCoord[XY] functions:

SELECT
ST_RasterToWorldCoord (raster, 0, 0) AS coord,
ST_RasterToWorldCoordX(raster, 0, 0) AS coord_x,
ST_RasterToWorldCoordY(raster, 0, 0) AS coord_y,

ST_WorldToRasterCoord (raster,
ST_RasterToWorldCoordX(raster, 1, 2),
ST_RasterToWorldCoordY(raster, 1, 2)) AS coord,
ST_WorldToRasterCoordX(raster,
ST_RasterToWorldCoordX(raster, 1, 2),
ST_RasterToWorldCoordY(raster, 1, 2)) AS col,
ST_WorldToRasterCoordY(raster,
ST_RasterToWorldCoordX(raster, 1, 2),
ST_RasterToWorldCoordY(raster, 1, 2)) AS row,
FROM
'./test/data/mosaic/SCL.tif-land-clip00.tiff'
;

Raster aggregates

We only have two aggregate functions, ST_RasterMosaic_Agg() and ST_RasterUnion_Agg(). First function creates a new virtual raster mosaicking input rasters, the second one joins them at band level. Under the hoods functions really invoke GDALBuildVRT().

WITH __input AS (
SELECT
1 AS mosaic_id,
ST_RasterFromFile(file) AS raster
FROM
glob('./test/data/mosaic/*.tiff')
),
__mosaic AS (
SELECT
ST_RasterMosaic_Agg(raster, options => ['-r', 'bilinear']) AS mosaic
FROM
__input
GROUP BY
mosaic_id
)
SELECT
ST_RasterAsFile(
mosaic,
'./test/data/mosaic.tiff',
'COG',
['COMPRESS=LZW']
)
AS result
FROM
__mosaic
;

Previous SQL statement loads all Gtiff’s in a folder, our four sample tiles, creates a mosaic from them (Aggregate function) and finally saves the mosaic to a new COG file.

Raster processing

I have developed only two functions, ST_RasterWarp() & ST_RasterClip(), as you can suppose, they are a simple wrapper of GDALWarp().

WITH __input AS (
SELECT
1 AS mosaic_id,
ST_RasterFromFile(file) AS raster
FROM
glob('./test/data/mosaic/*.tiff')
),
__mosaic AS (
SELECT
ST_RasterMosaic_Agg(raster, options => ['-r', 'bilinear']) AS mosaic
FROM
__input
GROUP BY
mosaic_id
),
__warp AS (
SELECT
ST_RasterWarp(
mosaic,
options => [ '-r', 'bilinear', '-tr', '40.0', '40.0']
)
AS warp
FROM
__mosaic
)
SELECT
ST_RasterAsFile(
warp,
'./test/data/warp.tiff',
'COG',
['COMPRESS=LZW']
)
AS result
FROM
__warp
;

Previous SQL statement, saves the mosaic of input tiles with 40m of resolution, the original resolution is 20m.

For clipping the mosaic with a Geometry:

WITH __input AS (
SELECT
1 AS mosaic_id,
ST_RasterFromFile(file) AS raster
FROM
glob('./test/data/mosaic/*.tiff')
),
__mosaic AS (
SELECT
ST_RasterMosaic_Agg(raster, options => ['-r', 'bilinear']) AS mosaic
FROM
__input
GROUP BY
mosaic_id
),
__geometry AS (
SELECT
geom
FROM
ST_Read('./test/data/mosaic/CATAST_Pol_Thownship-PNA.gpkg')
),
__clip AS (
SELECT
ST_RasterClip(
mosaic,
(SELECT geom FROM __geometry LIMIT 1),
options => [
'-r', 'bilinear',
'-crop_to_cutline',
'-wo', 'CUTLINE_ALL_TOUCHED=TRUE'
]
)
AS clip
FROM
__mosaic
)
SELECT
ST_RasterAsFile(
clip,
'./test/data/clip.tiff',
'COG',
['COMPRESS=LZW']
)
AS result
FROM
__clip
;

As you can view, “__geometry” stage loads a Geopackage with ST_Read(), this function is already included in the spatial extension to load geometries from a feature layer.

This is our result:

Almost there, just a moment…

Yes, now, yes. I had to include this! :-)

Conclusion

Integrating raster and satellite imagery offers significant advantages in geospatial data workflows. By incorporating raster data, such as satellite images, into spatial analyses, users gain access to a wealth of information about the Earth’s surface, including land cover, vegetation health, urban development, and more.

This implementation is a proof of concept. It works integrating vector and raster data, of course, only a little subset of functions are there (PostGIS provides a huge set functions), but it can be a starting point.

Anyway, this was a lot of fun, the DuckDB source code contains a huge collection of C/C++ classes, which are hard to understand at first. However, once you understand the impressive array of functionality it offers, browsing its code becomes a pleasure :-)

DuckDB rocks!

Thanks to

  • To DuckDB team, DuckDB is an open-source and fast in-process analytical database. It is an impressive database with incredible performance and a rich set of features and extensions!!!
  • To GDAL project, you know it, GDAL is another super-fantasic toolkit to manage vector & raster data sources.
  • Of course, that tools are composed by an infinity of other open source libraries, thanks to people and organizations who develop amazing open source stuff.

Next steps

  1. Maybe I will write another post, continuing this one, where we query metadata from a SpatioTemporal Asset Catalog (STAC), for example loading imagery from the Sentinel-2 Cloud-Optimized GeoTIFFs dataset (Registry of Open Data on AWS).

The idea is to perform with DuckDB a spatio-temporal filter to fetch tiles, create a mosaic with them, merge their raster bands (e.g. using B2, B3, B8 bands) and finally clip outputs using a specific area of interest (Polygon).

But first of all, I should figure out how to send a HTTP request in DuckDB, for invoking the STAC Catalog endpoint. Maybe something in the “httpfs” extension helps…

UDPATE: Second story is available

2. Maybe my PR is interesting for getting merged in the DuckDB spatial extension, I would be happy to help in this path.

Bye!

--

--