Querying STAC to load Satellite Imagery (Sentinel2) in DuckDB with the spatial extension (II)

ah
7 min readApr 7, 2024

--

This article is a continuation of this previous one. That post described an implementation (PoC) to integrate raster data into DuckDB using the spatial extension.

Now I will use all that in combination with the httpfs extension; the idea is to query metadata from a STAC (SpatioTemporal Asset Catalog), and geo-process some of the rasters from a collection in DuckDB, to finally export the results.

Understanding the Components

Before we start with this article, let’s introduce the key components involved:

  1. DuckDB is an open-source, embeddable database management system designed for analytical workloads.
  2. Spatial Extension: The spatial extension of DuckDB enhances its capabilities to handle spatial data.
  3. https Extension: The httpfs extension supports, among other things, connecting to HTTP(S) endpoints.
  4. STAC (SpatioTemporal Asset Catalog): The STAC is a standardized specification for organizing and describing geospatial assets in a manner optimized for search and discovery across both spatial and temporal dimensions.

Input data

ESA Sentinel-2

The Copernicus Sentinel-2 mission is a land monitoring constellation of satellites that provide high resolution optical imagery. The mission provides a global coverage of the Earth’s land surface, making the data of great use in ongoing studies.

There are many sources of Sentinel2 imagery. Of course, first of all, the Copernicus Open Access Hub from the European Space Agency (ESA), but we would have to download their products completely to process them. Amazon and Google provide Sentinel Collections as well, free or with very cheap charges.

STAC (SpatioTemporal Asset Catalog)

The SpatioTemporal Asset Catalog (STAC) specification aims to standardize the way geospatial assets are exposed online and queried. A “spatiotemporal asset” is any file that represents information about the earth captured in a certain space and time.

Speaking about the Amazon platform, you can find the amazing Sentinel-2 COG bucket managed by Element84.

Its description says: “This dataset is the same as the Sentinel-2 dataset, except the JP2K files were converted into Cloud-Optimized GeoTIFFs (COGs). Additionally, SpatioTemporal Asset Catalog metadata has were in a JSON file alongside the data, and a STAC API called Earth-search is freely available to search the archive”.

I love this dataset: COG files, I am going to use them as input data for this article.

Proposing new functions in httpfs

The key feature now is how to perform from DuckDB a spatio-temporal filter to a STAC collection, and digest the response data to load raster assets.

A lot of the work for making HTTP requests is done in the httpfs extension, but unfortunately I didn’t find a way to invoke a web server via HTTP requests.

So, I am proposing (sorry, again) a new PR providing two new functions in the httpfs extension; HTTPGetRequest() and HTTPPostRequest().

Let me explain how to use them.

Basic functions, sending HTTP requests

First sample, a basic HTTP GetRequest:

WITH __input AS (
SELECT
HTTPGetRequest('https://earth-search.aws.element84.com/v0/search')
AS data
),
__features AS (
SELECT
unnest( from_json((data::JSON)->'features', '["json"]') ) AS features
FROM
__input
)
SELECT
features->>'id' AS id,
features->'properties'->>'sentinel:product_id' AS product_id,
concat(
'T',
features->'properties'->>'sentinel:utm_zone',
features->'properties'->>'sentinel:latitude_band',
features->'properties'->>'sentinel:grid_square'
) AS grid_id,
ST_GeomFromGeoJSON(features->'geometry') AS geom
FROM
__features
;

Too much SQL code, there are three stages:

  1. Send one HTTP get request to the desired endpoint, and save results (The response is a string) into “data”.
  2. Parse “data” as a JSON document and unnest the array loaded, each item in the array is the metadata or feature objects that match the request.
  3. Digest each feature into planar attributes.

Ok, but I have specified none parameter, so server is only returning the first ten features of a collection.

We need to perform a post request to customize the response:

WITH __geometry AS (
SELECT
geom
FROM
ST_Read('./test/data/mosaic/CATAST_Pol_Township-PNA.gpkg')
),
__geojson AS (
SELECT
ST_AsGeoJSON(
ST_Transform(geom, 'EPSG:32630', 'EPSG:4326', always_xy => true)
) AS g
FROM
__geometry
LIMIT
1
),
__input AS (
SELECT
HTTPPostRequest('https://earth-search.aws.element84.com/v0/search',
headers => MAP {
'Content-Type': 'application/json',
'Accept-Encoding': 'gzip',
'Accept': 'application/geo+json'
},
params => {
'collections': ['sentinel-s2-l2a-cogs'],
'datetime': '2021-09-30/2021-09-30',
'intersects': (SELECT g FROM __geojson),
'limit': 16
}
)
AS data
),
__features AS (
SELECT
unnest( from_json((data::JSON)->'features', '["json"]') ) AS features
FROM
__input
)
SELECT
features->>'id' AS id,
features->'properties'->>'sentinel:product_id' AS product_id,
concat(
'T',
features->'properties'->>'sentinel:utm_zone',
features->'properties'->>'sentinel:latitude_band',
features->'properties'->>'sentinel:grid_square'
) AS grid_id,
ST_GeomFromGeoJSON(features->'geometry') AS geom
FROM
__features
;

Two S2L2A tiles matching the filter criteria! let me explain again all stages:

  1. Load a Geopackage with my area of interest, a polygon located in Navarra (Spain). Of course, we could load another feature layer from a remote s3 bucket as well or define the geometry using WKT format.
  2. Transform the spatial reference to EPSG:4326, the endpoint needs the geometry defined in this spatial reference system.
  3. Perform the post request defining as parameters: the aoi (geometry), the collection (sentinel-s2-l2a-cogs) and the range of dates (2021–09–30/2021–09–30).
  4. Parse “data” as a JSON document and unnest the array loaded, each item in the array is the metadata or feature objects that match the request.
  5. Digest each feature into planar attributes.

That is more useful, right?

Everything in a query, STAC & raster processing

WITH __geometry AS (
SELECT
geom
FROM
ST_Read('./test/data/mosaic/CATAST_Pol_Township-PNA.gpkg')
),
__geojson AS (
SELECT
ST_AsGeoJSON(
ST_Transform(geom, 'EPSG:32630', 'EPSG:4326', always_xy => true)
) AS g
FROM
__geometry
LIMIT
1
),
__input AS (
SELECT
HTTPPostRequest('https://earth-search.aws.element84.com/v0/search',
headers => MAP {
'Content-Type': 'application/json',
'Accept-Encoding': 'gzip',
'Accept': 'application/geo+json'
},
params => {
'collections': ['sentinel-s2-l2a-cogs'],
'datetime': '2021-09-30/2021-09-30',
'intersects': (SELECT g FROM __geojson),
'limit': 16
}
)
AS data
),
__features AS (
SELECT
unnest( from_json((data::JSON)->'features', '["json"]') ) AS features
FROM
__input
),
__rasters AS (
SELECT
ST_RasterFromFile(features->'assets'->'B02'->>'href') AS b02,
ST_RasterFromFile(features->'assets'->'B03'->>'href') AS b03,
ST_RasterFromFile(features->'assets'->'B04'->>'href') AS b04,
1 AS group_id
FROM
__features
),
__mosaic AS (
SELECT
ST_RasterMosaic_Agg(b02, options => ['-r', 'bilinear']) AS b02_m,
ST_RasterMosaic_Agg(b03, options => ['-r', 'bilinear']) AS b03_m,
ST_RasterMosaic_Agg(b04, options => ['-r', 'bilinear']) AS b04_m
FROM
__rasters
GROUP BY
group_id
),
__clip AS (
SELECT
ST_RasterClip(b02_m,
(SELECT geom FROM __geometry LIMIT 1),
options => [
'-r', 'bilinear',
'-crop_to_cutline',
'-wo', 'CUTLINE_ALL_TOUCHED=TRUE'
]
) AS b02_c,
ST_RasterClip(b03_m,
(SELECT geom FROM __geometry LIMIT 1),
options => [
'-r', 'bilinear',
'-crop_to_cutline',
'-wo', 'CUTLINE_ALL_TOUCHED=TRUE'
]
) AS b03_c,
ST_RasterClip(b04_m,
(SELECT geom FROM __geometry LIMIT 1),
options => [
'-r', 'bilinear',
'-crop_to_cutline',
'-wo', 'CUTLINE_ALL_TOUCHED=TRUE'
]
) AS b04_c
FROM
__mosaic
),
__rgb AS (
SELECT b04_c AS band FROM __clip
UNION ALL
SELECT b03_c AS band FROM __clip
UNION ALL
SELECT b02_c AS band FROM __clip
),
__raster AS (
SELECT
ST_RasterUnion_Agg(band, options => ['-r', 'bilinear']) AS rgb
FROM
__rgb
)
SELECT
ST_RasterAsFile(rgb,
'./mydata/results/rgb.tiff',
'COG',
['COMPRESS=LZW']
) AS result
FROM
__raster
;

Wow, it is giving results, commenting stages of the query:

1–4. Same stages than previous SQL statement.

New stages:

  1. Load rasters from each tile, we are reading assets from bands B02, B03 and B04. Raster files are COG format, no problem, this format is cloud-friendly for using in remote data sources. We are not going to download the tiles completely, but a small area of them.
  2. Create a mosaic for each band. Each mosaic is joining two tiles in a new virtual raster because we want a continuous layer of data pixels.
  3. Clip mosaic objects with the desired area of interest (geometry).
  4. Merge the three clipped raster objects (B04, B03 and B02 bands) into one unique new raster, each object into a separate band. So, we have now a RGB image.
  5. Save results to file.

That’s all!

Thanks to (again)

  • To DuckDB team, DuckDB is an open-source and fast in-process analytical database. It is an impressive database with incredible performance and a rich set of features and extensions!!!
  • To GDAL project, you know it, GDAL is another super-fantasic toolkit to manage vector & raster data sources.
  • To everyone, for reading that two posts, I would be happy if you like them.

Next steps

If both PR & PR are interesting for getting merged in the DuckDB, I would be happy to help in this path.

Anyway, this second one needs more testing and maybe more capabilities to customize a request.

It would be exciting to see this integrated into DuckDB Wasm; we would have a complete component for generating Earth Observation data processing workflows in your web browser, with zero setups. The possibilities are endless!

Bye!

--

--