Building the VIDA Data Catalog with STAC and Cloud-Native Geospatial formats

Giorgio Basile

Published in

VIDA engineering

11 min readJun 13, 2024

This blog post leverages the long-standing efforts of the Data Engineering and Platform teams at VIDA, led by

Darell van der Voort

VIDA risk assessment, visualizing global Drought risk and several other indicators for any given location.

In this blog post, we are excited to share some of the latest engineering developments we have been working on at VIDA, in our efforts to provide our users with a great data platform centered around climate, environmental and social indicators worldwide.

Given the steady growth in size and complexity of our data offer, we recently took on the challenge of implementing a Geospatial Data Catalog, leveraging the formats and infrastructure already used in our platform and fully aligned with the modern Cloud-Native Geospatial best practices.

To accomplish this, we relied on the SpatioTemporal Asset Catalog (STAC) standard, designed to simplify the search and retrieval of geospatial data. Therefore, we are very excited to share our journey towards adopting STAC, describing the interplay with the Cloud-Native formats and technologies already in place at VIDA, and documenting our learnings and design choices along the way.

Motivation and Use Cases

A Data Catalog offers significant benefits for data-driven companies and platforms. At VIDA, our key motivations include creating a structured, unified, and searchable view of all our datasets, in order to better support data discovery, transformation, analysis and visualization use cases.

Building a flexible Data Catalog involves several technical considerations related to the infrastructure and tools needed to run the catalog as a service, the definition of an expressive metadata language, and an efficient ingestion process.

Fortunately, with the adoption of the STAC standard and its supportive community, much of this effort could be considered a “solved problem”, leaving us only the necessary tailoring with regard to our data strategy and use cases.

STAC

STAC is a (Geo)JSON-based specification providing a common structure for describing and cataloging spatiotemporal assets. A spatiotemporal asset is any file or resource representing information about the Earth captured in a given space and time.

STAC is organized as a set of four semi-independent specifications that outline the fundamental concepts — Catalog, Collection, Item — and the RESTful endpoints to search, filter and retrieve Items of interest.

While these represent the core part of the standard, STAC is an extensible language, agnostic regarding the usage of custom metadata keys and values. Nevertheless, there are often common domains and use cases where further standardization can benefit multiple actors.

That’s where STAC Extensions come in, providing additional schemas related to particular domains, topics or scenarios, which typically require further metadata to be used at the Collection, Item or Asset levels. Indeed, using STAC Extensions was a significant part of our commitment to satisfy some of the requirements we had for our use cases and a great opportunity to learn and collaborate.

A STAC Item is a GeoJSON feature listing inseparable assets, and belonging to a parent Collection, typically associated with a dataset. (Credits: STAC website)

Beyond the excellent language and concepts it provides, STAC’s power relies on the community behind it — some of the most prominent organizations and developers in geospatial tech — and the set of tools built around the specification to facilitate and streamline the development and deployment of catalogs, which we leveraged fully in our journey.

Infrastructure and tools

The first step towards implementing the Catalog was to work on the necessary infrastructure to create it and expose it as a service. We use Google Cloud Platform (GCP) as our main cloud provider, and we heavily rely on its numerous storage and compute services.

As a starting point, we took inspiration from the Planetary Computer API repository, which collects and glues together several Python-based utility packages provided by the STAC community.

STAC utilities overview, in red those we actively use at VIDA. (Credits: stac-utils on GitHub)

In particular, we decided to use stac-fastapi-pgstac, an extension of stac-fastapi that exposes a STAC API having a PgSTAC backend. PgSTAC defines schemas and procedures to store STAC-related records in a PostGIS database as jsonb blobs of data.

stac-fastapi-pgstac allowed us to continue working with the awesome FastAPI framework, already widely used in our backend infrastructure. The package defines the STAC API endpoints and their models, performs request validation and keeps up with the specification’s evolution.

Moreover, it implements the Transaction extension, providing endpoints for insertion and update of Collections and Items through HTTP POST/PUT requests, performing the necessary validations and placing the data in the right tables.

So, the only work required at the infrastructure level was to set up the necessary Terraform files and deployment pipelines to:

Build a Docker image running a Gunicorn server and push it into our Artifact Registry.
Deploy a dedicated app through ArgoCD, backed by our backend GKS cluster.

Our Platform Team set up dedicated instances of the service, and we quickly started experimenting with the endpoints, creating test Collections and Items.

Datasets and buckets

At this point, the main task was filling the catalog with relevant metadata regarding our datasets. We decided to focus on those we leverage in our risk assessments feature, providing interesting indicators related to climate, environmental and social domains for any point on the globe.

This needed to be done in accordance with the bucket strategy implemented by the Data Engineering team at VIDA, which is responsible for all things data in support of other teams and business-related activities.

Simplified GCS bucket strategy implemented at VIDA.

We store most of our datasets in environment-based GCS buckets, neatly organizing files by spatial and temporal coverage. Each dataset usually has multiple time slices — i.e. dates or years — and optional partitioning depending on the way it was generated. Some of these files are mirrors from our providers, others we generate and maintain ourselves.

In some other cases, data is streamed directly from external sources like Source Cooperative and the Planetary Computer Data API, so the original URL should be available in the catalog.

For raster data, we mostly use Cloud-Optimized GeoTIFFs (COGs) with global coverage for each time slice, although in some other cases datasets are split into multiple COGs, with a MosaicJSON document placed alongside to describe their mosaicking.

MosaicJSON enables the processing and visualization of virtual rasters. (Credits: MosaicJSON on GitHub)

COGs can be directly used for both data analytics and web-based visualization through HTTP Range Requests, but it is usually very important to apply dynamic tiling and server-side styling or transformations. For this, we rely on TiTiler by Development Seed as part of our geospatial backend to enable large-scale visualizations of raster datasets.

In particular, TiTiler provides endpoints returning — among others — TileJSON documents to describe web-based map layers, with information such as XYZ URLs, zoom levels, available tiles and bounds. Those documents can be generated either from a single COG or a MosaicJSON file, using dedicated endpoints.

For vector data instead, GeoParquet is our main raw data format of choice, as its columnar structure allows for efficient querying using modern tools like DuckDB, although we also ingest them in BigQuery for data warehousing and distributed processing.

We then enable large-scale visualizations by providing PMTiles archives, so that vector tiles can be streamed directly to the client from our buckets. But, if the client does not yet support PMTiles, it can use a specialized TiTiler extension we developed using the aiopmtiles library, able to serve PMTiles files through an XYZ endpoint.

Our “pmtiler” extension serves PMTiles data as MapBox Vector Tiles (MVT) through an XYZ endpoint.

In summary, for each dataset we have one or multiple raw data files for each time slice, along with tiled map URLs for enhanced, large-scale visualizations, which should all be available in the catalog.

Luckily, STAC provides all the concepts and semantics to represent the necessary resources by appropriately using Items’ Assets and Links, also leveraging dedicated extensions depending on the use cases.

Collections and Items

With the previous considerations in mind, we went on sketching what Collections and Items would look like in the different scenarios, and we came up with a set of guidelines to follow.

While there usually is a 1-to-1 relationship between datasets and STAC Collections, there may be situations where a set of files released in the same dataset could be split into different Collections, depending on the emphasis and discoverability that needs to be enabled.

An example is the NEX-GDDP dataset provided by NASA’s NCCS, which groups together heterogeneous data related to i.e. storms, droughts, average temperatures, and extreme heat. In this case, we found it sensible to consider them as separate Collections, each with its own set of metadata and descriptions. Also, we defined a set of VIDA-specific keys related to properties we generally assign to our datasets, marked with a vida:* namespace.

NEX-GDDP — Drought risk Collection in the VIDA Data Catalog. Each Drought risk map for different years and RCP scenarios is represented by a dedicated STAC Item.

We then outlined how Items within Collections would be constructed, and the number of Assets and Links they would have in different cases. Considering raster data, in most — not all — cases, we have a 1-to-1 correspondence between files and Items, meaning that for each COG on GCS there should be one Item having the file’s spatial extent as geometry. Each Item would have one Asset attached, with the related COG file URL as the href value.

This leaves the open question of how to properly catalog TileJSON documents generated through TiTiler. While initially we thought of creating another Asset — as the Planetary Computer does — we then realized that the Web Map Links (WML) and Rendering extensions address exactly this situation:

WML specifies how to use STAC Links to describe XYZ, PMTiles, TileJSON or any other web mapping service URL, along with relevant metadata.
Rendering describes useful visualization parameters for Assets, like colormaps, no-data values and minimum-maximum zoom levels. These are very aligned with the parameters accepted by TiTiler for server-side styling, therefore making it perfect for our use cases.

One of the Drought risk Items, with the STAC Extensions we used for visualization and rendering.

These specifications have not yet reached the Stable maturity level, but we decided to implement them and keep up with their evolutions. We also had the opportunity to make a small contribution to the Rendering extension, ensuring it leverages STAC Item properties, as per best practices.

While this is good enough for Collections where each COG covers the full dataset extent, for spatially partitioned datasets we also needed to account for MosaicJSON-related Items. In this case, we followed a similar approach where the MosaicJSON file URL is attached as an Asset, with the related TileJSON URL again exposed as a Link.

One additional challenge was making sure that a client would be able to transparently access the available Collection-wide map layers for listing and visualization, whether they were the result of mosaicking or not. Therefore, we introduced the {'vida:coverage’:’collection’} property — marking all COGs-related Items when they have full spatial coverage, whereas only the MosaicJSON-related in the case of mosaics.

Sample CQL2 JSON payload to filter for any STAC Items in the catalog, providing Collection-wide map links.

An alternative approach could be using Collection-level Links or Assets, losing the capability to perform cross-Collection searches of visualization layers, something we decided to retain in our case. Also, MosaicJSON documents become unnecessary when using TiTiler-PgSTAC — as explained in the conclusions.

For vector data, we follow an analogous approach, where each raw GeoParquet file is exposed as an Asset of a dedicated Item. For any PMTiles archive, we also create a dedicated Item with an Asset for the raw file URL, and an additional Link with the XYZ endpoint we provide through our TiTiler extension. Any rendering information is available in the renders object, which we extend to accommodate key properties required by the MapBox Style Specification.

Our Google-Microsoft building footprints Collection, listing Items for some of the partitioned GeoParquet files, referring to the same time slice, and the global PMTiles-related Item.

Data ingestion

At this point, the only missing piece was developing a workflow that would accept a configuration for a given Collection, with related files to be considered for the creation of Items, Assets and Links, automatically implementing the guidelines described above.

We came up with a GCP Cloud Function — which will be refactored in a more scalable GCP Workflow — that would satisfy those requirements, following a series of standardized steps.

It accepts a JSON configuration with basic metadata about the Collection — i.e. id, title, description, custom properties, etc. — and would support two complementary modes to derive URLs to be considered for ingestion:

Harvesting: it allows to specify metadata about a GCS bucket — name, filtering prefix and regexes — and collect the URLs of all the available files, collecting further information from file path variables or embedded metadata. Particularly useful for very large collections of files.
Item specification: a more flexible approach where the client supplies a list of files it wants to consider for ingestion with related metadata. Works better for small collections of files, or for describing custom properties for specific Items.

Sample configuration with combined ingestion modes.

The two modes can be combined to achieve more sophisticated results — i.e. performing bulk ingestion of raw data, while specifying additional properties for specific Items of interest.

The pipeline uses pystac to describe Collections and Items, and then hits the transactional endpoints provided by stac-fastapi-pgstac to store them as jsonb blobs in a CloudSQL for PostgreSQL instance.

Data access

As already mentioned, Catalogs are accessible through STAC API calls, but one of the major advantages of the specification is that it allows for standardized clients to consume data.

STAC Browser allows to consult catalogs on a rich web interface, but we also rely on two great libraries for programmatic access in our Python-based data pipelines and analytics workloads:

PySTAC Client: allows querying the catalog with standardized Python objects and methods.
odc-stac: turns STAC Items retrieved through PySTAC Client into Xarray objects, performing raster operations like cropping, mosaicking, resampling, and reprojection.

With a few lines of code, it is possible to go from a STAC endpoint to multi-dimensional arrays tailored for analytics, saving valuable time for Data Engineers and Scientists.

Biodiversity intactness analysis with odc-stac. It allows to read Assets and perform chunked processing over large rasters with few lines of code. Learn more on its Example Notebooks.

Conclusion and next steps

Our STAC Data Catalog currently contains over 25 Collections, that aggregate 50+ data layers. Those are just a subset of the 70+ that are currently used in the risk assessments feature we recently revamped for VIDA Workspace users, not including several others we make available depending on use cases and needs.

We are currently working on increasing the number of available Collections and improving metadata quality in terms of thumbnails, descriptions and summaries.

Another significant enhancement to our geospatial backend is the adoption of TiTiler-PgSTAC as our default raster tiler, in alignment with Development Seed’s eoAPI reference project. It will support dynamic tiling for single STAC Items or mosaics derived from Collections or any STAC Item search, eliminating the need for MosaicJSON documents and related custom properties. Additionally, it will offer built-in statistics endpoints that will simplify many of our data analysis workflows.

While the catalog is currently accessed only internally, we plan to expose part of our STAC metadata through the VIDA platform according to authentication and authorization policies that will ensure controlled access to our users based on their subscription plan.

Finally, we want to acknowledge and recognize the amazing work carried out by the STAC Community, and all the organizations and developers that make Cloud-Native Geospatial technologies a reality. They have helped immensely with specifications and tooling we leverage every day, helping VIDA in its mission to use geospatial data for good and to have the most impact on infrastructure risk assessments, ESG targets and sustainability initiatives.