Discoverable and Reusable ML Workflows for Earth Observation (Part 1)
Using STAC to catalog machine learning training data.
Researchers and data scientists are increasingly combining Earth observation (EO) with ground truth data from a variety of sources to build faster, more accurate machine learning (ML) models to gain valuable insights in domains ranging from agriculture to autonomous navigation to ecosystem health monitoring. These models are integrated into analytic pipelines that generate on-the-fly predictions at scale. The accuracy of these inferences are then evaluated using well-defined validation metrics and the results used to improve the performance of the original model in a continuous feedback loop.
If this sounds like a complex process, that’s because it is! Ad-hoc techniques for handling these workflows may work well within a single organization, but can lead to a bewildering array of algorithms and data for end-users. There is a clear need in the machine learning for Earth observation (ML4EO) community to find common ways of cataloging ML data and models to make workflows more searchable and reproducible.
In this post, we will show how you can leverage the STAC specification and its ecosystem of extensions to address many of the needs in cataloging machine learning training data. There are also a number of efforts to develop specifications and tools for cataloging ML models and performance metrics that we will examine in a future post.
The Importance of Training Data
High-quality training datasets are a crucial requirement for building an accurate ML model, but creating these datasets can be time-consuming, expensive, and error-prone. Let’s say that you are interested in training a model to detect buildings from satellite imagery in Uganda. You will first need to find a sufficient number of high-resolution, cloud-free satellite scenes captured recently, and then pair them with accurate building polygons that are spatially and temporally aligned with the source imagery. Having spent countless hours of manual inspection, ad-hoc scripts, and verifications to generate the dataset, you probably want to catalog it so you (and perhaps others) can easily discover and reuse it in the future. This can go a long way towards making your work reproducible and conserving valuable resources when you are retraining or troubleshooting your model down the road.
The SpatioTemporal Asset Catalog (STAC) specification was created to address exactly this need for EO data. STAC provides a commonly agreed-upon baseline of EO metadata that has, in turn, fostered a rich ecosystem of tools in various languages for cataloging and searching data. Resources within STAC are connected through links, making it a breeze to navigate large data collections and find related content. While the spec has been tested in a variety of production environments for a while now, the recent release of STAC v1.0.0 means that users can rely on a stable interface for indexing and discovering EO data.
This post only scratches the surface of the STAC spec. For more information on the STAC spec and its applications, please refer to this excellent series of posts on the topic by Chris Holmes and others.
Describing Labeled Data
Cataloging source imagery using STAC allows us to filter training data based on project needs. For instance, you can easily filter a STAC catalog for images with < 10% cloud cover over Kampala, Uganda from 2017 with 2m spatial resolution or better. That’s a start, and it is a lot easier and faster than rooting through FTP folders, interpreting CSV inventories, or parsing file names to figure when they were created. However, to fully describe an ML4EO training dataset we need to go beyond the core STAC spec.
The core spec aims to be a lightweight specification covering the most commonly used EO metadata fields. However, it does not cover the full array of metadata needed to describe all EO and ML datasets. STAC Extensions give us a way of providing more detailed metadata for specific applications, including data sources (e.g., SAR, electro-optical, etc.), storage formats, or spatial projections, to name just a few. In particular, the Label Extension and ML AOI Extension have been developed to enable a more thorough description of ML4EO training datasets and workflows.
Developed through a collaborative effort between Radiant Earth Foundation, Azavea, Development Seed and others, the STAC Label Extension allows us to describe ML labels and their relationship to source imagery. In addition to providing fields to describe label formats, classes, and statistical summaries of the label contents, it also defines a mechanism for linking those label files to the source imagery to which they can be applied. Combined with the core STAC metadata and other extensions, this provides a powerful way of describing geospatial ML training data in a discoverable and interoperable way.
With our labels and source imagery cataloged using the core STAC spec and Label Extension, we can start feeding the data into our ML framework to train a model. In most supervised ML4EO training workflows, this will involve splitting our dataset into subsets for training and testing the model, which is exactly what the ML AOI Extension allows us to do. Documenting this train/test split is crucial in enabling the reproducibility of model results. Developed initially by Azavea and now open-sourced as a community extension, the ML AOI Extension defines a mechanism for assigning source imagery and label resources to a train, test or validation split, (“train”, “test”, or “validate”). When combined with the metadata and relationships defined in the Label Extension, this gives us the tools for fully describing a dataset used in model training.
Cataloging Model Predictions
In many ways, model predictions look a lot like training data: you have a source image and labels applied to that source image. The semantics may have changed (the labels now represent detections rather than ground truth), but the structure remains the same. Cataloging these predictions using STAC and the Label Extension simplifies your ML workflows by allowing you to use the same tools for your predictions and training data. Having this spatio-temporal index of your predictions also improves the experience for downstream users of your predictions. They can use it to filter national-scale results to their area of interest or easily link multiple model predictions for the same time and place to provide additional context.
The core STAC specification, combined with the Label and ML AOI Extensions, goes a long way towards cataloging ML4EO workflows in a searchable, discoverable way. But more work needs to be done. For training data, the Label Extension currently supports only a limited set of tasks (regression, detection, classification, and segmentation). It also does not provide a clear way of describing the semantic meaning of values in a raster label file. The ML AOI extension can only be applied to supervised learning scenarios and may need more work to fully describe how training data are used in training a model. There are also no well-established standards for cataloging ML models themselves, nor are there standard ways of describing the performance of ML4EO models or capturing potential bias.
This is where you come in!
If you are an ML practitioner working with EO data, we encourage you to try out the tools described in this article, ask questions, and provide feedback so they can continue to evolve. Join us on Gitter to get involved in the STAC conversation, or contribute directly in the STAC core spec, Label Extension, and ML AOI Extension repositories.
In a follow-up post, we will dive deeper into some of the work that we are doing to develop a Geospatial Machine Learning Model Catalog (GMLMC) specification to describe ML4EO models in a searchable way. Our dream is for this to be a community-driven standard informed by real-world use-cases. We encourage ML practitioners of all backgrounds to join the discussion and contribute to that spec in any way they can.
Click here to read Part 2.