Discoverable and Reusable ML Workflows for Earth Observation (Part 2)
Describing ML Models with the Geospatial Machine Learning Model Catalog (GMLMC)
“If we, as a community, can build tools that leverage these formats to allow users to discover, query, re-use, and publish their data and models, then the proliferation of their use and artifacts in project repositories will form the commonality that is required for effective standardization.”
During the height of the COVID-19 pandemic, the government of Togo launched a program to “boost national food production in response to the COVID-19 crisis by distributing aid to farmers”1. To accomplish this, the government needed accurate information about the distribution of smallholder farmers throughout the country. This kind of cropland map did not exist for the country, so they worked with NASA Harvest to rapidly develop a cropland map using AI (see Kerner and Tseng et al., 20201 for details). Finding enough high-resolution labeled training data to train the machine learning (ML) model was also a significant challenge, so the team combined global and local crowdsourced labels collected using the Geo-Wiki platform2 with hand-labeled imagery in targeted areas to train a new model for predicting crop areas. Impressively, the team was able to deliver the government of Togo a cropland map that “outperforms existing publicly-available cropland maps for Togo while also being higher resolution and more recent”1 all in under 10 days!
Reflecting on lessons learned by the NASA Harvest team, we examine how developing open specifications for cataloging models can enable and accelerate these kinds of time-sensitive ML workflows. This post explores the use of open specifications like the SpatioTemporal Asset Catalog (STAC) and the new Geospatial Machine Learning Model Catalog (GMLMC) to enable rapid model development projects and concludes with ways that you can contribute to the development of these specifications!
Building the Model
Finding high-quality training data for a given area can be a real roadblock to developing accurate models, especially when working with tight budgets and deadlines. To overcome this, the Harvest team started with an existing global reference database of “crop”/”non-crop” labels. They then used common GIS tools like QGIS and Google Earth Pro to draw labels over 2019 basemap imagery across a variety of agro-ecological zones in Togo. After initially training the model using these combined labels, they identified areas of model confusion and created additional labels for those areas to improve model performance.
In our first blog post in this series, we covered some standards and specifications that can make this kind of data discovery and analysis easier and more reproducible. Cataloging the global Geo-Wiki dataset using the STAC spec and Label Extension would allow teams like NASA Harvest to more easily discover labels in their area of interest and integrate them with other labeled data sources also cataloged using STAC. Using tools like Azavea’s GroundWork, which exports labels along with a STAC catalog, also makes it easy to create new labeled data and metadata that is already in a searchable format. Additionally, describing the train/test/validation split using the STAC ML AOI Extension makes reproducing the model training environment that much easier.
Rinse, Lather, Repeat
Shortly after the successful cropland mapping effort in Togo, the NASA Harvest team undertook a similar effort to create high-resolution cropland maps in Kenya (see Tseng and Kerner, et al.3). The researchers built upon the methods and models from the Togo project to train and deploy a model in a new region quickly. But what if someone not intimately involved in the original project wanted to apply these techniques and models to a new region? Making data and code publicly available, as their team has done, is a huge start and will hopefully become standard practice. However, publishing datasets and models that are easily discoverable and usable at scale will require well-known and transparent metadata specifications.
The GMLMC establishes some specifications for describing the different parts of the ML workflow. The Model Training section provides details on the hardware, software, and data used to train the model, while the Runtime and Usage Recommendations sections describe how to use the model to generate predictions. We also plan on developing a Metrics section in the near future that will help users evaluate the performance of the model under different conditions.
The speed with which the NASA Harvest team developed and delivered the cropland models is remarkable, but can we make it routine? The scientific community has already taken an important step by embracing the FAIR Principles. We now face an engineering question of how best to bring these principles to practice for the ML4EO domain.
The NASA Harvest team made their data and code publicly available for use in other efforts of this kind. However, we are still missing a machine readable model metadata format like GMLMC, which would allow future model builders to search a central location like Radiant Earth MLHub to discover models and their training data by description, type, and geographic area of applicability.
Users will not be swayed by promises of interoperability unless those promises are backed by action. If we, as a community, can build tools that leverage these formats to allow users to discover, query, re-use, and publish their data and models, then the proliferation of their use and artifacts in project repositories will form the commonality that is required for effective standardization.
As foundational libraries like PySTAC and the STAC spec itself stabilize and reach maturity, downstream tools like intake-stac can provide easy programmatic query and data access to STAC catalogs. These, in turn, can feed into distributed compute frameworks like Dask or be used in PyTorch data loaders directly. These tools are also being developed using a “cloud native” approach with data munging, model training, and model inference on the cloud as core design principles. Thus teams that adopt these tools can themselves be distributed, with training data easily accessible via cloud storage instead of hidden away on some device. Rather than trying one set of model hyperparameters, why not try one hundred in parallel? Prediction at scale is just a matter of pushing your model to a prediction service that uses these same formats. As this way of working becomes commonplace, it is not crazy to imagine cloning, not only the model repository but also the whole training environment.
Research to operation can become just the operation. To get there, we need to develop tools in conjunction with these standards to provide time savings at every step. Here, we need your help and your insights.
We are committed to growing the community of people who discover, leverage, and improve these models. To that end, we are developing the GMLMC as an open specification on GitHub, and we encourage anyone interested to get involved. If something seems incomplete or off-base, please let us know by creating an issue. You can also use this survey to join our email list or join the RadiantMLHub Slack workspace and find the #geo-ml-model-catalog channel.
- Hannah Kerner, Gabriel Tseng, Inbal Becker-Reshef, Catherine Nakalembe, Brian Barker, Blake Munshell, Madhava Paliyam, and Mehdi Hosseini. 2020. Rapid Response Crop Maps in Data Sparse Regions. In KDD ’20: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) Humanitarian Mapping Workshop. ACM, New York, NY, USA, 7 pages. https://arxiv.org/abs/2006.16866
- Linda See 2017. A global reference database of crowdsourced cropland data collected using the Geo-Wiki platform. International Institute for Applied Systems Analysis, PANGAEA, https://doi.org/10.1594/PANGAEA.873912
- Gabriel Tseng, Hannah Kerner, Catherine Nakalembe and Inbal Becker-Reshef. 2020. Annual and in-season mapping of cropland at field scale with sparse labels. Tackling Climate Change with Machine Learning workshop at NeurIPS ’20: December 11th, 2020, https://www.climatechange.ai/papers/neurips2020/29/paper.pdf.