In December 2019, we publicly launched Radiant MLHub, the first open-access cloud-based repository for geospatial training datasets. Since then, we have continuously published new datasets and expanded the ecosystem around Radiant MLHub.
The idea of Radiant MLHub was born in Spring 2018 after several discussions and feedback from members of the community and funders. We had started a new project to develop a global and geographically diverse land cover training dataset using human verification called LandCoverNet. Soon after the launch of LandCoverNet in 2018, we identified a gap in the ecosystem to facilitate publication and uptake of training datasets in our community. That gap in the data value chain led us to the design and implementation of Radiant MLHub.
Data Value Chain
Based on the description articulated by Open Data Watch, the Data Value Chain has four major stages: Collection, Publication, Uptake, and Impact. Figure 1 shows different processes in each of these stages and highlights that the data value increases as it moves from collection to impact.
While there is interest from various sectors to work in the data collection stage, there is less interest in facilitating publication and uptake, particularly with providing infrastructure and standards. Data collections are usually centered around a project, or in the case of some commercial organizations, around a product. The next step in these cases is the internal use of data and potentially publishing the research results or releasing the end product.
To maximize the value of a dataset, philanthropic and government organizations mandate open publication of data and research results from grantees. In the case of projects that use machine learning (ML) on geospatial data, such a mandate requires infrastructure and an ecosystem to enable easy sharing of the data while giving credits and incentives to data publishers. Such an infrastructure should also follow FAIR data principles (Findable, Accessible, Interoperable, and Reusable).
Radiant MLHub was established to fill this gap. It’s designed around the FAIR principles, and it is empowered by community standards such as SpatioTemporal Asset Catalog (STAC). While Radiant MLHub is focused on the publication and uptake of geospatial training datasets, we will work closely with organizations in data collection and data impact stages to better inform our design choices. At the same time, we will provide feedback to those organizations to ensure data interoperability across multiple data value chain stages.
Since its launch, Radiant MLHub has gathered a large, diverse user community who have used the API to search for and access ML-ready training datasets. Moreover, we have regular inbound requests to host new datasets. Building on users’ feedback, we have designed the 2021 roadmap to expand our services and enhance the usability of Radiant MLHub.
Roadmap for 2021
Enhancing user experience
Our main goal is to make data search and download from Radiant MLHub seamless for users with varying experience levels. So far, users have been using the API and writing their own code to search for data and download individual items or datasets. We are now developing a Python Client, compatible with STAC API, to help users interact with the API in Python without writing basic API calls. This is the first Python Client for a STAC API to our knowledge, and we hope that it will also encourage other groups in our community to contribute to it. Look for announcements around the first release of Python Client in March 2021.
Defining metadata for model cataloging
Radiant MLHub is beyond just a data repository. We think of Radiant MLHub as a set of commons to advance applications of ML on Earth observations. Therefore, we aim to expand its services to answer the needs of the community in this respect. One such need is a library of existing ML models that users can easily find and put into practice (either for inference or using it as a pre-trained model).
While there are some examples of such model catalogs within the ML ecosystem, they do not support metadata related to geospatial ML models. For example, one might be interested in models that detect surface water at a specific spatial resolution or a model trained on data from a certain geographical region. Therefore, we are developing a geospatial ML model catalog that users can 1) register and publicly publish their models and 2) search for existing models using various query parameters. This catalog would require a standard definition for model metadata that we will develop in consultation with various groups in all community sectors.
Training and capacity development
One of Radiant Earth’s three pillars is education and increasing awareness and capacity to inspire better use of Earth observations in addressing international development challenges. Our initial focus since the launch was on building the infrastructure and expanding the data catalog. In 2021, we are expanding our training and capacity development activities. We are going to have our first virtual training bootcamp focused on ML for EO and organize two more competitions on challenging new training datasets, which we will release on Radiant MLHub.
We look forward to working together to strengthen ML on EO in the coming year!