A Guide for Collecting and Sharing Ground Reference Data for Machine Learning Applications
Five criteria for making ground reference data ready for Earth Observation machine learning models
Machine learning (ML) applications for Earth observation (EO) can use currently available data that are collected via surveys for empirical research to investigate applied sciences or conduct socio-economic analysis. However, the chances of those survey data being incomplete are high. Many ML applications on EO require ground reference data, which are accurate observations of some property on the ground and can be used as a label or description of what a potential overhead image¹ represents.
It is, of course, possible to label images remotely using online platforms like OpenStreetMap, but image annotation has significant limitations. For example, one cannot identify crop types or crop yield by looking at the image, and it would require collecting reference data on the ground. However, ground data collections are expensive and complicated, and it is critical to leverage existing data collection efforts, like a survey that is completed by others for their specific research purposes. For this to occur, survey questionnaires must capture the right data with corresponding metadata information.
Leveraging Existing Data
Radiant Earth Foundation is grateful for its partners, such as PlantVillage from Penn State University, who have provided the data used to build its African crop type training datasets that are now available on Radiant MLHub. The team has been able to pair these data, and some from other partners, with satellite imagery from the European Space Agency’s Sentinel 2 mission to create the crop type training data for essential crops in Kenya, Tanzania, and Uganda. Radiant Earth learned crucial lessons about the limitations of data collection while reviewing potential datasets to use in this project. Issues such as overlapping crop fields, imprecise classifications, or lack of date/time for the records make it hard, if not impossible, to use these data for ML purposes.
The Best Practices for Ground Reference Data Collection and Catalogue Guide emerged out of this experience. The purpose of the Guide is to encourage the community to collect and prepare their data so that it…