A Guide for Collecting and Sharing Ground Reference Data for Machine Learning Applications

Five criteria for making ground reference data ready for Earth Observation machine learning models

Published in

Radiant Earth Insights

5 min readMar 30, 2020

By Yonah Bromberg Gaber, Geospatial Data Specialist, Radiant Earth Foundation

Machine learning (ML) applications for Earth observation (EO) can use currently available data that are collected via surveys for empirical research to investigate applied sciences or conduct socio-economic analysis. However, the chances of those survey data being incomplete are high. Many ML applications on EO require ground reference data, which are accurate observations of some property on the ground and can be used as a label or description of what a potential overhead image¹ represents.

It is, of course, possible to label images remotely using online platforms like OpenStreetMap, but image annotation has significant limitations. For example, one cannot identify crop types or crop yield by looking at the image, and it would require collecting reference data on the ground. However, ground data collections are expensive and complicated, and it is critical to leverage existing data collection efforts, like a survey that is completed by others for their specific research purposes. For this to occur, survey questionnaires must capture the right data with corresponding metadata information.

Leveraging Existing Data

Radiant Earth Foundation is grateful for its partners, such as PlantVillage from Penn State University, who have provided the data used to build its African crop type training datasets that are now available on Radiant MLHub. The team has been able to pair these data, and some from other partners, with satellite imagery from the European Space Agency’s Sentinel 2 mission to create the crop type training data for essential crops in Kenya, Tanzania, and Uganda. Radiant Earth learned crucial lessons about the limitations of data collection while reviewing potential datasets to use in this project. Issues such as overlapping crop fields, imprecise classifications, or lack of date/time for the records make it hard, if not impossible, to use these data for ML purposes.

The Best Practices for Ground Reference Data Collection and Catalogue Guide emerged out of this experience. The purpose of the Guide is to encourage the community to collect and prepare their data so that it can also be used for ML modeling and enhance productivity across the board. As a living document, Radiant Earth welcomes feedback from data partners, field workers, and ML developers to improve it.

Additionally, please reach out to us if you have data you’d like to share or have questions or suggestions about the Guide!

Best Practices for Ground Reference Data Collection and Catalogue Guide

Radiant Earth has identified five criteria for making ground reference data usable as labels in ML training or validation. Each criterion has both an ideal method and minimum requirements. The ideal collection technique should ensure that the information collected is highly useful, but it isn’t necessary to function as labeled data.

Criterion 1: Is the data geographically specific?

To be used for training data, the collected data have to be matched correctly with imagery. As such, each record (i.e., data point or row) has to be geographically specific, that is, a particular data value mapped to a distinct geometry (or in the case of anonymized data, a single image without geographic data).

Ideally, for applications such as land cover or crop type classification, a GPS trace of the field boundary should be included if possible, given the limited resources of field collection.

At a minimum, each record needs to have a discrete and well-defined geometry that maps to a defined set of pixels. For example, polygons that don’t overlap, and, points with set buffers are acceptable while overlapping polygons or points that could refer to a variable radius buffer are not.

Criterion 2: Are the classes well defined and consistent?

The data need to have a well-documented and consistent class definition so users can easily develop high-quality training data without any incorrect or false label. The methods used to identify each class, or measure a value, must be included in the metadata for the dataset to be replicable.

Ideally, datasets should follow existing taxonomies and methodologies, so that datasets can be compared and/or combined. For example, the ML4GD Working Group schema is recommended for land cover classes, and FAO AGROVAC URIs would make agricultural data easily replicable.

Any other classification schema would work as long as it’s well-defined and documented. If classes are discrete, then each category should be unambiguous; if the data are continuous, measurement accuracy and precision should be provided.

Criterion 3: Are the required metadata included?

Metadata provides valuable information to identify if a dataset is useful for building an ML model. Moreover, metadata can be used to discover datasets in search engines and API. As such, it should contain high-level information about the datasets, including the spatial and temporal coverage of the data and who is responsible for it.

Radiant Earth has identified a list of metadata fields that are required for each dataset, including Date, Coordinate system, Methods, Data/Class fields, Organization/Author, Data field definitions, Data citation, and License.

Additional metadata fields may include a description of the data, the consent or rights provided for the survey, and any other additional fields.

Criterion 4: Is the data properly licensed?

Any dataset, including ground reference data, need to be licensed to each user so they can use it in their application or product. To increase the impact of the ground collected data, Radiant Earth recommends the data license to be as open as possible. Open data license increases the impact of the data beyond its initial purpose and enhances innovation. The recommended open data license is Creative Commons (particularly CC-BY).

Additionally, consideration should be made to the data collector’s rights to the information collected. Data should only be shared and licensed by practitioners that have the rights and permissions to share that data. Considerations of anonymization of individual identities should be undertaken without changing (or distorting) the geographic location of the data.

Criterion 5: Is the data properly formatted?

Any geographic file format can be used to store the data, so long as it is documented and well-defined. Radiant Earth generally recommends the GeoJSON format for vector data, which is compatible with many standards, easy to transfer and use, and is open; other formats such as Shapefile and CSV work as well.

Radiant Earth has included a sample GeoJSON file for use as a template for creating ground reference data in the guide repository.

Best Practices Guideline Feedback and Discussions

Radiant Earth is inviting you to provide feedback on the Ground Reference Data Collection and Catalogue Guide on GitHub. Additionally, Radiant Earth hosted a virtual meeting on April 21, 2020, at 10:00 A.M. ET for an in-depth conversation on the Guide. You can watch the meeting and discussions here: https://bit.ly/GroundReferenceGuideWebinar

[1] Throughout this article, “image” refers to a georeferenced satellite or airborne imagery unless classified otherwise.