Making multi-source geospatial imagery useful for Machine Learning
Essential ingredients for ecosystem-wide geospatial analytics capability
A Tale of Two Communities
The past decade has seen an explosion of new sources of Earth observation (EO) data that has enabled new forms of analyses, driven by remote sensing scientists. We have also seen tremendous growth in research, commercialization and enterprise adoption of scaled deep learning-powered computer vision (CV) capabilities, driven by researchers, computer scientists and computer vision practitioners. These communities, both of which have different expertise, have been largely decoupled. However, recent advances in deep learning-powered remote sensing capabilities for generalized scalable automated analysis have been at the bleeding edge of the intersection of these two communities — and has initiated cross pollination between them. Workshops and conferences in the last couple of years have brought these communities together, including DeepGlobe2018, ARD2018 and EarthVision2019.
Analysis Ready Data (ARD) holds the promise of combining critical preprocessing steps with raw sensor outputs, making data ready for analysis. However, this definition is still broad and aims to address a multitude of use cases, both human and machine learning powered. In order to make ARD ready for machine learning model development, we need to address the ability to seamlessly create training data that is statistically representative of the data over which these models will serve inference.
In this article, we will distill the wide range of topics associated with ARD to the most critical features for application-focused machine learning model development.
Addressing the Source Gap: Data Transformation and Delivery
Geospatial data with associated time metadata over specific regions of interest, allows us to create spatio-temporal datasets, what the community calls data cubes. These data cubes make the geospatial imagery data more accessible for visual (human centric) analysis and ML driven (automation centric) analysis. Spatiotemporal inference on these data cubes require a few critical capabilities in remote sensing, EO data cubes to make it accessible for machine learning (ML) applications:
- Masking: A well defined Unusable Data Mask (UDM) informing source certainty of the geospatial observation at a specific time makes it clear which pixels are ready for analysis and can serve as input to model training and inference.
- Gap filling: Uncertain data points or pixels need to be gap-filled with best approximation from neighboring pixels in space and across time. This creates a synthetic data product, transformed from the source data to a more complete form, filling in for pixels masked by clouds and haze, that is ready for consumption with most computer vision algorithms. Both, Cubesat Enabled Spatiotemporal Enhancement Method (CESTEM) and Framework for Operational Radiometric Correction for Environmental Monitoring (FORCE) demonstrate the use of such techniques.
- Color Space transformations: Most ML and computer vision algorithms are trained and benchmarked on datasets that run on 8-bit Red-Green-Blue (Visual) imagery. ARD implementations often aim to preserve the imagery in post-atmospheric corrected multispectral (> 3 color bands) surface reflectance (SR)measurements. Different data providers use different techniques to transform SR-to-Visual imagery. Data cubes that preserve SR data while having clearly defined SR-to-Visual transformations as companion metadata will enable the use of ARD in ML algorithms in a flexible fashion and potentially power the transfer learning of open source models trained using visual imagery datasets in other domains.
- Geometric and radiometric normalization: Accurate geometric positioning and defining the tolerance on positional accuracy will allow for better use of labels for supervised learning over time. Radiometric normalization allows us to use data from multiple sources in the same data cube (NASA HLS, CESTEM, FORCE).
- Grid standardization: Most computer vision and time series algorithms require a fixed format and shape for the input data and imagery. Providing a standardized way to grid imagery and match up resolutions allows downstream ML applications to directly consume ARD as training data or to serve inference.
- Data cube as a service: On demand (or pre-computed) access to analysis ready data cubes with the above attributes makes them ready for visual or automated analysis. There are a few emerging approaches towards this ability (Open Data Cube and ESA Data Cube Facility Service). The data from these cubes could be cataloged using Spatiotemporal Asset Catalog (STAC) and served as a Web Coverage Service (WCS).
ML on ARD: Usage Considerations
Here are some desirable dimensions that would make analysis ready data usable for machine learning powered applications:
- Interoperability: ARD data cubes derived from multiple sources as training data can enable to use of source agnostic models in the future. Interoperability would also enable inference on data cubes irrespective of the data source on which the model training was done. Essentially, ML-ready ARD would enable the matching of training and serving data from a radiometric and geometric perspective independent of the sources used in the ARD.
- Discoverability: Often, the biggest challenge with creating new supervised learning capability is the access to well-curated, clean training data. Using a STAC-oriented catalog for ARD would allow the discovery and use of existing and new sources of training data. Radiant Earth has been rallying the community around a similar capability with MLhub and open training data. Existing labeled datasets on geospatial imagery have been largely sourced for open challenges, are few in number and are exclusively single source, FMoW, (WorldView), SpaceNet (WorldView), EuroSAT (Sentinel-2), Ship Detection Challenge (SPOT), Amazon from Space (Planetscope), NIST-DSE (NEON) to name a few.
- Geodiversity: New commercial capabilities in geospatial imagery have been pushing the limits of generalized inference at planetary scale, creating a capability that serves consistent inference over a range of land surfaces, atmospheric conditions and seasons requires capturing sufficient geodiversity in the training data. Metadata in the ARD data cubes can enable quick understanding of the geodiversity of the dataset in order to build towards generalized capability.
As previously described by my colleagues Ignacio Zuleta and Chris Holmes, and associate Hamed Alemohammad, making analysis ready data operational for machine learning applications requires industry wide collaboration and cooperation. We are collectively hosting ARD19, a cross-industry workshop on data interoperability at USGS in Menlo Park in early August, where I will be moderating a panel on ML-ready ARD. Feel free to reach out to me directly and register for ARD19 if the contents of this post resonate with your geospatial analysis needs and the technical capabilities you are currently building. In a future post, I will be summarizing key learnings from the workshop and providing a list of early movers and resources in this space.