Open Earth Observation Data in the Age of Machine Learning
The More Things Change, the More They Stay the Same.
“Raising awareness about the importance of sharing open training data is critical to building a strong foundation for this field of data science. And yet, many high-value EO training datasets remain closed and siloed for various reasons.”
Not so long ago, there was just one viable source for Earth imaging data: Landsat, a joint program of NASA and the U.S. Geological Survey whose mission dates back to 1973. Back then, civilian remote sensing was still in its infancy and commercial satellite operators were unheard of. And yet, the nascent professional remote sensing community was galvanized by the promise of what could be. Decades later, civil society has a wealth of commercial and government Earth observation data to analyze, and more is on the way thanks to a dramatic period of innovation whose day is almost here. Innovation fueled by the confluence of available Earth observation data, machine learning methods, cloud computing and an expanding data science workforce that is eager to create new products and solutions, will change everything. And in some respects, nothing.
Consider machine learning. Applying machine learning techniques to EO has great promise within the development sector. That promise can only be realized, however, with high-quality training data that is geographically diverse, and which encompasses all of the features and phenomena that consumers of geospatial information wish to identify, map, and monitor. The most fundamental step to creating a rich environment from which to drive this innovation is developing open training data libraries. The collective development of these libraries will require that training data is registered on a repository once it is labeled or collected so that it can be discovered and shared openly. Although the object is new — machine learning training data instead of the Global Spatial Data Infrastructure — the GIS data development mantra that underlies it remains the same: Collect once, use many.
Over the past decade, there has been a constant dialogue about the importance of open data to improve government transparency and drive innovation in the global marketplace. The geospatial community, in particular, has had robust discussions about the policies, technical procedures, and underpinning economic value of open data. Raising awareness about the importance of sharing open training data is critical to building a strong foundation for this field of data science. And yet, many high-value EO training datasets remain closed and siloed for various reasons. Upon further investigation, the failure to share this important asset is owed in large part to the same institutional and individual behaviors that existed prior to the push for open GIS data.
Generally, constraints fall into several categories:
- Technical constraints: Because the ability to share and register training data is very nascent, it is incumbent on the geospatial community to spread awareness and build technology tools that make doing it easy, seamless, and cost-efficient.
- Policy constraints: Many international governments still look on open data with a very suspicious eye. This typically is beyond the control of data scientists and must be expertly negotiated by executives in the institutions that are funding the original data collection work. Additionally, much of the ground-referenced survey work that could be exceptionally valuable to data scientists contains highly detailed and often very sensitive personally identifiable information.
- Financial constraints: More often than not, preparing training data to be shared on a repository is an afterthought and occurs long after the grant funds are spent.
- Competitiveness: There is a reticence from organizations to share high-quality training data because it may become the basis of their competitive advantage for future projects, even if the work originally was funded via government or philanthropic grants that require the research to be open.
“The rapid acceleration of machine learning techniques and the remarkable volume of imagery that is now available on a daily basis puts the geospatial community and its customers on the cusp of an intelligence revolution.”
Another constraint is a lack of awareness within the funding community about the ways in which training data can buy down future costs and speed the pace of innovation. The geospatial community cannot afford to spend years admiring this problem. Instead, it must leverage lessons learned from the drive for open geospatial data and bring best practices forward by applying the same techniques and policies to open training data.
There are several simple policies that can be implemented in order to encourage broad compliance with developing open training datasets for EO:
- Require recipients of government and philanthropic grants to make all generated data FAIR (Findable, Accessible, Interoperable, and Reusable).
- Specifically, require a data management plan from recipients of government and philanthropic grants, and mandate that datasets resulting from said grants be shared publicly on a permanent repository to maximize their impact and downstream use. An open source license such as CC BY-4.0 or similar should be assigned to datasets to allow any usage of the data. Furthermore, datasets should be accompanied by both human-readable documentation and machine-readable metadata. The former should contain information on how the data was collected/produced, quality assessment methodology, and contact information while the latter should encompass information such as temporal coverage, spatial coverage, keywords, links to the files, and other related information.
- Establish close-out procedures for government and philanthropic grants to ensure that the data have been made available on an open repository and condition the final grant payment after the publication of the open training data.
- Develop and implement simple tools to automate as much as possible the creation and registration of training datasets.
- Fund further research and development into applied methodologies to anonymize spatial data to protect privacy without losing the original value in the training dataset.
- Initiate more discussion about how to incentivize sharing of training data while creating additional forums in which to highlight successful use cases.
To be clear, these comments are not focused on commercial suppliers of imagery or machine learning analytical services, which have invested hundreds of millions of dollars of private capital into commercial services with clearly stated intentions: They are selling products, and much of their ‘’secret sauce” is their training data and intellectual capital. For all other stakeholders, however, the need for open training datasets for EO is as stark as the benefits such datasets will offer.
Indeed, the rapid acceleration of machine learning techniques and the remarkable volume of imagery that is now available on a daily basis puts the geospatial community and its customers on the cusp of an intelligence revolution. Being able to truly analyze geospatial data in concert with domain experts and decision makers will support the development of robust solutions that solve the world’s most intransigent problems. But for this vision to be realized, open data principals must reign supreme in the development of training data repositories.