Creating a Machine Learning Commons for Global Development
Advances in sensor technology, cloud computing, and machine learning (ML) continue to converge to accelerate innovation in the field of remote sensing. However, fundamental tools and technologies still need to be developed to drive further breakthroughs and to ensure that the Global Development Community (GDC) reaps the same benefits that the commercial marketplace is experiencing. This process requires us to take a collaborative approach.
Data collaborative innovation — that is, a group of actors from different data domains working together toward common goals — might hold the key to finding solutions for some of the global challenges that the world faces. That is why Radiant.Earth is investing in new technologies such as Cloud Optimized GeoTiffs, Spatial Temporal Asset Catalogues (STAC), and ML. Our approach to advance ML for global development begins with creating open libraries of labeled images and algorithms. This initiative and others require — and, in fact, will thrive as a result of — using a data collaborative approach.
“Data is only as valuable as the decisions it enables.”
This quote by Ion Stoica, professor of computer science at the University of California, Berkeley, may best describe the challenge facing those of us who work with geospatial information:
How can we extract greater insights and value from the unending tsunami of data that is before us, allowing for more informed and timely decision making?
Part of the response to this challenge is in advancing technological capabilities such as ML — training computers to rapidly sift through large amounts of data and identify specific characteristics. Another part — perhaps not quite as exciting and sometimes overlooked, but just as critical — is establishing a diverse and collaborative community devoted to creating and openly sharing these game-changing solutions.
Radiant.Earth embraces a vision of leveraging ML technology and growing a collaborative community of experts to advance global development.
Our vision emerges from the reality that an ever-growing number of Earth observation (EO) satellites are producing an unparalleled amount of data at various spatial, temporal, and spectral resolutions. ML is essential to analyzing this unprecedented amount of data.
Classical physics-based or process-based analysis techniques are not designed nor optimized to run in near-real time and extract information from large amounts of satellite imagery. However, ML models have unique characteristics that make them suitable for such applications. Moreover, ML can be used to find patterns and anomalies in observations in ways that would be very hard, if not impossible, to do with process-based techniques.
An additional feature of ML models is that they have a rapid development cycle. This enables developers and scientists to test and customize different model configurations in a relatively short amount of time and find the best performing algorithm for each specific problem.
ML algorithms use observations to explore patterns, find anomalies, and in general provide new insight such as predicting future events. For example, supervised ML techniques use training data (a set of input observations and known results) to learn complex relationships between input observations and output variable(s) of interest. Additionally, unsupervised ML models use training data to detect clusters of data with similar properties. Therefore, training data are a building block of ML techniques. In the case of remote sensing data, input observations are spectral information captured by satellites from around the globe on a regular basis. Variables of interest include land cover type, deforested area, number of cars in a parking lot, burnt area after a wildfire, oil volume in storage tanks, crop yield, and soil moisture among others.
What is good training data?
ML algorithms learn from the training data to which they are exposed to and will be able to generate the output for a future observation. However, if the training data is not accurate or representative of all possible scenarios, ML models may not provide acceptable outputs.
Training data needs to accurately capture (or, in statistical terms, sample) the wide range of possible outcomes both in space and time. For example, a training dataset for land cover classification should include all the different land cover classes and their temporal variates that appear around the globe (e.g., images of cropland at the beginning of the growing season are different from those of the same land close to harvest time). Moreover, there needs to be sufficient diversity in the imagery of each class; otherwise, ML model outputs will be biased.
ML models for EO can be divided into two groups, each with its own challenges with regards to collecting training data:
- Classification models: these models need imagery and labels for objects on the Earth. Generating these training data needs human beings who can look at each image and label the objects or verify the labels generated by an algorithm. However, because objects seen from a space-based satellite have a different shape than the same objects when seen on Earth, users can disagree as to how to label an object. Therefore, we need to develop standards for consistent classification of each object by different users.
- Regression models: these models estimate variables that have a continuous value (e.g., crop yield and soil moisture). Generating training data for these models requires ground measurements of these variables. However, those measurements will inherently contain some uncertainty due to measurement device error and/or human error. Moreover, these ground-based “point measurements” are not representative of the spatial scale of a satellite measurement. Therefore, it is necessary to develop standards and best practices to mitigate these issues.
These challenges call for a collaborative community effort to build new training data and standards to enhance applications of ML on EO data.
Radiant.Earth, as a neutral organization that invests in collaborative solutions for EO, is forming a new initiative to foster ML applications using remote sensing observations specifically to benefit the GDC.
MLHub.Earth: An open source library for machine learning applications of Earth observation data
Radiant.Earth is proud to announce its new MLHub.Earth initiative, which is focused on promoting the creation of new open source training data, models, and standards for applications of ML to EO data. MLHub.Earth is a library for hosting and reusing ML tools focused on EO.
Radiant.Earth will enable the community to develop training data, models, and standards for ML applications.
The Radiant.Earth vision includes a community of open source developers and citizen scientists who actively contribute to MLHub.Earth and enable the broader community to harness the power of ML and EO to tackle global challenges. Moreover, Radiant.Earth will invest in empowering the GDC to tap into this knowledge hub, contribute to its tools and use them in conjunction with Radiant.Earth’s cloud platform to address Sustainable Development Goals (SDG) and other key targets.
By hosting workshops, sponsoring hackathons, and convening technical meetings, Radiant.Earth will enable the community to develop new training data, models, and standards for this application. We will also invest in developing new tools internally, and in collaboration with other institutions. Meanwhile, as an open source platform, MLHub.Earth is available to the global community, and we encourage all interested parties to join this initiative and contribute their knowledge and expertise.
As this important work advances, we will keep the community informed via this blog and the Radiant.Earth website.