Regional Training Data are Essential for Building Accurate Machine Learning Models

Published in

Radiant Earth Insights

6 min readMar 30, 2020

Thinking that having familiarity with Las Vegas in the United States would be enough to navigate the streets of Khartoum, Sudan sounds far-fetched. Yet, many machine learning models use Western-centric training data to predict features from satellite imagery in places as culturally and economically diverse as Bangladesh, Uganda, or Honduras.

Advances in computer vision and machine learning (ML) are improving the ability to accurately extract insights from frequent and high-resolution satellite imagery, shedding light on global development and progress toward Sustainable Development Goals. While these advances — along with increased availability of high capacity computational resources — result in improved models, the lack of diversity amid training datasets significantly limits applications of these models to certain geographical regions.

Research conducted by Radiant Earth Foundation staff, including its former intern, Yoni Nachmany (now with Mapbox), showed that the types, surfaces, and arrangements of roads are too heterogenous across geographies. Therefore, regionally curated training data are needed to improve the accuracy of deep learning models that predict roads from satellite imagery worldwide.

Global imagery with skewed training data leads to inaccurate results

Representativeness of training data is a pervasive problem in Artificial Intelligence (AI) beyond satellite imagery applications. ImageNet, the visual database that is used as a benchmark for training image classification models, is Americentric. As a result, classification models generate biased results — a photograph of a woman in a traditional white wedding dress is classified as a ‘bride’, while a North Indian bride is labeled as ‘performance art’. Although algorithms can be adjusted to be more adaptive, the underlying training data is key to reducing bias.

Using Western standards to define classes in training datasets has been found to lead to biased labels and model predictions consistently. To circumvent this issue, one must first recognize existing biases, and then actively work to increase diversity of training datasets. This approach is particularly important when working with satellite imagery.

Satellite imagery is a powerful tool for monitoring environmental and development indices as the data coverage is consistent, global, and for some applications very affordable — many agricultural applications rely on open source satellite imagery. Meanwhile, ML models are capable of detecting patterns and features — like roads, buildings and farm boundaries which give practical insights useful for policy making.

While the imagery is available for the most vulnerable parts of the world, available training data (a.k.a labeled data) for building ML models typically only exists for the developed world. Such shortcoming limits the adoption of these data and modeling techniques for those interested in addressing global challenges.

To quantify the impact of regional training data on the accuracy of models for road detection, Radiant Earth’s research team designed a set of experiments in Khartoum, Sudan and Kumasi, Ghana, using SpaceNet training datasets in Las Vegas, USA, and Khartoum, Sudan. The team used Raster Vision library in Python to train and evaluate the experiments.

Regional labeled data is essential

The first study compared the effectiveness of using the Las Vegas training data versus using the Khartoum training data to classify roads in Khartoum. Figure 1 shows the qualitative results of predicting roads in Khartoum using training data in Las Vegas and Khartoum. For this experiment, a model with MobileNetV2 architecture was used, and results show that the model trained in Las Vegas completely fails to predict roads in Khartoum, despite its relative success in Las Vegas (not shown here)— likely due to a mix of issues from road types, and road network differences. The model trained in Khartoum, on the other hand, was reasonably successful in predicting roads, indicating that there was a significant advantage to using locally trained data (for more details on this experiment check out our paper from CVPR 2019).

Figure 1 — Prediction results in Khartoum for three different scenes. Top: input imagery. Middle: prediction results (shaded) from the model trained in Las Vegas overlaid with labels (red lines) on top of the input imagery. Bottom: prediction results (shaded) from the model trained in Khartoum overlaid with labels (red lines) on top of the input imagery.

What about transfer learning or crowdsourced labels?

While this experiment shows that regional training data is essential to build an accurate road detection model, such data is not always available. For example, to the best of our knowledge, there is no expert labeled road detection data for Kumasi. However, Kumasi has many road labels on OpenStreetMap (OSM) from crowdsourced users. Crowdsourcing is an effective way to generate a large number of labels, but it may not have the same quality as expert labels, such as in SpaceNet’s data. To assess the feasibility of using transfer learning from Khartoum or using crowdsource labels from OSM, the team designed and evaluated a series of models. For all models, the chip size was set to 300 x 300 pixels, and the training/validation split was 80/20. All models used MobileNetV2 architecture. The specifics of each model follows:

Khartoum Model - This model was trained using SpaceNet data in Khartoum with hyperparameter tuning, which yielded a learning rate of 1.0E-3. We then used the same learning rate for all the other models;
Kumasi Model - This model used DigitalGlobe WorldView-3 imagery as input and labels from OSM in Kumasi; and
Khartoum Model retrained in Kumasi - This model was the Khartoum Model fine-tuned on OSM labels in Kumasi with 10K steps (compared to the other two models with 100K training steps).

To compare the performance of these models, the team labeled a section of the Kumasi region to validate the models’ predictions. The labeled area is about 13 sq. km centered on 6º 41' 30'’ N and 1º 41' 23'’ W. These labels were only used for validation and include 5,406,942 road pixels and 50,627,010 background pixels. Figure 2 shows a qualitative comparison of the three models’ predictions vs. the validation labels. As Figure 2 (b) shows, the Khartoum model has a hard time detecting the road network.

Figure 2 — (a) Labels generated by experts for validation. (b) Predictions from the Khartoum Model. (c) Predictions from Kumasi Model. (d) Predictions from Khartoum Model retrained in Kumasi with 10K steps.

Table 1 also shows four accuracy metrics (F1, IOU, Precision and Recall) for each model. These scores are calculated as an average (including all pixels) as well as for each of the classes (road and background ). All three models have high scores for the background class. However, the Khartoum model has relatively low scores for the road class (other than precision). The Kumasi and the Khartoum model retrained on Kumasi labels have relatively close scores other than for precision and recall. While the Kumasi model has a higher recall (0.7513) for road class it has a lower precision (0.5662). In contrast, the retrained Khartoum model has higher precision (0.6363) and lower recall (0.5921).

*Table 1 — Accuracy results for each model presented for Average (all classes), Road class and Background class.*

These results underscore the importance of local training data for building accurate road detection models from satellite imagery. Moreover, using crowdsourced labels from OSM in Kumasi was promising and it shows the value of these open source data (with the caveat that all parts of the world are not labeled with the same quality and completeness as Kumasi).

Designing unbiased models

With the growing availability of satellite imagery and its applications, it is necessary to ensure the geographical representativeness of the benchmark training datasets. This is key to building unbiased models for regional and global applications.

Radiant Earth’s research has shown the significance of regionally appropriate training data and models. Moving forward, as a community, we need to address training data and model biases more proactively by sharing best practices to circumvent this issue and invest in developing and sharing diverse training datasets. Otherwise, the best ML advancements will be inaccessible and unusable for some of the world’s most vulnerable populations.

Acknowledgment: This research was partially funded by a grant from Schmidt Futures to Radiant Earth Foundation.

Regional Training Data are Essential for Building Accurate Machine Learning Models

Global imagery with skewed training data leads to inaccurate results

Regional labeled data is essential

What about transfer learning or crowdsourced labels?

Designing unbiased models

Written by Hamed Alemohammad