Scaling machine learning models across the globe: the quest for geo-generalisability in mangrove forests.
The Data Science team here at the UKHO specialises in applying the latest in machine learning and big data technologies to marine geospatial data.
One of the challenges we face across the numerous projects and data types we work with is creating machine learning models that can scale to any geographical location around the world. Typically, we may only have training labels and data in limited locations that may not represent the full set of conditions present in another part of the world. We’ve found that, when models are trained on data from one location, the model’s performance can reduce when it is applied to new parts of the world.
Our sometimes frustrating experiences of this problem have driven us to coin a new phrase — “geo-generalisability”, which we define as the desired ability of a machine learning model to achieve acceptable performance when it is applied to new areas of the world. This can be an elusive problem to fully understand and control for, as geospatial proximity or distance may not actually infer similarity or differences in terms of the appearance of, or environmental context of, a target feature. This can make it difficult to design training data to fully account for the global variation of the problem that our machine learning models need to solve.
UKHO Data Scientist Kari Dempsey recently described the work we’ve done to map the extent of mangrove forests around the world. In this blog post, I’ll set out how we went about trying to create one machine learning model that could successfully geo-generalise to identify mangrove forests within Sentinel 2 satellite imagery, regardless of geographical location. The work described here was a collaboration between the Data Science, Data Engineering and Remote Sensing teams at the UKHO.
To create a mangrove labelled data set, our first step was to understand our feature of interest (mangrove forests) and what controls its appearance in satellite imagery. By reading relevant scientific papers, we found that mangrove forests are made up of a collection of saltwater tolerant tree species and that in different parts of the world, a single mangrove forest can be made up of numerous tree species. In addition, mangrove forests don’t exist in isolation but are situated adjacent to other forest types (such as rain forests), arid areas, urban areas and areas that have been deforested that now contain land uses such as aquaculture and agriculture. In addition, mangroves can exist in different geomorphological settings including at the mouth of rivers, in coastal lagoons and in low lying areas that are under tidal influence.
Armed with this understanding, we devised a sampling strategy whereby we defined Areas of Interest (AOIs) around the world that captured as much of the possible variability in the appearance of mangrove forest as we could. We used a pre-existing data set (Global Mangrove Watch Baseline 2010) to guide the selection of AOIs. We split the world into 9 regions, and within each region we manually selected between 5–10 AOIs, where we aimed to not only select areas with mangrove forest but also mangroves in different geomorphological settings and with different adjacent land use types. In total we selected 60 different AOIs from across the world.
Typically, when a machine learning model is trained, the overall labelled data set is split; a proportion is used to train the model (the training set) and the rest is used to test the performance of the model (the testing set). This train/test split is typically something around 80/20 though there are no hard rules. Here, we faced a dilemma. On the one hand, to improve the geo-generalisability of the model, we wanted to train the model on as much data as possible. On the other, to ensure we conducted a robust test of the model’s performance, we wanted to test the model on a larger number and variety of imagery than that it was trained on, to replicate the scenario of the model being used in production (something that breaks with convention). In the end, we settled on a train/test split of 50/50.
For all 60 of our AOIs, we downloaded imagery (Sentinel 2 bands 2,3,4,5,8A,11 and 12) and 30 of the images were labelled into mangrove/not mangrove masks by our Remote Sensing team (we termed this the Gold Standard data set) and this made our test data. The remaining 30 AOI images were also labelled into mangrove/not mangrove masks by the data science team, and this made up our training set. We used a histogram stretching and clustering technique to speed up the data labelling process (perhaps a topic for another blog). Each AOI mask in our training and test sets were chipped into 256x256 chips (with corresponding multi-band image chips created) and for each chip we calculated a number of statistics including the percentage of mangroves, the percentage cloud cover, the dominant land use class (e.g. vegetated, water etc.), the minimum, maximum and mean elevation, and the region of the world. We used this chip metadata to filter out those chips that only contained water and that were over a threshold of elevation (as mangroves exist in low lying areas). When we then visualised the filtered chip metadata, we realised that despite our best efforts to create a globally sampled data set, we inadvertently had two sources of bias in our training data set:
1) The majority of image chips contained a low percentage of mangroves
2) Some regions of the world had more chips containing mangroves than others
These two sources of bias would need overcoming if we were to create our geo-generalisable model.
Our first step was to understand what was creating these biases in our data. After completing some more data exploration, we found that we needed to understand a bit more about mangrove forest geography. When we looked at example mangrove forests in different locations around the world, we found that the density and extent of the mangroves varied dramatically. In some areas (e.g. the Sundarbans) there are large areas of densely forested mangroves. Generating chips from a Sentinel 2 scene in this region created a large number of chips with a high percentage of mangrove forest. Conversely, when we looked in other areas (e.g. arid areas) we found that mangrove forests can have a low density, with a sparse arrangement of trees. Therefore, chipping up Sentinel 2 scenes in these types of areas created few chips with low percentages of mangroves.
We considered several possible remedies to these problems, including the option to generate more training data and to devise a further sampling strategy that would have balanced the number of chips from each region that contained mangroves. In the end, we applied rotational (90, 180 and 270 degree) and flipping (vertical and horizontal) augmentation to the regions of the world with the lowest number of chips and applied no augmentation to those regions with the most mangrove containing chips — with the aim of balancing out the representation of the different densities of mangrove within the training set but without removing any training examples.
We trained a deep learning model based on the UNET fully convolutional neural network architecture using a GPU equipped AWS EC2 instance, using Python and Keras (with TensorFlow). Model training initially started on mangrove chips with over 10% of mangrove cover, and as model performance improved, we added more and more chips into the training set that contained lower percentages of mangrove. We iterated through model training runs tuning the hyperparameters and in the end we applied early stopping (with a patience of 20 epochs), a validation split of 0.15, learning rate reduction on plateau (with a patience of 6 epochs and a delta of 0.1) with each model run lasting a maximum of 200 epochs. We used MlFlow to keep track of training experiments and eventually arrived at 4 possible candidate models which varied in whether a) they contained elevation data and b) whether the input bands were centred and scaled prior to model training. We devised a model review methodology to decide which model we would use in production, conducting three independent tests:
1) A quantitative analysis of the model’s performance against the validation data set
2) A qualitative assessment of each model’s predictions on unseen test areas
3) An assessment of whether each input band to the model was used by the model to make predictions
Each of the above tests was conducted independently by separate data scientists. We held a meeting to review our findings and we unanimously chose the same model — which was good!
At this point we were nervous; our deadline for the first iteration of our model was looming, yet we didn’t really know how well the model would scale to the whole world. We handed our model over to our colleagues in the UKHO’s Data Engineering practice who had been building the serverless pipeline on AWS lambda to scale the model to the 1500 or so images required to build our new global mangrove data set and we waited. The first set of results to come in were from our 30 test scenes and so we quickly started to compare the results of our model against the Gold Standard masks created by our Remote Sensing team. We computed the balanced accuracy for each Sentinel 2 scene in the test set and looked at the results from around the world and we found… success!
Well… sort of. On average, our model had scored a balanced accuracy of 82% against the Gold Standard data set which is lower than we may have hoped for, but importantly there is a tolerable range of variation between each region. The worst performing region was the Gulf with a score of 80%, and our best performing region the Caribbean (90%). Our model’s average performance was lower than we’d hoped for, but it was performing at a similar level of (balanced) accuracy across the world indicating that it had geo-generalised well.
This now gives us a solid platform for building the next iteration of the model. Our plan is to gather feedback on where the model is performing well (and where not so well) from our customer and use this information to incorporate new training data into the next iteration of the model.
This project has given us a broader sense of what is required to create one single model that can be used to make predictions on satellite imagery from around the world. To achieve geo-generalisability in our future projects we’ll need to:
- Develop an in-depth understanding of the feature of interest - its geography, geomorphology and what factors affect its appearance in satellite imagery
- Think carefully about data sampling strategies to encapsulate as much variation as possible, iterating if needs be as we collect, label, filter and augment training data.
- Design model testing strategies that reflect the manner in which a model will be deployed (e.g. not just following conventions on the train test split)
- Pull in Subject Matter Experts where available to help us understand more about the problems we’re training machine learning models to address
In conclusion, working at the UKHO presents a number of fantastic challenges as a data scientist. We have data from a number of different sources, and we create machine learning models that scale to global problems and products. We’re developing the methodological approaches required to step from proof of concepts, to world-leading applications of machine learning to the pressing environmental problems of our time.