Mangroves are a saltwater-loving family of tree species that grow at the coast in the equatorial regions of the world. They are of interest for a variety of reasons, including navigating the coast, their role as natural sea defence, and their ability to very effectively capture carbon.
I work at the UK Hydrographic Office where we specialise in the marine environment and create a variety of global data-sets to support marine and maritime organisations. We also use our data to create navigational charts, used by ships to navigate the coast. The charts include a wide range of information such as water depth, coastline, hazardous features, seabed type, amongst other things. Mangrove is a key element of the coastline as it significantly changes how you can access the coast.
We are continuously looking at ways we can improve our data-sets and take advantage of the vastly increasing source data available to us. One source that is becoming more abundant is satellite data, however, due to the large size of the data and the vast regions we cover it is not feasible to only process this data manually. Therefore we use machine learning to process this data into vector outputs.
In the past, we have created machine learning pipelines which have had great value in processing large volumes of previously unmanageable data, such as our object detection pipeline which identifies safety critical offshore infrastructure.
The problem has been that getting these things into live can be more painful and slow than we would like, with multiple re-writings of the code and fiddling with the infrastructure. In this project to map regions of mangroves globally, we embarked on trying Serverless architecture to see if it would reduce some of these pain points and be suitable for this type of geo-deep learning.
To train a deep learning classifier, it is first necessary to obtain labelled training data. Our products already include areas of mangrove but the environment is constantly changing, and the current data are often geared towards navigation which can mean they are generalised. To create more labelled data, which were less generalised, we used hierarchical clustering, a form of unsupervised learning. With the help of our remote sensing analysts we identified the cluster which denotes mangrove.
A U-net segmentation model was developed then trained against this new data, as well as some detailed hand labelled data. The model worked well in identifying the regions of mangrove, we used Keras and Tensorflow.
Once the model was defined we set about trying to determine the best way to get the model into live. Part of this work was to determine whether this type of processing would work well on the Serverless architecture. To test this we used AWS Lambda and the AWS hosted Sentinel-2 optical satellite data. We used a static html file hosted on S3 to allow users to select the region of interest, this then triggers a pipeline of Lambda functions to run.
The pipeline takes sentinel tiles of a selected region and processes these into chips ready for prediction — to reduce the processing overhead chips were filtered to areas of low cloud cover, and filtering out areas where you definitely would not get mangrove, such as well inland. Once filtered, the prediction takes place on the chips. The chips are then recombined as a single raster image stored on S3. An example region, showing the chips, and the mangrove prediction in black, is the first image above.
There are many benefits to Serverless, including ease of management and cost, but there are also challenges around memory and run-time restrictions. There are memory limits for the size of the libraries, we used Python and libraries such as rasterio, and GDAL, and loading the trained model for prediction also required work, due to the size of the Tensorflow library. There are also function-execution limits which can be fairly easily hit with this type of data, as well as run-time memory limits which we also hit initially.
There were a number of iterations required for us to come in under these constraints — definitely enough to warrant another more technical blog!-to come. The Lambda pipeline takes 5 mins to run over a region of 100*100 km (the area covered by a Sentinel-2 satellite image). A very large number regions can be processed in parallel due to the high scalability of the Serverless platform. This approach is certainly something we will take forward; despite some of the initial engineering challenges it has proven to be a great way to process our global data.