Creating a global data set: using serverless applications and deep learning
One of the roles of the Data Engineering team here at the UKHO is to take models and other deployable artefacts from the Data Science team, suitably deploy them, and make data sets with them.
One such model was a U-Net, trained to detect mangrove forests. Thomas Redfern, a Data Scientist at the UKHO has written a blog where he explains how the Data Science team produced a geo-generalised model. In the following post I will explain how we took this model, created a pipeline, and produced a global data set of mangrove forest and swamp cover.
Our compute provider of choice for this pipeline was Amazon Web Services (AWS). The strongest deciding factor for this was the availability of the European Space Agency’s Sentinel 2 (S2) imagery, which is available as an open data set, detailed here. The data set gives almost complete global coverage at 10m resolution. Having the availability of the data set beside our compute without storing it ourselves was a winner for us.
The Data Science team had already done some investigations into finding where mangrove forests were and where they were likely to be found. The most recent and up-to-date global data set they found was Global Mangrove Watch (GMW) - this was used as the baseline for our efforts on finding mangrove. Using GMW and other sources we were able to identify approximately 1600 S2 tiles (around 16,000,000 km² of imagery) which covered the areas in the data set. Each S2 image is the exact size of a MGRS square, the product ID of each image contains its corresponding square and this allowed us to create a list of squares around the globe that needed predicting.
To predict on all of these images would take hundreds, if not thousands of hours without a massively scaleable prediction pipeline. However, we had the whole arsenal of AWS services available to us and with that in mind, we started work on the first version of our pipeline.
The Pipeline (V.1)
As you can see from that lovely diagram, our pipeline consists of 7 Lambda functions which interact with external data sources, AWS’ S3, DynamoDB, and SNS. You’ll also notice that we use Lambda layers (we’ll cover these later) and Tensorflow.
To quickly summarise the functions, their uses are:
- find-images: Grabs all the S2 images for a given MGRS square and returns the most recent product ID with the lowest % of cloud
- retrieve-dem: SRTM tiles are 1⁰ x 1⁰ which means that we need more than one to cover an MGRS square; this will get us the correct squares, cookie cut out the area we need and save it as a GeoTIFF
- retrieve-tile: Grabs all of the S2 bands that our model requires for the given ID and saves them to S3 as GeoTIFFs
- chip-tile: Stacks the tiles retrieved from the previous two functions into one image. This image is then chipped up into 256*256px chips (to be able to be predicted on) and then saves those that pass filtering to S3 as .npy files
- model-predict: Takes the .npy files, predicts on them and then saves those predictions as .npy files in S3
- combine-predictions: Takes the prediction .npy files, merges them into one S2 tile sized array and saves this as a GeoTIFF to S3
- publish-global: For our internal customers, take the GeoTIFF and reproject it into WGS84 so that all predictions are in the same co-ordinate reference system
AWS Lambda is a powerful serverless compute platform; if you just want to use bog standard Python libraries then you’ll have no worries!
Unfortunately (not really but anyway), we deal with geospatial data AND machine learning. These areas both bring in a whole network of libraries and dependencies that can be painful to install and very very big.
Ironically, deep learning was the easiest hurdle to get over… 🤔
After doing some research we came across this repository, which offered a whole selection of publicly available Lambda layers with Tensorflow to point at. We found the one that suited our needs: tf_1_11_keras:2 and off we went! We loaded in the geo-generalised U-Net and out came predictions.
Deep learning in a Lambda function? Completed it. 🧙♂️
Geospatial stuff is a bit trickier. You need GDAL (Geospatial Data Abstraction Library 🌍) to do near enough anything. GDAL is great, but it needs to be built against the environment it’s running against and it is pretty big. Luckily, someone had already done the hard work of building GDAL against the Lambda runtime and trimming it down. Shout out to RemotePixel for that great help.
Based on the layer they provided, we were able to create our own layer using Docker. We installed the additional packages we required (like Rasterio and Shapely), bundled in some of our own helper functions, and pushed that image up to DockerHub. If you’re in town looking for a geospatial Lambda layer, we’ve got one right here.*
With the functions, dependencies, and known areas in place, we started to produce our first edition of the Global Mangrove Forest and Swamp Cover data set.
We produced some statistics to help us monitor our coverage whilst we were generating the data set. We based our measure of ‘completeness’ for a MGRS square on the following equation:
% completeness = (pP/tP) * 100
Where pP is the number of chips predicted on in the MGRS square and tP is the target number of predicted chips in the square (those that meet criteria such as being in the correct elevation range for mangrove etc.). We then averaged these figures over areas of the globe for regional stats. From that, here’s what we got out of 2 runs for the ‘entire’ globe:
North America (Inc the Caribbean): 89.46%, South America: 86.16%, Africa: 90.37%, Asia: 90.00%, and Australasia: 91.08%.
So for 2 runs of the pipeline against every MGRS square we’d identified as likely to contain mangrove (all 1600 of them) we were nearly at 90% coverage for our data set! Pretty swish!
But ‘wait’ you say! How come you haven’t got 100%? — Well unfortunately for us, clouds exist and they do insist on blocking optical imagery’s view of the ground! Luckily clouds do move and by performing more runs on areas that have a lower coverage, we can bump it up as we patch in areas of lower cloud.
What does that look like on a map? Glad you asked.
But ‘wait’ you say! (again) Those don’t look like mangroves! Well yes, I guess you’re not wrong… lets take a closer look.
So that our customers internally could enjoy these mangroves, we packaged the data set up in two different forms. The first was a VRT file (GDAL Virtual Format), which is essentially a big piece of XML that can be loaded into QGIS and loads the imagery required as the user pans around. The second was a ArcMap project in which we had an ArcGIS geo-data base which was around 3GB in size, for users who need the data to be portable and without internet access.
So, what’s next?
For the past few months we’ve been developing the pipeline from a mangrove specific implementation into a more generic pipeline. The thought process behind this being that we can plug and play different models from our Data Science Team and create global data sets of… well, basically anything. Ice? Ports? Puffins?! If you can make a model that expects an array of chips all 256*256px, the pipeline will accommodate it!
We’ve moved on from ‘daisy-chaining’ using SNS now and switched to orchestration using AWS Step Functions. We’re finding this gives us greater control over pipeline flow logic and it’s easier to monitor progress through a run.
We hope this has given you a good insight into what we’re doing here in Data Engineering at the UKHO. Make sure to watch this space for more of our geospatial projects and work that other teams around the business are doing.💖🚢🌍
*At the time of writing, I found that RemotePixel had been busy bees 🐝 and developed a more refined geospatial Lambda layer which was similar to our own rolled one, you can find it here.