Building Data Science Capability at UKHO: our July 2021 Research Week

Cate Seale
UK Hydrographic Office
10 min readAug 25, 2021

This blog was co-authored by Andrew Smith, Dr Susan Nelmes, Cate Seale, Dr Thomas Redfern and Rachel Keay.

Using research weeks to push the boundaries of data science at the UKHO

Welcome to our quarterly blog post in which we share our ‘research week’ findings, as we explore the interesting bits of data science and machine learning. This month, the team look at detecting offshore infrastructure, meta-learning, modelling text topics, adding residual units to convolutional neural networks, and satellite-derived bathymetry.

Firstly, what is a research week?

Members of the Data Science team at the UKHO come up with excellent ideas for how we can apply the latest in machine learning or big data research to the marine data domain. Sometimes these ideas are outside the scope of our current project deliverables so, to push the boundaries of what is possible, every quarter we conduct a ‘research week’ — an allocated time for data scientists to experiment with new ideas, techniques and technologies.

In this blog, we outline our work, how it might tie into our future projects, and how the lessons learned will help with future projects.

Detecting offshore infrastructure: Cate Seale

I’ve worked on detecting uncharted hazards at sea before, using blob detection to locate different types of offshore infrastructure. After a chance meeting, I found Brian Wong and Christian Thomas at SkyTruth were working on the same thing. We bonded over oil platform design and whether wind turbines look different depending on which way they are pointing (classic icebreaker chat at a conference 😎). We also had different approaches to detecting infrastructure, so I was keen to implement the methodology from their paper.

Sentinel-1 coverage and availability for the North Sea in 2020.

The source data for both our approaches is Sentinel-1 synthetic aperture radar data from the European Space Agency. I first visualised the available imagery for the North Sea captured in 2020 (loads!).

After creating a composite image using median pixels, there is a pre-processing step to remove the artefacts from compositing. First, a Mean Filtered image was created by passing a circular kernel over the composite, taking the mean value within the kernel. Then, the Mean Filtered image is subtracted from the composite. This was a simple step but worked impressively well at removing noise.

The median composite on the left, the Mean Filtered image in the centre, and the Differenced Image (the composite minus the Mean Filtered image ) on the right.

A threshold is applied to the cleaned image to output a binary image showing bright clusters against a black background. Erosion and dilation is then performed on the clusters to remove noise from their edges, after which their centroids are detected.

Erosion and dilation of clusters to clean cluster edges. Image Credit: https://www.sciencedirect.com/science/article/pii/S0034425719304316
My results detecting objects in the North Sea.

Using this method I was able to detect clusters (and therefore suspected infrastructure locations) for the whole of my area of interest.

I’m looking forward to integrating elements of this procedure (particularly the pre-processing and cleaning steps) as we work to iteratively improve our own methodology.

Learning about learning to learn via Meta-Learning: Andrew Smith

I have come across the term ‘meta-learning’ a number of times, and its popularity seems to be growing a lot in the last few years (an opinion I have formed mainly from Twitter). Consequently, I decided to learn more about this for my research week.

One of the motivations for meta-learning is that humans often learn new tasks quickly. The aim of meta-learning is to train a model that learns how to learn so that the model can adapt to new, unseen tasks quickly. Meta-learning systems are trained on a large number of tasks (each task consisting of only a small data set) and are then tested in their ability to learn new tasks. Here is one very impressive demo of meta-learning that is cable of mastering a new task after only the first episode!

Image credit: https://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/

Following this impressive example, I decided to attempt implementing this particular algorithm call MAML (Model-Agnostic Meta-Learning). Following the original paper, I experimented with using MAML to learn a sine wave. Each task consisted of training on only a few points from a sine wave that was generated with a different phase and amplitude. It worked surprisingly well:

This investigation has suggested that this approach can work well for problems involving small data sets with lots of tasks. At the UKHO, we have lots of similar tasks (e.g. image segmentation problems on satellite data for mapping mangrove, kelp, coastline, …), and not always lots of training data, so future work could be undertaken to determine how well this approach works on our more specific problems.

Topic Modelling Methods: Susan Nelmes

My research week involved looking at how topic modelling can be applied to textual data. Topic modelling is an unsupervised Natural Language Processing (NLP) technique to discover topics in a collection of documents, which can range in length from tweets to books.

For example, looking at travel guides, topic modelling would pick out words that are in many of the books such as ‘hotel’ and ‘restaurant’ as irrelevant, but if two books both contain ‘volcano’, ‘surf’ and ‘palm trees’, they are likely to be about a similar topic and indeed a different topic to books containing ‘mountains’, ‘ski’ and ‘chocolate’. On visual inspection of the words that define these abstract topics, we might be able to apply topic labels such as ‘Hawaii’ and ‘Switzerland’.

Pre-processing

Before applying topic modelling methods, I pre-processed my data (short texts of one or two sentences). I removed digits and punctuation and transformed it to lowercase. I tokenised it (splitting it into individual words) and removed stopwords (commonly used words that don’t add topic meaning). I then lemmatised the data, reducing words to a common base form. For example, ‘am’, ‘are’ and ‘is’ would be reduced to ‘be’.

Latent Dirichlet Allocation

The first method I tried is one of the most popular: Latent Dirichlet Allocation (LDA). This is a probabilistic model that assumes each topic will have probability distribution of words associated with it and that each document contains a number of topics. LDA finds out, through an iterative process, which combination of topics would best produce the distribution of words in the documents.

To prepare my data I used CountVectorizer and then implemented LDA using LatentDirichletAllocation. I also performed a search to find the best hyperparameter values to use. Additionally, I found the visualisation package pyLDAvis fantastic for visualising my results.

However, the resulting topics were not reflecting my data well. I came across suggestions that another method, Non-negative Matrix Factorization (NMF), could work better for smaller data sets like mine, so I tried this next.

Non-negative Matrix Factorization

Non-negative Matrix Factorization (NMF) is another widely used topic modelling method. As the name suggests, it is a matrix factorization technique. A matrix of words by documents is factorised into two matrices, words by topics and topics by documents, with the second of these defining which topics are in each document.

Image Credit: https://www.researchgate.net/figure/Conceptual-illustration-of-non-negative-matrix-factorization-NMF-decomposition-of-a_fig1_312157184

To prepare my data, I used the TfidfVectorizer and then implemented NMF. I found that the topics produced using this method were much more distinct and applicable to my data.

Residual U-Nets: Thomas Redfern

Following the success of our work detecting mangrove forests, the team has used the U-Net convolutional neural network architecture for image segmentation tasks. However, published research has demonstrated that adding residual units to the encoder and decoder paths may improve model performance. Therefore, in this research week I implemented a U-Net architecture inspired by that described by Zhang et al. (2019) to test whether it would speed up model training or lead to improved performance.

I added residual blocks to the encoder and decoder paths and then trained and tested the model on the same data as the mangrove project. I only trained these new models for 2 epochs, whereas our original mangrove U-Net model was trained for 200 epochs! The results are shown below:

The addition of the residual units has led to an improvement in model performance for all of the test metrics apart from Precision — and this is only after two epochs — a promising result!

In our work detecting mangroves, we created a bespoke mangrove labelled data set, whereby each labelled Sentinel 2 image had a corresponding label; in other words, there was a 1:1 relationship between images and labels. After mulling over this relationship and being inspired by Andrew Ng’s discussions around data-centric AI, I decided to experiment with creating a training data set where each label corresponds to more than one image.

My tested assumption was that mangrove changes over time, but not quickly enough to mean that a label captured at location x and time y would be too incorrect for location x and time z so long as z isn’t too far from y. The addition of more images would hopefully increase the variety of images that the model is trained with, thus improving its ability to generalise to unseen images.

For each image to label pair in the original labelled data set, I added another image from a different time. Therefore, each label then corresponded to two images both from location x, but time periods y and z. Using the top performing architecture from my first experiment, I trained a new model on this new data set and tested the trained model. Again, I only trained for 2 epochs.

The ‘Double Data’ model had led to a significant improvement in quantitative model performance — for example F1 score has increased from 0.72 to 0.79! This is a very exciting result that may allow us to gain greater value from our labelled data sets in the future.

Satellite-Derived Bathymetry using Random Forest Regression and Multi Temporal Prediction Aggregation: Rachel Keay

The UKHO receives and collects a large quantity of bathymetric data. I used data from a 2018 LiDAR survey over the British Virgin Islands to research and implement satellite-derived bathymetry (SDB) using Sagawa et al’s 2019 methodology from Satellite Derived Bathymetry using Machine Learning and Multi-Temporal Satellite Images. This approach overcomes the challenges of image variation in satellite imagery by aggregating the SDB results from each satellite image. They achieved an RSME of 1.41m for depths of 0 to 20m, and I achieved a close result of 1.62m.

To start, I downloaded nine random European Space Agency Sentinel-2 images using an automated python script with the sentinelsat and sentinelhub APIs. The images had less than 10% cloud cover, and were captured between early 2018 and late 2019. The raw Sentinel-2 SAFE files were atmospherically corrected with the Royal Belgian Institute of Natural Sciences’ ACOLITE tool to output 10m resolution images of reflectance values for the coastal, blue, green, red and near infra-red (NIR) bands. I then used a convolutional approach defined by J. Immerkær called Fast Noise Variance Estimation to identify the least noisy and glint-free images to take forward for further image pre-processing and machine learning.

Following the approach in Sagawa et al, I implemented:

(1) Land, cloud, glint and boat/object masking using NIR band thresholding.

(2) Deep water masking using the blue and green bands thresholding.

(3) Machine learning with scikit-learn’s random forest decision tree for regression data. For the random forest I used a typical machine learning approach which included splitting my image into training and testing areas, sampling and balancing my data, and grid search cross validation to find the best hyperparameters for the depth of tree, number of estimators and samples per leaf node.

(4) And finally, aggregating the predictions with median depth value to get the final SDB data and results.

The aggregated data achieved an average RMSE of 1.62m and R2 of 0.94. The scatter plots of LiDAR bathymetry depth points (x-axis) against SDB (y-axis) show a higher correlation of depth predictions vs true depths below 12.5m, after which the model tends to under-predict.

The SDB image plot below shows that the training area over Virgin Gorda (the island on the right) has less salt and pepper noise in the predicted bathymetry in comparison to the testing area over East Tortola (the island on the left). This indicates that the model could improve its confidence with more training data.

Satellite-derived bathymetry results from the method under test.

My recommendations for further work include improving and conducting further research of deep-water masking techniques, experimentation with other SDB algorithms (e.g. support vector regression and neural networks), and further evaluation of the prediction quality of the model at depth intervals for accuracy reporting to users of SDB data.

And that’s all for this quarter. I hope you enjoyed this blog and that these insights into our research are interesting and useful!

--

--