Building Data Science capability at UKHO: our January 2023 Research Week

Published in

UK Hydrographic Office

6 min readMar 15, 2023

This blog was co-authored by Andrew Smith and Dr Susan Nelmes.

Hello and welcome to our first data science roundup of 2023! This is the latest in our series of blog posts in which we share our quarterly ‘research week’ findings, where we explore the interesting bits of data science and machine learning. This month the team looked transformers and their application in computer vision, the use of Gradio to get feedback for models we deploy, but we taker deeper dive into three of those topics here; reinforcement learning from human feedback, python packages for analysing satellite imagery and using the H3 indexing system with graph databases for AIS data.

Firstly, what is a research week?

Members of the Data Science team at the UK Hydrographic Office (UKHO) come up with excellent ideas for how we can apply the latest in machine learning or big data research to the marine data domain. Sometimes these ideas are outside the scope of our current project deliverables so, to push the boundaries of what is possible, every quarter we conduct a ‘research week’ — an allocated time for data scientists to experiment with new ideas, techniques and technologies.

In this blog, we outline our work, how it might tie into our future projects, and how the lessons learned can help deliver insight to UKHO business problems.

Andrew Smith: Reinforcement Learning from Human Feedback

OpenAI recently introduced ChatGPT, a system that can be interacted with in a conversational manner. You can ask it questions on a wide range of topics, as well as ask it to assist with a large variety of tasks (e.g. to draft a custom email or to write a specific Python implementation of an algorithm). Inspired by the impressive capabilities of this system, I wanted to learn more about how it works and how the model has been trained. This introduced me to the concept of Reinforcement Learning from Human Feedback (RLHF). Consequently for this research week I spent some time learning more about this.

The ChatGPT blog post (linked above) provides a great overview of the methods used. One of the main things I have learned regarding how RLHF works is that in addition to training a model that learns in a supervised manner (e.g. GPT-3.5 in the context of ChatGPT), a reward model is also created to score an output generated from the original model. This reward model is trained on data created via human feedback. Given these two models, a reinforcement learning update step optimises the original model by evaluating the “action” (i.e. generated output) using the reward (computed by the reward model). If you want to learn more about RLHF, I also found the following blog post/diagrams extremely helpful for this: Illustrating Reinforcement Learning from Human Feedback (RLHF).

Illustrating Reinforcement Learning from Human Feedback process flow diagram — `Lambert, et al., "Illustrating Reinforcement Learning from Human Feedback (RLHF)", Hugging Face Blog, 2022.`

Here at the UKHO we have a number of text-based problems and products that could potentially benefit from RLHF, particularly those where it is hard to define an explicit loss function. I am also interested in exploring how RLHF might be used in non-textual problems, as well as how it might enable colleagues to train custom machine learning models for their needs in a no-code/low-code manner.

Dr Susan Nelmes: Python packages for satellite imagery analysis.

For our January 2023 research week, I took a look into a couple of Python packages that can help to analyse satellite imagery. We have used a few tools, including Google Earth Engine, to do this in the past so I thought it would be interesting to see how the latest tools compared.

The first Python package, or rather collection of subpackages, that I looked at was eo-learn which claims it “acts as a bridge between Earth observation/Remote sensing field and Python ecosystem for data science and machine learning”. It makes use of three core building blocks: EOPatch, EOTask and EOWorkflow. EOPatch stores the data, whether time-dependent or time-independent spatial data, time-dependent or time-independent scalar data, or data in any format readable by Python packages. EOTask is an operation on an EOPatch, such as obtaining the data or calculating and applying a mask. As a user you are not limited to the EOTasks that are within the package as you can easily add your own. Finally, an EOWorkflow links together EOTasks within a pipeline that can then be run.

Together these building blocks can apply machine learning techniques to satellite imagery. For an easy-to-follow example see here, a repository which shows eo-learn being used to predict land-use-land-cover. Once a model has been trained it can be converted to a json script that can be run on SentinelHub rather than locally, decreasing processing time and overheads. Another option for scaling up is to use the eo-grow package that builds on eo-learn.

The package seems easy to use and, importantly, customisable. I appreciated the many examples available within its documentation, making understanding and implementing eo-learn a smooth experience.

I also, more briefly, looked at TorchGeo, which is a PyTorch domain library that claims to “provide datasets, samplers, transforms, and pre-trained models specific to geospatial data.”

The package has a set of dataset classes for common geospatial and non-geospatial datasets such as Sentinel satellite imagery and if there isn’t a class for the data that you are using, you can create a custom one. It is easy to calculate intersections and unions of datasets using just the & and | operators.

The package also provides samplers to sample the data for machine learning processes, along with benchmark datasets, trainers and pre-trained models for a variety of tasks e.g. classification, object detection and segmentation.

TorchGeo seems a great expansion to the PyTorch ecosystem that allows for better and more streamlined use of geospatial data while benefitting from all that PyTorch already offers.

Dr Kate Liddell: Using a graph database for AIS data continued!

This research week I continued to look at how we can use a graph database to answer questions about shipping patterns in AIS data (see previous blog). I was interested in how we could assign a port node to a stopping place automatically if we did not necessarily have geospatial boundary information for all global ports.

I looked into using Uber’s H3 hexagonal hierarchical geospatial indexing system which allows spatial data to be joined to an indexed hexagonal grid at a range of scales. I joined vessel stopping locations detected from AIS data using DBSCAN to the H3 cells using the h3-py library. I then removed any stops that fell in cells that did not intersect land, so that anchorages and other out of port waiting was excluded. I used the cell id to represent unique port nodes in the Neo4j graph database.

A map showing stopping locations of vessels visiting ports in Lisbon as magenta dots and the boundary of H3 level 7 hexagonal cells. — Stopping locations of vessels visiting ports in Lisbon indexed by H3 level 7 cells

As in the case illustrated above, some ports fall neatly within the boundary of a single hexagonal cell, but this is not always the case. So it is important to choose the correct H3 scale band index appropriate to the questions you want to answer with your graph queries. For example, you can see below that in Rotterdam Massvlakte port neither the port, or berth fit neatly into one single cell at H3 scale bands 6 and 7.

Stopping locations of vessels visiting ports in Rotterdam Massvlakte indexed by H3 level 6 and 7 cells

Once I had created the port nodes in my Neo4j database, I was able to apply graph centrality algorithms to determine the relative importance of the ports. This is a promising technique that could help answer a range of business problems within the UKHO.

Building Data Science capability at UKHO: our January 2023 Research Week

Firstly, what is a research week?

Andrew Smith: Reinforcement Learning from Human Feedback

Dr Susan Nelmes: Python packages for satellite imagery analysis.

Dr Kate Liddell: Using a graph database for AIS data continued!

Written by Kate Liddell