Building Data Science capability at UKHO: our October 2022 Research Week

Kate Liddell
UK Hydrographic Office
7 min readNov 4, 2022

This blog was co-authored by Dr Susan Nelmes, Andrew Smith, David Stephens and Dr Thomas Redfern.

Hello and welcome to our latest data science roundup of 2022! I’ve taken over the reins as your guide to our regular blog post in which we share our quarterly ‘research week’ findings, where we explore the interesting bits of data science and machine learning. This month the team look at data from the Sentinel satellites, the use of transformers for summarising text, topic modelling, graph databases for Automatic Identification System (AIS) data and explore new technologies.

Firstly, what is a research week?

Members of the Data Science team at the UK Hydrographic Office (UKHO) come up with excellent ideas for how we can apply the latest in machine learning or big data research to the marine data domain. Sometimes these ideas are outside the scope of our current project deliverables so, to push the boundaries of what is possible, every quarter we conduct a ‘research week’ — an allocated time for data scientists to experiment with new ideas, techniques and technologies.

In this blog, we outline our work, how it might tie into our future projects, and how the lessons learned can help deliver insight to UKHO business problems.

Dr Susan Nelmes: Investigating Level 2 Sentinel products

For this quarter’s research week, I investigated various marine products which are available from the Copernicus programme’s Sentinel satellites. The Data Science team here at UKHO have previously used Sentinel-1 and Sentinel-2 data in many of our projects, valuing its long-term, continuous, consistent and open data. We have not yet used any of the Level 2 products from the Sentinel-1 and Sentinel-3 satellites, these being geolocated, geophysical products processed from SAR (Synthetic Aperture Radar) data, so I looked into them to discover if they would be useful in our work.

An image of the Sentinel 1 satellite in space with the curve of the earth just visible in the bottom of the image.
Image credit: Sentinel-1

The Level 2 products from Sentinel-1 contain three components: Ocean Wind field (OWI), Ocean Swell spectra (OSW) and Surface Radial Velocity (RVL). These can be downloaded, in netCDF format, from the Copernicus Open Access Hub and explored using SNAP (SeNtinel Applications Platform) toolbox.

The Level 2 Sentinel-3 products include altimetry data both over land and water, provided in netCDF format. The marine data can be downloaded from the EUMETSAT Data Store and explored using BRAT (Broadview Radar Altimetry Toolbox) toolbox. Valuable documentation on all Sentinel missions and products is provided on their website.

This research week exploration provided a good insight into the data products available that our Data Science team may make use of in the future.

Andrew Smith: Exploring Transformers and Hugging Face

Transformers have become a dominant architecture in many areas of deep learning research, primarily in the fields of computer vision and natural language processing. A number of large language models have recently been developed given the availability of large text data sets, huge compute power, and the ability to efficiently train transformers in a self-supervised manner. These demonstrate amazing capabilities like text generation, text summarisation, and question-answering. Here at the UKHO, we have a number of text-based problems and products. Consequently, for this research week I spent some time learning more about transformers and exploring how they might be able to help us.

The Hugging Face transformers library is a fantastic resource for using such models. It provides a standardised interface and many pre-trained models, enabling you to quickly experiment with these on your own data sets. The Hugging Face Tasks page helped me to identify the types of problems the library could help me solve. I combined some of these with Streamlit to create a quick dashboard that is useful for demonstrating the capability of such models to non-technical users:

Streamlit application showing the block of text to be summarised
Resulting block of summary text using the selected transformer in Streamlit
Basic Streamlit application for exploring various huggingface/transformers models. I tested this on a number of our data sets and was amazed at how well the pre-trained models worked without any specific fine-tuning to our data. For demonstration purposes, here the input text is the abstract of our paper “Using three dimensional convolutional neural networks for denoising echosounder point cloud data”, and the purpose of the model is to summarise the text.

For future improvements, Hugging Face also supports fine-tuning so that the models could be customised to our particular domain. The dashboard could also be developed further to more specifically help teams within UKHO for their particular text problems.

Dr Thomas Redfern: Document topic modelling

For this research week I have been exploring the Top2Vec library for text topic modelling. The Top2Vec library provides a very simple programming interface to a pipeline that aims to cluster documents together into groups, where the “topic” is then defined as a collection of words that occupy a similar vector space to the topics and documents. This is achieved by simultaneously embedding words, documents and topics within a multi-dimensional vector space (via pre-trained language models), applying dimensionality reduction to embedding vectors (via the UMAP algorithm) and then extracting topics from the dense regions of the 2-dimensional vector space (via the HDBSCAN algorithm). The really useful thing about the Top2Vec library within a discovery or experimental project is that a topic modelling pipeline can be applied to data with only 1 or 2 lines of code, allowing the user to change settings for each stage of the pipeline by passing in a dictionary of arguments.

To test the Top2Vec method, I used the 20newsgroups data set available in the Scikit-learn library. The data set contains nearly 20,000 pieces of text (of varying lengths) drawn from conversations around news and current affairs. The following figure shows a visualisation of the output of Top2Vec, illustrating the clustering of points (representing documents) into different topics, represented by different colours.

Each point represents a document in the 20newsgroupsdataset. Clusters (topics) were extracted using the HDBSCAN algorithm and coloured accordingly

The Top2Vec library also provides some helpful functions for creating word clouds of the words that describe the topics discussed within each cluster, this is helpful as a quick way of interrogating and analysing the topics contained within your data set.

David Stephens: Exploring technologies for efficient working

I used the time in our research week to explore several technologies which had caught my eye, I describe just the highlights here.

Quarto is an open-source system for publishing your scientific and technical work that is built on Pandoc. You can author documents in either plain text markdown or in interactive Jupyter notebooks. The markdown can then be rendered into any number of formats e.g. report, presentation, website, blog, book — you name it. It is particularly nice because it allows you to create dynamic/interactive content in Python, R, Julia or Observable JS. Since learning about Quarto I have used it to deploy a Github Pages website hosting all the documentation for one of our projects. It is really easy to setup and is published with only a couple of commands. I’ll certainly be rendering more of my work with Quarto in future!

Quarto logo
Image credit: Quarto

Another technology I was keen to try out was the newly released Shiny for Python. Shiny is the popular dashboarding system for R and allows for rapid development of interactive dashboards. Having Shiny available in Python gives another option for deploying dashboards and visualisations. In particular Shiny for Python offers something which is not currently possible in the R version — Shinylive. Shinylive allows you to develop serverless dashboards, these are dashboards that run entirely in the browser of the client rather than being hosted on a centralised server. There are pros and cons of this approach, but one positive is that you can host dashboards on static website hosting servers (e.g. Github Pages) and because the compute is done by the client's browser, your dashboards will be scalable. Here is a small example of a Shinylive dashboard along with the source code. There is also now a Shinylive extension for Quarto!

Dr Kate Liddell: Using a graph database for AIS data

For this research week I have been exploring ideas about how we can understand the shipping traffic between critical infrastructure using AIS data in a graph database. A small data set was created by selecting all container ships that visited the Port of London during 2022. Positional data for these vessels during 2022 were extracted from our Fleetmon AIS feed and stopping locations detected using DBSCAN.

A graph database was created in Neo4j following the data model for vessel journeys proposed by Laddada and Ray (2020). A ship is associated with journey segments which are stored as nodes with attributes such as maximum and minimum speed. These segments are connected to port nodes by edges which represent the start and end of the segment, attributed with the date. Additionally, the data model contains information about the ship obtained from the UKHO’s Clarksons data supply, such as flag state, beneficial owner, operator etc.

Schema for the graph data model showing nodes for ship, journey segment and port and start, end and following edges.
Graph model for vessel journeys implemented in Neo4j (Laddada & Ray 2020)

The stopping locations I created were assigned to real world port locations and the graph data model populated. This enabled a range of queries to be performed on the graph database using the Cypher query language. It is possible to use the results of these queries to understand the importance of a port in the global network and its connectivity to other ports. We can measure the percentage of traffic that moves from one port to another and show this in visual representations of the network.

Network of journey segments and start and stopping ports for the vessel Caucedo Express
Journeys between ports for the vessel Caucedo Express

This is a promising technique that could be scaled up to help answer a range of business problems within the UKHO.

--

--

Kate Liddell
UK Hydrographic Office

Dr Kate Liddell OBE Principal Data Scientist @ UK Hydrographic Office