Building Data Science Capability at UKHO: our October 2021 Research Week
Using research weeks to push the boundaries of data science at the UKHO
Welcome to our quarterly blog post in which we share our ‘research week’ findings, as we explore the interesting bits of data science and machine learning. This month, the team look at tools for big data, 3D visualisation, advanced reinforcement learning and data augmentation strategies for satellite imagery.
Firstly, what is a research week?
Members of the Data Science team at the UKHO come up with excellent ideas for how we can apply the latest in machine learning or big data research to the marine data domain. Sometimes these ideas are outside the scope of our current project deliverables so, to push the boundaries of what is possible, every quarter we conduct a ‘research week’ — an allocated time for data scientists to experiment with new ideas, techniques and technologies.
In this blog, we outline our work, how it might tie into our future projects, and how the lessons learned will help with future projects.
3D visualisation of bathymetric data: Rachel Keay
When visually communicating the accuracy of a satellite derived bathymetry (SDB) machine learning model, I decided to go to the third dimension this research week.
Python libraries matplotlib and holoviews/plotly are part of my day-to-day data science toolkit. Matplotlib has a simple interface, supports many backends and operating systems, and other python packages are built on top of matplotlib. But, it can be verbose and is not desirable for creating dynamic and interactive graphics. Whereas, holoviews/plotly is a high-level tool built on top of Bokeh, matplotlib, and datashader and in a couple of lines I can output a default 3D visualisation of bathymetry data. So I used both; holoviews/plotly for quick interactive plots and matplotlib with Pillow to output visualisation animations as gifs.
Visualisation is used to discover, explore, and explain patterns in data. The matplotlib gifs show:
1. that the SDB surface is more textured, demonstrating uncertainty in the predictions.
2. there is a tendency to under-predict deeper depths, demonstrating that there is an “extinction depth” where the satellite multispectral bands no longer penetrate the water column to derive bathymetry ~ 20m.
Python tools for exploring, analysing and visualising big, labelled data: Dr Susan Nelmes
I spent my research week getting up to speed with some of the common Python tools used to explore, analyse and visualise big, labelled data. To do this, I followed tutorials that originally ran during SciPy conferences 2020 and 2021, which have two days of tutorials before the main conference begins, both at a beginner and intermediate level.
Like many data science teams, we often work with very large data sets that cannot be held in memory. One of the tools that can help with this is Dask, a Python library for scaling and parallelizing code on a single machine or across a cluster. To learn more, I followed the Dask tutorial from SciPy2020, Parallel and Distributed Computing in Python with Dask (material, video), which provided a great introduction with informative exercises to work through.
Another tutorial from SciPy2021 focussed on a particular use case of applying Dask to explore AirBnB data (material, video) and was particularly insightful on the use of Panel to create interactive dashboards.
I also enjoyed the tutorial from SciPy2020 on Xarray, a Python package that allows labelled data to be manipulated as multi-dimensional arrays (material, video). Xarray works well with Dask, and the tutorial provides examples and exercises on this.
The final tutorial that I followed was HoloViz (material, video), to visualize all your data easily, from notebooks to dashboards. This looked at ways to use hvPlot and, again, Panel to visualise and interact with data.
A great aspect of research weeks is that they allow you the time to take deeper dives into areas that take your interest or that are a tangent to the main topic. For example, it was interesting to look further into how Python works under the hood and understanding why the Global Interpreter Lock, that allows only one thread to hold the control of the Python interpreter, affects how code can be parallelised. Another interesting tangent was looking into comparisons between Dask and other tools to parallelise Python code such as Ray and Modin.
Advanced reinforcement learning: Andrew Smith
I have been interested in learning more about reinforcement learning since my initial blog post (found within this previous team blog post). Reinforcement learning could be a very useful technology for some of our problems like determining optimal shipping routes and considerations around autonomous shipping. Consequently, for this research week I decided to learn more about MuZero.
MuZero is a recent development from DeepMind, the company with an aim of solving Artificial General Intelligence (AGI). The methods they have been developing have been getting more and more general, and MuZero is the most general of these methods. MuZero is capable of learning the rules of the environment it is acting in, in addition to benefitting from advances developed previously like the ability to learn entirely from self-play, and using a single algorithm to solve multiple games.
If you want to give MuZero a try yourself, I found this code repository useful. The following resources were also found to be extremely helpful for learning more about it:
Data augmentation strategies for satellite imagery: Dr Thomas Redfern
When training deep learning models for image segmentation tasks, we commonly use data augmentations to increase the volume and variety of our training data. Typically we might apply image rotations, flips and random crops. For natural images, or in settings where image exposure is well controlled, these augmentations allow a deep learning model to increase generalisation to different angles of capture. In satellite remote sensing imagery, the exposure of an image may vary as changes in sun angle, weather conditions, atmospheric interference and seasonality may alter the recorded reflectance of the Earth’s surface by a satellite sensor. Therefore, if we train a model using a labelled image from time t1, the exposure of an image used for inference at time t2 may be quite different and this may reduce the accuracy of our trained model. To overcome this problem, in this research week, I experimented with the development of an image exposure augmentation technique. I devised a number of different exposure adjustment strategies:
- All bands within a Sentinel-2 image were augmented by a fixed value e.g. new band value = original band value x an augmentation factor. I experimented with augmentations factors between 0.5 and 2.
- Each individual band within a Sentinel-2 image was augmented by its own individual augmentation factor, again between 0.5 and 2.
I used our pre-existing mangrove labelled train and test data sets and compared results to previous production and experimental models. Unfortunately, this experimentation yielded results that weren’t an improvement over previous models. I believe this could be because I was applying a different augmentation factor in each new epoch. A better technique may have been to create a larger fixed augmented data set first, and then train on all the imagery for a number of epochs — something to try next time!