XBT Project Summary — Using Machine Learning to infer missing metadata in Climate Science Datasets

Stephen Haddad
Met Office Informatics Lab
7 min readOct 22, 2020

Background and Data

Figure 1: An Expendable BathyThermograph probe, the source of the observation data that is the subject of this project, and a picture of an XBT probe being launched off a ship.

Recently the Met Office, historically a very data focused and data rich organisation, has invested in further developing use of cutting edge Data Science, including Machine Learning, by developing projects across the organisation to develop skills and inform future infrastructure choices. The Informatics Lab has engaged in such a project to fill in missing metadata in a climate dataset of ocean temperature measurements. These measurements come from Expendable BathyThermograph probes (XBTs), which have been collected since the 1960’s. Statistical bias corrections are an important part of the pipeline to include these measurements in a Ocean Temperature Dataset as produced by the Met Office (e.g. EN4 https://www.metoffice.gov.uk/hadobs/en4/ ) , but these rely on metadata specifying the type of probe that produced the reported profile. Missing metadata increases the uncertainty of the resulting dataset, so the aim of this project is to fill in the missing metadata to reduce the uncertainty.

Secondary motivations include demonstrating the value and increasing awareness of Data Science and Machine Learning techniques in Climate Science among domain researchers. In addition, although it is a real-world application of meaningful size and complexity, it is certainly not in the realm of truly big data, with approximately 2 million observations (~25GB file size) running training and evaluation does not require large computing resources, making it a good test case for evaluating usability of different tools and platforms both internally and on cloud services. This has tested the suitability of currently available tools for subsequent projects and will inform future tooling, so standard Data Science and Machine Learning techniques are part of the standard toolkit of Met Office Scientists.

The source data for this project is the database of XBT temperature profiles and associated metadata. This data is freely available from the World Ocean Database (WOD https://www.ncei.noaa.gov/products/world-ocean-database ) so together with open-source codes, including experiment configurations, based on open-source libraries running on public cloud, a key project aim is to be an example of good practice for an open, reproducible machine learning project.

Solution Overview

This project is not new or innovative from a machine learning perspective, but rather using machine learning for a new problem. The key challenge is not to develop any new ML, but to really understand the problem domain and the data to be able to make good choices of standard techniques that are appropriate and will give results for this problem.

There are approximately 2.2 million data points (each a temperature profile with associated metadata). Of these approximately 1 million are labelled, which means the metadata includes the probe type. The aim is to train and evaluate a classifier using the labelled data and then produce probe type classifications for the unlabelled data. A particular challenge with this dataset is the imbalance between the different class of probe types, shown in figure 1. Although there are about 20 probe types in the dataset, of the labelled data ~80% of the profiles are from three predominant types. Care is needed in the experiment design to take this into account.

Figure 2: The frequency of different XBT probe types, showing class imbalance.

This work builds on previous work, firstly a heuristic algorithm called Intelligent Metadata (iMeta) that selects probe type based on thresholds and values in certain fields based on statistical analysis which exploits the class imbalance. This is essentially a manually configured decision-tree. The second piece of work was a first attempt to use ML for this problem, using a neural network. This work has built on the approaches used there. After reproducing the results of that work, I extended the decision tree approach by using a standard ML decision tree that learns from the data, rather than one using human-specified thresholds. The good results with manual tree approach suggested this as a good starting point. Other advantages of this over a more powerful algorithm include speed of training and explainability of the results, as the actual thresholds used can be extracted and interrogated to understand why a classification was output.

Figure 2: Flowchart showing the data pipeline for the XBT project

Figure 2 shows the data flow through experiment pipeline. The relevant data for this project is extracted from the raw NetCDF files downloaded from the WOD. We are using standard tabular data library Pandas to store and manipulate the data. We carefully split the data into train/test/dev sets. For example, each profile comes from a cruise, which is a single ship’s voyage where usually multiple temperature profiles were taken. Through initial data exploration, we discovered that for some unlabelled profiles, they come from a cruise where some of the profiles are labelled. This makes it likely that the unlabelled profiles have the same probe type (but not certain). Other unlabelled profiles are from cruises where no profiles have probe type metadata. Other training and evaluation needs to consider both cases to successfully produce classifications for the unlabelled profiles.

So far we have primarily used scikit-learn for Machine Learning, mainly due to the ease of getting started and the variety of algorithms available. The scikit-learn API makes trying many approaches very simple. The output from the system are:

  1. Probe type classifications for all profiles
  • For labelled data this is used to calculate performance metrics. We use the standard F1, precision and recall to evaluate results, primarily using recall for consistency with previous work.
  • For unlabelled data, this will be used to inform downstream dataset production. In some cases the classifier cannot produce a prediction and we fall back on iMeta.

2. Metric scores for labelled data

3. A trained classifier on an ensemble of classifiers

  • We save out the state of the trained classifier, because in the application new data will become available and we may need to produce predictions for it that are consistent with the rest of the dataset.
  • Each classifier is trained on a different subset of data, and we combine them using voting to get a pseudo-probabilistic output, as a step towards later creating a proper probabilistic model of classification.
  • Each classifier is trained on a different subset of data, and we combine them using voting to get a pseudo-probabilistic output, as a step towards later creating a proper probabilistic model of classification.

Conclusions, Lessons and Next Steps

This structure has been implemented in Python built on top of standard tools such as Pandas and Scikit-Learn. Using this pipeline, we have trained, evaluated and selected a machine-learning algorithm that improves accuracy of classification on the labelled data. Despite trying several sorts of algorithms, the best result was for a decision tree, achieving a recall value of 0.94, compared to 0.75 for iMeta and 0.85 for the previous Neural Network approach. Figure 3 shows There is still scope for further analysis of misclassified profiles to see if results for low support probe types (i.e. those with few data points) can be improved and thus improve overall accuracy. These results are currently being written up into a Scientific paper to describe the data exploration, solution design and results in greater detail.

Figure 4: Recall by year for different algorithms. Blue shows the original heuristic prediction scheme output and red shows the best performing trained ML classifier.

One of the key challenges in all projects is good communication and coordination, but it is especially important in Data Science projects where you typically have a divide between domain experts (e.g. climate scientists) and machine learning researchers. Regular communication to develop an understanding of the requirements and expectations of each group was essential to making progress. This has been especially challenging in 2020 when impromptu water cooler conversations or pair programming in front of the same computer have not been possible!

It has been tested in three computing environments: the SPICE internal Met Office Linux cluster, a Pangeo Jupyter-hub based cloud platform, and using the Azure Machine Learning Studio service from Microsoft. The source has been designed to facilitate using different libraries (e.g. PyTorch, XGBoost) and platforms (e.g. AWS Sagemaker), so that this problem can be a test case to evaluate each based on requirements such as ease of getting started or scalability.

There has been an explosion in different possible pathways for implementing Now that I have a benchmark for doing so, I look forward to exploring the applicability of some of those possibilities for subsequent Data Science work. In subsequent months, there will be further posts exploring some of the technical details of software engineering for Data Science implemented in this project and the results of those explorations.

Project Repository and Documentation on GitHub

https://github.com/MetOffice/XBTs_classification

--

--