Microsoft AI for Earth

Helping Scientists Protect Beluga Whales with Deep Learning

Published in

Microsoft Azure

5 min readMar 10, 2020

This post is written by Ming Zhong from Microsoft AI for Good Research Lab, collaborating with Dan Morris from Microsoft AI for Earth, and Manuel Castellote from Alaska Fisheries Science Center at NOAA.

Background

In the U.S., there are five populations of beluga whales, all in Alaska. Of those five, the Cook Inlet population is the smallest and has declined by about seventy-five percent since 1979. Subsistence hunting contributed to this initial population drop, but this practice was regulated starting in 1999, with the last hunt in 2005. Still, the beluga whale population in Cook Inlet has yet to recover. This population was listed as an endangered species in 2008, with hopes that the population would begin to recover in the near future, but more than a decade later they continue to decline, with a current population estimate of 328 whales.

Like other toothed whales, beluga whales rely highly on sound. They produce acoustic signals to find prey and to communicate; consequently, scientists can use acoustic recordings to study beluga populations and behavior. In 2008, the NOAA (National Oceanographic and Atmospheric Administration) Alaska Fisheries Science Center, in partnership with the Alaska Department of Fish and Game, put together an acoustic research program to continuously monitor beluga whale habitat. This program has two main objectives: (1) studying beluga whale behavior and population size, and (2) understanding the extent to which human-generated noise is disrupting beluga populations.

How Machine Learning Can Help

In the past, with raw audio recordings collected by underwater moorings, NOAA scientists used a very basic detector — based on energy levels in certain frequencies — to detect acoustic signals from beluga whales. This detector was tuned toward high recall, i.e., it was tuned to make sure that it didn’t miss any beluga sounds, but consequently allowed many sounds to pass through that were false positives, i.e. uninteresting background noise. Consequently, manual validation was required for each of those detections. This validation process is very time-consuming and labor-intensive, which limits the number of sensors that the team can deploy, and also limits the speed with which the team can provide answers to critical conservation questions.

In this project, our goal is to build a machine learning model that automatically detects beluga whale acoustic signals. Reducing the burden of annotating acoustic recordings will allow the team to scale their deployments, and will allow them to allocate their personnel to conservation science and planning, rather than data annotation.

Data

The original raw data was collected with hydrophones (i.e., underwater microphones) placed in permanent moorings within the Cook Inlet beluga whale critical habitat. These microphones are serviced twice per year by the National Marine Fisheries Service (NMFS) and the Alaska Department of Fish and Game. Recording datasets included in this study correspond to 5 to 7 month mooring deployments for the ice-free water season (May to September) or winter season (October to April) in 2017–2018 in seven locations (Figure 1), which account for more than 13,000 hours of audio recordings.

The NOAA team ran all of these audio recordings through the basic detector that they have used throughout this project, and the results were manually validated through visual and aural inspection of spectrograms. Every detection was labeled as either a true detection (i.e., with beluga whale calls) or a false detection (i.e., without beluga whale calls). This labeled dataset served as invaluable training and test data for our machine learning work.

**Figure 1: Location of microphone deployments in Cook Inlet.**

Applying Machine Learning to Find Beluga Sounds

In order to take advantage of the extensive tooling that’s developed around computer vision, many projects applying machine learning to audio data first convert sound to images, by way of a spectrogram. A spectrogram is an image in which each row is a frequency, each column is a time point, and each pixel’s intensity tells us how much that frequency is present at that time point (Figure 2).

Figure 2: An example 2-second spectrogram, indicating the intensity of frequencies from 0 Hz to 12000 Hz. Each pixel has only one value (i.e., this is actually a grayscale image), but spectrograms are typically colorized so that red or yellow indicates high intensity, and dark blue indicates low intensity).

We first extracted spectrograms were extracted from audio files, where each spectrogram was generated from a 2-second audio segment from the start time of a detection. Spectrograms were resized to 300 by 300 pixels. In total, we generated 89,000 spectrograms from true detections, 146,000 spectrograms from false detections, and 25,000 spectrograms from the clips when there were no detections (we’ll discuss later how we use those).

The manually validated detections were used as the ground truth to train and evaluate a binary image classifier, i.e. a model with only two output classes: “beluga” and “not beluga”.

Specifically, we built four individual convolutional neural network models:

· Model 1: Built a CNN from scratch using AlexNet architecture.

· Model 2: Transfer learning with fine-tuning from a pre-trained VGG16 model.

· Model 3: Transfer learning with fine-tuning from a pre-trained ResNet50 model.

· Model 4: Transfer learning with fine-tuning from a pre-trained DenseNet model.

We then fit an ensemble model that weighted the outputs of these four individual CNNs. All models were trained in Keras (with the TensorFlow backend) on an Azure Deep Learning Virtual Machine.

Results

With a default neutral threshold score of 0.5 used for classification, the results show that each individual CNN model has relatively similar performance, while leveraging the pre-trained model with transfer learning yields better results, especially in recall. The best overall performance in terms of the area under the curve (AUC) — a common metric that factors in both precision and recall — was achieved using the ensemble model.

Putting the model into practice

We implemented the spectrogram generation and training steps using an Azure Deep Learning Virtual Machine, and provided the NOAA team with the model files and Python scripts to run those models. The team is now able to run their existing detector, then run our ensemble model to substantially reduce the number of false detections they need to manually review.

Next steps

The methodology presented here can be easily adopted by other similar bioacoustics applications, either for binary classifications or multi-class classifications. To facilitate this process, we’ve released all the source code here:

http://github.com/micosoft/belugasounds

The NOAA team is currently using our model to refine the outputs of their original detector. While this saves them substantial amount of time, we’re hoping to further streamline their process in the future by replacing the original detector entirely.