Bioacoustic Classification using Deep Learning for Multiple Species in a Puerto Rico Rainforest

Published in

Microsoft Azure

7 min readApr 21, 2020

This post is written by Ming Zhong from Microsoft AI for Good Research Lab, collaborating with Sieve Analytics, and Dan Morris from Microsoft AI for Earth.

Background

Acoustic monitoring has gained widespread interest as an ecological tool for wildlife population assessment, conservation, and biodiversity research. Many species emit regular vocalizations or other acoustic signals that are species-specific, which enables monitoring via sound recognition. Advances in automated acoustic recorders have enabled data collection at greater temporal and spatial scale and has resulted in enormous data sets. However, like the evolution of many data-driven approaches, including camera traps and eDNA, methods for acoustic data collection are progressing faster than those for effective analysis and interpretation. In many cases, acoustic analyses are done manually, and this often limits the analyses to a small subset of the complete datasets. To enable analysis of entire datasets, accurate, automated sound recognition methods are paramount.

To address this challenge, researchers have created species-specific algorithms by applying automatic speech recognition techniques, and recently deep learning techniques have become popular as well. To achieve good performance, training a deep learning model typically requires large amounts of labeled data. Unfortunately, training data is limited for most classification problems due to the high cost of labeling data or because a particular class is very infrequent (e.g. a rare or endangered species). Here, we propose a novel approach that combines transfer learning and pseudo-labeling as a data augmentation technique to: 1) train a deep convolutional neural network (CNN) model, 2) evaluate the model against a test set of audio clips with labeled species presence or absence, and 3) automate the detection of 24 species of birds and frogs from 1-minute field recordings from the mountains of Puerto Rico. The main focus of this study was species of Greatest Conservation Need according to the State Wildlife Action Plans, but we also included some other common species in Puerto Rico (e.g., the common coqui (a type of frog), the pearly-eyed thrasher (a common bird), and the scaly-naped pigeon).

Data

Audio data was collected by our collaborator, Sieve Analytics, from ~700 sampling sites across the mountains of Puerto Rico from 2015 to 2019. Audio recordings were collected using portable acoustic recorders, which were programmed to record 1-minute audio every 10 minutes 24 hours per day for 1–2 weeks per sampling site.

To create the training and test data, we used the call template matching process for each species within the ARBIMON II platform, a tool developed by Sieve Analytics to manage ecological acoustic data. The user annotated each detection as either positive or negative, indicating the presence or absence of the target species within the audio segment. Our manually validated dataset consists of 100,000 positive and 243,000 negative single-species detections across 24 species. To create CNN training samples from the call detections, we transformed each raw audio recording into a mel spectrogram, an image representation of the frequencies contained in the recording at each point in time.

Mel spectrogram images were computed using 2-second audio clip beginning at the start-time of each detection (see Fig. 1 for an example of a mel spectrogram from each species). These images were the input for the machine learning model, and the corresponding single-species labels for each image (i.e. species present (positive) or absent (negative)) were used as the ground truth data for training and evaluating the multi-label multi-class classification model.

Fig. 1. Sample mel spectrograms for each species’ two-second call. The horizontal axis represents time, and the vertical axis represents frequency (ranging from 0 to 24,000 Hz) with log transformation. Species names in the first row from left to right: Eleutherodactylus unicolor, Eleutherodactylus brittoni, Eleutherodactylus wightmanae, Eleutherodactylus coqui, Eleutherodactylus hedricki, Eleutherodactylus gryllus; second row from left to right: Eleutherodactylus richmondi, Eleutherodactylus locustus, Eleutherodactylus antillensis, Eleutherodactylus portoricensis, Leptodactylus albilabris, Vireo altiloquus; third row from left to right: Loxigilla portoricensis, Patagioenas squamosa, Spindalis portoricensis, Nesospingus speculiferus, Megascops nudipes, Margarops fuscatus; fourth row from left to right: Setophaga angelae, Coereba flaveola, Turdus plumbeus, Melanerpes portoricensis, Todus mexicanus, Coccyzus vieilloti.

Applying Machine Learning to Classify Species Sounds

A. Build CNN using VGG16 architecture

We used the neural network model VGG16 to classify the calls of the 24 species. This VGG16 CNN architecture, begins with the RGB images as input, which are passed through a stack of convolutional layers with 3×3 kernel-sized filters. Spatial pooling is carried out by five max-pooling layers that are insert among the convolutional layers. The final max-pooling layer is followed by three fully-connected layers and the soft-max output layer.

B. Transfer learning and fine-tuning with pre-trained CNN model

Transfer learning is a machine learning technique where a model trained on one task (or domain) is re-purposed on a second related task (or domain). Pre-trained models are usually shared in the form of the millions of parameters the model achieved while being trained to an optimal state. This approach is effective because the source model was trained on a large number of photographs and made predictions on a relatively large number of classes, and in turn, it required the model to efficiently learn to extract features from images in order to perform well. With fine-tuning, we freeze some layers from the pre-trained model and only train the last several layers, instead of having to train the whole model with random initialization of all parameters. In this study, we used the pre-trained ResNet50 and fine-tuned the parameters by adding a fully connected layer, a dropout layer and an output layer.

C. Custom Loss Function with Pseudo Labeling

With the CNN models described above, we only make use of the detected calls associated with positive labels (i.e., 100,000 clips), by assuming the absence of all other species and encoding them with negative labels. However, even within the 2-second time frame which the mel spectrogram was based on, there may be calls from multiple species, and thus the assumption of only a single species in a clip may not always hold and will yield some incorrect labels. Furthermore, they do not make use of the detected calls with negative labels (i.e., 243,000 clips), which may include useful information for the classification task.

Pseudo labeling, a semi-supervised learning technique, provides a simple yet effective method to augment the size of training data, and incorporates the information contained in the unlabeled data into the model to gain more understanding about the general structure. In this scenario, the initially trained model, which is based on labeled data only, will predict species presence or absence for an unlabeled sample if the corresponding predicted probability surpasses a certain threshold. Then the model is re-trained in a supervised fashion with both labeled and unlabeled data simultaneously, where pseudo labels are assigned as if they were true labels.

Results

Using the mel spectrograms and their corresponding labels as input, we built three individual convolutional neural network models, and we trained them in Keras (with the TensorFlow backend) on an Azure Deep Learning Virtual Machine.

· Model 1: Built a CNN from scratch using VGG16 architecture.

· Model 2: Transfer learning with fine-tuning from a pre-trained ResNet50 model.

· Model 3: Transfer learning with fine-tuning from a pre-trained ResNet50 model, using custom loss function with pseudo labeling.

We used a default neutral threshold score of 0.5 for classifying the test dataset which includes 30,000 positive detections (with label “1”) and 194,000 negative detections (with label “0”), and report the three key metrics — sensitivity, specificity and area under the curve (AUC) — for each model. While sensitivity and specificity are dependent on the choice of threshold score, AUC provides an aggregate measure of performance across all possible classification thresholds. Even though this is a multi-label multi-class classification model, for each detection we only evaluated the classification result on the labeled species, and we did not evaluate the remaining 23 species as their presence or absence are not annotated even though their pseudo-labels were generated. Using Model 1 and Model 2 as our baseline, we discovered that including both positive and negative labeled data in the model training as well as using pseudo labeling improved the model performance substantially.

For our proposed model (Pre-trained ResNet50 + Custom Loss Function + Pseudo Labeling) that performs best among the three, the histograms of predicted probability scores show high confidence for scoring on the test dataset (that is, with predicted score >0.9 for positive labeled data, and <0.1 for negative labeled data, see Fig. 2). When implementing the model for practical use the classification threshold score can be adjusted depending on the importance of reducing false positives or false negatives. For example, thresholds closer to 1would only predict species presence when there is a high level of confidence.

Fig. 2. Left: Histogram of predicted probabilities for positive labeled test dataset consisting of 30,000 observations. Right: Histogram of predicted probabilities for negative labeled test dataset consisting of 194,000 observations.

Implementation

We implemented the mel spectrogram generation and training steps using an Azure Deep Learning Virtual Machine, and provided the Sieve Analytics team with the model files and Python scripts to run those models. The team is now able to run our proposed model that substantially reduce the time needed for labeling.

The methodology presented here can be easily adopted by other similar bioacoustics applications, either for binary classifications or multi-class classifications. To facilitate this process, we have released all the source code here:

https://github.com/microsoft/Multi_Species_Bioacoustic_Classification

The Sieve Analytics team is currently using our model and building a pipeline that enables training convolutional neural networks for multi-species identification in soundscapes, starting from raw unlabeled recordings. This addresses an important need for more accessible training data from study sites to leverage deep learning for acoustic monitoring.