Cornell Bird-call Identification

32 min readSep 26, 2020

Identifying the species of the bird based on the the sound(call) it makes.

Do you hear the birds chirping outside your window? Over 10,000 bird species occur in the world, and they can be found in nearly every environment, from untouched rain forests to suburbs and even cities. Birds play an essential role in nature. They are high up in the food chain and integrate changes occurring at lower levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution. However, it is often easier to hear birds than see them. With proper sound detection and classification, researchers could automatically intuit factors about an area’s quality of life based on a changing bird population.

Business Problem

Description

There are already many projects underway to extensively monitor birds by continuously recording natural soundscapes over long periods. However, as many living and nonliving things make noise, the analysis of these datasets is often done manually by domain experts. These analyses are painstakingly slow, and results are often incomplete. Data science may be able to assist, so researchers have turned to large crowdsourced databases of focal recordings of birds to train AI models. Unfortunately, there is a domain mismatch between the training data (short recording of individual birds) and the soundscape recordings (long recordings with often multiple species calling at the same time) used in monitoring applications. This is one of the reasons why the performance of the currently used AI models has been subpar.

To unlock the full potential of these extensive and information-rich sound archives, researchers need good machine listeners to reliably extract as much information as possible to aid data-driven conservation.

The Cornell Lab of Ornithology’s Center for Conservation Bio-acoustics (CCB)’s mission is to collect and interpret sounds in nature. The CCB develops innovative conservation technologies to inspire and inform the conservation of wildlife and habitats globally. By partnering with the data science community, the CCB hopes to further its mission and improve the accuracy of soundscape analyses.

In this case study, we will identify a wide variety of bird vocalizations in soundscape recordings. Due to the complexity of the recordings, they contain weak labels. There might be anthropogenic sounds (e.g., airplane overflights) or other bird and non-bird (e.g., chipmunk) calls in the background, with a particular labeled bird species in the foreground. we will try to bring new ideas to build effective detectors and classifiers for analyzing complex soundscape recordings!

CREDITS:- Kaggle

Problem Statement

For each time window, we need to predict the bird species that made a call beginning or ending in that time window. If there are no bird calls in a time window, we will predict no-call.

Real world/Business Objectives and Constraints

CPU Notebook <= 9 hours run-time.
GPU Notebook <= 2 hours run-time.
External data, freely & publicly available, is allowed. This includes pre-trained models.

Now that we have understood the problem that we are going to solve and constraints that we need to keep in mind. Let us have a look at the data that we are going to work with.

Data

Data Overview

The data is downloaded from kaggle’s birdcall identification competition. The downloaded data is a zipped file which contains multiple files as described below.

train_audio

The train data consists of short recordings of individual bird calls collected with generous help of users of xenocanto.org.

test_audio

The hidden test_audio directory contains approximately 150 recordings in mp3 format, each roughly 10 minutes long. They will not all fit in a notebook’s memory at the same time. The recordings were taken at three separate remote locations in North America. Sites 1 and 2 were labeled in 5 second increments and need matching predictions, but due to the time consuming nature of the labeling process the site 3 files are only labeled at the file level. Accordingly, site 3 has relatively few rows in the test set and needs lower time resolution predictions.
Two example soundscapes from another data source are also provided to illustrate how the soundscapes are labeled and the hidden data set folder structure. The two example audio files are BLKFR-10-CPL_20190611_093000.pt540.mp3 and ORANGE-7-CAP_20190606_093000.pt623.mp3. These soundscapes were kindly provided by Jack Dumbacher of the California Academy of Science’s Department of Ornithology and Mammology.

test.csv

The test set csv file have the following columns:

site: Site ID.
row_id: ID code for the row.
seconds: the second ending the time window, if any. Site 3 time windows cover the entire audio file and have null entries for seconds.
audio_id: ID code for the audio file.

example_test_audio_metadata.csv

Complete metadata for the example test audio. These labels have higher time precision than is used for the hidden test set.

example_test_audio_summary.csv

Metadata for the example test audio, converted to the same format as used in the hidden test set.

filename_seconds: a row identifier.
birds: all ebird codes present in the time window.
filename_seconds: the second ending the time window.

train.csv

A wide range of metadata is provided for the training data. We will look at all the features of the train data in data analysis part in detail while the most directly relevant fields are:

ebird_code: a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.
recordist: the user who provided the recording.
location: where the recording was taken. Some bird species may have local call ‘dialects’, so you may want to seek geographic diversity in your training data.
date: while some bird calls can be made year round, such as an alarm call, some are restricted to a specific season. You may want to seek temporal diversity in your training data.
filename: the name of the associated audio file.

First Cut Approach

The main challenge of this case study is to identify which birds are calling in long recordings, given training data generated in meaningfully different contexts. This is the exact problem facing scientists trying to automate the remote monitoring of bird populations.

Based on the problem we are facing the first solution comes to the mind is to treat this one as a multi-class classification problem where for each time window we predict the species of bird that made a call in that audio sample, if no bird made a call we predict nocall. So, in simple words this is going to be an Audio classification problem similar to image classification problem where we classified the the image based on the objects of which the image is for example cat or dog in case of animal classification tasks. The difference as well as challenge is here that we need to do that for audio here.

Now that we know the problem, problem type and approach to solution, so let us decide the performance measure or error metric that we need to optimize and find best of it for the task.

Performance Metric

Row-wise micro averaged F1 score
For a given recording, we predict the birds calling in the between a given time range and measure micro averaged f1-score for each recordings. There might be many birds call in a long recording.

Exploratory Data Analysis

We have discussed the problem statement, constraints, brief solutions, metrics and had a glance of features data contains. So, let us dive deep into those features and look at what they have to say. In this section, I will take you through the exploratory analysis of the data. So, let us start by observing the features we have in our train data csv file.

#   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rating            21375 non-null  float64
 1   playback_used     19575 non-null  object 
 2   ebird_code        21375 non-null  object 
 3   channels          21375 non-null  object 
 4   date              21375 non-null  object 
 5   pitch             21375 non-null  object 
 6   duration          21375 non-null  int64  
 7   filename          21375 non-null  object 
 8   speed             21375 non-null  object 
 9   species           21375 non-null  object 
 10  number_of_notes   21375 non-null  object 
 11  title             21375 non-null  object 
 12  secondary_labels  21375 non-null  object 
 13  bird_seen         19575 non-null  object 
 14  sci_name          21375 non-null  object 
 15  location          21375 non-null  object 
 16  latitude          21375 non-null  object 
 17  sampling_rate     21375 non-null  object 
 18  type              21375 non-null  object 
 19  elevation         21375 non-null  object 
 20  description       15176 non-null  object 
 21  bitrate_of_mp3    21367 non-null  object 
 22  file_type         21375 non-null  object 
 23  volume            21375 non-null  object 
 24  background        8300 non-null   object 
 25  xc_id             21375 non-null  int64  
 26  url               21375 non-null  object 
 27  country           21375 non-null  object 
 28  author            21375 non-null  object 
 29  primary_label     21375 non-null  object 
 30  longitude         21375 non-null  object 
 31  length            21375 non-null  object 
 32  time              21375 non-null  object 
 33  recordist         21375 non-null  object 
 34  license           21375 non-null  object 
dtypes: float64(1), int64(2), object(32)

The above table shows the features present in the data frame with its datatype and number of instances. The maximum number of instance present in the data frame is 21375 suggesting that we have around 21375 recordings belonging to different bird’s call. From the table we gather that the these columns mainly describe the audio files that we have collected and gives the idea of our class labels from the column species and ebird_code. Now that we have talked about our class labels let us look at that next.

The above code snippet finds the number of unique bird species we have in the data frame. From the output we found that we have recordings of 264 different bird species present in the data. Also, from the second output in the snippet we observe that the bird species are abbreviated to ebird code of length 6 alphabets for ease of labeling and readability. For example, Alder Flycatcher species is abbreviated as aldfly where the first three letter is taken from word Alder while last three from Flycatcher. Similarly, American Avocet as ameavo in the same manner. To learn more about class labels let us look at the distribution of the classes.

The distributions above shows the number of the instances present for bird species. From we the plot we observe that maximum recordings/instance present for a single bird is 100. Around 40% of the 264 bird species have less than 90 recordings present for them in the data set.

Let us look at the other features of data now.

The plot describes the distribution of the audio recordings. From the plot we can see that 75% of the audio recordings provided for the training are of rating greater than 3 indicating the good quality of the audio recordings. Among them, 5 being the highest rating given is the rating which is given to around 6000 of the recordings out of 21k recordings.

Let us look at the distribution of the pitch of the sound. Pitch:-The sensation of a frequency is commonly referred to as the pitch of a sound. A high pitch sound corresponds to a high frequency sound wave and a low pitch sound corresponds to a low frequency sound wave. Generally, we have low and high pitch sound but as shown in graph, more than 70% of the recordings provided does not give any info on the pitch of the sound i.e is it either increasing or decreasing. For some examples less than 2%, the pitch is specified as increasing or decreasing meaning some birds call is a high pitched sound while for some it is low.

The above plot shows the types of channels used in the recordings. Basically, number of channels used is the number of signals in the sound.

https://music.stackexchange.com/questions/24631/what-is-the-difference-between-mono-and-stereo
In monaural sound one single channel is used. It can be reproduced through several speakers, but all speakers are still reproducing the same copy of the signal.
In stereophonic sound more channels are used (typically two). You can use two different channels and make one feed one speaker and the second channel feed a second speaker (which is the most common stereo setup). This is used to create directionality, perspective, space.
So, from our plot we observe that more than 50% of the data have recordings with number of channels as 1.

The playback used in the audio is similar to the feedback concept meaning it is way to reproduce audio, video recording to recheck them. The playback not used in the audio means that the audio recorded has not been reproduced or altered and is provided as it is. As from our distribution above we can see that around 18000 out of 21000 recordings provided have not used playback audio.

The distribution plot above shows the files available in the data set is recorded at what speed i.e normal, fast forwarded or in slow mode. From the plot we observe that similar to pitch distribution, most of them is not specified. Around 70% of them belongs to the column not specified.

The plot above shows the year in which the recordings of the bird call is made. From the plot it is observed that 90% of the recordings have made in the last 10 years or so while year 2014 being the year when highest number(around 14% of the total) of the recordings has been done.

This plot distribution shows the month in which the recordings has been done or the month in which birds called or sang. From the plot it is observed that 80% of the bird’s call is from the months March to August i.e 6 months interval which suggests that the most of the birds called during season `Spring` March to May and `Summer` June to Aug.

The plot titled `Top 10 countries where recording is done` shows top 10 countries where recordings has been done based on the number of recordings per country. From the plot it is obvious that around 70% of the recordings is from USA and more than 75% from the North America continent itself.

The plot titled `Recordings File Types` is plot showing the distribution of file types of the recordings. From the plot it is evident that the the recordings are available in four types of file:- mp3, wav, aac and mp2. But, around 99% of the recordings provided are in mp3 format.

The plot titled `sampling rate` show the distribution of sampling rate at which the recordings has been sampled i.e the number of samples of audio carried per second, measured in Hz or kHz. From the graph we can see that recordings we have contains 8 types of sampling rate while more than 95% of them are sampled at 44100 and 48000 Hz. The sampling rate at which most of the recordings is made is 44100 Hz.

The duration distribution plot above is done to observe the duration of the recordings file. The distribution starts at 0 and is rightly skewed. The minimum duration of the recordings is 0 seconds and around 70% of the recordings duration is below 100 seconds, while around 99% of recordings have duration less than 500 seconds.
From the above distribution plots titled `Distribution of duration of audio files` and `Commulative distribution of Duration`respectively we observe that 98% of the data is less than 250 seconds long while 99 percent of the data is less than 500 seconds (8 minutes) long.
To have a clear idea we printed the percentiles of the duration and observed that 90 percent of the data is around 2 minutes long in length.

Now, We have looked at almost all the important features of the data frame and have idea of data that we are dealing with. So, let us keep the pace and see some feature extraction technique which can help in converting/extracting these audio recordings into features which can be used to train classification model on it.

Feature Extraction

To extract and load these audio files we will be using the librosa library which is have similar audio processing functions to scipy. The librosa is chosen over other audio processing library is because of its functionality to load various types of audio files as well as fast processing. The library also have some beautiful audio processing technique like short term Fourier transform and spectrogram and many more but we will only focus on some of the techniques here.

Before we go deep into feature extraction let us understand the simple processing librosa do while loading an audio file. The librosa library load the audio files into two major components:
1. Sound(samples) : sequence of vibrations in varying pressure strengths as array of numbers
2. Sample Rate : (sample_rate) is the number of samples of audio carried per second, measured in Hz or kHz

let us look at code snippet and output we get after loading an audio file.

Here, we used librosa load function to read the audio files. The load function takes two arguments audio file path and sample rate at which it samples the audio file and returns array of samples and the sample rate. As observed, we used sample rate of 44100Hz that means we sampled 44100 samples per second from the audio file. From the the analysis we found that most of audio given in the data set is sampled at 44100Hz and according the reference papers of audio processing we found 44100Hz to be the standard sampling rate. So, we will go ahead with this sampling rate for our data processing. The samples is an array of shape (1601280,) got using sample_rate multiplied by duration of the file.
As seen above the duration of the audio files differ so we will get different and much larger or even smaller samples array for every file.

let us understand the sound waves graphically for better understanding as they say human’s perceive better with visuals.

Now, we observed above that we now samples array for audio files but we cannot say anything from that expect the duration and sampling rate of the audio. So, we plotted 2D representation of the audio signal where x-axis represents time and y-axis represents the amplitude of the sound wave at that time. From the first plot we see some spike in the graph which suggests the presence of the bird sound at that point. The plot shows some points where the amplitude decreases that suggests that the bird is not calling continuously or that it’s pitch differ for the entire call.

So, it might be possible that we can get the bird call in smaller samples i.e smaller duration of the file. let us check our theory.

Yeah, from the plot we observe that we are correct with our theory. Even the smaller samples have the bird calls. As we know generally bird calls are short while their song is long which is also verified by this graph. Now, that we can have bird calls in smaller duration we can divide the files in 5 seconds sample each and check for the presence of the call and in this way we can have equal sample size for every bird’s call which will help in data preparation during our model training. let us look at some more ways of processing audio and see we can find some more insights.

We have observed the sound waves in 2D representation where we saw the variation of amplitude with time. Now, we will look at some other features of the sound waves where the other parameters like frequency varies with time.

Spectrogram

We have seen 2D representation of audio in the above plot which does not describe anything more about the sound recording rather than its frequency and amplitude. so, to explore more about the audio files we came across another sound features called spectrogram. According to Wikipedia, A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams.

Spectrograms are used extensively in the fields of music, linguistics, sonar, radar, speech processing,seismology, and others. Spectrograms of audio can be used to identify spoken words phonetically, and to analyze the various calls of animals. A spectrogram can be generated by an optical spectrometer, a bank of band-pass filters, by Fourier transform.

A spectrogram is usually depicted as a heat map, i.e., as an image with the intensity shown by varying the color or brightness.

As learned above, spectrogram can be generated by many methods and one of them is Fourier transform. let us learn more about fourier transform as it is widely used mathematical method to generate spectrogram from audio recordings.

In simple terms, sound is a sequence of vibrations over time which we have seen through above plot. Fourier transform is another mathematical way of representing sound.
In mathematics, a Fourier transform (FT) is a mathematical transform that decomposes a function (often a function of time, or a signal) into its constituent frequencies, such as the expression of a musical chord in terms of the volumes and frequencies of its constituent notes. The term Fourier transform refers to both the frequency domain representation and the mathematical operation that associates the frequency domain representation to a function of time.
So, let us see how FT affects the sound wave from below operations on the aldfly audio call.

Short-Time Fourier Transform
Musical signals are highly non-stationary, i.e., their statistics change over time. It would be rather meaningless to compute a single Fourier transform over the entire song length.

The short-time Fourier transform (STFT) (Wikipedia; FMP, p. 53) is obtained by computing the Fourier transform for successive frames in a signal.

X(m,ω)=∑nx(n)w(n−m)e−jωn

As we increase m , we slide the window function w to the right. For the resulting frame, x(n)w(n−m) , we compute the Fourier transform. Therefore, the STFT X is a function of both time, m , and frequency, ω .

librosa.stft computes a STFT. We provide it a frame size, i.e. the size of the FFT, and a hop length, i.e. the frame increment and plot the results.

Yes I know we cannot make much sense out of this image. So, let us see what we can do to achieve more. In music processing, we often only care about the spectral magnitude and not the phase content.
And as defined above the spectrogram shows the the intensity of frequencies over time. A spectrogram is simply the squared magnitude of the STFT(X) which we calculated above:

S(m,ω)=|X(m,ω)|2

But there is one more thing the human perception of sound intensity is logarithmic in nature i.e we cannot visualize and make sense to a sound out of a certain frequency range. Therefore, we are often interested in the log amplitude. so, we convert the output to log scale and plot it.

This is what a spectrogram is. we can see the call part of the audio as highlighted and rest are dark.

Mel-Spectrogram

When we applied log transformation to the STFT(short time fourier transform), we got the above spectrogram. So, how is this Mel-spectrogram different than that. Actually, The Mel-spectrogram is a normal Spectrogram, but with a Mel Scale on the y axis rather than log scale where the Mel Scale, mathematically speaking, is the result of some non-linear transformation of the frequency scale.
We can use Mel spectrogram feature of librosa library to easily get this feature of the audio signals.

We can clearly see the changes from the previous plot. Here, we have used Mel scale for y-axis rather than linear as used in the previous approach. We have learnt about quite a features like Fourier transform, spectrogram and Mel-spectrogram. It is now time to understand how can be use these features properly.

From the pre-processing part we have found that 5 seconds samples contain the bird’s call. So, let us go ahead and check the spectrogram features for these samples and also verify some other parameters like frequency and window length etc.

The above two plots are the Mel-spectrogram feature for the 5 seconds audio samples. The parameters we used above like n_fft(window length) and hop length are referenced from the audio pre-processing paper and the parameters used above is found to be working best so, we plotted the final ones using them.

Also, in the first plot we used n_mels=128 and in second one we used n_mels=256 while the frequency below 20Hz and above 16000Hz is discarded as this range is the human’s hearing range. From the plots above we see that the first graph looks good where the bright contrasting bell shape shows the presence of the bird call. While the second plot looks same to first one but we have some lines and distortion in lower half part suggesting that the higher number of mels introduces the distortion in the image. So, we can conclude that the spectrogram with n_fft=1024, hop_length=512, n_mels=128 and freq range of 20–16000Hz gives the best features for the audio and we have our bird’s call properly represented. let us summarize what we gathered so far.

Observation on EDA and Feature Extraction

1. We have 264 unique bird species in our data.
2. 50% of the bird species have 100 recordings per bird while rest of them have recordings between 1 to 100 available to us.
3. Recordings contain bird call, songs and many different kind of bird sounds as well as background sounds.
4. Pitch and speed of the audio is not specified in the data for 90% of them.
5. 90% of the recordings provided has been rated more than 3 where 1 being the least and 5 being the highest.
6. Most of the bird call were recorded in the Spring and Summer season of North America:- USA, Canada and Mexico, USA being the country where highest number of recordings has been done for past 9 years from 2011 to 2019.
7. The minimum duration of the recorded audio is zero seconds while 85% of them having duration less than 100 seconds.
8. 99% of the recordings provided is of mp3 format with more than 55% of them having sampling rate of 44100 Hz.

Moving ahead let us see what we observed in FE part:-
1. librosa library is used to load and process the data
2. librosa load the sound as sequence of vibrations stored in form of ndarray and sample rate, the rate at which the sound is processed.
3. We have learned how to visualize sound wave in 2D representation with time on x-axis and amplitude of the sound on y-axis.
4. Using librosa we extracted many features from sound like zero crossing rate( the number of times the waves crossed horizontal x-axis) and for the Alder Flycatcher’s call sound crossed the horizontal x-axis 431290 times.
5. We observed how to convert audio signals into image using spectrogram feature of the sound waves.
6. Fourier transform a mathematical way to convert sound wave from time domain to frequency domain plays an important role in these feature generation.
7. Another important feature Mel-spectrogram is observed which is the same as spectrogram but is calculated on Mel scale rather than a log scale. And provides more readable image than the previous as observed.
8. Features like harmonics and perceptual helped us in learning the pitch and color of the sound.
9. We learned that we can use 5 seconds sample of the audio file instead of using the whole audio at a time which can cause size mismatch for each file and we might need to put extra efforts in padding the sequence.
10. The 5 seconds sample is taken after analyzing file for less than 5 seconds sample and found that 5 seconds is optimal duration and every audio contains bird’s call in this interval. So, we should not run into silence or noise sample only.
11. The Mel-spectrogram with n_mels=128 and signal between 20–16000 Hz showed the best image feature of the 5 seconds sample audio and we observe the presence of the sound quite clearly.
12. The frequency range is decided from the harmonics and 2d-representation of the sound where we found that this range has minimum distortion.

Data Preparation

Now, that we have learnt about the features that can be generated from an audio files. we can go ahead and prepare data accordingly. As per research done in audio classification we have found that LSTM and CNN models are widely used for this type of task. But for CNN we know that we need data either text or image and for LSTM it needs to be sequential data. So, as per our requirements we decide to generate the raw samples generated by librosa and spectrogram data as sequence data for LSTM model while we will convert the spectrogram to image feature by stacking it to 3-dimensions. Also, we have our class labels as ebird code which we converted to integer values as shown below.

As per the discussions above, we will sample the files for 5 seconds duration and generate the features for it and store them as compressed npz file with data and labels. We used batches of 32 audio file at a time to generate the data and store it in npz file. Look at below code snippet to understand what I did.

So, we have created data loader which loads the data in batches process it, extract features make it ready for model use. It is time to go ahead and experiment with some classification task models and observe the results. But, before that let us have a look at some of the augmentation techniques which might come in handy in improving the data samples as well as performance of the models.

Data Augmentation

Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks in case we are having image data. But, unfortunately we have sound data, wait don’t worry, thanks to domain experts we have various data augmentation techniques for sound data as well and among them the most widely used ones are time stretching, pitch shifting and frequency stretching.

Before going further let us understand these techniques first. According to Wikipedia, Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch scaling is the opposite: the process of changing the pitch without affecting the speed. Pitch shift is pitch scaling implemented in an effects unit and intended for live performance.

For our task, we used addition of some white noise, reducing the background noise, time stretching and pitch shifting techniques as these are found to be giving the best results which we will further verify in our model training section. We can easily implement these techniques using the audiomentations transform functionality as shown in the below code snippet.

As seen in the above code snippet we increased/decreased the time by 30 percentage while for pitch we shifted it by 5 percentage only as large shift is observed to be causing noticeable change in the file which we don’t want as these could impact the model in a bad way.

Modeling

There is no certain rule to know which model will work best on the data and we have and hence it is important to play with some models related to the context of the problems and observe which is performing best and then go for it. We start our experiment with some machine learning models like decision tree and random forest but found that sue to the complexity of the data, it is taking a lot of time for them to train as well as the results we achieved is poor than our simple neural network model which will see below. Hence, we decided to experiment with some of the famous neural network architecture as well as come up with some of other deep learning models of our own. The models we experiment with are:

Different LSTM Architectures
VGG16
ResNet50
Inceptionv3
DenseNet
Ensemble of one or two CNN Architectures

They are discussed in more details below.

LSTM Architectures

Now that we have done feature extraction, created data loading pipeline it ti time to decide some more things before we go and build neural network model and start training. we now have a idea of what will be the input and what will be the output for our model the performance metric we want to maximize as discussed in above metrics sections. Also, we need to decide what will be the loss we will minimize while training. So, let us discuss these things before we go ahead and discuss the model architecture and training. Since we are dealing with a multi-label classification we will go with binary cross-entropy loss for each 264 classed and then get its average. The performance metric we will use is f1-score with weighted average technique. To keep track of the parameters during training we will be using various callbacks like changing the learning rate if the update is not happening significantly, saving model at checkpoints to avoid re-training as well as early stopping to stop the training we have achieved the results beforehand. The below code snippets describes the implementations all these mentioned loss, metrics and callbacks things.

Since, we are working with a multi-class multi-label classification, we want our model to take as input the processed audio file and give as output the class to which the data belongs to. Now, we have 264 class labels so we will create a simple 4-layer LSTM network which takes 3D input of shape (batch_size, sequence length, time-step) and output sigmoid probabilities of each 264 classes. let us have a look at the model creation and training in below code snippet.

The above code shows how to build a model, define relevant callbacks to keep track of the training of the model, defining your own metrics as per your requirement, compiling your model and finally training in the same order. In the above model you see that we used output layer, metrics and callbacks as specified above.

The model is trained for 20 epochs with adam optimizer and learning rate 0.01 which is found to be working best among all the other parameters in previous trainings. We used the weighted f1-score and averaged binary cross-entropy loss as metrics for the model as we have multi-label outputs. The result found at the end of the training is 0.12 f1-score which is better than nothing so, let us go ahead and experiment with some more models and check the performance.

Another LSTM based Architecture

As learned from previous model, we still go with LSTM models but changed some layers in this model. Also, instead of taking raw data as input we decided to go with spectrogram features extracted from the audios while processing. So, now the input data to the model is spectrograms generated for 5 seconds sample for which 50 samples taken at a time of shape (batch, 128, 50) and output is 264 class sigmoid layer to get probability for each class.

The best initialization is found to be glorat with relu activation and adam optimizer with 0.01 learning rate. The model is trained similar to previous one with best parameters as well as regularization introduced between layers like batch normalization and dropout with whole training data and the results are observed. The results obtained is better than previous model. We have achieved quite good result here from the previous model.

In this approach we used LSTM network on spectrogram data but only for 50 samples per 5 seconds sample and found that we achieved better results. So, we decided to again train this same architecture with full spectrogram for 5 seconds sample of the audio file. We tried the same network with small change in the input layer to shape (None, 128, 451). The model is trained in batch of 10 epochs each for around 5 times and results is observed. We used reduceLRonplateau callback to track the learning rate while training, we used other callbacks too to keep an eye on the training. The results obtained is better than the previous one but we don’t see any significant improvement. So, let us experiment with some different architectures.

VGG16

The above code shows implementation of the VGG16 model. We used this model architecture as feature extractor for our spectrogram and Mel-spectrogram data and then trained simple NN architecture for the classification purpose. The model trained on both spectrogram and Mel-spec found to performing better than the previous results with score 0.54 and 0.69 respectively suggesting that the CNN models are working well and even better with Mel-spec images. To further improve and verify our results let us try some more complex networks.

ResNet50

Here, We tried the famous ResNet50 architecture to try some more complex network than VGG16. We used this architecture as feature extractor and trained similarly to previous model with spec and Mel-spec images and got score as 0.44 and 0.72 respectively. From the results obtained we found that the model is performing better than VGG for Mel-spec input while worsen for the spec image data. To understand if this is due to complexity in architecture or the spec feature itself, let us train some more complex architecture.

Inceptionv3

As observed from previous results, we tried another deep CNN architecture InceptionV3 similar to the previous approach. We tried and evaluated the model for both spec and Mel-spec images and got the score as 0.56 and 0.67 respectively. We observe that the InceptionV3 gives better result for spec images than ResNet but not that significant suggesting that the feature itself is not helping much in learning of the model. While the score for Mel-spec features for InceptionV3 decreased than that of ResNet50, to verify if it is due the complexity of the architecture we decide to go with deep CNNs like DenseNet.

DenseNet

To further verify our observation of the InceptionV3, we trained a deep CNN dense-net model and found that yes the more complex models is not performing better than that of our previous models with score we got around 0.48. Now, that we have observed that ResNet with Mel-spec images features and is performing best among all but the score we achieved is not that promising. So, to further improve our score we decided to try ensembling some of these good performing models. Let us see how can we combine these models and achieve more.

Ensembles

To ensemble the multiple architectures, we referred the above attached image and tried various combination. For example, we used the ResNet and VGG, ResNet and Inception as well as Inception and VGG for either feature extractor and did feature fusion on extracted features or result fusion on the results obtained by these individual models depending on the shape of the extracted features or result we got from each model architecture. The below code snippet shows an example of the ensemble model we tried.

Now, for the outputs we got from either the feature fusion or result fusion we tried some machine learning algorithms or simple NN architectures with sigmoid activation to get the prediction. The results we obtained from ML models is around 0.57 suggesting that they are not learning well either due to the complex features generated by these architectures or the amount of data.

So, we take our focus to the other method where we observe that ensembles with result fusion with sigmoid activation as last layer are doing good with best score of 0.76 which is indeed better than our previous best achieved by ResNet50 while the ensemble with feature fusion of ResNet50 and VGG16 on Mel-spec images have achieved the best results of all with average f1-score of 0.83.

Conclusion

We have done a lot of things including model implementation and training. So,let us conclude what we have done and observed so far.
1. We used raw audio samples, spectrogram images and Mel-spectrogram images for processed features.
2. The low and high frequency have been removed from the data.
3. In Mel-spectrogram data the Mel values is converted to power scale with exp value 0.5 to reduce the background noises.
4. Augmentation is done with introduction of Gaussian noise, white noise and pitch shifting of the audio.
5. These augmentations is implemented with the help of audiomentations library.
6. The Mel-spec and spectrogram images are extracted as 5 seconds clip and is properly normalized, cropped/padded etc. while feeding to the training.
7. The Mel-spec and spectrogram images extracted is stacked to make is usable by the CNN architectures.
8. Keras sequence generator and data loader is used to the load the data in batches while training and evaluation.
9. Multiple networks are used during training and their performance is observed as shown in the above summary table.
10. The first part of the table shows the networks trained with raw data and spectrogram data with LSTM and CNN architectures.
11. As shown in the first part of the summary table not too complex and deep 3-layered with 32 LSTM units network with proper dropout and batch normalization is trained with spectrogram and raw data. They achieved quite good results up-to 0.34 f1-score until we experiment with CNN models.
12. From CNN architectures to deep CNN we tried four different CNN architectures like VGG16, ResNet50, InceptionV3, and Dense Net. The deep CNN's seemed to be not performing as good as the other architectures as shown in the table.
13. The Inception architecture is found to be working best for the spectrogram images data with around 0.56 f1-score and ResNet and VGG comparatively good with around 0.54 score.
14. The ML ensemble models trained on the features extracted by these CNN models seemed to achieve good results but does not meet our expectations.

15. Failing to achieve performance more than 60%, we tried these same CNN architectures with some custom ensembling and combination but with different features i.e Mel-spectrogram data and some augmentation as well.
16. The same models for this feature data seemed to be performing well as you can verify from the results in the summary table.
17. The VGG16 model achieved the better result from models from part 1 with around 69% performance score.
18. Among all the single architectures the ResNet50 achieves the best result of around 0.72% score with Inception trying to keep up with only 0.67% score.
19. We tried ensembling the features and output generated by these models one by as shown in the summary table.
20. Two of the best performing networks like VGG16 and ResNet50 when their features is combined together to train another model to classify the calls achieves the best results of 0.83 f1-score.
21. Other ensembles do good as well as can be seen in the summary table.We tried every combination like VGG with ResNet, inception, and ResNet with inception so that we can know what is working well and what is not.

Now, that we have discussed what worked and best let us discuss what did not work.

1. The simple LSTM network trained is found to performing worst of all with around 0.12 f1-score as shown in the summary table.
2. Among all the optimizers and learning rate adam with learning rate 0.001 worked best for almost all the models while sgd and adadelta behaved poorly except for some models as shown in the summary table.
3. Only the good performing training results for each individual model is here in the notebook as the bad performing ones are overwritten with each training sequence.
4. Spectrograms performed poor that Mel-spectrogram as the later reduced the background noise and low and high frequency sound from the audio.
5. Ensembling of features generated by VGG and other CNN networks with spec and Mel-spec data performed poorly due to the difference in input shape(spec and Mel-spec) as well as and features shape generated by the last conv block except for the ResNet and VGG which achieved best results due to their similarity in the architectures.
6. The ML ensembles did not work well due the complexity of the features generated by the networks.
7. Overall LSTMs performed poor due to inconsistency in the audio files like presence of noise, background sounds and even no sound at all as well as discontinuous calls.

Existing Approaches and Improvement

Now, that we have discussed and observed our approach quite closely, it is time to discuss if we have made improvements to the existing approaches if we have any.

The Existing approach we have referenced have used resnet50 CNN network trained on the spectrogram generated for the audio files which has achieved average mAP of around 0.54. Here, we tried various models as shown above and found that our Inception model achieved better results than the existing approach. We achieved best result of around 0.83 f1-score.

Summary

Now, that we have covered almost all the aspects of the case study from data processing, analysis to model building and training. We have also concluded what we have learnt so far in the above section. Therefore, let us summarize the results for better and easy understanding of the our approach.

The above two tables shows the results we obtained with each model and hyper parameters we tried.

For the full implementation of the case study
Refer this GitHub repository. For any queries, connect with me on LinkedIn and GitHub.

Future Works

Before going away let us also have a look at if we missed something or we couldn’t do because of some limitations.

As, while training and evaluation we have found that when we introduced more background sounds and noise mostly overlapping with the bird calls and found that the model does not generalizes well and give even poorer results. So, We can gather some more data like these and train the models again for these outlier data. More deep and tuned LSTM networks as well as other Convolution networks like Efficient Net and XceptionNet can be tried as CNN networks like Res-Nets have been found to be working best. Tuning parameters and data augmentation techniques can be tried to achieve better results.