Bird Species Classification in High-Resolution Images

Akash Kumar
7 min readOct 12, 2018

This post is a summary of my challenge participation in the International Conference on Computer Vision & Image Processing (ICCVIP’18). I was the Winner of this Challenge. This challenge involves a supervised classification of bird species from a set of bird images. What’s the catch ??? Let’s find out.

Training Dataset Overview

Relevancy of the Problem

From an ecological and environmental point of view, monitoring bird diversity is an important task, especially in case of Himalayan birds. There are diverse species present with a very few images of each specie. While bird monitoring is a well-established process, the observation is largely carried out manually which is time-consuming😐😐😐, and hence the scalability is low👎🏻. This has motivated the use of machine learning methods to analyze bird images and sounds, using camera-trap data, recorded data or crowd-sourcing. The main challenge, especially for Himalayan birds, based on a limited but a diverse set of crowd-sourced data. Especially, the present challenge involves a fairly low amount of labelled data and may require transfer learning based approaches for effective classification.

Let’s begin with the introduction to the dataset, followed by data augmentation practices, object detection, ImageNet models and lastly Model Ensemble.

Dataset Overview

Training dataset contains 150 images and testing dataset contains 158 images with 1 image corrupted. There is a total of 16 species in which birds are to be classified. The dataset has class imbalance too with the number of images ranging from 5 to 20 per species. The resolution of the images lies in between 800x600 to 6000x4000. Such high resolution of images was one of the major hurdles in this challenge. Now, let’s discuss the step-by-step classification of bird species.

Methodology Overview

The methodology will contain the following subtopics:

  1. Data Augmentation
  2. Bird Detection Algorithms
  3. Multi-Stage Training
  4. Architecture Overview
  5. Test Results & Future Work

Now, let’s discuss this one by one. The code is shared at my Git repo. You can find it at the following link:

Data Augmentation

We have only got 150 images in the training set. That’s damn low🙁🙁. I used the imgaug library for data augmentation of the dataset. The imgaug library helps me to process images quite easily and I think its documentation and installation is quite easy too. You can install the latest version using this command:

sudo pip install git+https://github.com/aleju/imgaug

You can also look at the Github repo by aleju to get a better knowledge of different type of data augmenters. We used several augmentation practices such as Gaussian Noise, Gaussian Blur, Flipping, Contrast Enhancement, HSV colour space variation, Sharpening of images, Affine transformations, etc. It also helped us to manage class imbalance problem too. The table below illustrates the different types of data augmentation used for different bird species:

Data Augmentation Overview

Data Augmentation helped us to increase the dataset from 150 images to around 1.3k images 🙃. That’s what we can use for Transfer Learning, now I guess. 🤔

Bird Detection Algorithms

Although, transfer learning helped us to improve the accuracy of our model it was unable to learn the fine-grained features of the species. That’s what I next focused upon. Initially, I used the Single Shot Detector and YOLO to detect birds but these networks require images to be resized to 512x512 or 46x416. This leads to a huge loss of information and thus was unable to detect birds from the images. So, then I moved onto Saliency Detection algorithm that separates foreground and background from an image. That also failed.😐 We were kind of stuck at this phase. 😰 So, I get to thinking🤔🤔🤔 and looked into basic Object detection techniques such as R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN, etc. Mask R-CNN accuracy stood out from the rest. Mask R-CNN takes the whole image as input and then run object detection over the whole image. I used Mask R-CNN pretrained on COCO dataset. COCO dataset contains 1.5 million object instances with 80 object categories (including birds). That’s what helped me the most. Out of 150 images, more than 140 images gave perfect bird crops. A sample of bird crops is as follows:

Mask R-CNN cropped images

Multi-stage Training

For training purpose, ImageNet pretrained models came to rescue. ImageNet contains 1.2 million images belonging to 1000 classes. Following some of the network architectures such as AlexNet, VGG16, ResNet, Inception Nets have been found great at task classifying images using features from images. Unfortunately, these networks have millions of parameters if trained from scratch. Instead of training from scratch, we used ImageNet pretrained weights. Transfer Learning helped to increase the classification accuracy from 6–7% to 30–40%. In particular, Inception V3 and Inception ResNet V2 gave the best results.

Training Parameters

  1. Number of epochs: 7
  2. Activation Function: Swish (Improves the accuracy by 2~3%)
  3. Loss: Categorical Cross-entropy
  4. Optimizer: Adam
  5. Initial Learning Rate: 0.01
  6. Backends: Stepwise Decay after every 4 epochs

Initially, the ImageNet models were fine-tuned on the original augmented dataset of 1.3k images. That gives the accuracy of around 31% and 48% in the case of Inception V3 and Inception ResNet V2 respectively. After that, we trained on data augmented Mask R-CNN crops, that helped us to improve the accuracy by 10% in case of Inception V3 and 2~3% in case of Inception ResNet V2. That’s pretty good. 🙂

Multi-stage Training on Different Architectures

Architecture Overview

At last, we created an end-to-end deep learning model consisting of Mask R-CNN and Trained Ensemble Models. All the test images were passed through pretrained Mask R-CNN first. After that, it splits into cases. If a bird is detected in the test image, which it mostly did, it is a SUCCESS otherwise FAILURE. The two cases are discussed as follows:

  1. SUCCESS:
    a.
    A batch of cropped images from Mask R-CNN is created.
    b.
    The whole batch is then passed through trained ImageNet models.
    c. The ImageNet model with higher prediction value assigns the species to the whole image.
  2. FAILURE:
    a.
    The test image is resized to 416x416.
    b.
    The resized image is passed through trained ImageNet models and species is assigned based on the highest predicted value.

The architecture is shared as below:

Model Architecture

Mask R-CNN + Inception Nets increased the accuracy by 6% in the case of Inception V3 and by 2% in case of Inception ResNet V2. Inception V3 performed better in some cases. After ensembling, the final accuracy increased by 4~5%. Et, voila!!! 😇😇😇

Test Results

Evaluation of test data in terms of class-averaged Precision, Recall and F1-scores is shown in the table below:

Evaluation Metrics (in %) on Test Dataset

Confusion Matrix of Mask R-CNN + Ensemble model is as follows:

Confusion Matrix

Challenges

As a new dataset always have some problems whereas some major challenges too. Some problems with this dataset are as follows:

  1. The training dataset mostly contains bird images in which bird were almost 10–20% of the whole image whereas in case of test images the bird contains 70–80% of the image. Sometimes, the model fails to detect the birds due to less number of birds in the dataset.
  2. In some classes, the birds cover not even 10% of the whole images or the colour of bird and surrounding are very similar. Cases where birds are brown in colour. In those cases, the model fails to localize birds due to occlusion problem or background similarity problems. Some cases as follows:
Few example of images where Mask R-CNN fails to detect birds

Can you recognize the birds in the above cases??? 😯😯😯 Crop that out.

Future Work

We also tried with Siamese networks. It didn’t performed that well. In future, we are planning to extend this work using Part Model based approach with NTM (Neural Turing Machine) and Visual Attention Network.

References

[1] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna, “ Rethinking the Inception Architecture for Computer Vision” arXiv preprint arXiv:1512.00567.
[2] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning” arXiv preprint arXiv:1602.07261.
[3] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick, “Mask R-CNN” arXiv preprint arXiv:1703.06870

We were the CHALLENGE WINNERS of this challenge. ✌️ If you find this post useful, kindly appreciate it by giving a clap. 👏 👏 👏 Any doubts, please mention in the comment section. I will be happy to help it out.

--

--

Akash Kumar

Ph.D. scholar @CRCV, UCF. Enthusiastic Trekker!!!