Watch out for Invading Species! A Kaggle Competition
Recently I participated in a Kaggle playground competition Invasive Species Monitoring. This is just my second participation and I am excited about sharing the experience I had. I was not alone on this venture, Kunal Sarkar a friend mine joined me too. Vinayak Raj gave us some useful insights as well.
Overview
The organizers of the competition wanted to find a way to classify plant species as being invasive or non-invasive based on a collection of images taken from the natural habitat. According to an article published by the Natural Wildlife Federation, a species that do not natively belong to a particular habitat, beginning to expand rapidly in number and causing harm to the native species are termed to be invasive. Wildlife experts see this as a natural threat to the ecosystem. Kudzus tangling trees in Georgia and cane toads invading natural habitat in large numbers are just few of the examples worldwide.

Are invasive species truly threatening?
Yes! When a new species is introduced in an environment:
- they might not have predators
- they can breed and spread quickly
- they can out-compete the native species for food and other resources
- the native species might not have ability to defend against the invading species
Problem At Hand
Presently, species monitoring over a particular habitat depends extremely on wildlife experts who have been studying about the region for years. Only they know the native species of the land.
But relying on such expert knowledge alone across different landscapes can prove to be extremely costly, time consuming and insufficient as humans cannot manually sample large areas of land.
The organizers have thus reached out to the members of the Kaggle community to provide a robust and acceptable solution. The competition being categorized as ‘playground’ doesn't have any ranking points or rewards.
Data
The data set contains images captured from various places in a Brazilian natural forest. Some of the pictures contain an invasive species termed ‘Hydrangea’, which is a native species of Asia.
Evaluation Criteria
Submissions are evaluated based on the area under the Receiver Operating Characteristics between the predicted probability and the observed target.
In simple terms, we have to determine the probability for the target variable on whether the image contains invasive species or not.
Data Exploration
Packages used
For this competition we used Python and its associated libraries:
- OpenCV : for image processing capabilities
- Numpy : for mathematical/matrix operations
- Matplotlib : for plotting and visualizing
- Keras : for building CNNs and using existing networks (VGG NET, RES50 NET).
Data
Training set containing roughly 2300 images have been provided along with their labels. Predictions have to be made on a test containing approximately 1500 images. The data can be collected from HERE.

The training set contained all the images (invasive and non-invasive species) together. We had to separate them out into TWO different directories.
Decider
- We decided to perform classification using Neural Networks, given the complexity of the images at hand.
- Given training data was not sufficient to train a neural network. As a result we had to bump up the data set.
Data Augmentation
Augmenting the training data was needed as the existing data was not sufficient to train a neural network. Moreover augmenting data enables the network to look at the images in a more generalized way from all angles, thus reducing over-fitting.
We resorted to the following augmentation steps:
- Vertically flipping images (did not use horizontal flipping as it does not make sense)
- Rotating images at various angles (did not improve accuracy)
Image Preprocessing
Preprocessing is a fundamental prerequisite before feeding data to a neural network. It is especially important involving data containing images.
1. Contrast stretching
Some images in the training data were dull. So in order to enhance them, we decided to stretch all the pixel values lying between the 1st and 99th percentile across all the three channels in the image.
2. Padding
The images in the training set were rectangular. CNNs perform best when images fed have the same height and width. As a result, we decided to pad all the images.

Note: Preprocessing steps were also applied to the test data before prediction.
Our Various Approaches
Here is a graph depicting the different scores achieved for the various attempts:

We made several attempts to get the best score possible. We initially built a CNN with 8 layers followed by two dropout layers and a dense layer.
- In the first attempt we hit a score of 0.76268, without performing augmentation or preprocessing.
- For the next subsequent attempts, we tried varying the number of filters in each CNN layer, the kernel size of the filter and the image size. We hit scores ranging between 0.80 and 0.91.
- In one of our attempts when we used the HSV color space, we hit a low score of 0.65
- It was at this point that we decided to perform data augmentation and image preprocessing. Upon performing image flipping, contrast stretching and image padding with the same CNN architecture, we got a score as high as 0.95195.
- Finally for our last few attempts we used the VGG NET and Residual NET, with the pre-defined weights used in the ImageNet challenge. We hit the highest score of 0.9913.
The competition expects each participant/team to nominate the two BEST submissions. We chose the last two:

Ending Note
Other relevant information including code for augmentation and preprocessing is available on THIS GitHub page.
I have also written about our take on another Kaggle competition, on how to reduce testing time for Mercedes Benz vehicles. Feel free to have a look!
Do you have an article you would like me to read? Share it!!!