Automated Identification of Indonesian Rhinos

A proposal for the development and training of an automated rhino identification system from limited camera trap data.

By Cooper Oelrichs


Indonesia’s two rhino species, the Sumatran and Javan Rhino are both critically endangered. It is estimated that there are a total of 68 Javan Rhino, and 80 Sumatran Rhino remaining in the world.

This article proposes expanding the camera trap programs to collect data to study the remaining Indonesian rhino populations and then developing an artificial intelligence system to automatically identify the individual rhino in the camera trap images.

This article will detail the design of the artificial intelligence system.


There are ongoing efforts to conserve these species, but understanding of the remaining populations is limited. This is largely due to the difficulty of studying these rare animals in dense Indonesian rainforest. The remaining populations are so small that the most effective approach to conserving them is to monitor and manage individual rhino. This will allow, for example, isolated rhino to be translocated to possible mates and poaching to be detected and responded to when an individual rhino goes missing.

Camera traps have been used to successfully study the Javan Rhino in Ujung Kulon National Park and have been used in limited numbers on Sumatran Rhino in Mt Leuser and Way Kambas National Parks. However, identifying individual rhino from camera trap images is a time consuming process that can only be performed by staff experienced in recognising rhino. This limits the volume of camera trap images that can be processed and makes it impractical to use camera trap images to identify issues that require rapid responses.

See this post for a more detailed description of the problem and the proposed solution.

Data Collection

Why Camera Traps

Many options exist for studying wild populations of animals. These include transects, GPS tags or collars, camera traps, and satellite or aerial imagery. Transects are an intensive manual process that are expensive to perform over a multiyear project, GPS tags provide excellent data of individual behaviour but are invasive as they require that rhinos are captured for collaring or tagging, and satellite or aerial imagery is unable to penetrate the dense rainforest canopy.

Camera traps are proposed for this project because of the ability of this technology to work successfully in dense rainforest, and to produce a dataset with enough detail to identify individuals while being minimally invasive. Camera Traps have been used successfully in combination with machine learning models in several recent conservation projects. One notable example is the Snapshot Serengeti project. This project deployed 225 camera traps across 1,125 km2 in Serengeti National Park, Tanzania and collected 1.2 million sets of images by 2013. This data was used to train a deep neural network for classifying species in images with 88.9% top-1 accuracy.

Critically Endangered Species and Limited Data Availability

The limited size of the critically endangered Javan and Sumatran populations makes collecting sufficient data to train an image recognition system challenging. There is an inherent limit on the number of individuals the dataset can contain simply due to the size of the populations. While the challenging environment and probability of finding a member of a small population limits the total number of images than can be collected.

This issue is addressed in the design of the image classification system.

The Artificial Intelligence System

Manually Identifying Individuals in Images

Humans are able to identify individual animals from images. This paper from 1996 details a manual method of identifying Black Rhino from images.

Machine Learning and Individual Identification

In recent years significant progress has been made on developing machine learning systems for identifying individual animals from images. Some significant examples are presented below.

  1. Face Recognition: Primates in the Wild. This paper discusses the development of an identification model called PrimNet. Three versions of the model were trained on 3,000 Lemur images, 1,450 Golden Monkey images, and 5,559 Chimpanzee images and they achieve rank-1 open-set accuracies of 82%, 66%, and 37% respectively. The model’s pipeline consists of face alignment and detection followed by a convolutional neural network classifier.
  2. Wildbook. Wildbook is a generalised AI system for identifying individual animals from images. It has been used by Wildbook internally to develop systems for identifying cetaceans (including humpback whales, sperm whales, bottlenose dolphins), giraffe, whale sharks, manta rays and sea turtles as well as many other species by other organisations. The model’s pipeline has two steps: detection — which consists of a cascade of deep convolutional neural networks which perform whole-scene classifications of species, object bounding box localisations, and final species classifications for the candidate bounding boxes; and identification — which classifies the bounding boxes using extracted SIFT descriptors.
  3. Chimpanzee faces in the wild. This paper discusses a series of model experiments on two datasets, the C-Zoo data set with 2,109 images and one on the C-Tai dataset with 4,377 usable images (the complete dataset is larger), achieving a class-wise average recognition rate of 92% and 77% on the most accurate experimental model for each dataset. The model’s pipeline starts with images that have been cropped to face regions, these are fed into a classification model. The classification model consists of an SVM which is trained on the outputs of a number of layers from a pre-trained convolutional neural network (VGGFaces or BVLC AlexNet). In some experiments the pre-trained network was fine-tuned at a low learning rate.
  4. Towards Automated Visual Monitoring of Individual Gorillas in the Wild. This paper discusses the development of an identification model that is trained on a dataset of 2,500 images of gorillas and achieves an accuracy of 62%. The model’s pipeline receives field imagery, after which face detection is performed using a fine-tuned You Only Look Once (YOLO) model. Each candidate region is then processed by the lower layers of the BVLC AlexNet Model for feature extraction (without fine-tuning), and extracted features are then classified by a linear SVM.
  5. Towards Automatic Identification of Elephants in the Wild. This paper follows a similar approach to the gorilla paper above. An identification model is trained on a dataset of 2,078 images of elephants and achieves an accuracy of 74%. The model’s pipeline receives input imagery, after which face detection is performed using a fine-tuned You Only Look Once (YOLO) model. Each candidate region is then processed by the lower layers of the ResNet50 Model for feature extraction (with out fine-tuning), extracted features are then classified by an SVM.

The Proposed Model

Based on the research above an artificial intelligence system for automatically identifying Indonesian rhino from camera trap images is proposed. The proposed approach, which is described below, is similar to the ones used in the “Towards Automated Visual Monitoring of Individual Gorillas in the Wild” and “Towards Automatic Identification of Elephants in the Wild” papers. As data is collected the modelling pipeline will be tested and adjusted and alternative approaches will be tested based on new and existing research.

The proposed pipeline.

  1. Pre-processing. Camera trap images will be pre-processed as required, for example image scaling and normalisation may be performed and the dataset may be augmented with generated images.
  2. Object detection. Object detection will be used to crop images to rhino faces or bodies depending on what achieves the best results. This could be performed using a YOLO model (a unified, real-time object detection — combined object classification and bounding box model) or some other object detection model.
  3. Feature extraction. Cropped images will be input into a pre-trained computer vision neural network for feature extraction. As with the “Towards Automated Visual Monitoring of Individual Gorillas in the Wild” and “Towards Automatic Identification of Elephants in the Wild” papers this will most likely be a deep convolutional network that has been trained on the ImageNet dataset, and had a number of its later hidden layers and its output layer removed.
  4. Classification. The output of the feature extraction step will be input into a shallow model which has been trained to classify individual rhino. This could be a neural network, a SVM, or another type of model. This stage will output class probabilities.

Leveraging a Pre-trained Model

One of the key issues that this project will face is the limited availability of image data on which a model can be trained. The ImageNet dataset currently contains over 14 million images while the network will have to be trained using thousands of images. Based on the success of the models used in the papers described above, this proposal will use a neural network that has been trained on millions of images to extract features on which a much simpler individual classier can be trained.