Object detection on Satellite Imagery using RetinaNet (Part 1) — Training

Ijeoma
9 min readJan 14, 2020

This tutorial walks through the data loading, preprocessing and training steps of implementing an object detector using RetinaNet on satellite images.

Photo by Will B on Unsplash

Object detection is a subsection of computer vision and refers to the process of determining a class or category to which an identified object belongs to and estimating the location of the object by generating a bounding box around it. Deep Convolutional Neural Networks (CNNs) have been used extensively for object detection and have consistently achieved remarkable results. Thanks to the availability of large datasets, powerful computing resources, and constant innovation around network architectures, increasingly faster and more accurate models surface every day.

So what exactly is object detection used for? It is applicable to many domains across industires including surveillance (tracking entities like people, vehicles or indentifying unattended baggage), medical diagnosing (detecting lung nodes or localizing lesions), or transportation (detecting objects on the road for autonomous driving). While these are more specialised use cases, the value of object detection more generally lies in the fact that information can be obtained in a much faster and much more accurate way and at a significantly reduced cost than with conventional data collection methods.

The data used in this post is the Swimming Pool and Car Detection dataset from Kaggle. It consists of 3,750 satellite images of residential areas with annotation data for swimming pools and cars. Extracting visual features using convolutional neural networks on aerial imagery could be used together with more traditional housing features to make more accurate estimates of property value, for example. This would equally be valuable for local government agencies who otherwise are forced to rely on information obtained through expensive and sporadic surveys to assess property taxes.

Now that we know what object detectors are useful for, let’s jump right to the fun stuff.

This tutorial demonstrates an end-to-end deep learning workflow and is organised as follows:

  1. Some a priori choices
  2. Loading and preparing training data
  3. Training a deep neural network using RetinaNet
  4. Evaluation
  5. Using the model for inference (discussed in Part II of this post)

1. Some a priori choices

A few things to think about before starting to train: What deep learning environment can we use? What network architecture should we choose? Do we train from scratch or do we use pre-trained weights?

Environment: For this project I use Google Colab. Colaboratory is a research tool for machine learning education and research. It enables you run interactive Jupyter notebooks in a GPU backed deep learning environment and is for free(!). Mind you, a notebook is automatically disconnected after a maximum of 12 hours (after which you can reconnect to a fresh session). It requires no setup and is great for developing deep learning applications using popular libraries such as Keras, TensorFlow, PyTorch, or OpenCV. I happened to get connected to a powerful Tesla P100 GPU with 16GB but you might equally get connected to others, most commonly K80 and T4 nodes. For more info on Google Colab see the Colab FAQs. Here’s a great post for other deep learning environments you might want to try out.

Architecture: I opt for the state-of-the-art RetinaNet architecture (see image below)— a single shot object detector (SSD) which is much faster than, but just as — if not even more as accurate — as, two stage detectors. RetinaNet was developed out of the desire to match accuracy of a one-stage detector with that of two-stage detectors, which previously had been the highest accuracy detectors. While one-stage detectors are typically faster and simpler, they also tend to perform worse in terms of accuracy.

One-shot RetinaNet network architecture: a multi-scale convolutional feature pyramid consisting of a feedforward ResNet architecture and Feature Pyramid Network (FPN) backbone. Source: Tsung-Yi Lin et al. (2017).

The reason is that the number of anchors containing positive classes (i.e. relevant objects) is a lot smaller than that of anchors containing negative (or background) classes. And since SSDs consider every anchor present in a grid of feature map as a region proposal (since we only look once), the classifier tends to be fed with significantly more negative examples relative to positive ones. This makes for biased learning (biased towards background examples) resulting in suboptimal accuracy.

RetinaNet addresses the extreme foreground-background class imbalance by tweaking the loss such that it discounts the loss assigned to well-classified examples. That way the detector is not overwhelmed by easy negatives (background class) during training. As a result, RetinaNet is able to match the speed of previous one-stage detectors while even surpassing the accuracy of existing state-of-the-art two-stage detectors.

(Here I just decide to go with this state-of-the-art architecture but working on real-life projects you might want experiment with a number of different architectures to find the most suitable for your particular task given your various performance, infrastructural, cost, and time constraints.)

Transfer learning: I also decide to initialise with weighs from a pre-trained model, meaning I use previously learned convolutions, then retrain the dense layers from that model with our data. This approach is called transfer learning and is used in deep learning where a model that was created and trained for one task is used as a starting point for another task. It involves using a pre-trained model as a foundation for the task of interest — either as an initialization or as a fixed feature extractor. This means we require less data to achieve similar results, and the network generalizes better. In fact, it has been shown that initialising a network with transferred features, regardless of the number of layers, can boost model performance.

2. Loading and preparing training data

As already mentioned, using a Google Colab Jupyter Notebook which has everything pre-installed means you can pretty much start working straight away. Just make sure you are using Keras version 2.3.1 for this tutorial.

To start off, I clone the Github repository of Fizyr’s popular implementation of RetinatNet.

import osrepo_url = 'https://github.com/fizyr/keras-retinanet'
repo_dir_path = os.path.abspath(os.path.join('.', os.path.basename(repo_url)))
# clone git repository
!git clone {repo_url}

The dataset is downloaded from Kaggle using the Kaggle API (use after you’ve installed it) — pretty straight forward (remember to insert your personal kaggle username and kaggle key):

# import osos.environ['KAGGLE_USERNAME'] = "YOUR_USERNAME"
os.environ['KAGGLE_KEY'] = "YOUR_KAGGLE_KEY"
!kaggle datasets download kbhartiya83/swimming-pool-and-car-detection -p /content/data --unzip

Next, I prepare the data such that it can be fed directly into the training step. For this exercise, I first select a random sample of 800 images to quickly get an idea of how well I can expect the algorithm to do on this task. Feel free to use the full dataset, but bear in mind computational and time implications.

# select subsample of N for initial training
# import os
import random
from shutil import copyfile
# determine number for the sample
NUM = 800
# create directory for the sample
base_dir = os.getcwd()
sub_dir = base_dir + '/images_subset/'
!mkdir {sub_dir}
image_dir = base_dir + '/data/training_data/training_data/images/'
image_paths = os.listdir(image_dir)
# randomly select subsample
random_NUM = random.sample(image_paths, NUM)
# copy subsample into subsample directory
for i in random_NUM:
copyfile(image_dir + i, sub_dir + i )

I then proceed to splitting the dataset into train and test sets and generate the train.csv, test.csv and classes.csv files required for model training. I start by constructing an argument parser and creating variables from the arguments.

Argument parser

The next bit of code defines our train and test sets by creating two respective lists of image paths:

Train and test split

Now I can create the actual train and test sets by extracting annotation data, i.e. the object’s class and its bounding box coordinates, from the XML annotation files according to the training and test image path lists. There is one XML file per image, meaning each file can contain several object tags (one for each annotation).

Generate train, test and classes files for training

After combining the three preceding code snippets into a build_dataset.py file, it can be easily run as follows:

!python /content/ije_retinanet/build_dataset.py \
-l /content/data/training_data/training_data/labels/ \
-i /content/images_subset/ \
-r /content/images_subset/train.csv \
-e /content/images_subset/test.csv \
-c /content/images_subset/classes.csv

With 800 images and a 80–20 split, the output should look something like this:

[INFO] creating 'train' set ... 
[INFO] 640 total images in 'train'
[INFO] writing train annotations ...
[INFO] total 2827 annotations
[INFO] train.csv completed
[INFO] creating 'test' set ...
[INFO] 160 total images in 'test'
[INFO] writing test annotations ...
[INFO] total 604 annotations
[INFO] test.csv completed
[INFO] writing classes ...
[INFO] classes.csv completed
[FINAL] Task completed!

Lastly, I download a pre-trained model for transfer learning. I use ResNet50, one of the classical pre-trained models trained on the ImageNet dataset.

import urllib.request# path where pre-trained weights should be saved
PRETRAINED_MODEL = '/content/keras-retinanet/snapshots/resnet50_coco_best_v2.1.0.h5'
# model url
URL_MODEL = 'https://github.com/fizyr/keras-retinanet/releases/download/0.5.1/resnet50_coco_best_v2.1.0.h5'
# retrieve model from model url and save
urllib.request.urlretrieve(URL_MODEL, PRETRAINED_MODEL)

3. Training a deep neural network using RetinaNet

Ready to train your model? If you’ve done the preceding prep work right, running keras-retinanet’s train.py file should successfully kick off training:

!python /content/keras-retinanet/keras_retinanet/bin/train.py \
--freeze-backbone \
--random-transform \
--weights {PRETRAINED_MODEL} \
--weighted-average \
--batch-size 32 \
--steps {no_steps} \
--epochs 30
csv '/content/images_subset/train.csv' '/content/images_subset/classes.csv'

A few things to note here:

  • freeze-backbone: this enables freeze training of the backbone layers (for transfer learning).
  • random-transform: providing this argument enables random transformations of images and annotations (for data augmentation).
  • weighted-average: to compute the mean Average Precision (mAP) — discussed later in this section — using the weighted average of precisions among classes.
  • weights: I use the previously retrieved PRETRAINED_MODEL to initialise the weights.
  • batch-size: you can experiment with batch size based on your individual setup. Generally speaking if training starts successfully, you can try increasing it by any multiple of 8.
  • epochs: the number of epochs you decide to train for.
  • steps: this needs to be calculated and depends on batch size — number of (training) annotations/batch size (saved as no_steps, as seen below):
import pandas as pd
import math
df = pd.read_csv('/content/images_subset/train.csv', header=None)
count = len(df)
no_steps = math.ceil(count/32)
print("Count of annotations: {}".format(count))
print("Number of steps per epoch: {}".format(no_steps))
--------------------------------Count of annotations: 2827
Number of steps per epoch: 89

This training process (640 images, 2827 annotations, batch size 32, 30 epochs) takes me just over an hour on Colab.

3.1 Evaluation

The previously generated test.csv file is what is used to evalute the model’s performance. The directory from which RetinaNet’s train.py is run is where a /snapshots/ folder is created where the model’s weights (training model) are automatically saved after each epoch. You can specify the most recent one for computation of the object detector’s mAP (or use it as pre-trained weights to continue training if you wish to do so).

# model path
model_path = os.path.join('/content/snapshots', sorted(os.listdir('/content/snapshots'), reverse=True)[0])
# evaluate model using test.csv
!python /content/keras-retinanet/keras_retinanet/bin/evaluate.py \
csv '/content/images_subset/test.csv' \
'/content/images_subset/classes.csv' \
{model_path} --convert-model

Mean Average Precision (mAP) is a popular metric for measuring accuracy of object detectors. It computes the area under the precision-recall curve or Area Under Curve (AUC). How do we define Precision and Recall in the case of object detection? We need to be able to compute TP (true positives) and FP (false positives).

To decide whether a prediction should be considered a true positive or a false positive, an Intersection over Union (IoU) threshold is chosen. IoU is used to measure how much a predicted boundary overlaps with the ground truth (the given boundary coordinates). Average Precision is typically averaged over all categories, which is then called ‘mean average precision’ (mAP). You can refer to this post for a more detailed explanation.

I end up with a mAP of 0.74 and a weighted average mAP of 0.76 (with an IOU threshold of 0.5) — a pretty decent result for this quick experiment training only on a fairly small set of data.

138 instances of class 2 with average precision: 0.7031 
466 instances of class 1 with average precision: 0.7768
Inference time for 160 images: 0.0718
mAP using the weighted average of precisions among classes: 0.7599 mAP: 0.7399

Overall, the model’s performance is not bad at all given the size of the dataset used.

A good place to start for model improvement would be to look through predicted images to understand characteristics of the objects the detector consistently struggles to detect. For example, many false negatives for class 1 (cars) seem to be dark cars that throw large shadows. Instead of randomly selecting images, carefully selecting enough examples of dark cars throwing big shadows for the training set (either by finding more or by generating synthetic examples) might improve accuracy. Since I have a whole lot more data available, however, there is a good chance I can already improve the detector’s performance by simply training on a larger data set.

Head over to Part II to see how to generate detections on previously unseen test images.

--

--