Satellite images object detection —part 1: 95% accuracy in a few lines of code

Romain Guion

Follow

Published in

VorTECHsa

7 min readFeb 4, 2020

--

Ship surveillance & tracking with TensorFlow 2.0 — part 1: the basic solution

Diversity of satellite images conditions and scales makes object detection one step harder.

The question is not “if”, but “how” image-based insights should be consumed

Given the exponential growth of images, and in particular optical, infrared and SAR satellite images, business opportunities are growing faster than the number of data scientists that know how to handle them.

For companies used to consume structured data, this means:

Dedicating a number of their data scientists to develop image processing skills in-house, or
Consuming already processed feeds, for example Vortexa and its SDK or its datasets on AWS data exchange in the case of ship analytics.

Target audience for this post:

The data scientist tasked with developing image processing skills will be interested in the simple implementation of Part 1, as well as the thinking process in dealing with the often unspoken architecture and optimisation choices made in Part 2 & 3.
The reader interested in consuming already processed feeds might want to look at images and their captions in both posts, and get a feel for what’s involved in object detection, and their interpretability.

This is the 1st part of a 3 blog posts series:

Ship detection — Part 1: ship detection, i.e. binary prediction of whether there is at least 1 ship, or not. Part 1 is a simple solution showing great results in a few lines of code
Ship detection & early localisation — Part 2: towards object segmentation with (a) training fully convolutional NN through transfer learning, to build an encoder for a U-Net, and (b) first emerging localisation properties with CAM attention interpretability
Ship localisation — Part 3: identify where ships are within the image, building a U-Net from the encoder developed in Part 2

What you will learn in this post (Ship detection — Part 1):

Context: satellite imaging context & opportunities
On the fly preprocessing generators with Tensorflow
Learning rate selection technique
Simple ConvDenseNet (LeCun style) implementation, reaching 95% accuracy while having seen the data set only 2.5 times

Code available on github

Source: Kanjir et al, 2018, Remote Sensing of Environment, https://www.sciencedirect.com/science/article/pii/S0034425717306193

Context, motivation and opportunities.

With the increase of satellites in orbit, daily pictures of most of the world are now widely available. Data sources include both free (USGS, Landviewer, ESA’s Copernicus etc) and commercial (Planet, Orbital insight, Descartes labs, URSA etc) entities. The competition is fierce, and seems to yield very fast innovation, with companies such as Capella Space now pitching hourly updates by launching constellations of small satellites of only a few kg. Those satellites increasingly cover beyond the optical visible spectrum, notably Synthetic Aperture Radar (SAR), which offers the benefit to work independently of clouds or daylight. Along with the growth of other image sources, the ability to interpret image data is a key to untapped commercial opportunities.

The monitoring of human activity is increasing for the purpose of fishing, drilling, exploration, cargo and passenger transport, tourism, for both governmental and commercial purposes, particularly at sea. For ship tracking in particular, satellite images offer a rich complement to baseline cooperative tracking systems such as AIS, LRIT and VMS.

Feeding the beast: preprocessing steps

Label each image as ship / no-ship.
Mini-batch data into 40 images chunks. Using batches > 1 leverages vectorisation and makes computation faster compared to stochastic gradient descent (i.e. batch size = 1). Using smaller batches (~20–500) than the entire dataset (200k images here) first allows the data to fit in memory, and the extra noise tends to prevent premature convergence on local minima.
Reduce the image size from 768 x 768 x 3 to e.g. 256 x 256 x 3, to make training faster, and reduce overfitting. Using modern ConvNet (fully convolutional or with GAP/GMP), it is good practice to start training with small images and fine tune with larger images later.
Data augmentation: flip horizontally and vertically initially. Extra distortions could be added later on, when the model starts overfitting to squeeze extra performance.
Renormalise magnitude from [0, 255] to [0, 1].

Here is how we conduct this preprocessing on the fly with Keras’ ImageDataGenerator class, with the labeling done with flow_from_dataframe, all feeding later on into the fit / fit_generator API:

Simple generator to grab image on the fly, label it, resize and normalise it, and perform basic data augmentation

1st architecture — a simple ConvDense network

Starting simple and iterating:

Images are represented as 4 dimensional tensors: batch size, height, width and channels (RGB at the input, layers of features deeper in the network)
3 blocks of convolutional + max pooling to extract features from the image (with 3x3 sliding windows). As we go deeper into the network and work with increasingly abstract features, the image dimension is reduced (256 -> 128 -> 64 -> 32) while the number of channels is increased (3 -> 16 -> 32 -> 64)
1 dropout layer for regularisation (i.e. prevent over-fitting)
2 dense layers to make sense of all these extracted features, and combine the presence or absence of those to decide whether there is a ship on the image or not

To optimise this model we need:

loss: sparse_categorical_crossentropy if using softmax as the last activation, or as there are only two categories binary_crossentropy with sigmoid as the last activation (can also work with raw logit, with a twist, see Part 2)
optimiser: one choice is using Adam, and tune (image below) or cycle the learning rate (to get out of local minima, more on this in Part 2). For this first run I ended up choosing a learning rate of 3E-4. The other parameters controlling the exponentially weighted averaged moments of gradient use the default values (first order is called momentum and parameter beta1, second order is called RMS prop and parameter beta2)

[Left] Illustration of a neural network optimisation problem: the goal here is to find the minimum while being blindfolded. [Right] Exploring the learning rate landscape — we want to pick the learning rates with the largest loss gradient, and stay away from divergence. Method: start at a very low LR, say 1E–7, and at each batch increase the LR slowly, until a high LR, say 1E-2 or even 10. This is a very approximate method, as both weights and data are changing as the learning rate increase. However, this is computationally cheaper than running multiple simulations in parallel, so hopefully a reasonable stop-gap solution to pick a safe learning rate.

Here is the Tensorflow 2.0 implementation:

Implementation on Tensorflow, using the Keras API

Keras’ architecture summary of the network described earlier, following a classic LeNet design (LeCun) of a series of convolutional and max pooling layers, followed by dense layers. Note that most of the weights are in the dense layer, despite limiting the image size to 256 x 256 pixels. More recent architectures have moved away from this design, and are now fully convolutional, avoiding this concentration of weights on a single layer, which, all things being equal, tends to overfit and yield lower performance. See Part 2.

Here is the basic Tensorflow / Keras code to train the model, with the parameters used.

Parameters used in basic TensorFlow code to train the model.

X-axis is the number of pseudo-epoch, i.e. 1000 batches of 40 pictures. Given 200k training samples, and data augmentation x 4 (flipping horizontally and vertically), this makes ~ (200k x 4) / (1k x 40) = 20 pseudo-epochs for the model to see all the images, i.e. 1 true epoch is ~ 20 pseudo-epoch.

By showing this simple neural network our training set about 2.5 times, we reached 94.8% cross-validation accuracy. A naive approach given the class imbalance would be 77.5%.

Note that the model is still learning, and that we aren’t yet observing signs of overfitting. This is by design, due to the relatively small size of the model compared to the dataset size (200k + 600k augmentation), and the dropout layer.

Sample predictions from this simple ConvDense model.

Examples of misclassification below. We can see that some cases are hard to resolve to the human eye while there still seems to be a minority of easy wins. Also interesting that a fraction of labels seem wrong: another reason to avoid overfitting.

Misclassifications tend to involve small vessels at the edge of the image, as well as potential mislabeling (image on row 1 column 3)?

Improving this model

A tested improvement to this model is to add batch normalisation between weights and activation: it allows the model to reach 93% accuracy in only 10 pseudo-epochs, compared to 25 pseudo-epochs without. The plateau seems the same, close to 95% (only trained 40 pseudo-epochs).

Further improvements could involve more capacity for our model as it may underfit slightly currently, as well as exploit higher resolution images.

An issue with this architecture is that input image size is fixed. That can be solved through

Global Averaging / Max Pooling layers (GAP / GMP) at the end, just before a or multiple dense layers, for a global aggregation over the spatial dimensions (batch size, height, width, channels) -> (batch size, 1, 1, channels), irrespective of image size.
Fully Convolutional Network (FCN), using a similar idea but generalising it with a learned spatial operation through a 1x1 depth filter.

In the second part of this post, we explore the former architecture, along with a deeper network. Click here to read part 2!