Leading up to the holidays, we took a look back at the body of academic literature for deep learning and computer vision from 2018. As a team we constantly review new innovations in deep learning, and are seeing exciting opportunities to apply them to our global, daily imagery of the earth. In this post we highlight five recent papers which we think hold promise to power the next-generation of geospatial applications based on satellite imagery.
“Functional Map of the World” — Gordon Christie, Neil Fendley, James Wilson, Ryan Mukherjee (Paper, Dataset, Code)
Problem addressed: Using time-based sequences of multispectral satellite imagery and associated metadata, can we determine building function and land use?
New contributions: in this paper they demonstrate a model ‘joint reasoning’ on temporal stacks of images (at Planet we call them “stacks”), and their associated metadata e.g. UTM zone (i.e. location), Ground Sampling Distance (aka resolution, unit: pixel side length in meters), sun angle, off-nadir angle (angle of the satellite to the ground) and bounding box-to-image size ratios. They also publish the fMoW dataset, which comprises 1 million images over 200 countries, with bounding box labels for 64 land use categories, such as ‘crop field’, ‘airport’, and ‘gas station.’
The authors try several supervised learning approaches on the fMoW dataset, including a CNN architecture with a DenseNet-161 backbone and multi-temporal instance classification averaging, as well as LSTM-based models. Metadata parameters are then fused with the CNN outputs and fed to the fully connected layers. Unlike typical object detection datasets, the output expected from the model is just a class label, not a bounding box too.
The authors found that the addition of metadata gave a meaningful improvement in classification accuracy — an exciting finding. They also found that enlarging bounding boxes with a buffer improved performance for small objects, e.g. single unit residential and fountains. This makes intuitive sense — giving the model more context of the surrounding area will help distinguish small objects under varying conditions.
Relevance to Geospatial: Satellite imagery comes with a rich set of metadata. The key takeaway for us was that a model fed both imagery & metadata can jointly reason on both, yielding a performance boost vs the imagery alone. It’s fun to hypothesize what the model might be learning e.g. “a factory/powerplant is more likely in this part of the world”.
Problem addressed: Spatial pooling is a foundational operation used consistently by many of the well known architectures e.g. ResNet-101, VGG16, GoogLeNet and DeepLabV3. It reduces the number of parameters, improves invariance to certain distortions, and increases the receptive field size. It is however an inherently lossy processes — a downsampling step that discards spatial information, and hence reduces a network’s discriminability.
New contributions: This paper from CVPR 2018 demonstrates a learnable pooling layer, called ‘Detail Preserving Pooling’ (DPP). For every feature map, it learns the shape of a nonlinear function between Max pooling and Average pooling. Max pooling preserves details better than Average pooling, and was found to yield better performance for features with low activation probability. Max pooling however distorts the image, and the result is visually less plausible than for Average pooling (see figure). So DPP enables a network to learn how much detail preservation to use, with a minimal increase in the number of parameters, and speed.
Relevance to Geospatial: In satellite imagery analytics — where the sensor is located in Space (!) — detecting small objects is equal part essential as it is challenging. For small objects, performance tends to degrade due to inconsistent feature resolution downstream in CNNs layers. Learning when & where to preserve details through the sequence of layers in a network may lead to improved object detection and segmentation performance.
“Perceptual Generative Adversarial Networks for Small Object Detection” — Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, Shuicheng Yan (Paper)
Problem addressed: CNN-based object detection models perform worse on smaller objects.
New contributions: A Generative Adversarial Network (GAN) is embedded within a CNN object detection model architecture which narrows the internal representation difference between small objects and large ones. Specifically, the conditional GAN’s generator network learns to transform perceived poor representations of small objects in to super-resolved ones which are similar enough to real large objects to fool a competing discriminator. Meanwhile its discriminator competes with the generator to identify the generated representation. The discriminator contains a second branch they term ‘perceptual loss’, which calculates classification and bounding box accuracy for the generator’s output. This imposes that the super-resolved features must also improve detection accuracy.
Relevance to Geospatial:. As with DPP, small object performance is essential in satellite imagery analytics. Often in object detection model architectures, ‘scale’ data augmentation is used to brute-force learn representations of objects at multiple scales. This is expensive both in terms of compute and time, and cannot include all possible object sizes. Perceptual GANs offers a more efficient, learnable solution to enforce scale invariance.
Though this paper was originally published in 2017, we couldn’t help including — it was too interesting!
“Rethinking the Faster R-CNN Architecture for Temporal Action Localization” — Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, Rahul Sukthankar (Paper)
Problem addressed: In the field of CNN-based ‘video action classification’ (VAC), the time window over which 3D convolutional models learn — the ‘memory’ — is inflexible in duration.
New contributions: This paper proposes dilated, 1D-temporal region proposals to tackle these problems.
Before we get to that, first some background on CNN-based video modeling. If you extract CNN features from each frame of a video independently, pool and classify them, you cannot distinguish between e.g. opening versus closing a door. State of the art models for VAC do not use LSTMs, but ‘3D convolutions’ — an adaptation of 2D convolutions that convolves over the ‘3D space’ of a video sequence of images. In fact the pioneering paper on 3D convolutions in 2014 (Tran et al. 2014) did not get state of the art results out of the gate. Carreira & Zisserman 2017 made a breakthrough in 2017 when they added a pre-computed optical flow imagery stream alongside the RGB frames, fusing together the 3D CNN features from each. [Footnote: optical flow in high temporal cadence medium-resolution satellite imagery is new territory and will require investigation!].
This is still some way from practical utility for object detection in satellite imagery. For starters all of these papers are classification. Secondly they use temporally trimmed clips of video where an action, e.g. a person sitting in a chair, coincides with the start/end of the video — in expansive, daily satellite imagery one doesn’t have that luxury! To tackle this, Xu et al. 2017 proposed ‘action segmentation’ where they adapted the concept of ‘region proposal’ from the Faster R-CNN architecture to ‘temporal proposal’, with ‘1D temporal convolutions.’ Temporal proposals provide the start and end point of an action within a longer, untrimmed video.
This paper we select here, builds on all of this: 3D convolutions, fusing optical-flow features, 1D temporal convolutions, and adds ‘dilated’ temporal convolutions allowing variable, and longer-term memory.
Relevance to Geospatial: This model design has promise to help detect e.g. a ‘construction project’ in satellite imagery i.e. a series of events starting with ground clearing/leveling , building the foundations, adding the roof — which can occur anywhere in the world, and vary in duration from months to years. It’s exciting to imagine a 3D convolutional, dilated temporal region proposal network, convolving through both space and time to find objects like this!
Problem Addressed: Supervised, deep learning-based object detection models require large amounts of training data. Collecting accurate bounding box labels from human annotators is an expensive and time-consuming process.
New contributions: This paper poses the coordination of bounding box annotation by humans as a learning problem. Specifically, for a given image with an initial set of object proposals from a bootstrapped detector [and access to a crowd of human annotators], they design a model to suggest in what sequence (or ‘dialog’) to send the bounding box proposals to human annotators for ‘yes/no verification’ and/or ‘manual hand-drawing’ to minimize the total time taken.
Typically one would use a fixed dialog like e.g “Yes/no verify the first 5 boxes. If the first 5 are rejected send for manual drawing”. However your dialog will not always be well suited to a particular image e.g. large objects on homogenous backgrounds may require fewer verification steps (i.e. varying difficulty), or if your bootstrapped detector is weak you may require more. In these scenarios annotation time & expense will be unnecessarily increased.
The authors test two modeling strategies, which take in a set of input features including e.g. bounding box confidence score & it’s size relative to the image, average confidence score of all box proposals for the target class:
- IAD-Prob: a neural network classifier, which predicts the acceptance of each bootstrapped bounding box proposal for an image, from which a dialog is then derived
- IAD-RL: a reinforcement learning approach — deep Q-learning is used to learn an approximate optimal policy for dialog selection, from interactions with a simulated environment.
The author’s experiments show that the IAD agent based approaches do better than fixed strategies, and approximately equally well as each other.
Relevance to Geospatial: Previously at Planet, we’ve explored multistep data annotation to collect training data on objects in disaster regions. When collecting over diverse geographies annotation tasks can vary widely in difficulty. Using an approach like this shows promise for reducing the expense of creating labels of consistent quality especially on large, global datasets.
Connecting the Dots
Building object detection analytics on geospatial imagery requires careful design of both model architecture and data preparation to suit the unique aspects of the data e.g. the resolution, on-nadir perspective, and radiometric variation. The findings in “Functional Map of the World” show that fusing imagery metadata with CNN features can yield better performance. Detail Preserving Pooling and Perceptual GANs can improve detection of small objects. Temporal Action Localization can help identify a complex series of events such as building construction, which play out over many timepoints and can occur anywhere. IAD agents can improve the efficiency of creating large datasets with human-in-the-loop workflows. Together, these novel approaches can move us closer to fully automated, high performing object detection, well suited to the unique aspects of satellite imagery.
What were your favorite deep learning papers in 2018? Let us know in the comments. Happy holidays and see you in the new year!