A baseline model for the SpaceNet 4: Off-Nadir Building Detection Challenge
Note: SpaceNet is a collaborative effort between CosmiQ Works, Digital Globe and Radiant Solutions hosted on Amazon Web Services as a public dataset. To learn more visit https://spacenetchallenge.github.io/.
This post is part 2 of a series about the SpaceNet 4: Off-Nadir Dataset and Building Detection Challenge. For the first part of the series, click here.
The SpaceNet 4: Off-nadir building detection challenge has begun, and participants are vying for $50,000 of prize money by competing to see who can most accurately identify buildings in 27 different WorldView 2 satellite image collects taken at different angles over Atlanta. For competition specifics and details about the underlying datasets, check out the Topcoder challenge website and the dataset announcement blog post.
Mapping building footprints from imagery acquired at different nadir angles poses unique challenges, as detailed more below. To highlight these and help challenge competitors get started, we built a set of baseline building footprint detection models and then evaluated their building detection performance. The source code and instructions to get the models up and running is available in a repo on CosmiQ Works’ GitHub.
We chose a semantic segmentation approach followed by object boundary tracing to identify building outlines. For the segmentation step we trained models with the TernausNetV1 architecture (a U-Net with a VGG11 encoder, not using pre-trained weights in our case; for full details, see their paper and the source code) using Keras. We trained the model to produce a binary mask of building footprints using 70% of the Pan-Sharpened RGB imagery from the SpaceNet 4 challenge training dataset tiles, holding out 15% of the tiles for validation and 15% for internal testing. To explore how nadir angle impacts model performance, we trained TernausNetV1 four times: once with image tiles taken from all of the collects, once with only the “nadir” (0–25 degree) collects, once with the “off-nadir” (26–40 degree) collects, and once with “far off-nadir” (>40 degree) collects.
Model training and loss
Training the models with standard loss functions proved challenging. Semantic segmentation models are often trained using binary or categorical cross-entropy loss; however, these functions can struggle when segmentation outputs have imbalanced classes. In the SpaceNet 4 training dataset, only 9.5% of pixels correspond to building footprints. Possibly due to the class imbalance, our models quickly learned to classify all pixels as belonging to background when using a binary cross-entropy loss function (for more on this topic, see this paper).
To overcome the “all-background valley” during training we tried two different approaches: weighted binary cross-entropy and a composite binary cross-entropy loss + Jaccard loss.
Weighted Binary Cross-entropy Loss: First, we weighted our binary cross-entropy loss function to more strongly penalize mis-classification of building pixels than background pixels; however, this model performed very poorly at the boundaries of buildings, often merging objects, as shown below. This resulted in an unacceptably low SpaceNet IoU > 0.5 F1 score on nadir imagery of ~0.2.
Composite Binary Cross-entropy + Jaccard Loss: This composite loss function is similar to one described in the previously mentioned article. Specifically, we used a composite of binary cross-entropy (BCE) and Jaccard loss:
This function penalizes mis-classification of background pixels while strongly penalizing under-prediction of building pixels, thereby eliminating the “all-background valley” in the loss landscape. Training the TernausNetV1 model architecture with this loss function went off without a hitch, completing in under 8 hours on a single NVIDIA Titan Xp GPU using a batch size of 4 images and 512-by-512 crops. Interestingly, the loss values for training and validation data matched very closely, implying that the models did not over-train badly. There is a lot of room to explore batch size, model depth, and other parameters if you wish to iterate on our model! Source code for defining, training, and evaluating the models is available here.
Data preparation and augmentation
We experimented with a number of data pre-processing parameters — image crop sizes, data augmentation methods, and convolutional filter count, among others — and found that many of them impacted model performance. Smaller crops from the imagery (256-by-256) were much less effective for model training. Check out our source code to see the full details of what we ended up with, but in short, we used: rotations, x/y flipping, and a reduction to 8-bit depth. There is still a lot to explore, however — from using the full bit depth of the source imagery, to using the 4th Pan-Sharpened Near-IR channel, to the 8-channel MUL imagery, and much more — plenty of room to improve your models and beat our baseline!
A critical part of any image segmentation pipeline is post-processing of the segmented output, and this is an area where we have left a lot of room for improvement. For our baseline, our only post-processing was to remove very small (< 20 px) footprints from our model output. As is often the case in cities, many of the building footprints were juxtaposed within a few pixels of one another, and segmentation often joined these footprints into a single polygon (see the example above). We encourage competitors to explore further post-processing, such as classical computer vision methods like binary morphology operations and watershedding, to improve their submissions.
And now the segment you have all been weighting for (ha): the baseline model performance! After training the baseline models we evaluated their performance against the held-out images from the training set as well as the test set. Results were striking: the model trained only on nadir data — the only angles available in previous SpaceNet datasets — detected buildings modestly well in nadir imagery (Footprint IoU >0.5 F1 score 0.679), but very poorly on images collected far off-nadir, dropping as far as 0.003 on imagery acquired at 53 degrees off-nadir (see the plot below). The model trained on imagery from all angles performed nearly as well on the nadir imagery (F1 0.638) and showed some improvements through the off-nadir angles. Surprisingly, the model trained on all of the data out-performed the models trained on off-nadir (26–40 degrees) and far off-nadir (>40 degrees), even in specific angle ranges used in training. Inference on the test set takes about an hour and inference on the final evaluation set takes approximately 6 hours on a single NVIDIA Titan Xp GPU using the overlapping tile approach described in the source code (which is much slower than usual inference methods).
As you can see, model performance varies dramatically across collects at very similar angles, particularly in the off-nadir range. We will have another DownLinQ post coming soon about this phenomenon, but if you want to try to determine what’s going on yourself, check out the images below as well as collect metadata at the SpaceNet Dataset Website.
The full source code to train and run the models described here are provided in the CosmiQ SpaceNet 4 Baseline GitHub Repo. There are scripts there for running each step of the process as well as a pip-installable library of utility functions for Keras and general remote sensing data processing. There are some useful bits of code in there which we didn’t have time to discuss here at all (hint: check out model.py!). See the README there for usage details.
We hope you participate in this exciting challenge! Start by registering at the SpaceNet Challenge Page and then download the dataset to compete in the SpaceNet Off-Nadir Building Detection Challenge.
In the next couple of weeks SpaceNet Partners will be releasing several blogs that discuss different aspects of off-nadir imagery.
Thanks for reading, and good luck in SpaceNet 4: Off-Nadir Building Footprints!