Building Extraction with YOLT2 and SpaceNet Data
The first SpaceNet Challenge to identify building outlines from satellite imagery demonstrated that the field of computer vision as applied to satellite imagery remains relatively nascent. In many computer vision tasks (e.g. ImageNet), accuracies >95% are common, even expected. The winning SpaceNet Challenge score of F1=0.26 underscores the challenges of extracting building footprints from satellite imagery in diverse and often very crowded scenes.
The majority of submissions to the SpaceNet Challenge utilized an image segmentation approach, where each pixel in an image is labeled as belonging to one of several classes (in this case: building or background). Since the goal of the SpaceNet challenge was to provide exact building outlines, this approach makes sense given that if one classified all pixels correctly all buildings would be precisely defined.
In this post we detail a different approach: object detection with the YOLT2 pipeline. Recall that YOLT outputs bounding rectangles around objects of interest. As such, this approach will never achieve perfect building footprint detection. Nevertheless, we demonstrate that this approach proves competitive for the challenge evaluation metric of assigning a true positive to any proposal with a Jaccard index ≥ 0.5 compared to ground truth.
1. YOLT2
Recall that YOLO (upon which YOLT is based) is an object detection framework that uses a 7x7 final grid, meaning that each object is placed on one of 49 boxes. YOLO version 2 incorporates a number of improvements to the original paper such as: batch normalization, finer grained features, multi-scale training, and a denser 13x13 final grid. These enhancements improve the accuracy to state-of-the-art (see Table 3 in YOLO version 2), while maintaining a speed advantage over other options such as Faster R-CNN and SSD. Many of these improvements were independently implemented in the version of YOLT discussed in previous blogs, and the remaining improvements have been incorporated into YOLT version 2.
2. Training Data
We utilize data from the first SpaceNet challenge, obtainable from AWS. In the previous post we discussed methods for transforming the GeoJSON label files into formats more conducive for machine learning. Recall that YOLT2 requires cardinally oriented rectangles to label ground truth. In this post we utilize the NumPy arrays of building pixel coordinates to infer bounding boxes around buildings.
In most computer vision object detection paradigms, bounding boxes fully encompass objects of interest. Our goal is to achieve a Jaccard index ≥ 0.5, so the ground truth bounding boxes used for YOLT2 do not fully enclose the buildings, as illustrated in Figure 1.
To construct training data we utilize the geojson_to_pixel_arr.py script and, as in Figure 1, construct a bounding box extending 80% of the length and width of each building in the training dataset. Examples are shown in Figure 2.
3. Model Training
We train on 90% of the SpaceNet training dataset, discarding images without any buildings present; the remaining 10% is withheld for internal testing purposes. This leaves 3926 labeled 200 x 200 meter images for training purposes. Image cutouts for the pan-sharpened 3-band imagery are 438–439 pixels in width, and 406–407 pixels in height. We craft a new network architecture with a denser 26 x 26 final grid to accurately localize buildings in the the highly concentrated regions of central Rio de Janeiro. Training occurs for for seven days on a single NVIDIA Titan X GPU.
4. Model Evaluation
The YOLT2 SpaceNet model is evaluated on the entirety of the SpaceNet test dataset from the SpaceNet Challenge. For the 200 x 200 meter test chips the YOLT2 pipeline inference proceeds at a rate of 45 frames per second. Post-processing is minimal, simply consisting of non-max suppression. We achieve an F1 score of 0.21 over the test set; this score is certainly far from ideal, though it would have been moderately competitive in the challenge results (reported scores are F1 * 1,000,000). Example outputs are shown below.
The actual F1 score of 0.21 is calculated via the SpaceNet Building Detector Visualizer, screenshots of which we display below.
5. Conclusions
The YOLT2 detection pipeline is designed to rapidly localize objects of interest in satellite imagery via rectangular bounding boxes. This design has shown promise in detecting both vehicles and infrastructure projects, though is suboptimal for the task of precisely defining the highly varied shapes and orientations of building footprints. Nevertheless, we achieve reasonably competitive results on the SpaceNet Challenge.
Inspection of results indicates a large number of proposals with a Jaccard index between 0.4 and 0.5, very nearly at the threshold for a true positive. Further post-processing may nudge those scores above the threshold. In addition, a denser grid than the 26 x 26 grid used here may improve results, at the cost of increased computation time.
In the long term, image segmentation algorithms offer the best approach for defining the precise outline of buildings. Yet if the goal is to rapidly determine the location and approximate area of buildings, the YOLT2 pipeline may prove useful.
May 29, 2018 Addendum: See this post for paper and code details.