Building Extraction with YOLT2 and SpaceNet Data

The first SpaceNet Challenge to identify building outlines from satellite imagery demonstrated that the field of computer vision as applied to satellite imagery remains relatively nascent. In many computer vision tasks (e.g. ImageNet), accuracies >95% are common, even expected. The winning SpaceNet Challenge score of F1=0.26 underscores the challenges of extracting building footprints from satellite imagery in diverse and often very crowded scenes.

The majority of submissions to the SpaceNet Challenge utilized an image segmentation approach, where each pixel in an image is labeled as belonging to one of several classes (in this case: building or background). Since the goal of the SpaceNet challenge was to provide exact building outlines, this approach makes sense given that if one classified all pixels correctly all buildings would be precisely defined.

In this post we detail a different approach: object detection with the YOLT2 pipeline. Recall that YOLT outputs bounding rectangles around objects of interest. As such, this approach will never achieve perfect building footprint detection. Nevertheless, we demonstrate that this approach proves competitive for the challenge evaluation metric of assigning a true positive to any proposal with a Jaccard index ≥ 0.5 compared to ground truth.

1. YOLT2

Recall that YOLO (upon which YOLT is based) is an object detection framework that uses a 7x7 final grid, meaning that each object is placed on one of 49 boxes. YOLO version 2 incorporates a number of improvements to the original paper such as: batch normalization, finer grained features, multi-scale training, and a denser 13x13 final grid. These enhancements improve the accuracy to state-of-the-art (see Table 3 in YOLO version 2), while maintaining a speed advantage over other options such as Faster R-CNN and SSD. Many of these improvements were independently implemented in the version of YOLT discussed in previous blogs, and the remaining improvements have been incorporated into YOLT version 2.

2. Training Data

We utilize data from the first SpaceNet challenge, obtainable from AWS. In the previous post we discussed methods for transforming the GeoJSON label files into formats more conducive for machine learning. Recall that YOLT2 requires cardinally oriented rectangles to label ground truth. In this post we utilize the NumPy arrays of building pixel coordinates to infer bounding boxes around buildings.

In most computer vision object detection paradigms, bounding boxes fully encompass objects of interest. Our goal is to achieve a Jaccard index ≥ 0.5, so the ground truth bounding boxes used for YOLT2 do not fully enclose the buildings, as illustrated in Figure 1.

Figure 1. Proposed bounding boxes for YOLT2 training. Left: Ground truth building outline shown in red. Middle: Bounding box (white) that fully encompasses the red building; given the non-cardinal orientation of this building the Jaccard index is below the threshold of 0.5. Right: bounding box extending only 80% of the length and width of the building, which gives a Jaccard index of 0.51, greater than the threshold for a true positive detection. For labeling purposes we therefore use the partial box depicted on the right.

To construct training data we utilize the geojson_to_pixel_arr.py script and, as in Figure 1, construct a bounding box extending 80% of the length and width of each building in the training dataset. Examples are shown in Figure 2.

Figure 2. Examples of YOLT2 SpaceNet training labels. The left column depicts ground truth labels overlaid in yellow on the image cutouts, whereas the right column shows YOLT2 bounding box labels in red. In dense areas the bounding boxes often overlap, complicating efforts to disentangle nearby buildings.

3. Model Training

We train on 90% of the SpaceNet training dataset, discarding images without any buildings present; the remaining 10% is withheld for internal testing purposes. This leaves 3926 labeled 200 x 200 meter images for training purposes. Image cutouts for the pan-sharpened 3-band imagery are 438–439 pixels in width, and 406–407 pixels in height. We craft a new network architecture with a denser 26 x 26 final grid to accurately localize buildings in the the highly concentrated regions of central Rio de Janeiro. Training occurs for for seven days on a single NVIDIA Titan X GPU.

4. Model Evaluation

The YOLT2 SpaceNet model is evaluated on the entirety of the SpaceNet test dataset from the SpaceNet Challenge. For the 200 x 200 meter test chips the YOLT2 pipeline inference proceeds at a rate of 45 frames per second. Post-processing is minimal, simply consisting of non-max suppression. We achieve an F1 score of 0.21 over the test set; this score is certainly far from ideal, though it would have been moderately competitive in the challenge results (reported scores are F1 * 1,000,000). Example outputs are shown below.

Figure 3. Example outputs of SpaceNet test data. The top two images demonstrate that YOLT2 struggles in central urban environments with adjacent large buildings. The lower two images are somewhat better, as YOLT2 performs relatively well with separated buildings.

The actual F1 score of 0.21 is calculated via the SpaceNet Building Detector Visualizer, screenshots of which we display below.

Figure 4. Screenshots of the SpaceNet Building Detector Visualizer evaluation of YOLT2 outputs. White boxes denote true positives, while yellow indicates a false positive, blue shows ground truth for those buildings not correctly predicted, and the Jaccard index is displayed over each proposal. Top + Middle: YOLT2 generally does well for separated buildings, though struggles with unusual outlines. Bottom: Dense urban areas are more challenging due to the overlap of bounding boxes; these scenes are also weighted higher in the total score due to the much greater number of buildings.

5. Conclusions

The YOLT2 detection pipeline is designed to rapidly localize objects of interest in satellite imagery via rectangular bounding boxes. This design has shown promise in detecting both vehicles and infrastructure projects, though is suboptimal for the task of precisely defining the highly varied shapes and orientations of building footprints. Nevertheless, we achieve reasonably competitive results on the SpaceNet Challenge.

Inspection of results indicates a large number of proposals with a Jaccard index between 0.4 and 0.5, very nearly at the threshold for a true positive. Further post-processing may nudge those scores above the threshold. In addition, a denser grid than the 26 x 26 grid used here may improve results, at the cost of increased computation time.

In the long term, image segmentation algorithms offer the best approach for defining the precise outline of buildings. Yet if the goal is to rapidly determine the location and approximate area of buildings, the YOLT2 pipeline may prove useful.