How to make a building footprint detector

Published in

Picterra

8 min readAug 2, 2019

Introduction

Why detect building footprints? You probably have heard many times that buildings can be detected from satellite images but for what purpose? Having up-to-date maps of buildings and settlements are key for tasks ranging from disaster and crisis response to locating eligible rooftops for solar panels. Creating automated maps of buildings from aerial or satellite imagery is the best way to obtain large-scale up-to-date geospatial information on populations and their settlements.

In this article we will learn how to make our own building footprint detector from aerial or satellite imagery. We will present the standard machine learning approach to tackle this kind of problem in the first part and discuss its limitations. Then we will present an alternative approach in the second part to remedy those limitations using Picterra.

1 Traditional machine learning approach

a) Data handling & preparation

As in every machine learning project, the first thing to do is to collect labeled data, in our case images of building with their corresponding footprint (segmentation mask).
The preparation of data is very often overlooked in machine learning projects. This phase of the pipeline can represent up to 50% of the project workload if we don’t have an established data (pre-)processing pipeline.

We used data from the second SpaceNet challenge obtainable from Amazon Web Service [1]. This dataset consists of 10608 at 30cm spatial PanSharpen (native resolution 1.20m) resolution over 4 cities: Las Vegas, Paris, Shanghai and Khartoum (see Figure 1).

To train the model we randomly took 80% of the dataset. The 20% left used as test set for evaluating the model.

b) Training procedure

We approached this specific problem as a semantic segmentation problem with a pixel belonging to one of the two classes: “building” and “not building”.
One problem encountered with the pixel classification in “building” or “not building” is that pixels on the boundaries can be difficult to classify. In order to help the model to learn per pixel information about the location of the boundary and capture implicitly geometric properties we add an other learning objective; a signed (+/-) distance where pixel values indicate the distance to the closest building wall (border), with pixels inside buildings having positive values and pixels outside buildings having negative values (see Figure 2).

We truncated the distance to have a maximum absolute value of 64 to only incorporate the nearest pixels to the border [2].
Instead of trying to directly learn this signed distance and treat this as a regression problem, we uniformly quantized to 20 bins the distance values to facilitate the training [3].
This new 20-classes classification problem is an intermediate objective that make the classification into “building” or “not building” more fine-grained and improves the accuracy especially for buildings which are contiguous.

The model that will learn to predict this representation is a variant of U-Net [4], a very popular end-to-end encoder-decoder network for semantic segmentation (see Figure 3). The encoder chosen is a ResNet50 [5] pretrained on imageNet [6].
The loss functions define what the model should be focusing on optimizing while training. In this case, the loss function is made of two objectives:

the cross-entropy loss for the signed distance to border objective,
a combination of cross entropy and iou-loss [7] for the segmentation mask.

The model learned as well the weights between this three losses using the procedure described in [8].

c) Evaluation

We used IoU (Intersection over Union) to evaluate our model’s results as in the SpaceNet challenge.

A building is considered detected if and only if the IoU between the ground truth building and the predicted building is greater or equal to 0.5.
Finally, we evaluate our model by combining precision (fraction of predicted building that are actually building) and recall (fraction of building that are detected) into one overall F1 score.

Our model had a final F1 score of 0.686. This is a bit below the score of the SpaceNet challenge winner (0.693), but we only used one model whereas he used an ensemble of three different models.

Figure 4 illustrates results on some images in the test set.

d) Limitations

So now we have our own building footprint detector which performs well on Spacenet dataset but it is not our main goal. We want to use it for our own use case and this might be problematic for several reasons.

The resolution of the training images.

We trained our model with images at 30cm resolution and it is unlikely that the model will be effective on images with different resolution. One possible solution would be to train a model using the same procedure described above on images with the proper resolution but it might be hard to find since there are not a lot of annotated building footprint datasets.

Variations in overall appearance and context of our images.

Indeed, when we train a model to perform a task, the goal is to make the model accurate not only on the training data but more importantly on unseen data (generalization). That is why we report score on test set and not on training set. Generally the test set is somewhat related to the training set, for example here we validate on different building footprint but from the same cities.
But if our images of interest are too different (different building architecture, different geographical characteristics, …) from the training set it might not generalize as well.

For example let’s say we wanted to detect images that contain dog.
If the only breed dog in training set is Labrador then if we want to use this model on our images and there are, for example chihuahua, it is unlikely that this model will categorize this as being a dog.
In other words, our images can be too different from the training set and thus the model will be less effective on our images.

The ideal solution to solve those issues would be to train a building footprint detector directly on our images of interest. This is what Picterra has to offer.

2 Alternative to the traditional training using Picterra

As we can see, the major inconvenience of the traditional way of training comes from the need of having a similar annotated dataset in order to perform well on unseen images.
What Picterra allow is to directly use the images of interest to train the model.

We will now describe the workflow to train a model with Picterra.
We will use two images from Austin in Canada at 30cm spatial resolution. The first one will be used to train the Picterra model and the second one to how well it will generalize.

First of all we need to, select zones (training areas see Figure 5) and annotate every building inside each training area (see Figure 6). The annotated training areas are similar to the training set for the part one model. Finally we just need to click on the build button and wait for the results.

We will compare visually the prediction between the model of the first part (trained on the much larger SpaceNet dataset) and the model train from Picterra.
Figure 7 shows the prediction on the image where the training areas were drawn. We can see that the Picterra model gives better results than the classical model in this image as expected. One important thing to notice is that it took only two minutes to train and predict, whereas it took a full day to train the classical model on a GTX 1080 Ti.

Figure 7: (a) Picterra model detection, (b) Classical model detection

Figure 8 shows the detection on an unseen image for both models.We can see that the Picterra model is far better at finding small building but less precise with bigger one. It can be explained by the fact that we mainly annotated medium to small size building. If we want to improve further Picterra’s model we can add training areas where it performed below expectation, directly in the platform. This is another key strength of Picterra’s platform. It allows us to do active learning ourselves by seeing the outputs of the model so that we can help it get better. This is much harder to do with the traditional machine learning pipeline.

Figure 8: (a) Picterra model detection, (b) Classical model detection

Conclusions

In this article, we gave an overview on how to make our own building detector using traditional machine learning and presented an alternative approach using Picterra. We noticed that the classical pipeline has inherent inconveniences and might make it hard to use for purposes other than winning a competition.
The major issue was that we needed similar annotated data to images we want to work on. To bypass this issue Picterra offer an approach where we can directly train a model on our data by making it easy to annotate and iterate on, to produce superior results. Another key point is that by using similar annotated data, we can provide orders of magnitude of less data to the model than SpaceNet and still have it perform much better, while training the model for much less time as well. Superior results, with less training data in less time. These are just some of key strengths of Picterra’s platform.

References

[1] Spacenet on amazon web services (AWS). “datasets.” the spacenet catalog.
last modified April 30, 2018.. https://spacenetchallenge.github.io/
datasets/datasetHomePage.html. Accessed: 2019–04–14.

[2] B. Bischke, P. Helber, J. Folz, D. Borth, and A. Dengel, “Multi-task learning for segmentation of building footprints with deep neural networks,” CoRR, vol. abs/1709.05932, 2017.

[3] Z. Hayder, X. He, and M. Salzmann, “Shape-aware instance segmentation,” CoRR, vol. abs/1612.03129, 2016.

[4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.

[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” CoRR, vol. abs/1512.03385, 2015.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A
Large-Scale Hierarchical Image Database,” in CVPR09, 2009.

[7] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. S. Huang, “Unitbox: An advanced
object detection network,” CoRR, vol. abs/1608.01471, 2016.

[8] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” CoRR, vol. abs/1705.07115, 2017.