3D Buildings from Imagery with AI. Part 1: From Elevation Rasters

Published in

GeoAI

14 min readJul 9, 2021

by Dmitry Kudinov, Camille Lechot, David Yu, Hakeem Frank

Recent advancements in artificial neural networks that focus on reconstructing 3D meshes from input 2D images show great potential and significant practical value in a multitude of GIS applications.

This series of posts describes our experiments with one such neural network architecture that we applied to reconstruct 3D building shells from various types of remotely sensed data.

This post, the first of the series, describes extracting buildings from elevation rasters, specifically, normalized digital surface model rasters. You can find the “Part 2. Adding RGB Orthophotos” here.

Introduction

Modern municipal governments and national mapping agencies are evolving their traditional 2D geographic datasets into 3D interactive and realistically looking digital twins to optimize results from planning an analysis projects. For example, some local governments that are responsible for urban design, public events planning, safety, pollution monitoring, solar radiation potential assessment, etc. rely more and more on this kind of data. Cost-efficient and prompt acquisition of accurate 3D buildings is a vital to their success. In recent years, significant progress has been made in automating 3D building reconstruction workflows, yet the amount of manual labor spent on cleanup and quality control remains high, leading to slower adoption of the Digital Twin concept.

Can deep learning help in increasing the quality of reconstructions while reducing the labor cost? Let us see…

(…but if you are short on time and want to go straight to results, here is an interactive reconstruction)

Input: Normalized digital surface model rasters

A normalized digital surface model raster (nDSM) can be created by aggregating heights of points in a lidar point cloud — usually collected by an aircraft or a drone — into square bins/pixels fixed in size, such as 0.5m per pixel resolution. The result is a georeferenced single-band raster with “color” representing heights above, in case of a normalized raster, local ground. To learn more about creating a surface model rasters, see this imagery workflow and conversion tool.

Fig. 1. A segment of a 0.5m nDSM raster with a cathedral in the middle. Note how the pixel values change coming from zero at the ground to ~58.3m at the top of the spire. From the Zurich test set.

Floating-point rasters like those shown above already contain enough semantics for crude 2.5D extrusions (Fig.2), but can we take it further from the quality standpoint with the help of artificial intelligence?

Fig. 2. Procedural extrusion of the cathedral generated from the nDSM segment shown in Fig. 1, resulting in a mesh of two vertices/m for a total of 4,308 vertices and 8,290 faces. Note the pixelized boundary and an excessive number of vertices. Simple reduction of vertices/m leads to quick degradation of extruded meshes. Test set.

Output: LOD2 Building Models

Level of Detail (LOD) — can we, given the nDSM rasters above, automatically and efficiently produce LOD2 building models [1]?

Fig.3. Level of Details (Biljecki & Ledoux & Stoter, 2016)

Or maybe push the quality of extracted building models even further, say LOD2.1–2.3? [2]:

Fig. 4. LOD2 Family (Biljecki & Ledoux & Stoter, 2016)

LOD2.0 is a coarse model with standard roof structures, and potentially including large building parts such as garages (larger than 4 m and 10 m^2 ).
LOD2.1 is similar to LOD2.0 with the difference that it requires smaller building parts and extensions such as alcoves, large wall indentations, and external flues (larger than 2 m and 2 m² ) to be acquired as well. In comparison to the coarser counterpart, modelling such features in this LOD could benefit use cases such as estimation of the energy demand because the wall area is mapped more accurately.
LOD2.2 follows the requirements of LOD2.0 and LOD2.1, with the addition of roof superstructures (larger than 2 m and 2 m² ) to be acquired. This applies mostly to dormers, but also to other significantly sized roof structures such as very large chimney constructions. Because the roof is mapped in more detail, this LOD can be advantageous for the estimation of the insolation of roofs.
LOD2.3 requires explicitly modelled roof overhangs if they are longer than 0.2 m, therefore the roof edge and the footprints are always at their actual location, which has advantages for use cases that require the volume of the building.
[2],[10]

Prominent Neural Network Architectures

So, what could be the launching pad to explore potential solutions here? About a year ago we started by evaluating recent publications and associated implementations, most prominent of which were Differential Rendering[3], Occupancy Networks[4], and Mesh R-CNN[5].

Differential Rendering-based approaches, being actively popularized by both NVIDIA and Facebook AI teams, use 3D rendering as part of the training loop. A loss is calculated based on similarity between the ground-truth image and rendering of the mesh, which is being morphed, textured, and lit by the neural network with every iteration. While novel and conceivably a weakly supervised approach, training setups like these require significant compute and rendering resources.

Occupancy Networks are expected to consume less compute resources at training time, but returning results, once the network is trained, is more computationally demanding. In the inference mode, the model returns occupancy probabilities for an input point cloud, making the inference of an accurate object boundary an iterative process.

Mesh R-CNN, on the other hand, has a traditional convolutional backbone (ResNet50[6]+FPN[7]) with three to four additional heads (for our story, the “Mesh” head, built with Graph convolutions, is of special interest). Therefore, the compute expectations here, for training time, are relatively low compared to Differential Rendering-based solutions — great for running quick turnaround experiments. And at the inference time, Mesh R-CNN is expected to be computationally closer to traditional convolutional architectures, faster than Occupancy Networks. The inference time for us was important as well, since we were looking for a practical solution, which could potentially grow into a product or a service later.

Fig. 5. Detection, voxelization, voxel-to-mesh conversion, and mesh-refinement processing stages with Mesh R-CNN. https://github.com/facebookresearch/meshrcnn

Bld3d Neural Network

Given the considerations above, we decided to build our prototype on top of the PyTorch Mesh R-CNN implementation kindly shared by the Facebook AI team.

But there were a few important modifications that we needed to make in the original code and architecture before we could start training:

Original Mesh R-CNN code was designed to work with images taken by a perspective camera, whereas our input rasters were orthorectified, i.e. translations from world space to normalized device coordinates (NDC) to camera space and back, had to be replaced with orthographic camera projections.
Each mesh refinement stage uses hyperbolic tangent as an activation function to predict vertex offsets, which sometimes causes numeric stability issues when predicting offsets for large objects like buildings since Mesh R-CNN calculates losses in the world space. To address this issue, we modified the loss calculation logic to work off the NDC space, which is confined to [-1..+1] cube.
Mesh R-CNN uses voxel head to predict a coarse-grained set of occupied voxels based on FPN outputs. Next, these voxels are converted into a triangular mesh, which then gets morphed (vertex connectivity stays unchanged, only vertex positions are modified) into a final shape by a chain of refinement stages of the mesh head.
Since we were working with nDSM rasters as input, we replaced the trainable voxel head with a procedural extrusion code, which was producing initial meshes as shown in Fig. 2. This allowed us to speed up training experiments and drastically reduce memory requirements. (The voxel head uses cross-entropy to predict occupied voxels in 3D space resulting in O(n³) memory usage.)
We streamlined the code further by removing the mask head. The mask head of Mesh R-CNN plays the same role as in its ancestor Mask R-CNN[8] architecture and was designed to return a per-instance pixel occupancy mask — pixels occupied by an object in the given image.
For our use case, we already had building footprints.These days there are multiple ways of getting accurate building polygons from raw data: pretrained deep-learning models, statistical methods working off raw point clouds, footprint-regularization methods, etc. Again, this led to a significant speedup in training times.
Original Mesh R-CNN output is always a watertight mesh with an arbitrary height above the “ground”. For the purpose of our building extraction experiments, we modified the refinement stage code to keep initial ground vertices always at zero-ground allowing movements in the horizontal plane only. Also, to reduce the output size, our resulting meshes do not have “floor” faces.
In our experiments with 0.5m nDSM rasters, we found that initial meshes of 2 vertices per meter and refinement stages with larger than original Mesh R-CNN settings work better:
GRAPH_CONV_DIM = 128
POOLER_RESOLUTION = 32
In some experiments we used up to four mesh refinement stages instead of original three, training them sequentially, and gradually adding new stages once the validation loss and mesh quality stabilized. 1. Compared to our two-stage models, four-stage Bld3d models showed somewhat better reconstruction results in geographic regions where architectural styles were similar to those of the training set, but demonstrated less stability in geographies where the style was significantly different.
We modified the normal consistency loss to penalize deviations only when below a certain threshold (15 grads) between neighboring face-normals, which reduced the noise on reconstructed planar surfaces.
We used MSE of relative ground perimeter lengths as an additional component of the compound weighted loss.

Training Data

The second most (or maybe even the most) important aspect of Deep Learning is the data that will be used to train the neural network.

For our experiments, we decided to tap into the Zurich Open Data portal, which offers multiple high-quality datasets collected around the same time featuring Digital Terrain Model (DTM) and DSM rasters (0.5m resolution), visible spectrum satellite imagery (0.1m resolution), and most importantly, about 48,000 manually crafted LOD2.3 building models. We selected datasets from 2015.

Fig. 6. Manually created LOD2.3 (i.e., it has features of 2m+ in size and roof overhangs) model of the cathedral on top of the matching 0.5m nDSM raster from the Open Data Zürich portal. 1,199 vertices x 567 faces.

Preparing the Data

To create training data, first we created an nDSM raster by subtracting a DTM from a DSM and zeroing all the pixels with a height value below 2.0m.

Next, by using Esri’s ArcGIS Pro a simple Python wrapper on top of Export Training Data for Deep Learning tool, we exported georeferenced nDSM chips containing individual buildings in the center with all the pixels beyond the current building’s boundaries zeroed out. In addition to the nDSM chips, mask rasters of the same resolution were exported: each mask chip represented a Boolean raster with a value of 0 representing ground and 1 representing a building. The original building mask geometries were calculated with the help of the Multipatch Footprint tool, effectively calculating building footprint polygons out of the original 3D building shells.

Exported Boolean masks helped to speed up the data augmentation at training time and allowed for better quality extrusions of initial meshes, specifically in cases when building’s nDSM pixels were below the 2m threshold due to data quality issues; e.g., excessively large DTM raster values.

With the help of CityEngine, we exported individual buildings as OBJ files along with the metadata needed to link the exported OBJs with their nDSM image-chip counterparts.

The Training / Test split was done manually by selecting representative architectural styles and environments (urban and rural), quantitatively ending up with 1,031 buildings in the Test set and 45,057 in the Training set. About 2,000 buildings were removed from the original pool because of either the underlying nDSM chip had an average height below 2m, or the building footprint area was less than 30m².

Training Bld3d Neural Network

Hardware
We conducted our experiments on multi-GPU nodes, including a p3dn.24xlarge instance for up to 1,000 epochs in some of the runs.

Loss
Specifically, the majority of the hyperparameter search was focused around weights of the compound loss of the mesh head. We found that a weighted sum of Chamfer, Edge, Normal, modified Normal Consistency, and MSE Perimeter losses produce the highest quality meshes.

Augmentation
We extended initial augmentation of Mesh R-CNN to include random rotations along the vertical axis, which significantly improved the convergence and F1-Score on the Test set.

Flex-batching
Input nDSM chips, given fixed 0.5m resolution, varied in size from ~12 to almost ~1,000 pixels along the longest side, depending on the size of the building. Therefore, relying on Detecton2’s ImageList container, which pads all the input images to the dimensions of the largest one in the minibatch, would significantly reduce the number of images going into forward pass on average.

To address this issue and make GPU utilization more efficient, we dynamically varied the minibatch size. Specifically, in collate_fn(), we controlled the payload size by keeping track of the largest dimensions of the image in the batch and the total size of tensors of initial meshes. As soon as we reached the empirically calculated memory limit, the assembly of the minibatch would terminate and the result would be passed into the training loop.

Mesh Quality Metrics and Tensorboard
We added Tensorboard hooks to keep track of the metrics below:
At training time for each refinement stage:

Original (unweighted) Chamfer, Edge, Normal, Normal Consistency, MSE Perimeter losses.
Total loss and its moving average.
Some manually selected buildings from Training set, representative of typical architectural styles and zoning, were captured inside the training loop and logged for interactive visualizations in Tensorboard.

Fig. 7. Tensorboard visualization of a raw reconstruction captured inside the training loop of Bld3d neural network. A setup of eight spotlights highlights faces with different colors depending on face-normals. Training set.

At validation time we kept track of losses of the final refinement stage as well as additional quality metrics:

Absolute and relative normal consistencies.
Precision, Recall, F1-Scores for point clouds sampled (50,000 points) from GT and predicted meshes at various thresholds: 0.1m, 0.25m, 0.5m.
Test meshes from batch of worst F1-Score: both the metrics and interactive visualization as in Fig. 7.

Results

Zurich test sites

We trained a 4-stage Bld3d model to the following metrics on the Zurich test set:

Chamfer-L2: 1.28m
Absolute normal consistency: 0.88
Normal consistency: 0.80
F1@0.25m / 25,000 sampled points: 0.49
F1@0.5m / 25,000 sampled points: 0.75

Fig. 8. Raw extrusions from a 4-stage Bld3d model. A historical quarter of Zurich with a variety of architectural styles and moderate vegetation. Test set.

You can see a live web scene of the test reconstructions shown above at ArcGIS — Bld3d reconstructions. Zurich Test site #4.

LOD2.?
In our experiments we have observed that, depending on the weights of the compound loss, resulting meshes vary significantly in quality, extracted features, and spatial resolution.

Specifically, with chamfer_w=16.0, normal_w=0.1, edge_w=0.3, and normal_consistency_w=0.1, we have seen superstructure features being acquired by Bld3d models even out of a few input pixels. Below is an example of a smaller spire on the same cathedral captured based upon a ~1.5–2.0m² of input nDSM pixels — an LOD2.2-level capability:

Fig. 9. A raw Bld3d reconstruction. LOD2.2-level spatial resolution of extracted features: a spire represented by nDSM adjacent pixels of <2m². Test set.

We expect that by training the Bld3d neural network on nDSM rasters of a higher resolution, we will see a proportional increase in resolution of the reconstructed building models.

The weight of the Chamfer loss heavily affects another quite curious ability of Bld3d models — the ability to replicate eaves, or roof overhangs. By doubling the Chamfer weight and training for a few more epochs, we observed an LOD2.3 feature in the outputs:

Fig. 10. LOD2.3: roof overhangs generated by a trained Bld3d model. Test set.

This is particularly intriguing, since the semantics of roof overhangs are not explicitly present in the nDSM rasters (nDSMs contain 2.5D information only), while roof overhangs are a true 3D feature. The ability of a trained Bld3d neural network to replicate these geometries comes from a learnt bias, effectively letting the model guestimate whether the building represented by a given nDSM chip might or might not have roof overhangs.

Nonetheless, consistently generating roof overhangs was hard to achieve in our experiments. The likely reason is a combination of a relatively low resolution and noisy nDSM raster and the overall low number of training examples.

Buildings that are reconstructed with roof overhangs often have self-overlaps in their mesh topology, which makes them harder to regularize and texturize later. Roof overhangs are nevertheless an LOD2.3 feature.

Nevertheless, roof overhangs is an LOD2.3 feature.

Denoising
One of the biggest problems in reconstructing buildings from rasters is noise caused by overhanging tree canopies that block a significant portion of the roof from the view of a lidar sensor or camera:

Fig. 11. A building in center with multiple trees blocking portions of the roof from the view of lidar sensor. 0.5m nDSM raster. Test set.

But what we have seen in Fig.10 above, with learnt bias enabling a Bld3d model to reason about roof overhangs, works here as well:

Fig. 12. Left: raw extrusion from the 0.5m nDSM raster given building footprint (same raster as in Fig. 11). Note the large tree blocking one corner of the roof, and smaller size trees partially obscuring the opposite wall and a lower roof. Right: Corresponding Bld3d reconstruction. Test set.

Transferability
As we said before, well trained Bld3d models show F1-Score@0.5m of ~0.75 on the Zurich Test set.

But would these models work in other geographic regions, with unfamiliar architectural styles, and nDSMs produced from point clouds collected by lidar sensors of different specs?

We conducted inference experiments in two different regions — Turkey and Miami-Dade county, U.S., which showed promising results in both general extraction quality, and denoising. Roof overhangs, on the other hand, were reproduced less reliably, sometimes leading to unrealistically looking building shapes.

Fig. 13. Raw 4-stage Bld3d reconstructions. Model trained on Zurich data. 0.5m nDSM test site, Turkey.

Fig. 14. Raw 4-stage Bld3d reconstructions. Model trained on Zurich data. 0.5m nDSM test site, Miami-Dade county, U.S.

Fig. 15. Left: A raw 4-stage Bld3d reconstruction. Right: A raw 2-stage and lower Chamfer-weight Bld3d reconstruction. Model trained on Zurich data. 0.5m nDSM test site, Miami-Dade county, U.S.

Alternatively, the convolutional nature of the Bld3d architecture allows for efficient transfer learning applied by freezing some portion of the graph and retraining the other.

Similarly, adding new mesh refinement stages to a pretrained Bld3d model and finetuning only those, too, as experiments showed, produce practical improvements in just a few epochs of training.

Inferencing

We observed the average Bld3d model’s inference throughput at about 120 buildings per second (with 2 vertices per meter of initial mesh density) when running on NVIDIA P100 GPU with 16 GB of vRAM. This is pure GPU forward-pass time without accounting for input data preparation and export of resulting meshes in OBJs.

To improve the data I/O we relied on the Export Training Data for Deep Learning (ETDFDL) geoprocessing tool and its RCNN_Masks output option, which conveniently writes out nDSM tiles and building instance masks.

To process large geographic areas, we ran the ETDFDL tool with 200m overlap and used an approach similar to Non-Maximum Suppression to pick the most well represented building footprints on the tile borders (we developed a GPU-optimized version, which counts per-pixel overlaps instead of bounding-box-overlaps).

Moreover, thanks to the ETDFDL tool, all the tiles for inference come with georeferencing metadata, which makes the translation of the reconstructed OBJs into geographic space trivial.

Once the OBJs are processed, to place them on a geographic map, we used the ArcGIS Pro’s Import 3D Files geoprocessing tool for collective visualization, analysis, publishing, etc.

3D Regularization

Raw reconstructions done by Bld3d models are high on vertex count. When working with massive geographic areas, it makes sense to reduce the number of vertices, even if it comes at the expense of some spatial resolution. Level-of-detail pyramids built this way are useful for faster drawing and lower client/server traffic.

Here, we used a two-pass approach, first fixing and leveling-off most self-overlaps in the raw output meshes, making the meshes more conformant to two-dimensional manifolds, then, as the second stage, applying Variational Shape Approximation [9] to further reduce the vertex/face count.

Fig. 16. Left: Raw Bld3d reconstructions — 272,487 vertices x 512,428 faces. Right: After 2-step 3D regularization, coarse settings — 24,622 vertices x 42,430 faces. Test set.

Continue reading: Part 2. Adding RGB Orthophotos

References

[1] Gröger, G. & Plümer, L., 2012. CityGML — Interoperable semantic 3D city models. ISPRS Journal of Photogrammetry and Remote Sensing. Vol. 71. pp 12–33. DOI: 10.1016/j.isprsjprs.2012.04.004. ISSN: 0924–2716.

[2] Biljecki, F., Ledoux, H., Stoter, J. (2016): An improved LOD specification for 3D building models. Computers, Environment, and Urban Systems, vol. 59, pp. 25–37.

[3] Chen, W., Gao, J., Ling, H., Smith, E.J., Lehtinen, J., Jacobson, A., Fidler, S. (2019) Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer. arXiv:1908.01210v2

[4]: Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A. (2019) : Occupancy Networks: Learning 3D Reconstruction in Function Space. arXiv:1812.03828v2

[5] Gkioxari, G., Malik, J., Johnson, J. (2020): Mesh R-CNN. arXiv:1906.02739v2

[6] He K., Zhang X., Ren S., Sun J. (2015) Deep Residual Learning for Image Recognition. arXiv:1908.01210v2

[7] Lin T., Dollár P., Girshick R., He K., Hariharan B., Belongie S. (2017) Feature Pyramid Networks for Object Detection. arXiv:1612.03144v2

[8] He K., Gkioxari G., Dollár P., Girshick R. (2018) Mask R-CNN. arXiv:1703.06870v3

[9] Cohen-Steiner, D., Alliez, P., Desbrun, M. (2004) Variational Shape Approximation. ACM Transactions on Graphics, Vol. 23, Issue 3, August 2004 pp 905–914

[10] Biljecki, F., Heuvelink, G.B.M., Ledoux, H., Stoter, J., 2015a. Propagation of positional error in 3D GIS: estimation of the solar irradiation of building roofs. International Journal of Geographical Information Science 29, 2269–2294