Buildings reconstructed by a DRGB Bld3d neural network model from a combination of 0.1m-resolution elevation rasters (D) + orthophotos (RGB) and building footprints. White buildings: geometry after 3D regularization; textured buildings: roof textures from orthophotos, facades were textured by Pix2PixHD neural network; cyan buildings: a selection demonstrating per-footprint segmentation resulting in feature geometries suitable for further spatial analysis. R&D by Esri, source drone imagery courtesy of Wingtra AG.

3D Buildings from Imagery with AI. Part 2: Adding Orthophotos.

Dmitry Kudinov

Follow

Published in

GeoAI

23 min readOct 28, 2021

--

by Dmitry Kudinov, Camille Lechot, David Yu, Hakeem Frank, Oktay Eker , Yasin Koçan

Part 1 of this series looked into extracting 3D building models from 2.5D Elevation raters (nDSM). If you did not read it yet, and are relatively new to the subject, we recommend reading it first to better understand the data types, neural network architecture, terminology, relevant publication references, etc. referred to herein.

Introduction

We focus on reconstructing buildings from visible spectrum RGB orthophotos in the section “A. Our RGB-Only Experiments”. Then, in the second section, “B. Mixing Orthophotos and Elevation Rasters”, we cover in detail a set of experiments with a combination of orthophotos and nDSM elevation rasters as a single blended input for the neural network to perform the reconstructions.

Orthophotos are readily available to businesses and local governments because they are collected with much more affordable equipment — cameras — than elevation rasters, which primarily requires active lidar sensors. Also, regions covered by RGB orthophotos are often covered multiple times in the span of a few months growing the size of a data collection usable for training a neural network.

We uncover insights into intriguing questions: can 3D buildings and complex 3D features in general be extracted from RGB orthophotos alone? And, if so, how do the results compare to extractions made from a combination of visible-spectrum RGB and elevation rasters?

Data

We experiment again with the datasets from the Zurich Open Data portal, conveniently offering multiple RGB and elevation rasters of the entire Canton of Zurich, taken at adequately different, yet similar moments in time:

Training & Validation Rasters

RGB 0.1m per-pixel resolution from summer 2014
RGB 0.1m from Spring 2015/2016
RGB 0.1m from Summer 2018
Normalized DSM raster 0.5m from 2014
Normalization was done in ArcGIS Pro by subtracting DTM raster from DSM, making the ground pixels being close at 0.0 values:
- DSM raster
- DTM raster

We ran validation metrics on a manually picked set of around 1,000 buildings representative of different urban and rural environments and zoning: commercial, historic, and residential.

Test Data

RGB 0.1m from Summer 2020
Normalized DSM raster 0.5m from 2017/2018 (for the experiments combining orthophotos and nDSM)
- DSM raster
- DTM raster
In the extended tests (covered at the end of the post) we used rasters produced from photogrammetric meshes built with the Sure for ArcGIS meshing engine and oriented imagery collected and kindly shared with us by Wingtra AG (captured with WingtraOne GenII light drone equipped with a 24MP Sony a6100 camera — processed with SURE by nFrames/Esri) and Hexagon (captured with Leica CityMapper-2 — processed with SURE by nFrames/Esri).

Building Models

For training, validation and testing we used the high quality LOD2.3 building models which are also available for download in various formats: LOD2 Zurich Buildings from 2015. With the help of CityEngine we exported the buildings as individual OBJ files.

For 2D building footprints, as in previous experiments, we used the Multipatch Footprint geoprocessing tool to convert the LOD2 models into footprint polygons.

Related Work

So, can a neural network accurately reconstruct a 3D object out of a pure 2D image?

The short answer: Yes, to some degree.

The longer answer is more complicated because there are multiple sources of semantics hidden in the input rasters and these semantics are necessary clues for the neural network to guesstimate the third dimension of the detected objects.

For more than a decade, neural networks were relying on learned bias to perform depth estimation for a given single-view image. One of the latest examples is the PLADE-Net [1] from CVPR 2021 which “shows remarkable accuracy levels, exceeding 95% in terms of the δ¹ metric on the challenging KITTI dataset”.

Fig. 1. Left: input monocular RGB image; Right: Output per-pixel depth map by PLADE-Net [1].

Meanwhile, a depth map does not represent a true 3D conversion yet — in fact, it is very similar to an elevation raster that is oriented parallel to vertical axis, i.e. it contains only 2.5D information.

Nevertheless, the 2.5D transformation is already extremely useful, and in the Part 1 of the series, we showed that a neural network can actually push the bias-based inferencing even further and produce true 3D geometric features like roof overhangs from 2.5D elevation rasters alone (Fig 2.).

Fig. 2. nDSM version of Bld3d neural network reconstructing true 3D features like roof overhangs from 2.5D elevation rasters. See Part 1 of the series for more details.

A true jump from 2D RGB space into 3D geometries was demonstrated only recently by works like Mesh R-CNN [2] (Fig. 3), Occupancy Networks [3], etc.

Fig. 3. Mesh R-CNN [2] performing object detection, instance segmentation, voxel estimation, voxel-to-mesh transformation, and, finally, mesh refinement as a single end-to-end differentiable architecture.

Multiple flavors of Mask R-CNN were trained by Gkioxari et al. on semi-synthetic datasets like ShapeNet[4] and its RGB renderings[5] with fixed perspective transformation and resolution, and Pix3D[6] with real-world images, but just 395 unique 3D ground truth meshes.

A. Our RGB-Only Experiments

In one of our recent experiments with elevation rasters as input, we used the Facebook Research team’s Mesh R-CNN implementation as the foundation for our Bld3d neural network to perform accurate 3D feature extraction. Furthermore, 2.5D semantics explicitly available in nDSM inputs, the Bld3d architecture successfully proved its abilities to jump into 3D feature space and to resist significant levels of noise.

But the question this time: Can a Mesh-RCNN-style architecture jump from pure 2D RGB pixel space into 3D vector space relying on real-world GIS data as input? In contrast to ShapeNet- and Pix3D-based inputs, our dataset was significantly different: the 2D raster representations of Zurich buildings were not fixed in size and contained fewer images per model — initially, only one image chip per building. Moreover, the Zurich LOD2 models had a wide variety of shapes and scales — 45,000+ unique buildings from small sheds to industrial-sized warehouses and transportation hubs with areas in the thousands of square meters.

Fig. 4. A portion of the training set. Left: the input 2015 RGB raster 0.1m resolution and (blue) individual 2D building footprint polygons; Right: LOD2 manually crafted buildings corresponding to these inputs and used as the ground truth for loss calculation during the training of Bld3d neural network.

Training with Single RGB Raster

We started with a single RGB raster (RGB 0.1m from Spring 2015/2016) and 2D building footprints as input to train Bld3d models with up to 4 mesh refinement stages. The building footprint polygons were rasterized with 0.1m resolution into boolean masks and added as the 0-th band forming the M(ask)RGB input tensors. The mask band was later normalized with 0.5 mean and 0.5 standard deviation.

We refer to models trained on the MRGB rasters as “MRGB” Bld3d models and models trained on a combination of nDSM and RGB rasters as “DRGB” Bld3d models.

The loss function was a weighted sum of Chamfer, Normal, Normal Consistency and Edge losses comparing the Bld3d-produced meshes and ground-truth LOD2 Zurich buildings:

CHAMFER_LOSS_WEIGHT = 16.
NORMALS_LOSS_WEIGHT = 0.2 
EDGE_LOSS_WEIGHT = 0.8
NORMAL_CONSISTENCY_WEIGHT = 0.4

Although we were training on NVIDIA’s 32GB V100 GPU cards, nevertheless, a high 0.1m input raster resolution and large variation in building sizes intermittently were causing spikes in loss values leading sometimes to a divergence, particularly at early stages of training. To stabilize the training, we were calculating a loss moving average and skipping backpropagation steps for loss values larger than x10.0 (x2.5 at later stages of training) of the moving average.

In our previous experiments with elevation rasters we decided not to use the Mesh R-CNN’s voxel head, as for the spatial resolution we were aiming at (the initial meshes were generated with vertex density of 2 vertices per meter), the voxel head’s cross-entropy driven architecture demanded too much GPU-memory. Instead, we relied on initial mesh extrusion from nDSM chips limited by building footprint. This time though, in case of RGB-only input, there was no roof-elevation data available anymore — only boolean footprint masks. Thus, the initial mesh was produced procedurally based on the mask pixels alone with boundary vertices placed on the 0-ground and the flat-roof vertices placed at the mean Z-height of the Zurich ground truth buildings.

As before, the mesh refinement head had to learn to modify the positions of the vertices of the initial mesh given the FPN’s[7] P2 feature maps (on top of ResNet50[8] backbone), but since now the initial mesh was always a flat-roof extrusion, it took considerably longer to train the network compared to previous experiments when working off nDSM rasters. Nevertheless, the mesh refinement stages, with a larger number of trainable parameters than in the initial Mask R-CNN design, did the job:

NUM_STAGES: 4
GRAPH_CONV_DIM: 128
POOLER_RESOLUTION: 32

The training data was organized such that we had square input MRGB tiles per individual building in the center of each tile. Even in cases where there were multiple buildings visible in the tile, the mask (M-band) always had only one building in its center (Fig.6).

For the data augmentation we used random rotations along vertical axis and random horizontal flips. The random rotations explain the reason behind square-shaped tiles: to guarantee the availability of RGB bands in the corners after a random rotation, the original building footprint’s least bounding square was factored by SQRT(2) resulting in larger tiles we exported for the MRGB (and, later, DRGB) experiments.

The larger-sized tiles also could have been playing a curious role by somewhat compensating for the imperfections in RGB rasters’ orthorectification.

Fig. 5. Left: Zurich RGB rasters have defects in their orthorectification: you can see some vertical walls and the “spills” of the building pixels beyond the footprint polygon (red). Right: Correctly orthorectified drone imagery.

As you can see in Fig. 5 (left half), the footprint polygons coming from the Zurich LOD2 building models do not match the imagery exactly — a good example here are the cathedral’s towers which “spill” way beyond the red footprint boundary. In an ideal world, the images obtained by an airborne camera undergo the process of orthorectification, after which the joint raster represents nadir point of view at any pixel — this would result in an exact match between the polygonal footprint and the RGB bands content (Fig. 5, right). Alas, this is not always possible due to errors in any of the orthorectification inputs. You can learn more about orthorectification here.

Under such circumstances, allowing extra RGB pixels beyond the footprint mask being passed to the neural network, led the training to converge faster and to higher F1-score values when compared to an ablation experiment, in which we zeroed-out the RGB bands beyond the masks.

After training for ~540K iterations on 8xV100s (one iteration is a single flexible size mini-batch per GPU, see Part 1 on flexible batching), the validation losses and quality metrics stabilized. At this point we used the 2020 RGB raster to run the metrics on the test inputs, using global histogram matching for better performance on the unseen RGB raster.

Chamfer-L2: 5.86m
Absolute normal consistency: 0.85
Normal consistency: 0.81
F1@0.25m / 25,000 sampled points: 0.45
F1@0.5m / 25,000 sampled points: 0.66

Shadows

Looking at the MRGB inputs, one may ask a reasonable question, is there something else besides learned bias that helps a trained neural network predict geometry in the third dimension?

After more closely examining various predictions from the test set, it became clear that there is one more strong source of semantics at work: the shadows. In particular, the effect of shadows could be vividly demonstrated by specific reconstruction defects in commercial zones where nearby high-rises cast shadows on buildings below (Fig. 7).

The role of shadows was also indirectly confirmed again by a higher F1-score achieved by the neural network model trained on full RGB bands versus trained on RGB bands zeroed-out beyond the footprint masks.

We thought the shadows and imperfections in orthorectification were static and specific to given RGB raster. Since we had access to multiple rasters captured at different times, we moved on to the next set of experiments from here.

Training with Multiple RGB Rasters

We used 2014, 2015, 2018 RGB rasters for training and validation, 2020 — for testing (as before, applying global histogram matching to the 2020 RGB raster at inference time for best performance).

Fig. 8. Representations of the cathedral (shown before in Fig. 5) in various Zurich RGB rasters: 2014, 2015, 2018, 2020. Note the differences in shadows and orthorectification defects, variations in hue/saturation as well as seasons (2015 raster is from an early spring — the tree at the top of the 2015 tile has no leaves).

The training and validation dictionaries were rebuilt to include references to three image chips (one per respective raster) for every ground truth building model. Our custom dataset implementation was redesigned to return flex-size batches of MRGB images where RGB bands were randomly selected from the three available options.

As a result, we were able to improve the reconstruction quality metrics, particularly Chamfer-L2 and F1-Score:

Chamfer-L2: 4.80m (was 5.86m)
Absolute normal consistency: 0.85 (was 0.85)
Normal consistency: 0.81 (was 0.81)
F1@0.25m / 25,000 sampled points: 0.48 (was 0.45)
F1@0.5m / 25,000 sampled points: 0.69 (was 0.66)

How Practical are the MRGB Results?

But what does the Chamfer-L2 of 4.80m really mean? Are the extracted buildings suitable for any spatial analysis? Or could they even be useful as a basic visual backdrop for 3D scenes?

Fig. 9. Left: Buildings reconstructed by an MRGB Bld3d model on the 2020 RGB test raster. Right: Ground truth LOD2 buildings from the test subset.

As you can see in the Fig. 9, the heights of the extracted buildings often do not match the ground truth heights, although the slopes of gable and hip roofs are properly fit into the given building footprints.

With the MRGB Bld3d models trained on multiple RGB rasters, the learned bias still plays the major role in guesstimating the heights, making the model behavior unstable when dealing with buildings a few times higher than the mean height of Zurich ground truth models (10.95m). The reconstruction difficulties of tall buildings are also exacerbated by the orthorectification defects which get worse the higher the building is.

Fig. 10. Left: Defective reconstruction of a tall building by an MRGB Bld3d model. Right: ground truth LOD2 tall building of ~64m height.

As the result, buildings produced this way are not suitable for a precision spatial analysis like internal volume estimation or visibility evaluation.

Also, despite the promising performance on such simple and ubiquitous GIS input data, we were not able to train the MRGB Bld3d models to the precision levels sufficient for a consistent reconstruction of true 3D geometric detail like roof overhangs — the predominant majority of the extracted features remained in 2.5D space. For these, we speculate, more trainable parameters in the mesh head would be needed in order to keep more detailed geometric representations compressed in the trained network’s “memory”.

Applying Textures

On the other hand, if the building-height precision and roof detail are not a top priority, the extracted buildings can be used for quick visualizations or gaming projects since the acquisition cost is much lower compared to costs of reconstructions made with the help of elevation rasters. And, if the textures are applied, the results look fairly visually appealing at certain levels of detail (LOD).

Fig. 11 shows a portion of the test set of Zurich with roof textures clipped out of the RGB raster, while the façade textures generated by a Pix2PixHD[9] deep neural network model. The building geometry was 3D-regularized by the same method we described earlier in the Part 1 of the series.

Fig. 11. Fully textured buildings reconstructed from a test area by an MRGB Bld3d model from RGB rasters and given building footprints.

Although heavily relying on learned bias, MRGB Bld3d models still can be applied to other geographies for inferencing, but the reconstructed roof shapes are clearly of the Zurich architectural style, not the local one. As an example, the test area from the city of Amasya, Turkey shows discrepancies in roof slopes, ridge allocation, and average building heights (Fig. 12).

Fig. 12. MRGB Bld3d reconstructions on a 0.1m resolution RGB raster from Amasya, Turkey after global histogram matching. Left: raw untextured reconstructions; Right: fully textured reconstructions with prior 3D regularization. Data courtesy of General Directorate of Land Registry and Cadastre of Turkey.

B. Mixing Orthophotos and Elevation Rasters

In the Part 1 of the series, we described the experiments and reconstructions done by the nDSM version of Bld3d neural network, which worked off 0.5m resolution nDSM rasters alone. Trained nDSM Bld3d models set a worthy quality bar back then, beating our best MRGB Bld3d models despite working off lower resolution inputs and having fewer trainable parameters:

Chamfer-L2: 1.28m (MRGB Bld3d — 4.80m)
Absolute normal consistency: 0.88 (MRGB Bld3d — 0.85)
Normal consistency: 0.80 (MRGB Bld3d — 0.81)
F1@0.25m / 25,000 sampled points: 0.49 (MRGB Bld3d — 0.48)
F1@0.5m / 25,000 sampled points: 0.75 (MRGB Bld3d— 0.69)

So, how about joining these two types of rasters together — can we improve it more to get the best of both worlds and clear the high bar of nDSM Bld3d models?

Data & Initial Meshes

This time we trained on 2014, 2015, 2018 RGB and 2014 nDSM rasters, while reserving the 2020 RGB and 2017 nDSM raster combination to run the tests.

Since the original nDSM rasters were of 0.5m resolution, we up-sampled and cell-aligned them with the RGB rasters using the Resample tool with the Nearest option. After that, we exported a set of nDSM squared chips of the same dimensions as the RGB chips exported earlier for the MRGB experiments.

Next, we rebuilt the training and validation dictionaries, adding corresponding nDSM chip to every ground truth building.

When fed to the Bld3d network’s forward pass, the RGB and nDSM chips were stacked together similarly to the MRGB experiments, with the nDSM band taking place of the Mask band — DRGB 4-band tensors (Fig. 13).

Each nDSM chip, before being stacked with the RGB bands, was also cleaned up by zeroing-out all the pixels beyond the building footprint mask, while all the nDSM pixels under the mask were brought up to the minimum of 2.0 meters above ground. The former step allowed us to convey the single-building-mask semantics through the nDSM band — particularly useful when the chip contained more than one building in dense historic neighborhoods. The latter — for correcting nDSM defects caused by imperfections in DTM raster, which, upon subtraction from DSM, caused some of the building pixels to go below 2m threshold.

For normalization of the bands we used mean and standard deviation calculated individually for all three RGB rasters; whereas for nDSM raster, we used mean and standard deviation calculated on the building pixels only of the cleaned chips.

Since we now had the nDSM band as part of the input, we relied on a procedural extrusion (see the GitHub gist below) of the initial mesh which immediately brought the architecture closer to desirable output geometries.

This way, both training and inference forward passes were given 4-band DRGB tensors, each tensor containing a single building, and the initial building mesh extruded from the cleaned-up nDSM chip by the above nds_to_mesh code.

Training and Resulting Metrics

We trained a Bld3d neural network with four mesh refinement stages and the same number of trainable parameters as in previous single-raster MRGB experiments:

NUM_STAGES: 4
GRAPH_CONV_DIM: 128
POOLER_RESOLUTION: 32

We started with the following weights of the compound loss:

CHAMFER_LOSS_WEIGHT = 16.
NORMALS_LOSS_WEIGHT = 0.1 
EDGE_LOSS_WEIGHT = 1.0
NORMAL_CONSISTENCY_WEIGHT = 0.005
SURFACE_AREA_WEIGHT = 1e-4

…but increased the Chamfer loss weight to 32.0 at later iterations.

You may have also noticed the new SURFACE_AREA_WEIGHT coefficient above. The Surface Area loss was calculated based on comparing the total face areas between reconstructed and ground truth meshes per minibatch. The Surface Area loss helped to keep the reconstructed meshes closer to 2-manifold topology, reducing the number of self-overlaps at later stages of training.

area_loss = (stage_meshes_pred.faces_areas_packed().sum()).dist(meshes_gt.faces_areas_packed().sum())

After validation losses and metrics stabilized, we moved to testing.

Back when we trained the nDSM Bld3d models, we did not have the 2017 nDSM raster, so, to compare “apples-to-apples”, we ran the DRGB model on the same test areas of 2014 nDSM raster we used for evaluating the nDSM models in the Part 1 (these areas were not part of DRGB training dictionaries) :

Chamfer-L2: 1.07m (nDSM Bld3d — 1.28m)
Absolute normal consistency: 0.89 (nDSM Bld3d — 0.88)
Normal consistency: 0.82 (nDSM Bld3d — 0.80)
F1@0.25m / 25,000 sampled points: 0.58 (nDSM Bld3d — 0.49)
F1@0.5m / 25,000 sampled points: 0.80 (nDSM Bld3d — 0.75)

Then we ran the tests with the 2020 RGB and 2017 nDSM, applying global histogram matching to both, to compare with the best MRGB results:

Chamfer-L2: 1.24m (MRGB Bld3d — 4.80m)
Absolute normal consistency: 0.89 (MRGB Bld3d — 0.85)
Normal consistency: 0.82 (MRGB Bld3d — 0.81)
F1@0.25m / 25,000 sampled points: 0.56 (MRGB Bld3d — 0.48)
F1@0.5m / 25,000 sampled points: 0.79 (MRGB Bld3d — 0.69)

LOD2.3 Features: Roof Overhangs

DRGB Bld3d models reconstruct roof overhands for simple hip and gable roof shapes and properly placing them on the front façade in case of gable roofs, withholding the ones from the sides (in case of Zurich, many gable-roof buildings share a wall with another structure) — Fig.14 (Left). In case of more complex roof shape though, a significantly longer training time was required to get to consistent roof overhang geometry — Fig.14 (Right).

Transferability

Transferability of the DRGB Bld3d models is better than of the MRGB ones as well. Fig. 15 shows reconstructions in the same neighborhood of Amasya, Turkey, but this time the quality of the shapes is much higher: correct fitting of the roof ridges into the building footprints, as well as consistent roof overhangs. Amasya RGB raster was modified through global histogram matching against Zurich RGB training set; Amasya’s nDSM raster remained original.

Fig. 15. High quality LOD2.3-level raw reconstructions from Amasya, Turkey by DRGB Bld3d models trained on Zurich data. Data courtesy of General Directorate of Land Registry and Cadastre of Turkey.

There is a strong architectural bias remaining with the DRGB Bld3d models as we will cover more in the “Bad Examples” section below, but in short, the Zurich dataset is of a relatively narrow spatial variability, thus leading to unstable reconstructions in geographies with significantly different architectural styles. For example, downtown areas of the U.S. cities with skyscrapers packed closely together. A solution here can be transfer learning / fine-tuning with more local examples.

Bld3d as a Continuous Transformation Function

There is a trick in getting a bit more detail out of pretrained Bld3d models: the vertex transformation function learnt by the neural network is continuous. Yet, the initial mesh given to the network as part of the forward pass input, is discrete, having vertices spaced at equal intervals (see the GitHub gist above for details). Moreover, the orientation of the initial mesh is along the X-Y pixel space axes, consequently causing, under certain angles, a discrete noise along the wall/roof creases which is hard to cancel even by a well-trained Bld3d model due to the vertex budget and the edge length constraints (Fig 16).

Fig. 16. Left: 0.5m nDSM input raster with building footprint (cyan). Right: DRGB Bld3D reconstruction with 2.0 vertex per meter density (vpm) of initial mesh. Note the noise on the façade plane and the wall/façade creases.

What can be done here, specifically at inference time, is to increase the vertex-per-meter density of the initial meshes— empirically we found that 3.0 vertices per meter produces more accurate results (Fig. 17). Increasing the density further led to noise climbing too, likely being caused by a growing discrepancy between vertex density and input raster resolution.

Fig. 17. DRGB Bld3D reconstructions with different vertex per meter (vpm) density of initial mesh. Left: Original 2.0 vpm density used in training. Right: 3.0 vpm density used for inferencing.

On the other hand, the downside of a higher vertex count is, obviously, a larger-sized mesh with proportionally higher bandwidth and rendering requirements for production applications. In the Part 1 of the series we talked about 3D regularization as a way to reduce the number of vertices. It’s interesting to note here, that after the regularization is applied, the original 3.0 vpm density still allows for a cleaner geometry at the end (Fig. 18).

Fig. 18. Results of 3D regularization of DRGB Bld3D reconstructions with different vertex per meter (vpm) density of initial mesh. Left: 2.0 vpm. Right: 3.0 vpm.

Comparing Side-by-Side

How do these models compare with each other from the human perception standpoint? When visually examining the reconstructions, there are few distinctive properties to look for: spatial resolution of extracted features, level of noise (around the roof-to-wall creases, wall smoothness), roof ridge sags, roof overhangs quality (where they are placed, how consistent is the overhang rim, planarity) — Fig. 19.

Fig. 19. (Click for full size) Side-by-side comparison of a building from test set. Top-left: Ground truth. Top-right: DRGB reconstruction. Bottom-left: MRGB reconstruction. Bottom-right: nDSM-only reconstruction from Part 1 of the series. Note the zones: 1 — spatial resolution, 2 — level of noise, 3 — roof ridge sags.

Another powerful feature comes with deep learning — resistance to noise in the input rasters. Specifically, the tree canopies partially obscuring building roofs are a notorious source of errors in traditional RANSAC-based deterministic reconstruction algorithms.

As it was shown in Part 1, the nDSM Bld3d models already have a strong resistance to noise built in them at training time (Fig. 20).

Fig. 20. nDSM Bld3d reconstructions. Left: raw extrusion (used as initial mesh) from the 0.5m nDSM raster. Note the large tree blocking one corner of the roof, and smaller size trees partially obscuring the opposite wall and a lower roof. Right: Corresponding nDSM Bld3d reconstruction. Test set.

DRGB Bld3d takes it further with a higher spatial resolution of affected buildings, and better looking guesstimated portions of the geometry (Fig.21).

Fig. 21. Left: nDSM Bld3d reconstruction. Right: DRGB Bld3d reconstruction. More planar roof surface, roof overhangs, better pronounced three-step structure in front.

Bad Examples

Tall and Large Area Buildings

With the mean height of Zurich buildings at 10.95m with a standard deviation of just 1.8m, clearly, there are difficulties for the DRGB Bld3d models to do reconstructions of buildings that are few times larger or taller than the average. Of course, they are not as bad as MRGB-based reconstructions (shown earlier in Fig. 10), but the resolution of the rooftops is still far from being as sharp as buildings that are a few stories high, or near the mean. The fact that the resolution of the tall buildings’ roofs is even lower than of nDSM Bld3d models likely indicates once again that the orthorectification problems in training RGB rasters were not always playing a positive role, particularly at inference time.

Fig. 22. A DRGB Bld3d reconstruction of a ~64m tall building (shown before in Fig. 10) from 2020 RGB + 2017 nDSM test rasters. The rooftop resolution is not as sharp as in case of nDSM models — are orthorectification issues to blame?

With regards to large-area buildings, the DRGB Bld3d reconstructions sometimes exhibit a low-frequency noise (Fig. 23), which again are likely caused by underfitting to large structures (Zurich’s 98.7% of building footprints are under 1,700m²).

Growing the training dataset would be one solution. Potentially, there is also another option for future research — specifically for large area structures — to do a multiscale iterative inferencing, refining the mesh in multiple passes and adding mesh vertices only where needed.

Fig. 23. DRGB Bld3d reconstruction of a building of ~12,700m² in its footprint. Note the low-frequency noise — blurred and wavy lines in the middle of the reconstruction. From 2020 RGB + 2017 nDSM test rasters.

Too Much Tree Noise

In cases where the tree pixels start exceeding about 25% of the affected roof, particularly in the nDSM band, the Bld3d neural network starts having difficulties compensating for so much noise (Fig. 24).

A potential solution here is extending the data augmentation code to place random, yet statistically plausible, tree noise in the training data on the fly.

Thin Appendages in Building Footprints

Sometimes, when a building footprint has a long and thin appendage, the Bld3d models tend to represent them by roof overhangs, or canopies, often leading to a inferior-looking geometry (Fig. 25).

Beyond typical “more training data”, there are other potential solutions for this issue. One is to increase NORMAL_CONSISTENCY_WEIGHT during training. We noticed while making the shapes smoother and losing some of the spatial resolution, the buildings reconstructed by such models had fewer yet more geometrically stable and accurate looking roof overhangs. Another one, a rather radical solution, is to modify the ground truth LOD2 building models to not have roof overhangs at all — that would leave the reconstructed buildings without LOD2.3 features and increase the internal volume of reconstructions (which may not be desirable due to additional analysis requirements); on the other hand, it would produce more stable geometry more closely conforming to the 2-manifold topology.

Fig. 25. Bld3d models tend to represent thin appendages of building footprints as canopies without walls.

Practicality and no-Lidar Inference

Zurich Drone Imagery

Earlier this year our business partner, Wingtra AG, collected oblique imagery of a portion of Zurich using their WingraOne GenII light drone equipped with a 24MP Sony a6100 camera (more on this here).

With the help of the SURE for ArcGIS meshing engine, we processed the collected set of oriented images (about 13,000 in total, collected during 6 hours of flight time) into a pair of orthorectified RGB + nDSM rasters and ran a pretrained DRGB Bld3d model on it (Fig. 26).

Fig. 26. Although trained on an elevation raster collected by an expensive lidar sensor, DRGB Bld3d neural network successfully reconstructed buildings from imagery collected by a drone equipped with an optical camera only. The input elevation raster was created with SURE for ArcGIS meshing engine using the collected imagery. Left: raw 3vpm-density geometry on top of nDSM elevation raster. Right: fully-textured buildings after 3D regularization on top of RGB raster. Roof textures are from the RGB raster, while wall textures are Pix2PixHD-generated. Cyan-highlights here are to emphasize individual buildings segmentation which is based on input footprints.

What’s remarkable about this particular case, is that there was no active lidar sensor involved in the source data collection used to perform the 3D building reconstruction process . Thanks to a pretrained DRGB Bld3d model, the 3D building shells acquisition became nearly automated and took just a few hours of manual intervention to complete:

Oriented image collection with an unpiloted drone
SURE for ArcGIS photogrammetric mesh reconstruction and conversion into orthorectified RGB and nDSM rasters
Inference with a pretrained DRGB Bld3d model using Zurich Open Data building footprints
Import of inference results into ArcGIS for display and further analysis, web-service publication.

You can view an interactive web scene with about 7,800 buildings reconstructed from drone-collected imagery here.

Munich Aerial Imagery and OSM Buildings

We took a pretrained DRGB Bld3d model further, and moved to an RGB+nDSM raster pair of Munich, Germany. The rasters were produced with the help of the SURE for ArcGIS meshing engine, from the source imagery courtesy of Hexagon. Again, no lidar sensor was employed during data collection nor reconstruction. The RGB bands were histogram-matched against the Zurich training rasters.

For the building footprints we used the OpenStreetMap polygons. Although the OSM building footprints do not perfectly align with the imagery and some of the footprint geometry is not accurate, a pretrained DRGB Bld3d model was able to successfully reconstruct 3D buildings out of the given inputs (Fig. 27).

Fig. 27. DRGB Bld3d reconstructions from RGB+nDSM rasters built from oriented imagery only, no lidar. OpenStreetMap building footprints. Source imagery 2021 Hexagon / Esri. Captured with Leica CityMapper-2 — processed with SURE by nFrames/Esri. All rights reserved.

We did not have ground truth LOD2 building models for Munich to run a quantitative evaluation of the reconstructions, but visual comparison against unsegmented photogrammetric mesh shows a close match (Fig. 28).

Fig. 28. DRGB Bld3d Munich reconstructions with OSM building footprints. Comparing photogrammetric unsegmented mesh (Left) against the DRGB Bld3d reconstructions. Cyan highlights emphasize per-OSM footprint building segmentation. Source imagery 2021 Hexagon / Esri. Captured with Leica CityMapper-2 — processed with SURE by nFrames/Esri. All rights reserved.

The level of noise in Munich reconstructions was significantly higher than in Zurich, but given the size and limited geography of the training dataset, it was expected.

Nevertheless, looking at the Fig. 28 image, a good question to ask is: “Why would I want to extract geometrically less detailed building shells out of a photogrammetric mesh…could I just use building footprints to cut the mesh into individual buildings instead?” The answer here again is in the denoising ability of the neural network — like in case of the Munich mesh, which is rich on vegetation located in direct proximity to buildings, if we would just use the building footprints to cut the mesh as-is, we would see gaps in façade geometry and vegetation melded into buildings (Fig. 29).

Fig. 29. Left: A portion of the Munich mesh with vegetation in direct proximity to the building, blending geometrically into the façade. Right: DRGB Bld3d reconstruction with clean façade wall geometry. Source imagery 2021 Hexagon / Esri. Captured with Leica CityMapper-2 — processed with SURE by nFrames/Esri. All rights reserved.

Summary

In the previous post, Part 1 of the series, we discussed nDSM version of Bld3d neural network, which preformed reconstructions based on elevation rasters and building footprints.

In this post we covered two more flavors of the Bld3d: one, MRGB working off the input RGB raster and building footprints and, two, the DRGB version which reconstructs buildings using footprints and the RGB + nDSM pair of rasters.

After conducting a broad range of experiments on various test inputs, we conclude that the DRGB models produce the best reconstructions from the spatial resolution and geometrical stability standpoints. The DRGB version also showed the best results running in other geographic regions, exhibiting the best noise resistance.

Second place is taken by the nDSM version of Bld3d similarly demonstrating a strong noise resistance, but preforming worse on the spatial resolution side (except for tall buildings affected by RGB orthorectification issues where nDSM models often performed better than DRGB).

MRGB version showed, as expected, the worst performance due to a significant gap in semantics of the pure 2D inputs. Relying heavily on learned bias, it also exhibits the worst results in transferability tests.

Pretrained DRGB models, combined with preprocessing by a meshing engine, demonstrated a compelling way of lowering the costs of 3D asset generation and acquisition by processing input data collected by passive optical systems only, significantly reducing the need for expensive lidar sensors.

From a practicality standpoint, the Bld3d architecture has a fully convolutional backbone and distinct mesh refinement stages built with graph convolutions, making the pretrained models suitable for transfer learning and fine-tuning, which makes the architecture a good candidate for rapid user adoption.

Acknowledgments

Many thanks to Konrad Wenzel, Robert Garrity, Jim McKinney for valuable feedback reflected in this series’ content.

References

[1] Juan Luis Gonzalez, Munchurl Kim; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6851–6860

[2] Gkioxari, G., Malik, J., Johnson, J. (2020): Mesh R-CNN. arXiv:1906.02739v2

[3] Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A. (2019) : Occupancy Networks: Learning 3D Reconstruction in Function Space. arXiv:1812.03828v2

[4] Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository. In CoRR 1512.03012, 2015.

[5] Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3D-R2N2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.

[6] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B. Tenenbaum, and William T. Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. In CVPR, 2018.

[7] Lin T., Dollár P., Girshick R., He K., Hariharan B., Belongie S. (2017) Feature Pyramid Networks for Object Detection. arXiv:1612.03144v2

[8] He K., Zhang X., Ren S., Sun J. (2015) Deep Residual Learning for Image Recognition. arXiv:1908.01210v2

[9] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, June 2018. arXiv:1711.11585