Reconstructing 3D buildings from aerial LiDAR with AI: details

Authors: Dmitry Kudinov, Daniel Hedges; Contributors: Omar Maher

In this blog post we will cover the Proof-of-Concept project we did here at Esri on reconstructing 3D building models from aerial LiDAR data with the help of Deep Neural Networks, in particular, a Mask R-CNN model trained to detect and report instances of roof segments of various types. We will cover the data preparation, Mask R-CNN training and achieved precision, go over the inference architecture and integration of TensorFlow and ArcGIS Pro 2.3 BETA, and the steps used to reconstruct 3D building models from the predictions.

Introduction

The value of accurate 3D building models for cities is hard to overestimate, but modern-day approaches to collect and maintain such data are labor intensive, error prone, and expensive. We teamed up with Miami-Dade County and Nvidia on this project to see if we can streamline this data acquisition workflow or at least make it more cost-effective.

The workflow traditionally used to reconstruct 3D building models from aerial LiDAR is relatively straight-forward: the LiDAR point-cloud is transformed into a Digital Surface Model (DSM) raster, then inspected by human editors for buildings present. If a building is found, one or more polygon describing the roof form of the building is manually digitized, e.g. if it is a large hip roof with two gable outlets, there will be three polygons (one hip and two gables on top) drawn by the editor. Once all the roofs are described that way, a set of ArcGIS Procedural rules is applied to extrude the building models using the manually digitized roof segments, with heights and ridge directions computed from the DSM.

A building with complex roof shape and its representation in visible spectrum (RGB), Aerial LiDAR, and corresponding roof segments digitized by a human editor. The last one is a 3D reconstruction of the same building using manually digitized masks and ArcGIS Procedural rules.

The most time-consuming and expensive step in the above workflow is the manual search and digitization of the roof segment polygons from a DSM raster (average human editor digitizes about 70 roof segments per hour), and this is where we decided to help our users. As the result, we now have a trained network which can produce up to 60,000 roof segments per hour from a single GPU for that geographic region! Of course, the AI predictions are not that accurate as the ones digitized by humans, but on the other hand, having already close-enough automatically generated segments can instantly boost the human editors’ productivity, asking for fine-tuning of the AI suggested models instead of digitizing them from scratch.

Procedural 3D renderings of a sample building with different roof types (flat not shown).

We recently published a high-level overview describing this project, so if you are interested in more non-technical details, you may find more of them here.

Training

Data Preparation

As it was mentioned in the overview post above, we had about 200 square miles (~518km²) of aerial LiDAR with 213,000 roof segments manually digitized by human editors. Each roof segment is a polygon of one of the seven types: Flat, Gable, Hip, Shed, Dome, Vault, and Mansard.

The roof segments were stored as features in a polygon feature class in a local file geodatabase, whereas the LiDAR point cloud was converted into a single-band (32 bit) raster layer with 2.25 square feet (~0.21m²) per pixel resolution using the “LAS Dataset to Raster” geoprocessing tool.

Originally, the LiDAR raster’s single channel represented the heights from the sea level — Digital Surface Model (DSM), which we after normalized by subtracting the Digital Terrain Model (DTM). Resulting nDSM raster contained heights above the ground level.

We found that using normalized Digital Surface Model (nDSM) allows for more efficient training of the neural network, as it removes dependency on the surface elevation, making the heights range much more compact and dense, allows reaching the same mAP values with fewer training examples.

Creating Training, Validation, Test Sets

Once the nDSM and human-digitized roof segments layers were ready, we ran the ArcGIS Pro 2.3 BETA version of the “Export Training Data for Deep Learning” geoprocessing tool which exported tiles and masks for the instance segmentation problem (Fig. 1).

Figure 1. Exporting tiles and masks for the instance segmentation problem.

The tool writes the output tiles and masks into the destination folder with a convenient folder structure which can, with a little scripting, be easily plugged into a deep learning framework. Tiles and masks in our case were 512 x 512 GeoTIFFs containing chips of rasterized LiDAR, and per-roof-type instance masks — images of the same pixel size and color channel values enumerating the instances of a particular roof-type present in the tile (Fig. 2).

Figure 2. Pseudo-color tile of input nDSM raster(left) and corresponding per-roof-type masks with multiple segments (instances) of Flat, Gable, and Shed type.

The resulting set contained about 19,000 unique samples (a tile + masks) — not that much, considering the complexity of the task, and the size of training sets of similar successful projects. In order to preserve as much training samples as possible, we split the original set into non-overlapping Training (95%), Validation (2%), and Testing (3%) subsets. Traditionally, Validation samples were used to verify the loss convergence at the end of each training epoch, whereas the Test set was used at the end of each experiment only.

The modest size of the training set was causing a problem with the variety of building heights: buildings in our region of interest are varying from a single-story shed under 9 feet (~2.7m) high, to 600ft. (~183m) and above. With such a variation in height there we just did not have enough samples to train the neural network efficiently. The solution we found here was to normalize each input tile so its single-band Float32 gets converted into three 8-bit Unsigned Integer bands, allocating the Red band for normalized height value, the Green and Blue bands for normalized height derivatives –sobelX and sobelY respectively (Fig. 3). Such normalization made the input signal pretty much building-height invariant, although reduced the accuracy of small structures on top and around tall buildings.

Figure 3. Sample Python3 script which converts tiles with single-band representing heights into pseudo-color normalized height, sobelX, sobelY three-band images.

Data Augmentation

Still, even with the above normalization, the resulting training set of ~18,200 samples was not that big, so, in order to make the training proceed on a larger variety of samples, we used traditional image augmentation techniques, in particular the imgaug Python library, relying mostly on the affine transformations and flips (Fig. 4).

Figure 4. Image augmentation code which allows for bigger variation in training samples.

Loading Training Data and Mask R-CNN Configuration

For the neural network, we chose one of the open source implementations of the state of the art Mask R-CNN architecture.

From the original Mask R-CNN and FPN papers, our implementation was different only in sizes of anchors (10, 20, 40, 80, 160) to allow picking up smaller size masks more efficiently while working with 2.25sq ft. per pixel resolution, and number of RoIs (1,000) allowing for a bigger number of RPN proposals per image. Our implementation also uses custom DETECTION_MAX_INSTANCES = 1,000 config value which allows a bigger number of detections be reported through the matterport’s framework.

We used TensorFlow 1.7 to train the model with a ResNet-101 backbone. Initially starting from the Imagenet pre-trained weights, we selectively trained the fully-connected layers first to adapt them to our new classes, and only then proceeded with the full network training for about 1,400 epochs, 4 images per GPU, using two NVIDIA Quadro GV100’s connected with NVlink and achieving ~0.574 mAP on the validation and ~0.483 mAP on the test sets.

The matterport’s implementation allows for convenient subclassing of the utils.Dataset in order to load your data into the framework for training. See Fig. 5 for our implementation of loading tiles and masks produced by the “Export Training Data for Deep Learning” geoprocessing tool (at this point the tiles are already normalized with the script from Fig. 3).

Figure 5. utils.Dataset subclass for loading samples into the training framework.

Inferencing

Architecture

For the inferencing we chose a client-server architecture, where the service role is played by a pre-trained Mask R-CNN model wrapped into a REST API, and ArcGIS Pro 2.3 BETA as a client. The use of a client-server architecture is justified in this case, as inferencing is also happening using multiple expensive GPUs which organization may want to share across many clients for better utilization.

Server: REST Wrapper and JSON FeatureSet as Output

As a proof of concept, before wrapping the model into a full-blown server framework, we used Python3 libs to prototype a simple HTTP server with an image upload functionality which would accept rasterized LiDAR tiles and return a JSON FeatureSet of recognized roof segments. Our complete server prototype is contained in a single Jupiter Notebook, which includes the Mask R-CNN inferencing code (Keras + TensorFlow), code for vectorization of the resulting masks (skimage libs), and code for batch-processing.

The output JSON FeatureSet follows the ArcGIS REST API format for polygonal features. Although the output of Mask R-CNN is a raster mask, we perform server-side vectorization and Ramer–Douglas–Peucker simplification of the detections before sterilizing them into JSON — see Fig 6 for a sample server response.

Figure 6. Sample server response containing a FeatureSet with two detected roof segments. Note that the coordinates are in “super-tile” 1024 x 1024 pixel space (see “Super-tile and batch inference” section for details).

Server: Tiles and Tiling Artifacts

The service expects 512 x 512 x 1 input tiles to be submitted as input. There is a challenge though: if to use a simple non-overlapping tiling schema, the roofs which fall between adjacent tiles will be reported as disconnected pieces and will need to be merged somehow later with the help of seam location data. Although it looks like an acceptable solution, the tile-seam “healing” process needs to be performed with a CPU-based algorithm on the client side, which demands the client to bear a fairly advanced implementation of a geometry engine, and therefore excludes most of thin clients.

We decided to take a different approach and make the server-side GPU do most of the work instead. The solution we found uses overlapping tiles with stride of 256 pixels, i.e. 50% overlap along each axis. For each tile, we report only the detections which are either completely within the central portion (Fig. 7) of the tile or the detections which are too big to be reported by the overlapping nearest neighbors; all other detections are discarded, assuming the neighboring tiles will be reporting these.

There are pros and cons to this approach though too. The good thing is that such processing allows to effectively use GPU for removing seam artifacts in a scalable manner, leaving the requirements for the client-side implementation at a relatively low bar. On the other hand, this comes at the cost of quadrupling the amount of work on the server GPU, as almost every input pixel must be processed four times as part of different overlapping tiles. Another disadvantage is the reliance on the neighbors to report detections which are outside of the central portion of the current tile: there are rare cases, where the same roof segment can be considered by two tiles as belonging to each other (this may happen due to convolutional layers being in different initial conditions on neighboring tiles) … as the result, such detection may not get reported at all.

Nevertheless, having the high scalability requirement on hands, we decided to have the server-side GPU doing the work of tile-seam artifacts removal.

Server: “Super-tile” and Batch Inference

Figure 7. Tiling schema: 1024x1024 “super-tile” which gets split into nine 512x512 overlapping tiles. Each colored segment is the central portion of each 512x512 tile. Adjacent “super-tiles” overlap by 256 pixels to provide seamless coverage.

It is more efficient to ask the GPU to perform inference on multiple tiles (as many as the GPU’s memory fits) in a single pass, as it cuts costs on data marshalling and initialization of the inference stack. Because of the pseudo-color local normalization performed on the original training 512 x 512 tiles, it made sense to stick to the same tile dimensions for inference as well, and with a 16Gb NVIDIA Quadro GP100 GPU we could perform inference on multiple tiles like this in a single pass too.

To leverage this efficiently, we made the client to upload bigger tiles (“super-tiles”) of 1024 x 1024 size which were then taken by the server and batched into 9 overlapping tiles each (3 x 3 grid of 512 x 512 tiles with 50% overlap), so the server can batch-process them in a single request-response cycle. Because of that, the coordinates you see in Fig. 6 sample response, while being in pixels, are from the 1024 x 1024 “super-tile” pixel space.

Client: ArcGIS Pro 2.3 BETA and Python Raster Function

For this experiment we used ArcGIS Pro 2.3 BETA and its new “Detect Objects Using Deep Learning” Geoprocessing tool which conveniently performs tiling of the input raster and for each extracted tile calls a custom Python Raster Function (PRF), working as a REST client for our inference Mask R-CNN service. You can find more details about PRFs here — please note, that until ArcGIS Pro 2.3 is officially released, the signature of the methods and API may change. Come to this GitHub page after the official release for the latest documentation and samples on the Python Raster Functions.

Figure 8. Python Raster Function (PRF) — this is a custom Python3 object which is used by the “Detect Objects using Deep Learning” geoprocessing tool to perform roof segment detection on the input LiDAR raster layer. The PRF’s vectorize() method is called with “super-tiles” of 1024 x 1024 x 1 size each.

3D Reconstruction

Once the “Detect Object Using Deep Learning” tool is done running, and the detections are written into a feature class in a geodatabase, we can use traditional Geoprocessing tools and Procedural rules to extrude the 3D building multipatches.

First, we used the “Regularize Building Footprint” tool from the 3D Analyst toolbox to regularize raw detections. Next, we used surface elevation and DSM rasters to acquire the base elevation for each building segment, and calculate the roof ridge direction. The final step was calling the “Features from CityEngine Rules” GP tool which applied a procedural rule to these polygons to achieve a final reconstruction of the 3D buildings.

Results

Pros and Cons

At this point, after four weeks of training and as we have reached the training plateau on the given set of Miami data, we can draw some conclusions: while the neural network prediction accuracy did not achieve the accuracy of human editors, the speed (60,000 segments per hour from a single NVIDIA Quadro GP100 GPU versus 70 per man hour) and negligible marginal cost at which neural network predictions can be produced, makes a pre-trained Mask R-CNN service a great helper tool which will significantly reduce the cost of creating and maintaining 3D city models. With this new tool in the workflow, human editors, instead of starting from rasterized LiDAR point cloud and manually digitizing roof segments, can now focus on fine tuning AI-offered 3D building multipatches using standard editing tools.

Live Web Scene

Use the link below to open live WebScene with 2D predicted masks and 3D reconstructions.

In the flowing 3D WebScene you can find raw Mask R-CNN predictions (Predictions__2D), and 3D buildings extruded on top of these (Predictions__3D), no manual edits were done to neither:

Ongoing and Future work

Growing the Training Set and Improving Predictions Quality

And still, accuracy of the AI predicted roof segments can be further improved a lot by growing the size of the training set, adding more architectural variations and environmental conditions: downtowns with closely standing high-rises, tree shaded buildings, highway overpasses as negative examples, waterbodies nearby, etc.

We will be adding more examples of complex roof types like historical buildings, churches and cathedrals which roofs can be used to train the model for more precise discrimination between sloped segments.

Having a big enough training set may even allow discarding the tile normalization step (Fig. 3), bringing the prediction accuracy higher, especially for tall buildings and their surroundings.

Another issue with real-world training data is class imbalance, e.g. number of flat roof buildings, at least in Miami-Dade County, significantly outnumbers all other roof types, while domes and vaults are pretty much rare exceptions, — and this is problem for training a neural network classifier. Here we will be using synthetic training examples produced with ArcGIS CityEngine, and ArcGIS Pro.

Building ArcGIS Online Inference Service

Our goal is to achieve a consistent quality and accuracy across larger geography, conditions, and architectural styles. Once this is done, we are looking forward to publishing this model as a BETA ArcGIS Online inferencing service which will be used by our customers and partners to bring the cost of 3D building models down, while increasing the quality and coverage of such content.

Like what you read? Give Dmitry Kudinov a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.