3D cities: Deep Learning in three-dimensional space

Dmitry Kudinov
Published in
18 min readMar 16, 2019


Authors: Dmitry Kudinov; Contributors: David Yu, Tamrat Belayneh

In this post we will continue exploring the ways how Deep Learning models can be applied to extract information from three-dimensional spatial data, particularly in the context of 3D building model reconstruction in large geographic areas — from individual cities to entire regions.

Third dimension in GIS is not new: for many years there have been areas where all thee spatial coordinates were considered essential: climatology, oceanography, geology, architecture, utility networks, etc. Nowadays, 3D models of cities are too. Such models are rapidly being adopted and recognized as an invaluable source for decision making in urban design, event planning, maintenance, safety and security, disaster response, crowd control and insurance to name a few.

So, what are the raw data sources which can be used to reconstruct a 3D model of a city? Which tools and workflows are available? In which direction to explore further? We are going to explore and find answers for these questions below.

As with multiple previous Deep Learning projects, we performed the following experiments in collaboration with NVIDIA, who provided high end QUADRO GV100 cards with 32GB memory — suitable to fit large mini-batches for efficient training of massive Mask R-CNN and PointCNN networks. https://www.nvidia.com/en-us/design-visualization/quadro-store/

Realism vs Cubism

Before we dive deep into technical details, let us take a step back and look at the bigger picture of the needs and means of the subject.

In the ideal world, we want the digital model to be indistinguishable from the real city: the high-end fidelity of the details and textures from cars to buildings, road cracks, leaves on every tree, ideal positioning and height of every chimney pipe and each dormer window handle. On second thought though, such level of detail, while theoretically possible, appears to be not practical due to computational requirements, time it takes to acquire, and future resource drain to maintain the model actuality.

Therefore, for all practical purposes, 3D building models are often split into two main categories: the high-fidelity, and the schematic ones. Into the high-fidelity category fall, predominantly, models of historical buildings, which look is fixed and protected by regulations or even listed in the UNESCO World Heritage Sites. The high-fidelity building models require significant initial investments, but once created, only little and sparse updates are needed to reflect infrequent restorations made to the originals.

The latter, schematic, category on the other hand, is where all the other parts of the city belong to: commercial, residential, industrial zones which often go through development, reconstruction, expansion, rezoning, i.e. where changes happen every day. And these changes must be reflected periodically in the digital model of the city with a reasonable trade-off between accuracy, speed, and cost — trying to maximize speed/cost ratio, while not letting the accuracy to go down. This is where our focus is going to be for the rest of this post.

But why do we need fast, periodic, and cost-effective updates to the building models of the schematic category? One particularly prominent reason is that these parts of the city often accommodate majority of the population during business and night hours. In case of a disaster, like an earthquake, a quick update to the model, and comparison of it to the previous state before the event, would give rescuers a powerful tool to see where the most damage has occurred, and how many people are in the affected areas by computing the number of collapsed floors, lost square footage, estimating the amount of debris, etc.

Data Sources

There are two main data sources for acquisition of 3D building models at various scales: using airborne or terrestrial LiDAR, and 3D triangulated meshes computed by Structure-From-Motion algorithms and from photogrammetric process using oblique imagery. The former is an older, well established technique, which typically requires rather expensive sensors. The product of a LiDAR scan is an unsorted three-dimensional point cloud, where each point may also contain a number of additional attributes like Intensity, Red-Green-Blue values, etc. The latter technique, Structure-From-Motion is a more recent technique which allows for reconstruction of a continuous 3D mesh. The mesh is calculated from series of oblique pictures taken by an aircraft or a drone flying through the city and preserving the detailed trajectory information. Such continuous mesh is often constructed from millions of interconnected triangles and has associated high-resolution RGB textures.

You can construct 3D meshes using Drone2Map for ArcGIS extension. To learn more: https://doc.arcgis.com/en/drone2map/

Both sources have a common problem though: it is not known which points (in a LiDAR point cloud), or triangles (in a mesh) belong to a building, ground, a tree, a water body, a car, etc… they are just raw unsorted X-Y-Z points, or a huge number of connected triangles with RGB textures.

LiDAR Point Clouds

The aerial LiDAR point clouds, we were conducting the experiments with, are of a relatively high density: about 15–20 points per square meter on average. Such point density is needed to get a strong enough signal in the input data, so a neural network can be trained faster and with fewer number of examples to a higher accuracy. The reason is that the statistical properties in local neighborhoods of the source point cloud carry the valuable signal, which, for the neural network, is the key allowing it to discriminate between various object classes present in the cloud. Therefore, the more disperse the cloud gets, the more vague the signal becomes, quickly resulting in a need for an exponentially larger number of training samples to learn from.

Instance Segmentation in Rasterized Point Clouds

We wrote about the pilot project we performed with Miami-Dade County back in 2018. In that project we tried to optimize one of the steps of an existing and well-established workflow of reconstructing 3D building models, which required manual digitization of building segments of seven distinct roof types from a rasterized point cloud.

In short, the traditional workflow is straightforward:

1. The point cloud is converted into a raster with color channel storing average heights of LiDAR points per pixel.

2. GIS engineers manually digitize roof segment polygons on top of the step#1 raster: Flats, Gables, Hips, Sheds, Domes, Vaults, and Mansards.

3. ArcGIS 3D Analyst extension tools and CityEngine Procedural Rules are used to extrude the schematic-type building models from the roof segment polygons.

1. point cloud rasterization; 2. manual digitization of roof segments; 3. extrusion of building models.

Overview of the Miami-Dade project: https://www.esri.com/arcgis-blog/products/product/3d-gis/restoring-3d-buildings-from-aerial-lidar-with-help-of-ai/
More technical details with code snippets: https://medium.com/geoai/reconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb3079c0

The original rasterized point cloud, which GIS engineers were working with to manually digitize roof segments, was calculated using the LAS Dataset To Raster geoprocessing tool with the cell size of 2.25 square feet per pixel. The result is a single-channel two-dimensional raster, with a pseudo-color channel representing the height of each pixel, so called Digital Surface Model (DSM). You can find more details on creating such surfaces from point clouds here.

The manual process of digitizing various roof segments from the Digital Surface Model was painfully slow and the most expensive part of the workflow, so the idea was simple: to train a Mask R-CNN neural network to, at least, help with the roof segments extraction from the DSM raster.

But before we jumped into training the model, some additional preprocessing on the input data needed to happen. The original DSM encodes the height values including the ground elevation, and therefore, if fed to the Mask R-CNN as is, would require a significantly larger number of training examples in order to make the network terrain-invariant. (Un)fortunately, we did not have so many examples, so we converted the original DSM into, so called normalized Digital Surface Model (nDSM) raster with ground elevation (Digital Terrain Model, DTM) subtracted:


Just to elaborate on this formula a little bit:

1. We calculated the DSM as it was described above.

2. The source point cloud had its points classified as Ground / non-Ground by the Classify LAS Ground geoprocessing tool.

3. The LAS Dataset To Raster tool was used to create Digital Terrain Model out of classified point cloud filtered with the Ground class only.

4. The normalized DSM (nDSM) was calculated by running the Spatial Analyst’s Minus tool: nDSM = DSM — DTM.

Once the nDSM was ready, we used the Export Training Data for Deep Learning geoprocessing tool to create the training set from the nDSM and roof segment polygons which were manually digitized before by Miami-Dade GIS engineers. Although the number of unique training tiles was just about 18,000, the data augmentation and additional pseudo-color conversion allowed the Mask R-CNN to achieve impressive results, significantly improving the efficiency of the traditional workflow (you can read more about data augmentation and pseudo-color conversion in the second article mentioned above).

After the Mask R-CNN results were imported back into ArcGIS Pro, the only additional tool, which we needed to use to get back to the traditional workflow, was the Regularize Building Footprints to geometrically transform the predicted roof segment polygons so they have the right and diagonal angles typical for man-made structures.

The results of the Procedural Rules-based extrusion, which followed the regularization, can be seen in the live 3D WebScene below. It is important to emphasize, that no manual edits were made to any of the input or intermediate data, nor to the final building shells. Not least important is the fact that the area covered in the below WebScene belongs to the so-called Test region, i.e. the region which was not seen by the Mask R-CNN model while the model was in training.

Resulting schematic building models: https://arcg.is/1jvDO00

As you could have noticed, these models are not always perfect, but this is a huge jump in productivity and reduction of the cost from the baseline workflow where the manual labor was required at the DSM level: now, the GIS engineers can perform a fine-tuning of the proposed 3D models as needed, rather than manually digitizing every single roof segment.

And, as it often happens in the world of Deep Learning, you can always make the Mask R-CNN predictions better by bringing more training samples.

Semantic Segmentation in Raw Point Clouds

In the previous example with rasterized point clouds we were bound to a predefined workflow for multiple reasons: one, is to have a way to compare apples-to-apples before and after introduction of Deep Learning into the process. But can we perform a similar experiment within a raw point cloud itself, without preliminary conversion into a DSM? And can we find a similar workflow which can give us a good baseline to compare and improve upon?

It turns out we have one: there is another well-established process of reconstructing 3D building models from raw point clouds using RANSAC algorithms. The whole workflow, step-by-step, if performed in ArcGIS Pro, looks like this:

First, we assign appropriate labels to the points which statistically look like ground (class 2), and building rooftops (class 6):

  1. ClassifyLASGround, if ground has not been already classified.
  2. ClassifyLASBuilding.

Next, we rasterize and vectorize the point cloud into polygons under the points of class 6 (Buildings). Then, we apply building footprint regularization algorithm to fit most suitable shapes with the right and diagonal angles into the initial polygons:

3. LASPointStatisticsAsRaster with LAS layer filtered on class 6 (Buildings), and using the ‘Most Frequent Class Code’ option.

4. RasterToPolygon with the Simplify Polygons option turned off.

5. EliminatePolygonPart to remove small holes (could alternately be done by morphological manipulations on the step#3 raster).

6. RegularizeBuildingFootprint to straighten things out.

Finally, we calculate the local terrain raster using points of class 2 (Ground), and run the RANSAC to build three dimensional shells on top of the building footprints, DEM, and class 6 (Building) points:

7. LASDatasetToRaster with input LAS layer filtered on class 2 (Ground) points to make DEM.

8. LASBuildingMultipatch to build the actual shells.

RANSAC reconstructed building shells: not so manual-edits-friendly due to a large number of vertices; sensitive to noise in the original point cloud and misclassification of building points.

At the first glance, the results look pretty good. But once we examine the resulting building shells closer, we may find a pretty high level of noise and a huge number of minuscule triangles contributing to each shell, which altogether make such building models unsuitable for further manual fine-tuning / editing.

But where does this triangle noise come from? Some of it can be attributed to uneven point cloud density and LiDAR scanner sensitivity, and another to the misclassification of the points which happened at steps #1 and #2 above, particularly inside the Classify LAS Buildings tool.

The Classify LAS Buildings tool has to perform a quite sophisticated task of figuring out which points of the point cloud look like reflections from a building, and which don’t. In the perfect world, when it’s a flat-roof four-walls cubic structure on a perfectly flat ground it is easy… but in the real world, buildings often have complex roof shapes, dormers, chimney pipes, staged floors, etc. with nearby trees overshadowing parts of the structure and all of these located on a complex step-like terrain.

In such complex environments, no wonder that Classify LAS Buildings tool often misses building points or mixes them with other points which in reality belong to trees, bushes, ground, parked nearby cars, etc.

Meanwhile, raw LiDAR point clouds contain billions of points which need to be classified this way — all together it sounds like a perfect task for a neural network to take care of, isn’t it? Or, to be more precise and practical, can we train a Deep Learning model to label an unclassified point cloud more efficiently than existing deterministic algorithms?

From a pure X-Y-Z point cloud to classes of objects… Is this possible with Deep Learning?

Well, there are some serious challenges when we talk about Deep Learning in such sparse and unordered spaces as point clouds — we just cannot apply traditional and well-known Convolutional Neural Networks to them:

…point clouds are irregular and unordered, thus directly convolving kernels against features associated with the points, will result in desertion of shape information and variance to point ordering.

Nevertheless, although much less explored than traditional computer vision domain, point cloud Deep Learning analysis benefits from a recent explosive growth of the robotics, self-driving cars, and SLAM in general, where the LiDAR sensors play a key role. Another helping hand comes from Graph Convolutions which were designed to work on graph-like data structures, like social networks — a point cloud can be reduced to fit such structures under some conditions.

After some research of recent publications on the subject we decided to choose a PointCNN implementation for the experiments, as the one promising up to the state-of-the-art results on many common benchmarks.

New post on PointCNN: replacing 50,000 man hours with AI.

The core idea behind PointCNN revolves around the multi-pass application of Multilayer Perceptrons on local fixed-size neighborhoods inside the point cloud to elevate the initially sparse features into a dense latent feature-space where traditional convolutions can be used for further processing.

While being quite compact, with just about 3.5M trainable parameters, the PointCNN model reached 0.97 accuracy on the validation set after just 6.5 hours of training on a single NVIDIA QUADRO GV100 with 32GB of GPU RAM. The training set was constructed from a subset (covering Amsterdam only, 1.8B points total) of the Netherlands open LiDAR dataset with about 18 points per square meter average density. Testing was performed on the nearby city of Utrecht, from the same data-source.

Although the way how the Netherlands point cloud was initially classified is unknown to us (was it even an algorithm, or a manual labor?), the PointCNN model we trained on it showed impressive results when discriminating between building and non-building classes on the test set, surpassing in most cases the traditional Classify LAS Buildings tool’s results.

It is also quite fascinating, that PointCNN was trained on X-Y-Zs only (no Intensity or RGB, nor any other attributes), meaning that the model was able to effectively learn properties of spatial distributions specific to different classes of objects, and not least important, the distinct boundaries between the classes.

A block in city of Utrecht (test set) with lots of vegetation inside, which partially covers the main structure and significantly covers the low-height sheds of the inner courtyard.
Classify LAS Buildings tool’s building points. Note the missing building portion on South.
PointCNN model’s building points. Note much larger number of shed points picked up properly despite the overhanging trees.
The point cloud inside the East side of the courtyard after being classified by PointCNN (Ground, Building, Vegetation classes are shown). Note how complex is the environment with tree canopies overshadowing the sheds and buildings in direct proximity.

Here is another surprising example of the PointCNN model labeling a windmill from the Utrecht test set. We are not sure, if there were any windmills in the training set, but even if there were some, the number of windmills in the training data is negligible compared to other building types. In other words, from the classification standpoint, there is a huge class imbalance which would puzzle a traditional classifier like Mask R-CNN. This remarkable case demonstrates the ability of PointCNN to learn properties of complex spatial distributions which are specific to general man-made objects of particular size and proportions (e.g. cars in the test set were correctly discarded from building class, as not being tall enough) and, apparently, a windmill meets the learnt criterion.

A windmill in Utrecht, accurately segmented by PointCNN.

A good example of how PointCNN model relies on the point neighborhood’s height and how much it resembles a near-vertical plane can be seen in the below partially misclassified tall ship from the Utrecht test set:

A partially misclassified tall ship in Utrecht giving a hint about internal PointCNN criterion of the importance of planar point neighborhoods and their height above the ground.

Well, after we successfully trained a PointCNN model and labeled the Ground and Building points of the Utrecht test point cloud with it — how does it all affect the resulting building model reconstruction? Here is the answer -

RANSAC building shells from the same point cloud labeled by Classify LAS Buildings, and by PointCNN.

The above animation shows the results of two identical workflows (as steps described in the beginning of this chapter) performed on the same point cloud. The only difference is in steps #1 and #2: in the first case the Buildings and Ground points were labeled by the traditional deterministic algorithm, while in the second case — by the PointCNN model. As you can see, the latter building shells have much lower level of noise, particularly in the areas of vegetation being close to the buildings.

3D Meshes

Looks like we have some encouraging results with PointCNN and LiDAR Point clouds, isn’t it? But there is another source of raw 3D data mentioned earlier, which comes from Structure-From-Motion algorithms, and which is actively growing in popularity because of a much lower cost to acquire. Another significant advantage of this data source is that it comes with high-resolution RGB textures precisely attached to the triangulated mesh right out of the box! So, the question is: can we extend our previous success into the three-dimensional continuous meshes?

Semantic Segmentation in 3D Meshes

The main problem with a raw 3D mesh is that it represents a continuous triangulated surface — millions of triangles connected with one other. To a human, especially, when high-resolution RGB textures are applied to the mesh, it is clear which triangles belong to buildings, which to the ground, trees, light posts, cars, etc. But we do not have these attributes associated with the triangular faces coming from the Structure-From-Motion pipeline, which makes raw meshes useless when we talk about, for example, buildings’ square footage estimation. Again, either a sophisticated deterministic algorithm, or, most often, manual segmentation is employed to solve this problem.

And, nevertheless, looks like we have a very good chance to succeed here…, by using the same PointCNN! The idea is quite simple: sample the mesh with a fixed distribution (even Monte Carlo seems to be working) to produce synthetic point cloud, and then just ask the PointCNN to label it. Finally, apply the resulting labels back to the triangular faces of the source mesh, and that’s it: we get a segmented 3D mesh!

One may say that it sounds too good to work in practice, but here are some remarkable examples from a synthetic point cloud, which was produced from a mesh by using Monte Carlo point sampling:

Synthetic point clouds sampled from a triangulated mesh and then segmented by a PointCNN model trained on Amsterdam LiDAR data.

Of course, the results are not ideal, but here is the most surprising part: the segmentation was performed by the same PointCNN model which was trained on the Amsterdam true LiDAR point cloud with very different point density, different distribution properties of points per class, even of a different architectural style (this is from a city outside of Netherlands).

This is truly encouraging, because pretty much guarantees that a PointCNN model trained on a synthetic point cloud in the first place, will have much more accurate results segmenting the meshes sampled with the same sampling technique. Moreover, synthetic point clouds sampled from triangulated meshes will have additional attributes like RGB, and sampled face-normal vectors which will be a great help to the PointCNN to learn the proper segmentation rules.

Mask R-CNN and PointCNN: be careful with…

As it often happens in the complex world of Deep Learning, both Mask R-CNN and PointCNN networks tend to learn some undesirable semantics from the training data, leading to biases and making them harder to transfer to other geographies or models of sensors. This is their disadvantage compared to traditional deterministic algorithms.

We have completed a few experiments by moving the trained models to other data sources and geographies, and here is a short list of biases which are usually picked up by the two networks when used in the aforementioned workflows. Not meant to be a complete list, but still, the most striking things to look after:

Mask R-CNN sensitivity / bias:

  • Architectural styles.
  • LiDAR scanner: point density.

PointCNN sensitivity / bias:

  • LiDAR scanner: point density, intensity, RGB consistency.
  • Sampling technique when segmenting 3D Meshes.

And, as it was already mentioned earlier, the best way to reduce biases in a neural network, is to bring more training samples to the table — as long as the “mental capacity” (# of trainable parameters and architecture) of the network allows. Even synthetic data: for example, if we do not have enough LiDAR coverage for a given geography, architectural style, or sensor type — it is possible to build synthetic training samples using ArcGIS Pro and CityEngine’s procedurally generated 3D content, which will contain the valuable signal we are trying to teach the model to extract.

Voxels and future work

We are continuously experimenting with various Deep Learning architectures to find best fits to various industries, use cases, environments. Another exciting family of DL models work in a voxel space which we are actively exploring. Here is an update from David Yu, Esri’s Data Scientist:

Whereas 3D scenes are normally created by stitching together disparate oblique views, one idea to explore is the possibility of generating 3D models from a single 2D image. This can be achieved and has been tested to limited success with DCGANS that take as input embedding layer produced by a Variational Autoencoder. Such an approach requires a unique model to be fitted for every class of 3D objects (e.g. cars, trees, lamp post, fences etc) in order for the output to have sufficient variation but also preserve the fidelity of the general form of the class. By this method, a 3D DCGAN combined with a latent vector from an overhead shot is sufficient to recreate the object’s unique attributes in 3D.

A voxel representation of objects is chosen because while it is possible to represent the output of generative models as a meshes (AtlasNet) or even point clouds (PC-GAN). It is intuitive to extend the original GAN network to produce 3d grid outputs (voxels) without having to redesign the network. The high availability of voxelized training data from libraries such as ShapeNet makes this process easy and painless. Further, voxel shapes being a collection of points without explicit coordinates has certain advantages such as permutation invariance and memory efficiency (compared with meshes) when it comes to representing irregular and non-homogeneously filled objects. However, when it comes to simple objects or objects of higher resolution, voxel representations are very memory-intensive, which is why it is better suited for generating specific classes of objects rather than whole scenes.

In terms of architecture, this model effectively merges two well-known networks: A variational autoencoder (VAE) is used to generate the 1D embedding vector from an overhead shot of a 3D object. This latent vector is then concatenated with a noise prior to act as input for the GAN generator. The generator generates its own voxel model and passes it to the discriminator, which attempts to distinguish the generated inputs from the real ones and back propagates the error to the generator. As of now, this architecture is capable of producing decent results, but future extensions could include incorporating color encoding to voxel outputs as well as introducing normalizing flows into the noise prior in order to model a more complex distribution that is less likely to suffer from mode collapse.

And here is a list of other connected initiatives we are currently working on — we will keep you updated on the progress with the future posts.

  • PointCNN has a great potential to work not just on building classification, but on other, more complex point classification tasks like labeling powerlines and associated devices, railroad equipment, telecom devices inside tunnels, etc. In other words, in areas where the use of deterministic labeling algorithms is limited, or they don’t exist at all.
  • The Mask R-CNN approach needs lots of training samples. We are working on a CityEngine script which will help with the automation of synthetic training samples creation.
  • Building Footprint Regularization tool will be improved.
  • Simplification tools for RANSAC-reconstructed shells.

Thank you for reading such a long post, we hope you have enjoyed it! :)



Dmitry Kudinov

Senior Principal Data Scientist at Esri Inc. Research of AI applications in remote sensing and transportation.