Segmentation Pipeline for 3D
Tiles

Published in

d*classified

10 min readOct 3, 2023

Dylan Chua & Anne Lee developed a pipeline to semantically segment models in GL Transmission Format (glTF) contained within 3D Tiles. This pipeline reads and traverses a 3D Tileset to output a transformed set of divided objects incorporating metadata. The project delivered a minimum viable product for a 3D semantic segmenter, serving as a proof-of-concept with varied applications. They were mentored by How Chang Hong, Principal Engineer, Simulation and Training Systems Hub.

Motivation

Contemporary 3D representations of the real world are often handmade and expensive. High-fidelity models also necessitate long production times, resulting in the completion of an already outdated model. 3D representations of Singapore currently exist in the form of OneMap, “the
authoritative national map of Singapore with the most detailed and timely updated information developed by the Singapore Land Authority”. Although models in OneMap are clean and elaborate, it lacks trees and terrain elevation, let alone the complete absence of infrastructure such as bridges.

With the advent of satellite photogrammetry, photorealistic 3D models of the Earth can be built swiftly, at a lower labour and resource cost. The antecedent shortcomings could be resolved with the latent power of these 3D Tiles, requiring little time investment to build. Digital twins can accordingly be kept up-to-date. The automatic nature of creating these 3D Tiles thereby results in a noisy amalgamation of meshes with little metadata. Herein lies the challenge to develop a method to process and classify this data.

Existing Work

In the field of 3D segmentation, existing segmentation methods can be broadly categorised under:

1. Multi-view: render 2D images of the object at different angles
2. Volumetric: transform into binary voxels
3. Point clouds: using the vertices
4. Mesh based: using the mesh primitives

Of the four aforementioned, “point clouds”-based techniques have the greatest depth of research. However, this was not explored due to 2 main considerations:.

1. Large-Scale Urban Meshes
Most of the existing literature in the field of 3D semantic segmentation has
primarily been focused on the smaller-scale, such as distinguishing features like a bottle cap from its body or identifying types of furniture in a room. However the type of data given for this project demanded a solution tailored to the unique characteristics of large-scale urban meshes.

2. Data Limitations
The state-of-the-art 3D segmentation systems for the urban-scale have often relied on rich point clouds, frequently generated using 3D laser-scanning technologies like Mobile Laser Scanning, Airborne Laser Scanning, or LiDAR. However, the data given was generated through photogrammetry. Hence, the meshes provided lacked the level of detail required to produce robust point clouds. This limitation is evident when comparing images from OneMap and Google’s Map Tiles, where the boundaries between vegetation and buildings tend to blend together, and artifacts such as vehicles persist in the data.

Ergo, novel means of segmentation were explored (covered below in the Approaches section).

Overall Pipeline

In essence, the pipeline consumes the root tileset JSON and subsequently traverses the tileset tree. Each 3D model, a binary glTF (.glb) file, is segmented and its vertices classified. Metadata embodying the classes can be stored either in the tileset JSON or within the .glb files.

Approaches

There are two ways we explored for segmenting the 3D model

Using the texture image
Mesh-based segmentation

Image Segmentation

Given the photogrammetry 3D Tileset to be converted, the 3D objects contained are .glb files attached with photorealistic textures. Research into 2D image segmentation is diverse and there exists a multitude of well-trained computer vision models, especially on cityscapes. Consequently, semantic segmentation can potentially be performed on these textures.

Lamentably owing to the limited duration of the project, a computer vision model was not trained on the data. Notwithstanding, a pre-trained semantic segmentation model built on Facebook’s Mask2Former was utilised. With a reported mean Intersection over Union (mIoU) of 57.7, the model could label regions as “roads”, “buildings” and “vegetation” for instance. Although the model was trained on cityscapes, there is sufficient similarity to textures derived from satellite imagery. Results of the image segmentation model is displayed below, where orange indicates buildings while purple is vegetation. Visual inspection reveals adequate performance from the segmentation, separating obvious areas of buildings and vegetation. Nonetheless, there are pockets of the texture incorrectly labelled, and can definitely be improved via training of a model specifically on the textures.

Semantic Segmentation performed on the texture of the 3D Model. Orange: Building; Purple: Vegetation

Mapping to 3D

To translate the 2D segmentation into 3D, the .glb file has to be loaded to retrieve its vertices, faces and texture coordinates. High-level libraries such as trimesh can easily obtain these, however this incurs a data compression penalty. Each vertex has a respective UV coordinate, a 2D vector comprising two numbers from 0 to 1, corresponding to a pixel on the texture. Thence, the vertices can be classified by their respective value on the mask.

Semantic Urban Mesh Segmentation (SUMS) by TUDelft

Another method explored employed the use of a lesser-known segmentation tool made by TUDelft. The main difference from the aforementioned method would be that in addition to performing image segmentation on the texture image, it would also take into account mesh features (e.g. geometric, contextual information). SUMS was built by the 3D geoinformation research group at Delft University of Technology (TUDelft). It is an open-source program that allows for the automatic semantic segmentation of large-scale urban meshes. Their Github repository is here.

It is primarily implemented in C++ and utilizes open-source libraries like CGAL and Easy3D.

Why SUMS
It provides a pre-trained semantic segmentation model trained on a meticulously annotated mesh dataset. This dataset spans 19 million triangles, covering a 4 km square area of Helsinki and encompasses six object classes commonly found in urban environments: terrain, high-vegetation, building, water, vehicle, and boat. It is important to acknowledge the degree of similarity between this dataset and the
provided data, as the closer the test data aligns with the given data, the higher the likelihood of improved model performance. SUMS also offers the potential for model refinement through the use of a mesh annotation tool.

SUMS — under the hood

The technicalities and details of how it works can be read in their paper here. In essence, it over-segments a mesh to group triangles, identifies planar (i.e. flat) segments, and extract features. These features are then fed into a random forest classifier for mesh segmentation.

On Given Data

The provided data was initially in .glb format, but it needed to be converted into .ply format for compatibility with SUMS.

The labelled output appears as shown below:

Conversion process from original .glb file to converted .ply file and colour-coded output

As SUMS was trained on large-scale urban meshes, it was also tested on a higher-level mesh.

SUMS detection on low- and high-level meshes

A table comparing SUM’s performance on both meshes is shown below.

It performed about 4.5x better when applied to high-level meshes compared to low-level meshes, primarily attributable to the mesh’s resemblance to the data upon which SUMS was trained. In these output representations, every triangular face has been color-coded to denote
its respective class or category. Each triangular face comprises three vertices, each with its own set of coordinates within the 3D space. Each vertex is labeled to signify its associated class.

Metadata

The equivalent metadata can be stored either in the 3D model itself, or in the JSON file encompassing it. Hierarchical storage is represented in the diagram below. Marking features and vertices must be inside the .glb file, whereas higher level information belongs to the 3D Tileset JSON file.

Within .glb File

Through a glTF extension, EXT_MESH_FEATURES , per vertex features can be included into the .glb file itself. Appendix C demonstrates how the classes of the vertices are packed into bytes and can be retrieved by the loader.

Individual vertex semantic metadata allows for ease of visualisation. A custom shader can be employed, imparting different appearances such as colour based on the class of each vertex. Further details are later expanded on in the Visualisation section.

Within 3D Tileset JSON

To segment the tileset, each 3D model is split into multiple .glb files by class. With the classes of each vertex, consequently faces of the model can be selected and extracted. Classification of faces may be contingent on the approach towards vertices of each face. Should the face be categorised to each class of its vertices, all faces are guaranteed to be grouped but overlap is probable. Rather the assignation to one class common to all its vertices prevents overlap however unclassified faces could result.

Separation of 3D Model apropos of vertex class

As one .glb model is now split into multiple files, the tile must carry multiple contents, each labelled with the class it belongs to. Semantic metadata and multiple contents per tile is newly supported in 3D Tiles version 1.1, whilst the given tileset is still on version 1.0. Henceforth, a 3D Tiles reader and exporter, based on py3dtiles, was required. Splitting into multiple contents enables the manipulation of the segmented models. Namely, it permits selecting specific classes to render and interact with its meshes. For example, vegetation could be switched off, removing the categorised meshes from
the scene.

Output

From a single .glb file to the entire tileset, to fully segment the 3D Tiles data, the tileset tree needs to be traversed to locate the .glb files to therefore segment. As 3D tiles content URIs may be relative to the tileset JSON file, a traverser accounted for the relative paths to search for children tiles and .glb files. Only a subset of the data was processed considering the large size and numerous 3D files. Extraction of a particular region of interest was accomplished from above 75,500 to a mere 174 3D models to segment.

Running on 174 .glb files, the pipeline took just under 5 minutes, averaging at around 1.5 seconds per file. Majority of the computation time was expended by the machine learning models. The first approach took approximately 1 second per texture image. The second model took approximately 5 seconds per mesh and corresponding texture
images.

Visualisation

Cesium, the platform which introduced 3D Tiles, also developed libraries to visualise 3D Tiles. In particular, we concentrated on CesiumJS for the web and Cesium for Unity. In both scenarios, a local server provides the tileset data. A Node.js web app with Express was set up with the path to the data as a static directory, allowing the app to host the files.

CesiumJS

CesiumJS is an open-source JavaScript library provided by Cesium designed for handling of massive datasets, widely used to create interactive web apps for sharing geospatial data.

The Node.js app simultaneously serves the HTML and JavaScript files to visualise the data within the browser, which employ CesiumJS to deliver a browser-based visualiser.

Visualisation of the segmented 3D Tiles is accomplished by feeding a custom shader into CesiumJS that identifies the class ID of each vertex and results in a unique appearance for each class. This shader can be toggled to display either the original textures, or the colour of the classes.

A custom shader distinguishing between buildings (blue) and vegetation (green) in CesiumJS

Cesium’s Unity plugin was instrumental in rendering the 3D tiles within Unity, enabling interactivity and the creation of an immersive VR experience. The detailed process of accomplishing this can be found in the project’s GitHub repository. To navigate and explore the map, a humanoid avatar was incorporated into the game, following the instructions provided in this tutorial. Control of the avatar is achieved using the WASD keys.

A humanoid avatar walking within the tiled 3D Tiles, in Bishan, Singapore

Metadata can be selected and viewed within Unity as well.

Example scene in Unity where Metadata is shown as text

Conclusion

Although 3D mesh segmentation for large-scale urban textures is a relatively new and emerging field, exploration of two distinct methods to execute the segmentation task was done: one reliant on texture image analysis and the other employing a mesh-based segmentation approach.
These methods enable the grouping of vertices and faces within each mesh, facilitating the addition of valuable metadata for enhanced data organization and analysis.

Segmentation Pipeline for 3DTiles