3D surface reconstruction on Compute over Data running Neuralangelo

Frontier tech
Circum Protocol
Published in
6 min readNov 30, 2023

About:

This is the first article in a series of technical tutorials about scaling the 3D surface reconstruction job on top of Compute over Cloud infrastructure. We will introduce the algorithms we are using and evaluate the performance and cost constraints of this application compared to traditional cloud-native applications. Additionally, we will explore how we can scale by combining workflow orchestration tools to run large-scale training and point cloud generation.

Background Overview:

3D surface reconstruction is a rapidly growing field of interest. Its objective is to accurately map geographical areas and assets of various shapes and layouts, and then reconstruct them temporally in the form of continuous point cloud data. Traditionally, this work has been carried out by surveying firms that implemented ELT pipelines with geometric algorithms which:

  1. Collect data using LiDAR or high-resolution photogrammetry techniques.
  2. Transform the data using geospatial metadata by importing it into GIS tools like Cloud Compare and normalizing the output.
  3. Finally, run image transform pipelines by applying geometric transforms (using GDAL library operations like Delaunay) and combining various data pipeline transformations (using PDAL).

However, these geometric reconstruction techniques had the following limitations:

  • They were not efficient for reconstructing heterogeneous and diverse surfaces of objects, such as a gallery with multiple objects or capturing fine details on the objects.
  • They required manual tweaking each time to construct edges, overlap surfaces, textures, and other details on the traditional GIS software, thus being an cost intensive process.

Thus, in 2020, a research paper was published introducing the neural radiance field architecture for synthesizing photorealistc isometric point cloud representations of the given region using images/videos as inputs. This modeling technique was further enhanced by the NVIDIA team with the creation of Neuralangelo, which improved the performance of neural surface reconstruction by generating higher-order derivatives to represent complex surface generation. As a result, it became possible to reconstruct detailed 3D point cloud maps using commercial footage and corresponding calibration data.

Workflow of the neuralangelo (credits to Laurent for the simplified description).

Tutorial

diagram workflow of the tutorial notebook.

Here, we will have 3 main components:

  1. Workflow orchestration platform (flyte): This project allows the execution of stateless, asynchronous, and serverless services (referred to as agents) for running data/ML operations. It spawns pods for corresponding operations and schedules jobs as different workflows on bacalhau computation, based on the requirement and with complex dependency patterns of the outputs.
  2. Bacalhau network: We have already explained about this COD platform in the previous articles. In this application, we will be invoking the application infrastructure using the flyte agent plugin which wraps the call on top of the bacalahu infrastructure and then allows developers to invoke parallel multiple workflows on top of bacalhau computation. With concepts like lazy execution, it helps to define custom conditions for the waiting jobs for previous pipelines to the corresponding jobs based on the given function mapping.
  3. Storage and containers: Here, we will be mounting the S3 storage to the Bacalhau container, demonstrating the benefit of running ML ops on the private data and storing the results within the same source.

1.Workflow:

a. Here our demo we first install the necessary libraries and CLI tools for importing the datasets and integrating flyte (a DAG based job running framework that integrates Kubernetes to run machine learning and data engineering jobs concurrently).

b. Users will then need to integrate the dataset they want to process, following the format as defined in the repository. There are three ways to do this:

  • Host the dataset in the /content/dataset directory by using the gdown <video-id> command. This method stores the images along with the collaboration data, but it is not recommended due to limited size constraints.
  • Plug the storages into the docker mounts.
  • Set up the AWS CLI to enable S3 integration (which we have chosen to do in this tutorial).

c. To set up Flyte, follow the instructions mentioned here. It is recommended to set up a single compute cluster on the free instance and then integrate the infrastructure with the notebook using the flytectl command.

d. Next, we define the script for running the chained jobs on the Flyte plugin of Bacalhau. As shown in the notebook, these jobs will perform the following three steps:

  • Preprocess the video and generate the configuration for model training.
  • Pass the outputs of the first executed task to the training script, which will then train the model.
  • Finally, use the trained model and the parameters (such as resolution and block spec) to run the script and generate the corresponding point cloud file.

e. Registering the worker to the cluster using “pyflyte register” in order to be able to have their commands executed on the cluster and see the results on the dashboard.

And thats it , after running the function run_workflow with the corresponding input parameters, you will be able to track the jobs executed for each step of the training job from the https://localhost:8080 .

And that’s it! After running the function run_workflow with the appropriate training parameters, you will be able to track the jobs executed for each step of the training process.

Additionally, by using the — wandb option, you can monitor the training metrics on Wandb.

Benchmark performance:

Now, as we have explained the benefits of combining a DAG-based job orchestration framework like Flyte with Bacalhau for running training jobs, let’s discuss the cost implications based on the 3 scenarios of deployment architectures

  1. Significant decentralised architecture with public hosted bacalhau nodes + IPFS mounted storage + workflow orchestration using flyte on EC2 .

— Here the cost is only for the hosting of the flyte deployment (single cluster as explained here).

  • EKS 1 cluster : 73 USD
  • S3 object storage (around 600GB) : 15 $
  • MySQL instance hosting : db.t3.small with 82 $
  • AWS lambda + API gateway for serving requests (using the free tier)
  • With the storage on IPFS private gateway services (like the web3.storage lite or business usage ) ⇒ 100$.
  • Thus coming out to be: 277USD

— Running docker version of bacalhau (here) with the AWS Elastic Compute Service (that instantiates the compute and requester node every time given compute task is instantiated) with S3 block storage:

  • 20$ for the object storage
  • Amazon Fargate serverless compute service (on spot) with assumption of 4hour jobs, 8core 16GB GPU compute costs : 1009 USD
  • but with the Spot instances based on the nature of long term storage (ref) you can get the reduced cost of around 50% ⇒ 500USD

— traditional cloud hosting of private bacalhau + S3 storage with the Flyte deployment.

  • 2 EC2 xLarge instances (running the clomp reconstruction + training instances separately) with the upfront costs approx 200USD (with 1 yr upfront being 156$).
  • S3 being the same (30 $)
  • EKS cluster hosting : 177$
  • total ⇒ 477$

Here cost taken from AWS calculator and are indicative of scenario of running the beta service by 50 jobs of training + mesh generation/day ). and the benefits / challanges are defined as follows:

comparison of the various characterstics

Whats Next

The above demo showcases the capabilities that can be achieved by using the framework built on top of Bacalhau to run geospatial reconstruction pipelines at scale.

In the coming tutorials will show how to build more complex pipelines by utilising the benefits of more complex workflow and trigger situations in the flyte framework on top of bacalhau cluster, to create modified version of NeRF algorithms that are curated based on the user data and crowdsourced caalibration of the geospatial mapping using normal phones as captors.

Stay tuned to our blog for updates and the beta release of our geospatial pipeline. Additionally, you can connect with us on our Discord group (here) to share your feedback and follow updates on the frameworks we are developing on our GitHub.

Credits

  1. Codebase of neuralangelo paper.
  2. My Collague and researcher in the field of geospatial data Laurent.
  3. Bacalhau flyte integration tutorials.
  4. Expanso.io

--

--

Frontier tech
Circum Protocol

developer focused newsletter presenting the simple demonstrations of building products combining AI and web3 for impactful usecases