Monocular depth-sensing & point cloud from webcam feed using Unity Barracuda

Raju K
XRPractices
Published in
5 min readOct 11, 2020

Depth sensing typically achieved using physical sensors like ToF (Time of Flight) and LiDAR where the depth details are directly delivered by the sensor. There are other photogrammetric approaches that make use of pair of cameras known as “Stereo”. The photogrammetric approach has processing implications. However, we had some promising development like “Google Tango” using a stereo pair of depth-sensing technology. But unfortunately, google decided to kill “Google Tango” for ARCore.

Given all the dependency on hardware and processing, there have been efforts to remove these dependencies and enable depth sensing from single images (monocular) using machine learning.

Depth sensing is a primitive problem to be solved in the space of robotics, autonomous vehicles, and Extended Reality (SLAM). Right now in autonomous vehicles, the depth-sensing is solved with 2 major approaches.

  1. Using a LiDAR sensor
  2. Using an ML/AI model named “monodepth

In this article, we are going to leverage the pre-trained “monodepth” ML model to give depth sensing for Unity applications using the webcam feed. “monodepth” is a CNN based ML model, and executing the same in mobile hardware may be costly. Let’s find out how efficient is Unity Barracuda when it comes to performance. Before we proceed further, there is already a mobile/Edge optimized version of “monodepth” ported by MIT in “fastdepth”.

For this article, I have used an already ported and ONNX converted model from here. You can download the ONNX model to experiment with it at your end.

Step 1: Unity Barracuda Setup

Create a new Unity 3D project (2019.3) and import “Barracuda” using “Window → Package Manager”

Select Barracuda and click “Install”.

After importing Barracuda, copy the fastdepth_7.onnx into the Assets folder. The ONNX importer will automatically recognize and import the model into the project. After import, the important thing to notice is the input and output shape dimensions as highlighted below,

This model has an input shape of (1,224,224,3) which is (batch, width, height, channels) meaning the input is a single batch of 224 x 224 pixels of RGB values.

Similarly, the output shape is (1,224,224,1) meaning the output is a single batch of 224 x 224 pixels of depth values.

With that understanding, let us create a script that will read the webcam feed and pass it to fastdepth ONNX, construct a depth map, and a point cloud from the depth output.

Step 2: Scene Setup

Our scene setup looks something like this, A Canvas (Screen Space — Camera) with 2 RawImages within it. One to render the incoming webcam feed aligned to the right bottom corner of the canvas. The second RawImage is to render the depth image at the left bottom corner of the canvas.

Add a simple 3D cube object to the scene and rename it as “PointCloud” and align it to the middle of the camera view. This cube mesh will be altered with the point cloud that we generate from the depth map converted from the webcam feed. The material assigned to this cube is a custom material with a custom Point Cloud shader (Google — Tango / PointCloud).

Step 3: Script to execute the model

Create a new C# script and name it as “DepthSensor” and declare 3 serializable fields as shown below,

public class DepthSensor : MonoBehaviour
{
[SerializeField] private NNModel _monoDepthONNX;
[SerializeField] private RawImage _sourceImageView;
[SerializeField] private RawImage _destinationImageView;

The script should be attached to the “PointCloud” game object in the scene. These 3 fields should be set in inspector such that,

In Unity Barracuda, we need a runtime model and a worker

m_RuntimeModel = ModelLoader.Load(_monoDepthONNX);
worker = WorkerFactory.CreateComputeWorker(m_RuntimeModel);

the m_RuntimeModel will contain the runtime instance of the pre-trained model. and the worker is needed for execution (inference).

To execute the model, the following lines are added to the “Update” method.

var tensor = new Tensor(inputTexture);
var output = worker.Execute(tensor).PeekOutput();
float[] depth = output.AsFloats();

Yes, thats all it takes to load the ML model, execute it and obtain the results.

For creating the depth texture and point cloud from the obtained depth, the “DepthSensor.cs” is added with additional code. The complete class looks like as follows,

Here is some output from this experiment,

Left — Depth image obtained from fastdepth

Middle — Point cloud generated from the depth and RGB

Right — actual webcam feed

Exterior:

Depth (Left), Point Cloud (Middle), Original Monocular Image (Right)

Interior:

Depth (Left), Point Cloud (Middle), Original Monocular Image (Right)

Summary:

The whole experiment is executed in android phone (Samsung Galaxy S9+) and we could able to get a decent 30 FPS out of the box without any performance tweaking. Thus making it mobile and Edge (Raspberry Pi and alike) friendly. While the accuracy of the output is debatable, it still can be utilized for some of the practical computer vision applications for Robotics and XR (SLAM). The pre-trained model (fastdepth) ONNX is almost 2 years old. we could also see re-training the model with the latest datasets to get an acceptable accuracy.

Can we make a 3D Scanner application out of it?

May not be now. However, this may be a reality within a short span of time when the fastdepth model improves.

--

--

Raju K
XRPractices

Innovator | XR | AR | VR| Robotics Enthusiast | Thoughtworks