Unleashing Depth Anything v2: SOTA Monocular Depth Estimation on Intel CPU with OpenVINO and NNCF

Luís Condados
LatinXinAI
Published in
6 min readJun 16, 2024
Video by cottonbro studio from Pexels: https://www.pexels.com/video/close-up-video-of-a-basketball-passing-through-the-hoop-6777262/

In this article, we’ll dive into the latest advancements in monocular depth estimation, focusing on the state-of-the-art Depth Anything V2 model. We’ll walk through how to convert this model to OpenVINO, leverage the benefits of OpenVINO 2024.1 for inference, and further optimize it by quantizing to INT8 using Neural Network Compression Framework (NNCF ). By the end of this guide, you’ll be equipped to use the converted models for the most accurate and efficient monocular depth estimator.

Let’s get started! 🚀

You can find the complete code and all utility functions used here on my Github.

Depth Anything V2

Depth Anything V2 is a cutting-edge model for monocular depth estimation. Developed by Yang et al. (2024), it significantly advances the accuracy and efficiency of depth estimation from a single image.

Image from the paper

About OpenVINO

OpenVINO™ (Open Visual Inference and Neural Network Optimization) is an open-source toolkit that optimizes and accelerates AI inference. The latest version, OpenVINO 2024.1, offers enhanced support for various deep learning models, making it a perfect choice for deploying depth estimation models in real-world applications.

Python Environment

For this project, I’m using a python virtual environment with python 3.11.

the requirements.txt files can be found on the Github repo.

Download Pre-trained Weights and Inference with PyTorch

First, let’s download the pre-trained weights for Depth Anything V2 and run inference using the original PyTorch implementation.

Official weights provided by the authors

Right below I’m going to show how to build the pytorch model, load the pre-trained weights and how to run inference.

import torch
from depth_anything_v2.dpt import DepthAnythingV2

import utils

class DepthAnythingV2Pytorch:
def __init__(self, model_type="vits", device="cpu"):
self.model_configs = {
'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
}

self.device = device
self.weights_path = f"weights/depth_anything_v2_{model_type}.pth"
self.model = DepthAnythingV2(**self.model_configs[model_type]).eval()
self.model.load_state_dict(torch.load(self.weights_path, map_location=device))

def predict(self, image):
"""depth estimation prediction method from a RGB Image.
Args:
image (numpy): RGB Image of shape (height, width, 3)
"""
input_tensor, image_size = utils.image_preprocess(image)
out = self.model(torch.from_numpy(input_tensor))
depth = utils.postprocess(out.cpu().detach().numpy(), image_size)
return depth

# example of how to use it
if __name__ == "__main__":
# load model with pretrained weights - choosing small version and using cuda
model = DepthAnythingV2Pytorch(model_type="vits", device="cuda")

# download image from a url and convert to numpy (RGB Image on PIL to numpy)
image_url = "https://images.pexels.com/photos/5740792/pexels-photo-5740792.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1"
image = np.array(utils.download_image(image_url))

# prediction
depth = model.predict(image)
# colorfull depth map with values, check utils module no Github to see the full code

The authors have published three model versions, the small, base and large. With this code we can choose which one to use, but to keep it as fast as possible, I’m going to use the small version for all analysis here.

Prediction using original Pytorch model running on my RTX3060

Original video — by cottonbro studio from Pexels: https://www.pexels.com/video/close-up-video-of-a-basketball-passing-through-the-hoop-6777262/

Converting Pytorch Model to OpenVINO

    ##########################
# Load Pre-trained model #
##########################
model_select = "vits"

model_configs = {
'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
}

weights_path = f"weights/depth_anything_v2_{model_select}.pth"

model = DepthAnythingV2(**model_configs[model_select]).eval()
model.load_state_dict(torch.load(weights_path, map_location='cpu'))

########################
# Get Sample RGB Image #
########################

image_url = "https://images.pexels.com/photos/5740792/pexels-photo-5740792.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1"
image = np.array(utils.download_image(image_url))

########################
# Preprocess RGB Image #
########################
input_tensor, image_size = utils.image_preprocess(image)

#######################
# Convert to OpenVINO #
#######################

ov_model_path = Path("models_ov") / Path(Path(weights_path).name.replace(".pth", ".xml"))
if not ov_model_path.exists():
ov_model = ov.convert_model(model, example_input=input_tensor, input=[1, 3, 518, 518])
ov.save_model(ov_model, ov_model_path)

Using the latest OpenVINO is simple to convert from a pytorch model, so it basically loads the pre-trained Pytorch model and provides an input example.

Note: By default the OpenVINO Intermediate representation (IR) saved using this approach will quantize the model to be float16.

Using OpenVINO model for inference

Here’s how we can load the converted openvino IR model and run inference. As before, I’m creating a class to help use the code later.

Note that the input and output of the object will remain the same as the Pytorch version.

import openvino as ov
import utils

class DepthAnythingV2OpenVINO:
def __init__(self, ov_model_path="depth_anything_v2_vit.xml", device="AUTO"):
self.ov_model_path = ov_model_path
self.core = ov.Core()
self.compiled_model = self.core.compile_model(self.ov_model_path, device)

def predict(self, image):
"""depth estimation prediction method from a RGB Image.
Args:
image (numpy): RGB Image of shape (height, width, 3)
"""
input_tensor, image_size = utils.image_preprocess(image)
out = self.compiled_model(input_tensor)[0]
depth = utils.postprocess(out, image_size)
return depth

# example of how to use it
if __name__ == "__main__":
model = DepthAnythingV2OpenVINO()

# download image from a url and convert to numpy (RGB Image on PIL to numpy)
image_url = "https://images.pexels.com/photos/5740792/pexels-photo-5740792.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1"
image = np.array(utils.download_image(image_url))

# prediction
depth = model.predict(image)
# colorfull depth map with values, check utils module no Github to see the full code

Quantization to INT8 using NNCF

To run this step you’ll need to install two more packages

pip install nncf
pip install datasets

NNCF — Neural Network Compression Framework for enhanced OpenVINO™ inference.

And datasets to download our calibration dataset used by the NNCF to quantize our openvino model (at the moment with fp16 precision) model.

First, nncf needs a few images to be able to quantize the model reducing as much as possible the quality of the output, a calibration dataset. We are going to use the dataset created on the Depth Anything v2 work as well, which is also published in their hugging face collections with the name “depth-anything/DA-2K”.

import datasets # to get the images for calibration
from tqdm import tqdm

import utils # to use image_processing function

#################################
# Creating the calibration data #
#################################
calibration_data = []

dataset = datasets.load_dataset("depth-anything/DA-2K",
split="train",
streaming=True)

# let's shuffle and take just a small portion of it
dataset = dataset.shuffle(seed=2024).take(300)

for batch in tqdm(dataset):
image = np.array(batch["image"])[...,:3]
input_tensor, _ = utils.image_preprocess(image)
calibration_data.append(input_tensor)

##############################
# Load the openvino ir model #
##############################
ov_model_path = "models_ov/depth_anything_v2_vits.xml"

# output path
ov_model_int8_path = "models_ov/depth_anything_v2_vits_INT8.xml"

print("[INFO] Reading input ov model ...")
core = ov.Core()
model = core.read_model(ov_model_path)

and finally, let’s run the quantization process itself

print("[INFO] Running quantization process ...")

subset_size = 300

quantized_model = nncf.quantize(
model=model,
subset_size=subset_size,
model_type=nncf.ModelType.TRANSFORMER,
calibration_dataset=nncf.Dataset(calibration_data),
)

print("[INFO] Saving quantized model at {} ...".format(ov_model_int8_path))
ov.save_model(quantized_model, ov_model_int8_path)

print("[INFO] Done!")

Note: This process will take time and need computational resources.

Prediction using converted model to OpenVINO IR + Int8 Quantization

We can use the same code we already have to run inference from the OpenVINO model, just need to prove the path for the desired INT8 model.

Below we can check visually a result using that model running on my Intel Core i7–12700H.

Original video — by cottonbro studio from Pexels: https://www.pexels.com/video/close-up-video-of-a-basketball-passing-through-the-hoop-6777262/

Findings

Converting the Depth Anything v2 model to OpenVINO + using NNCF to quantize it to INT8, we could speed up the inference by almost 3x running only using CPU, without much degradation on the output quality(just qualitative analysis here of course).

You can find the full code and the converted models on my Github

Acknowledgments

This work is heavily based on the Depth Anything notebook by the OpenVINO Toolkit team. Special thanks to Yang et al. (2024) for their groundbreaking research.

Licenses:

  • Depth-Anything-V2-Small: Apache 2.0
  • Other variants: CC-BY-NC-4.0

References

Thanks for reading! Happy coding! 💻✨

#IntelSoftwareInnovator #openvino

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

--

--

Luís Condados
LatinXinAI

A Computer Engineer with a background in robotics, computer vision, and deep learning.