Using Monocular Depth Estimation to Mask an Image

Or, How I Selected a Dinosaur and Isolated It Out of a Flat Image

Published in

Intel Analytics Software

5 min readMar 27, 2024

In this article, I will guide you through the steps I took to create a clipped image with background clutter removed from the image. I will accomplish this using monocular depth estimation. This could potentially be used to automate structure from motion and other image-related tasks where you want to highlight or focus on a single portion of an image, particularly for identifying parts of the image that were closest to the camera. Specifically, I use depth estimation on a couple of images that I took at a natural history museum. The challenge I gave myself was to capture just the dinosaur in the foreground, eliminating the background murals, lights, and building structure. The cool thing about this algorithm is that it creates a depth estimate from a single image!

Monocular Depth Estimation (DPT)

Monocular depth estimation, aiming to infer detailed depth from a single image or camera view, finds applications in fields like generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the under-constrained nature of the problem. Recent advancements attribute progress to learning-based methods, particularly with MiDaS, leveraging dataset mixing and scale-and-shift-invariant loss. MiDaS has evolved with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, including those pioneered by models like ViT, there’s been a shift toward using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer-based encoders alongside traditional convolutional ones, aiming for a comprehensive investigation of depth estimation techniques. The paper focuses on describing the integration of these backbones into MiDaS, providing a thorough comparison of different v3.1 models, and offering guidance on utilizing future backbones with MiDaS.

This DPT model, created by Intel, uses the Hugging Face BEiT model as the backbone and adds a neck + head on top for monocular depth estimation. It is called dpt-beit-large-512 (but we have more compact resolutions as well). This model is based on a resolution of 512 x 512 internally, but images of any size can be inferred against this model. For more information, see a condensed explanation on Hugging Face Model Card, dpt-beit-large-512 or a deeper dive explanation in the paper, MiDaS v3.1 — A Model Zoo for Robust Monocular Relative Depth Estimation by Reiner Birkl, Diana Wofk, Matthias Muller.

Coding Steps

Install the requisite libraries from a bash terminal or a Jupyter notebook cell (remember to add the ! in front of pip) that are required by dpt-beit-large-512:

# bash
pip install transformers==4.34.1
pip install intel_extension_for_transformers==1.2.2
pip install intel_extension_for_pytorch==2.1.100
pip install tqdm
pip install einops
pip install neural_speed==0.2
pip install torch==2.1.1

Import the libraries in the Jupyter notebook, python script, or editor.

import torch
print(torch.__version__)
import transformers
print(transformers.__version__)
from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

Load the image:

# python
path = "image/DSC_0566.png"
image = Image.open(path)
MAX_SIZE = (600, 400)
image.thumbnail(MAX_SIZE)

Notice that the dinosaur, stegosaurus, is placed in front of a painted mural — making separation a more manual task with many common methods:

Load the Intel/dpt-beit-large-512 model and generate the depth estimate. Notice how the mural all but disappears while the stegosaurus pops to the front.

processor = DPTImageProcessor.from_pretrained("Intel/dpt-beit-large-512")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-beit-large-512")

# prepare image for the model
 inputs = processor(images=image, return_tensors="pt")
 with torch.no_grad():
 outputs = model(**inputs)
 predicted_depth = outputs.predicted_depth
# interpolate to original size
 prediction = torch.nn.functional.interpolate(
 predicted_depth.unsqueeze(1),
 size=image.size[::-1],
 mode="bicubic",
 align_corners=False,
 )
# prepare image for the model
 inputs = processor(images=image, return_tensors="pt")
 with torch.no_grad():
 outputs = model(**inputs)
 predicted_depth = outputs.predicted_depth
# interpolate to original size
 
 prediction = torch.nn.functional.interpolate(
 predicted_depth.unsqueeze(1),
 size=image.size[::-1],
 mode="bicubic",
 align_corners=False,
 )
# visualize the prediction
 output = prediction.squeeze().cpu().numpy()
 formatted = (output * 255 / np.max(output)).astype("uint8")
 depth = Image.fromarray(formatted)
 depth

Create BW mask from the depth estimate:

Threshold = 45
a = np.array(depth)
b = np.where(a>Threshold,a,0)
b[b>Threshold] = 255
mask = Image.fromarray(b).convert('L')
black = Image.fromarray(a*0).convert('L')
mask.convert('RGB').show()

Composite the image and the mask:

out = Image.composite(image.convert('RGBA'), black.convert('RGBA'), mask.convert('1'))
out.convert('RGB').show()

Notice that the depth estimation completely masked out the background mural painting and let me select just the foreground dinosaur!

Pottery in Museum Example:

We’ll use the same code as above, but with two changes:

# first cell

path = "image/DSC_0566.png"
image = Image.open(path)
MAX_SIZE = (600, 400)
image.thumbnail(MAX_SIZE)

# Threshold cell

Threshold = 155 # use 155 for Pottery
a = np.array(depth)
b = np.where(a>Threshold,a,0)
b[b>Threshold] = 255
mask = Image.fromarray(b).convert('L')
black = Image.fromarray(a*0).convert('L')
mask.convert('RGB').show()

The following images show the depth estimate and the background removed:

You can experiment with these concepts and play with monocular depth estimation on the Intel Developer Cloud. It’s free to sign up and spend time on a really powerful server. The code for this article and the rest of the series is located on GitHub. For this article, experiment with the file: dpt_dino.ipynb. See the Hugging Face model card (https://huggingface.co/Intel/dpt-beit-large-512) for more information.

Using Monocular Depth Estimation to Mask an Image

Or, How I Selected a Dinosaur and Isolated It Out of a Flat Image

Monocular Depth Estimation (DPT)

Coding Steps

Pottery in Museum Example:

Written by Bob Chesebrough