Swift and Simple: Calculate Object Distance with ease in just few Lines of Code

Published in

Artificialis

5 min readMay 26, 2023

So the past few days, I’ve rummaged through various internet sources looking for ways to calculate object distance using monocular vision. Along the way, I found deep learning based monocular depth estimation models and some landmarks based distance approximation methodologies. These were the only options considering my obstinacy towards a low-resource alternative to this problem.

Generally, depth or distance is calculated using stereo vision approaches. Stereo vision is a powerful technique inspired by human vision, it uses binocular disparity concept to precisely approximate the distance of object from camera. However, it requires stereoscopic cameras capable of depth estimation by capturing two slightly offset images of a scene simultaneously. I was lacking this luxury, so I focused my efforts towards a landmarks based distance approximation approach.

So that was a bit of a background, and without wasting one more minute we’ll go through the code:

import mediapipe as mp
import cv2
import numpy as np
import math

mp_pose = mp.solutions.pose
pose = mp_pose.Pose()

Importing the dependencies, instantiating the media-pipe class instance mp_pose.Pose() and assigning it to a variable pose.

cap = cv2.VideoCapture('distance.mp4')
while cap.isOpened():
  ret,img = cap.read()

#Converting to RGB
  img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
  results = pose.process(img)

#check to see if bodylandmarks are being detected   
  if results.pose_landmarks is not None:
    mp_drawing = mp.solutions.drawing_utils
    mp_drawing.draw_landmarks(img,results.pose_landmarks, mp_pose.POSE_CONNECTIONS)

    cv2.imshow('ImgWindow',img)

  if cv2.waitKey(1) & 0xFF == ord('q'):
    cap.release()
    cv2.destroyAllWindows()

Loading the Video Capture window from OpenCV, and converting the image to RGB as cv2 loads the images in BGR format. Next we draw the extracted landmarks on image for a sanity check. Here’s the result so far :

Now the landmarks are being detected, the next step is to extract the landmark co-ordinates that we’ll use as a reference point for detecting distance. We’ll be taking the nose landmark as reference, although the body is only visible from behind — the model is still able to extrapolate the nose key-point.

#Extracting the nose landmark. 
landmarks = []
for landmark in results.pose_landmarks.landmark:
    landmarks.append((landmark.x, landmark.y, landmark.z))
        
nose_landmark = landmarks[mp_pose.PoseLandmark.NOSE.value]
_,_, nose_z = nose_landmark

We extracted the landmarks and appended all detected landmarks within the landmarks list. From there we extract the nose_landmark which encapsulates three values for x, y and z axis. The x-axis and y-axis values provide the position of the point along the x and y plane while the z axis provides the relative depth/ distance information from all the other keypoints. By leveraging this depth information we can estimate the distance of the object from the camera. We can take any key-point for reference and the below figure from the official media-pipe website provides all the key-point locations with their respective names and numbers to help you select.

#Calculating distance from the z-axis
#Set the depth_scale to 1
depth_scale = 1
def depth_to_distance(depth_value,depth_scale):
    return -1.0 / (depth_value*depth_scale)

distance = depth_to_distance(nose_z,depth_scale)
cv2.putText(img, "Depth in unit: " + str(np.format_float_positional(distance, precision=2)),(20,50),cv2.FONT_HERSHEY_SIMPLEX,1,(255,255,255),3)
cv2.imshow('Video',img)

The function depth_to_distance converts the z-axis value into distance value. The parameter depth_value is the value obtained from the nose landmark, the depth_scale is used to adjust the depth values to desired unit of measurement. The depth scale is usually provided by the algorithm. In this scenario, depth scale is in meters as mentioned on the model’s documentation site. Furthermore, the constant -1.0 is used to invert the depth value, as depth is usually in the negative.

The above depth values are fluctuating at a rapid rate. Moreover, there are random spikes within the values. The random spikes are due to the camera not being stationery, causing occasional glitches in key-point detection — resulting in negative or very large values. To stabilize the fluctuations in distance we can use an exponential moving average filter. The filter will approximate the values to a more stable state reducing the fluctuations. Explanation of the filtering technique is beyond the scope of this article. So we’ll keep it for some other day and for some other application. Here’s the code to apply this filtering technique and get better results.

#Tweak the alpha value to suit your needs
alpha = 0.6
previous_depth = 0.0

def apply_ema_filter(current_depth):
    global previous_depth
    filtered_depth = alpha * current_depth + (1 - alpha) * previous_depth
    previous_depth = filtered_depth  # Update the previous depth value
    return filtered_depth

filter = apply_ema_filter(nose_z)
distance = depth_to_distance(filter,depth_scale)

Here’s the result of the video, after applying the filter. The value is a bit stable then before however there are still some random negative and positive peaks due to the mobility of the camera as discussed before. A stationery camera will give much better results.

Here’s the entire code :

import mediapipe as mp
import cv2
import numpy as np

mp_pose = mp.solutions.pose
pose = mp_pose.Pose(static_image_mode=False)

#Tweak this parameter to suit your own need
alpha = 0.6
previous_depth = 0.0

def apply_ema_filter(current_depth):
    global previous_depth
    filtered_depth = alpha * current_depth + (1 - alpha) * previous_depth
    previous_depth = filtered_depth  # Update the previous depth value
    return filtered_depth

#Play with this constant value and vary it to check what suits your needs.
def depth_to_distance(depth_value, depth_scale):
  return -1.0 / (depth_value * depth_scale)

#Change it to your own camera feed
cap = cv2.VideoCapture('distance.mp4')
while cap.isOpened():
  ret,frame = cap.read()

#Grayscaling the image
  img = cv2.cvtColor(frame,cv2.COLOR_BGR2RGB)
  results = pose.process(img)

#check to see if bodylandmarks are being detected
  if results.pose_landmarks is not None:
    landmarks = []
    for landmark in results.pose_landmarks.landmark:
      landmarks.append((landmark.x, landmark.y, landmark.z))

    nose_landmark = landmarks[mp_pose.PoseLandmark.NOSE.value]
    _, _, nose_z = nose_landmark

    img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
    filter = apply_ema_filter(nose_z)
    distance = depth_to_distance(filter,1)
#Convert the distance to your own requirement.
    cv2.putText(img, "Depth in unit: " + str(np.format_float_positional(distance, precision=2)),(20,50),cv2.FONT_HERSHEY_SIMPLEX,1,(255,255,255),3)
    cv2.imshow('ImgWindow',img)

  if cv2.waitKey(1) & 0xFF == ord('q'):
   cap.release()
   cv2.destroyAllWindows()

Conclusion:

Although this monocular method of finding distance is an effective way to play around with the concept. However, the results from the dedicated stereo vision equipment are recommended for applications needing high accuracy and precision. I hope, this satisfies your monocular vision based distance estimation needs. Give me a 👏if you like this. I’ll try to bring more refined versions of this project in near future.
Till then …. Take care and stay blessed !

Swift and Simple: Calculate Object Distance with ease in just few Lines of Code

Written by Nabeel Khan