Day 93: Advanced Video Object Removal Using Deep Learning

Adithya Prasad Pandelu

--

Welcome to Day 93 of our 100 Days of ML Journey! Today, we’ll explore an exciting project that demonstrates the potential of deep learning in video processing: Video Object Removal. Inspired by an older project by zllrunning, we’ll bring it up to date with modern techniques and tools available in 2025. This enhanced version leverages advancements in computer vision and generative AI to remove unwanted objects from videos seamlessly.

Generated using AI

What is Video Object Removal?

Imagine shooting a video of a scenic beach, only to find unwanted elements like trash or passersby in the frame. Video object removal is the process of identifying and removing these undesired objects while maintaining the consistency and realism of the background.

This task requires:

  1. Object Detection: Identifying the objects to remove.
  2. Inpainting: Filling in the removed region with background details.

Enhancing the Project

  • YOLOv8 for state-of-the-art object detection.
  • Stable Diffusion Inpainting for generating realistic backgrounds.
  • FFmpeg for video processing.
  • Streamlit for an interactive user interface.

Step 1: Setting Up the Environment

Install the required libraries:

pip install ultralytics opencv-python-headless diffusers torch torchvision ffmpeg-python streamlit

Step 2: Object Detection with YOLOv8

We’ll use YOLOv8 for object detection, which provides excellent accuracy and speed.

Input video source: car_detection.mp4

Code for Object Detection

from ultralytics import YOLO
import cv2
import matplotlib.pyplot as plt

# Load YOLOv8 model
model = YOLO('yolov8n.pt') # You can use a pre-trained YOLOv8 model

# Load a video
video_path = 'car-detection.mp4'
cap = cv2.VideoCapture(video_path)

# Object detection on each frame
frames = []
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
results = model(frame)
for result in results:
for box in result.boxes:
x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
cv2.rectangle(frame, (x1, y1), (x2, y2), (255, 0, 0), 2) # Draw bounding box
frames.append(frame)
cap.release()

# Display a frame with bounding boxes
plt.imshow(cv2.cvtColor(frames[73], cv2.COLOR_BGR2RGB))
plt.title('Detected Objects')
plt.axis('off')
plt.show()

Step 3: Inpainting with Stable Diffusion

Once objects are detected, we’ll create masks and use Stable Diffusion Inpainting to fill in the background.

Extract Bounding Box Coordinates

From the YOLOv8 output, the bounding box coordinates are stored in result.boxes. You can extract these values and create a mask for the object to be removed.

def generate_mask(frame, results):
"""
Generate a mask for the detected object(s) in the frame.

Parameters:
- frame: The video frame (numpy array).
- results: YOLO detection results containing bounding boxes.

Returns:
- mask: A binary mask with detected objects marked as white.
"""
mask = np.zeros_like(frame[:, :, 0]) # Initialize a blank mask
for result in results:
for box in result.boxes:
# Extract bounding box coordinates
x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
# Draw a white rectangle on the mask for the bounding box
cv2.rectangle(mask, (x1, y1), (x2, y2), 255, -1)
return mask

Use the Mask in Inpainting

After generating the mask, pass it along with the frame to the inpainting function.

from diffusers import StableDiffusionInpaintPipeline
from PIL import Image
import numpy as np

# Load inpainting pipeline
inpaint_pipeline = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")

def inpaint_frame(frame, mask):
image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
mask = Image.fromarray(mask)
result = inpaint_pipeline(prompt="fill the missing background naturally", image=image, mask_image=mask).images[0]
return cv2.cvtColor(np.array(result), cv2.COLOR_RGB2BGR)
# Example: Inpaint one frame
from PIL import Image

sample_frame = frames[73] # Example: Take the 73rd frame
results = model(sample_frame) # Run YOLOv8 detection on the frame
sample_mask = generate_mask(sample_frame, results) # Generate mask based on detections

# Inpaint the frame
inpainted_frame = inpaint_frame(sample_frame, sample_mask)

# Visualize the inpainted frame
import matplotlib.pyplot as plt
plt.imshow(cv2.cvtColor(inpainted_frame, cv2.COLOR_BGR2RGB))
plt.title('Inpainted Frame')
plt.axis('off')
plt.show()

Explanation of the Steps

1. YOLO Detection:

  • YOLO detects objects in the frame and provides bounding box coordinates (x1, y1, x2, y2).
  • These coordinates define the region of the object to be removed.

2. Generate the Mask:

  • Create a blank mask of the same dimensions as the frame.
  • Draw rectangles on the mask where the detected objects are located.

3. Inpainting:

  • Pass the frame and the mask to the inpaint_frame function.
  • The inpainting model will fill in the masked regions with realistic background content.

Notes:

  • If your video contains multiple cars or other objects, this method will handle all detected objects in a frame.
  • Ensure that YOLO’s detection confidence threshold is set appropriately to avoid masking false positives.

Step 4: Video Processing

Combining Detection and Inpainting

import ffmpeg

def process_video(video_path, output_path):
cap = cv2.VideoCapture(video_path)
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_path, fourcc, cap.get(cv2.CAP_PROP_FPS),
(int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))))

while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
results = model(frame) # Detect objects
mask = np.zeros_like(frame[:, :, 0])
for result in results:
for box in result.boxes:
x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
cv2.rectangle(mask, (x1, y1), (x2, y2), 255, -1)
inpainted_frame = inpaint_frame(frame, mask)
out.write(inpainted_frame)
cap.release()
out.release()

# Process video
input_vpath = 'YourPath/car-detection.mp4'
output_vpath = 'YourPath/output.mp4'
process_video(input_vpath, output_vpath)

Step 5: Interactive User Interface with Streamlit

We’ll build a simple interface where users can upload videos, specify objects to remove, and download the processed video.

Streamlit Code

import streamlit as st

st.title("Video Object Removal App")

# Upload video
uploaded_file = st.file_uploader("Upload a Video", type=["mp4", "mov", "avi"])
if uploaded_file:
with open("input_video.mp4", "wb") as f:
f.write(uploaded_file.read())
st.video("input_video.mp4")

# Process video
if st.button("Remove Objects"):
process_video("input_video.mp4", "output_video.mp4")
st.success("Processing Complete!")
st.video("output_video.mp4")

Enhancements

  1. Use of Vision Transformers (ViT): ViT-based models can be integrated for better object segmentation.
  2. Real-Time Processing: Leverage GPU acceleration for near real-time object removal.
  3. Context-Aware Inpainting: Use large generative models like GPT-4 Vision for more contextually accurate inpainting.

Applications

  1. Media Production: Remove unwanted elements in videos seamlessly.
  2. Surveillance: Blur or remove sensitive objects or faces in security footage.
  3. Augmented Reality: Enable AR applications to remove real-world objects from view.

Wrapping up

This project showcases the power of deep learning in solving challenging problems like video object removal. By modernizing it with state-of-the-art tools like YOLOv8 and Stable Diffusion, we’ve created a robust, practical solution. As always, the possibilities with machine learning are endless — so keep experimenting and innovating!

Stay tuned, happy coding!

Thank you for reading…Let’s Connect!

--

--

Responses (2)