Step-by-Step Guide to Automatic Background Removal in Videos with U-2 Model (Deep Learning)

3 min readDec 24, 2023

The magic of digital content creation often lies in the invisible threads of post-processing. One such strand of magic is the background removal in videos, a technique commonly used in many applications ranging from virtual conferencing to creating engaging content for social media. Today, we’ll walk through a Pythonic journey to seamlessly remove backgrounds from any video, using the powerful U2-Net model for segmentation.

Setting the Stage: Preparing Your Environment

Before diving into the code, let’s set up our environment. Ensure you have OpenCV and U2-Net installed in your Python environment. OpenCV will handle our video processing needs, while U2-Net, known for its remarkable performance in salient object detection, will be the workhorse for background removal.

The First Act: Capturing Frames from the Video

Our journey begins with a video file, a series of moments captured in frames. Using OpenCV’s VideoCapture, we can step through each video frame. We read the frames one by one, and as we do so, we save them as individual images. This is our raw material, the untouched canvas upon which we’ll perform our digital sorcery.

video = cv2.VideoCapture(input_video_path)
fps = video.get(cv2.CAP_PROP_FPS)
count = 0
flag, image = video.read()
while flag:
    cv2.imwrite(f'input_frame{count}.png', image)
    flag, image = video.read()
    count += 1

The Second Act: Invoking the U2-Net Model

With the frames extracted, we call upon the U2-Net model. This deep learning model will process each image and produce a corresponding mask, delineating the main subject from the background. For the sake of this tale, let’s assume the magic has been done, and our masks are ready, each a perfect silhouette of our subject.

The Third Act: Merging Frames and Masks

Here’s where we blend the output of U2-Net with our original frames. For each pair of frame and mask, we perform a bitwise operation. This is the heart of our background removal — where the mask tells us what to keep from each frame and what to cast into the void.

for i in range(count):
    mask = cv2.imread(f'mask{i}.png')
    frame = cv2.imread(f'frame{i}.png')
    result = cv2.bitwise_and(frame, mask)
    cv2.imwrite(f'result_frame{i}.png', result)

The Final Act: Reassembling the Video

With all frames now processed, it’s time to reassemble them into a new video. This is a reversal of our first act, where instead of splitting, we’re combining, stitching together our frames into a coherent whole, now devoid of its original background.

video_writer = cv2.VideoWriter(output_video_path, cv2.VideoWriter_fourcc(*'MP4V'), fps, frame_size)
for frame in processed_frames:
    video_writer.write(frame)
video_writer.release()

The Curtain Call: Reflecting on Our Journey

What we’ve achieved is nothing short of alchemy. We started with a standard video and, using Python and a sprinkle of deep learning, removed the background, rendering a subject free from its original constraints, ready to be placed into any context we desire.

A Standing Ovation

And there you have it. With Python as our director and U2-Net as our lead actor, we’ve performed the magical act of background removal. The story we’ve told is one of transformation, and the script we’ve written is a testament to the power of technology in the realm of digital content creation.

Full Code Here

import cv2
import os

def remove_background_from_video(input_video_path, output_path):
    # Create the video capture object
    video = cv2.VideoCapture(input_video_path)
    fps = video.get(cv2.CAP_PROP_FPS)
    
    # Read each frame from the video
    count = 0
    flag = 1
    while flag:
        flag, image = video.read()
        if not flag:
            break
        cv2.imwrite(os.path.join(output_path, 'input_frames', f'input{count}.png'), image)
        count += 1

    # Assume that the u2net_video.py script has been executed here and masks are available

    # Perform bitwise operation on the frames and the masks
    for i in range(count):
        mask_path = os.path.join(output_path, 'u2net_results', f'input{i}.png')
        frame_path = os.path.join(output_path, 'input_frames', f'input{i}.png')
        mask = cv2.imread(mask_path)
        frame = cv2.imread(frame_path)
        real_part = cv2.bitwise_and(mask, frame)
        cv2.imwrite(os.path.join(output_path, 'bitwise_images', f'mask{i:04d}.png'), real_part)

    # Create a video from the processed images
    img_array = []
    for i in range(count):
        final_img_path = os.path.join(output_path, 'bitwise_images', f'mask{i:04d}.png')
        final_img = cv2.imread(final_img_path)
        height, width, layers = final_img.shape
        size = (width, height)
        img_array.append(final_img)

    # Write the output video
    out_video_path = os.path.join(output_path, 'output', 'input_1_BGR.mp4')
    outv = cv2.VideoWriter(out_video_path, cv2.VideoWriter_fourcc(*'MP4V'), fps, size)

    for img in img_array:
        outv.write(img)

    outv.release()

# Call the function with the paths
input_video_path = '/path/to/input/video.mp4'
output_path = '/path/to/output/directory'
remove_background_from_video(input_video_path, output_path)