Using CLIP to Build a Natural Language Video Search Engine

7 min readDec 28, 2022

Discover how to use CLIP to build a video search engine that responds to natural language prompts with minimal effort.

CLIP (Contrastive Language-Image Pre-training) is a machine learning technique that allows computers to understand and classify images and natural language text with impressive accuracy. This has far-reaching implications for image and language processing and has already been used as the underlying mechanism in the popular diffusion model DALL-E. In this post, we’ll explore how we can adapt CLIP to assist with video search.

This post will not delve into the technical details of the CLIP model, but rather show a practical application of CLIP (besides diffusion). To give a simplified explanation of CLIP, here is a quote from the blog post.

CLIP uses an image decoder and a text encoder to predict which images were paired with which texts in our dataset.

Using CLIP for search

By using a pre-trained CLIP model from hugging face, we can build out a simple, yet powerful video search engine, with natural language capabilities and zero feature engineering.

There are many techniques for searching videos by use of text. At a high level, our search engine will be comprised of two parts, indexing something and searching the indexes for something.

Dependencies

Python ≥= 3.8
ffmpeg

Indexing Implementation

Video indexing typically involves a combination of human and machine processes. Humans preprocess videos by adding relevant keywords to titles, tags, and descriptions, while automated processes extract visual and auditory features, such as object detection and audio transcription. User interaction metrics are also collected to understand what parts of the video are most relevant, as well as how long they stay relevant. All of these steps help to create a searchable index of the video’s content.

A high-level overview of the indexing process is as follows

Split the video into scenes
Sample the scenes for frames
Process pixel embeddings from frames
Index into storage

Splitting the video into scenes

Why is scene detection important? A video is comprised of scenes, and a scene is comprised of similar frames. If we were to only sample arbitrary scenes in a video, we may be missing out on key frames throughout the video.

Additionally, it will allow us to more accurately identify and locate specific events or actions within a video. For example, if I search for “a dog at a park” and the video I’m searching for contains multiple scenes, including one scene of a man biking and another scene of a dog at a park, scene detection allows me to identify the scene that most closely matches my search query.

To accomplish this we can use the “scene detect” python package.

import scenedetect as sd

video_path = '' # path to video on machine

video = sd.open_video(video_path)
sm = sd.SceneManager()
        
sm.add_detector(sd.ContentDetector(threshold=27.0))
sm.detect_scenes(video)

scenes = sm.get_scene_list()

Sample the scene for frames

Next, we can use cv2 to sample the video for n number of frames, maintaining a list of all the samples.

import cv2

cap = cv2.VideoCapture(video_path)

every_n = 2 # number of samples per scene

scenes_frame_samples = []    
for scene_idx in range(len(scenes)):
    scene_length = abs(scenes[scene_idx][0].frame_num - scenes[scene_idx][1].frame_num)
    every_n = round(scene_length/no_of_samples)
    local_samples = [(every_n * n) + scenes[scene_idx][0].frame_num for n in range(3)]
            
    scenes_frame_samples.append(local_samples)

Process pixel embeddings from frames

After we have these samples collected, we need to compute them into something usable by the hugging face CLIP model.

To start, we will need to convert each of these samples to an image embedding tensor.

from transformers import CLIPProcessor
from PIL import Image

clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def clip_embeddings(image):
    inputs = clip_processor(images=image, return_tensors="pt", padding=True)
    input_tokens = {
        k: v for k, v in inputs.items()
    }

    return input_tokens['pixel_values']

# ... 
scene_clip_embeddings = [] # to hold the scene embeddings in the next step

for scene_idx in range(len(scenes_frame_samples)):
    scene_samples = scenes_frame_samples[scene_idx]

    pixel_tensors = [] # holds all of the clip embeddings for each of the samples
    for frame_sample in scene_samples:
        cap.set(1, frame_sample)
        ret, frame = cap.read()
        if not ret:
            print('failed to read', ret, frame_sample, scene_idx, frame)
            break

         pil_image = Image.fromarray(frame)
                
         clip_pixel_values = clip_embeddings(pil_image)
         pixel_tensors.append(clip_pixel_values)

Next, we average all of the samples within the same scene to reduce the dimensionality of the samples and account for any noise present in a single sample.

import torch
import uuid

def save_tensor(t):
    path = f'/tmp/{uuid.uuid4()}'
    torch.save(t, path)

    return path

# ..
avg_tensor = torch.mean(torch.stack(pixel_tensors), dim=0)
scene_clip_embeddings.append(save_tensor(avg_tensor))

After this, we will be left with a list of tensors that represent our video in the CLIP embedding space.

Index into Storage

For the underlying indexing storage device, we will be using LevelDB. LevelDB is a key/ value store maintained by Google.

The schema for our search engine will include 3 separate indexes

Video scenes index- referencing which scenes belong to a specific video.
Scene embeddings index- tensor data of a specific scene.
Video metadata index- metadata about a video.

At a high level, we will first insert all of the computed metadata from the video into the metadata index, as well as a unique identifier for the video.

import leveldb
import uuid

def insert_video_metadata(videoID, data):
    b = json.dumps(data)

    level_instance = leveldb.LevelDB('./dbs/videometadata_index')
    level_instance.Put(videoID.encode('utf-8'), b.encode('utf-8'))

# ...
video_id = str(uuid.uuid4())
insert_video_metadata(video_id, {
    'VideoURI': video_path,
})

Next, for each of the pixel-embedded tensors in a video, we want to create a new entry in the scene embeddings index. We also want to be able to identify each scene by a unique identifier, which will also be computed here.

import leveldb
import uuid

def insert_scene_embeddings(sceneID, data):
    level_instance = leveldb.LevelDB('./dbs/scene_embedding_index')
    level_instance.Put(sceneID.encode('utf-8'), data)

# ...
for f in scene_clip_embeddings:
    scene_id = str(uuid.uuid4())
    
    with open(f, mode='rb') as file:
        content = file.read()
            
        insert_scene_embeddings(scene_id, content)

Finally, we need to keep track of which scenes belong to which video.

import leveldb
import uuid

def insert_video_scene(videoID, sceneIds):
    b = ",".join(sceneIds)
    
    level_instance = leveldb.LevelDB('./dbs/scene_index')
    level_instance.Put(videoID.encode('utf-8'), b.encode('utf-8'))

# ...
scene_ids = []
for f in scene_clip_embeddings:
    # .. as shown in previous step
    scene_ids.append(scene_id)
    scene_embedding_index.insert(scene_id, content)

scene_index.insert(video_id, scene_ids)

Searching Indexes

Now that we have a way to ingest videos into a group of indexes, we can search and sort them based on the predictions found in the output of the model.

First, we iterate over all of the records in our scene index. Then, create a list of all of the videos and matching scene ids in a video.

records = []

level_instance = leveldb.LevelDB('./dbs/scene_index')

for k, v in level_instance.RangeIter():    
    record = (k.decode('utf-8'), str(v.decode('utf-8')).split(','))
    records.append(record)

Next, we will need to collect all of the scene tensors that exist for each of the videos.

import leveldb

def get_tensor_by_scene_id(id):
    level_instance = leveldb.LevelDB('./dbs/scene_embedding_index')
    b = level_instance.Get(bytes(id,'utf-8'))

    return BytesIO(b)

for r in records:
    tensors = [get_tensor_by_scene_id(id) for id in r[1]]

After we have all of the tensors that make up a video, we can pass it into the model. The model expects an input including the key “pixel_values”, containing the tensor representing the scene of a video.

import torch
from transformers import CLIPProcessor, CLIPModel

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(text=text, return_tensors="pt", padding=True)

for tensor in tensors:
    image_tensor = torch.load(tensor)
    inputs['pixel_values'] = image_tensor   
    outputs = model(**inputs)

To get the output of the model, we can access the “logits_per_image” on the output of the model.

Logits are essentially a raw unnormalized prediction of the network. Since we are only supplying one string of text and one single tensor representing a scene in a video, the structure of our logit will be a single-value prediction.

logits_per_image = outputs.logits_per_image    
probs = logits_per_image.squeeze()

prob_for_tensor = probs.item()

To get the average probability of the video, we can summate each iteration's probability and divide that by the total number of tensors at the end of the operation.

def clip_scenes_avg(tensors, text):
    avg_sum = 0.0

    for tensor in tensors:
        # ... previous code snippets
        probs = probs.item()
        avg_sum += probs.item()
   
    return avg_sum / len(tensors)

Finally, to return a response, we can keep track of the probability for each of the videos, and sort the probabilities. Returning the requested number of search results.

import leveldb
import json

top_n = 1 # number of search results we want back

def video_metadata_by_id(id):
    level_instance = leveldb.LevelDB('./dbs/videometadata_index')
    b = level_instance.Get(bytes(id,'utf-8'))
    return json.loads(b.decode('utf-8'))

results = []
for r in records:
    # .. collect scene tensors
  
    # r[0]: video id
    return (clip_scenes_avg, r[0]) 

sorted = list(results)
sorted.sort(key=lambda x: x[0], reverse=True)
            
results = []
for s in sorted[:top_n]:
    data = video_metadata_by_id(s[1])
                
    results.append({
        'video_id': s[1],
        'score': s[0],
        'video_uri': data['VideoURI']
     })

And that is it! It is now just a matter of inserting some videos and testing out the search.

Conclusion

You can view the full code used in this post here: https://github.com/GuyARoss/CLIP-video-search/tree/article-01.

As well as a modified version of this code for improved efficiency here: https://github.com/GuyARoss/CLIP-video-search.

CLIP can be used to create a natural language video search engine with little effort. By using a pre-trained CLIP model and Google’s LevelDB, we can index and process videos for searching with natural language prompts. This search engine enables users to easily find relevant videos without extensive preprocessing or feature engineering.

What's next?

Determining the best scene, with the timestamp of that scene.
Modifying the inference to run on a compute cluster + LSH.
Benchmarking CLIP search against BERT + Whisper + Elastic Search.
Video recommender using CLIP.