Discover how to use CLIP to build a video search engine that responds to natural language prompts with minimal effort.
CLIP (Contrastive Language-Image Pre-training) is a machine learning technique that allows computers to understand and classify images and natural language text with impressive accuracy. This has far-reaching implications for image and language processing and has already been used as the underlying mechanism in the popular diffusion model DALL-E. In this post, we’ll explore how we can adapt CLIP to assist with video search.
This post will not delve into the technical details of the CLIP model, but rather show a practical application of CLIP (besides diffusion). To give a simplified explanation of CLIP, here is a quote from the blog post.
Using CLIP for search
By using a pre-trained CLIP model from hugging face, we can build out a simple, yet powerful video search engine, with natural language capabilities and zero feature engineering.
There are many techniques for searching videos by use of text. At a high level, our search engine will be comprised of two parts, indexing something and searching the indexes for something.
Dependencies
- Python ≥= 3.8
- ffmpeg
Indexing Implementation
Video indexing typically involves a combination of human and machine processes. Humans preprocess videos by adding relevant keywords to titles, tags, and descriptions, while automated processes extract visual and auditory features, such as object detection and audio transcription. User interaction metrics are also collected to understand what parts of the video are most relevant, as well as how long they stay relevant. All of these steps help to create a searchable index of the video’s content.
A high-level overview of the indexing process is as follows
- Split the video into scenes
- Sample the scenes for frames
- Process pixel embeddings from frames
- Index into storage
Splitting the video into scenes
Why is scene detection important? A video is comprised of scenes, and a scene is comprised of similar frames. If we were to only sample arbitrary scenes in a video, we may be missing out on key frames throughout the video.
Additionally, it will allow us to more accurately identify and locate specific events or actions within a video. For example, if I search for “a dog at a park” and the video I’m searching for contains multiple scenes, including one scene of a man biking and another scene of a dog at a park, scene detection allows me to identify the scene that most closely matches my search query.
To accomplish this we can use the “scene detect” python package.
import scenedetect as sd
video_path = '' # path to video on machine
video = sd.open_video(video_path)
sm = sd.SceneManager()
sm.add_detector(sd.ContentDetector(threshold=27.0))
sm.detect_scenes(video)
scenes = sm.get_scene_list()
Sample the scene for frames
Next, we can use cv2 to sample the video for n number of frames, maintaining a list of all the samples.
import cv2
cap = cv2.VideoCapture(video_path)
every_n = 2 # number of samples per scene
scenes_frame_samples = []
for scene_idx in range(len(scenes)):
scene_length = abs(scenes[scene_idx][0].frame_num - scenes[scene_idx][1].frame_num)
every_n = round(scene_length/no_of_samples)
local_samples = [(every_n * n) + scenes[scene_idx][0].frame_num for n in range(3)]
scenes_frame_samples.append(local_samples)
Process pixel embeddings from frames
After we have these samples collected, we need to compute them into something usable by the hugging face CLIP model.
To start, we will need to convert each of these samples to an image embedding tensor.
from transformers import CLIPProcessor
from PIL import Image
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def clip_embeddings(image):
inputs = clip_processor(images=image, return_tensors="pt", padding=True)
input_tokens = {
k: v for k, v in inputs.items()
}
return input_tokens['pixel_values']
# ...
scene_clip_embeddings = [] # to hold the scene embeddings in the next step
for scene_idx in range(len(scenes_frame_samples)):
scene_samples = scenes_frame_samples[scene_idx]
pixel_tensors = [] # holds all of the clip embeddings for each of the samples
for frame_sample in scene_samples:
cap.set(1, frame_sample)
ret, frame = cap.read()
if not ret:
print('failed to read', ret, frame_sample, scene_idx, frame)
break
pil_image = Image.fromarray(frame)
clip_pixel_values = clip_embeddings(pil_image)
pixel_tensors.append(clip_pixel_values)
Next, we average all of the samples within the same scene to reduce the dimensionality of the samples and account for any noise present in a single sample.
import torch
import uuid
def save_tensor(t):
path = f'/tmp/{uuid.uuid4()}'
torch.save(t, path)
return path
# ..
avg_tensor = torch.mean(torch.stack(pixel_tensors), dim=0)
scene_clip_embeddings.append(save_tensor(avg_tensor))
After this, we will be left with a list of tensors that represent our video in the CLIP embedding space.
Index into Storage
For the underlying indexing storage device, we will be using LevelDB. LevelDB is a key/ value store maintained by Google.
The schema for our search engine will include 3 separate indexes
- Video scenes index- referencing which scenes belong to a specific video.
- Scene embeddings index- tensor data of a specific scene.
- Video metadata index- metadata about a video.
At a high level, we will first insert all of the computed metadata from the video into the metadata index, as well as a unique identifier for the video.
import leveldb
import uuid
def insert_video_metadata(videoID, data):
b = json.dumps(data)
level_instance = leveldb.LevelDB('./dbs/videometadata_index')
level_instance.Put(videoID.encode('utf-8'), b.encode('utf-8'))
# ...
video_id = str(uuid.uuid4())
insert_video_metadata(video_id, {
'VideoURI': video_path,
})
Next, for each of the pixel-embedded tensors in a video, we want to create a new entry in the scene embeddings index. We also want to be able to identify each scene by a unique identifier, which will also be computed here.
import leveldb
import uuid
def insert_scene_embeddings(sceneID, data):
level_instance = leveldb.LevelDB('./dbs/scene_embedding_index')
level_instance.Put(sceneID.encode('utf-8'), data)
# ...
for f in scene_clip_embeddings:
scene_id = str(uuid.uuid4())
with open(f, mode='rb') as file:
content = file.read()
insert_scene_embeddings(scene_id, content)
Finally, we need to keep track of which scenes belong to which video.
import leveldb
import uuid
def insert_video_scene(videoID, sceneIds):
b = ",".join(sceneIds)
level_instance = leveldb.LevelDB('./dbs/scene_index')
level_instance.Put(videoID.encode('utf-8'), b.encode('utf-8'))
# ...
scene_ids = []
for f in scene_clip_embeddings:
# .. as shown in previous step
scene_ids.append(scene_id)
scene_embedding_index.insert(scene_id, content)
scene_index.insert(video_id, scene_ids)
Searching Indexes
Now that we have a way to ingest videos into a group of indexes, we can search and sort them based on the predictions found in the output of the model.
First, we iterate over all of the records in our scene index. Then, create a list of all of the videos and matching scene ids in a video.
records = []
level_instance = leveldb.LevelDB('./dbs/scene_index')
for k, v in level_instance.RangeIter():
record = (k.decode('utf-8'), str(v.decode('utf-8')).split(','))
records.append(record)
Next, we will need to collect all of the scene tensors that exist for each of the videos.
import leveldb
def get_tensor_by_scene_id(id):
level_instance = leveldb.LevelDB('./dbs/scene_embedding_index')
b = level_instance.Get(bytes(id,'utf-8'))
return BytesIO(b)
for r in records:
tensors = [get_tensor_by_scene_id(id) for id in r[1]]
After we have all of the tensors that make up a video, we can pass it into the model. The model expects an input including the key “pixel_values”, containing the tensor representing the scene of a video.
import torch
from transformers import CLIPProcessor, CLIPModel
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=text, return_tensors="pt", padding=True)
for tensor in tensors:
image_tensor = torch.load(tensor)
inputs['pixel_values'] = image_tensor
outputs = model(**inputs)
To get the output of the model, we can access the “logits_per_image” on the output of the model.
Logits are essentially a raw unnormalized prediction of the network. Since we are only supplying one string of text and one single tensor representing a scene in a video, the structure of our logit will be a single-value prediction.
logits_per_image = outputs.logits_per_image
probs = logits_per_image.squeeze()
prob_for_tensor = probs.item()
To get the average probability of the video, we can summate each iteration's probability and divide that by the total number of tensors at the end of the operation.
def clip_scenes_avg(tensors, text):
avg_sum = 0.0
for tensor in tensors:
# ... previous code snippets
probs = probs.item()
avg_sum += probs.item()
return avg_sum / len(tensors)
Finally, to return a response, we can keep track of the probability for each of the videos, and sort the probabilities. Returning the requested number of search results.
import leveldb
import json
top_n = 1 # number of search results we want back
def video_metadata_by_id(id):
level_instance = leveldb.LevelDB('./dbs/videometadata_index')
b = level_instance.Get(bytes(id,'utf-8'))
return json.loads(b.decode('utf-8'))
results = []
for r in records:
# .. collect scene tensors
# r[0]: video id
return (clip_scenes_avg, r[0])
sorted = list(results)
sorted.sort(key=lambda x: x[0], reverse=True)
results = []
for s in sorted[:top_n]:
data = video_metadata_by_id(s[1])
results.append({
'video_id': s[1],
'score': s[0],
'video_uri': data['VideoURI']
})
And that is it! It is now just a matter of inserting some videos and testing out the search.
Conclusion
You can view the full code used in this post here: https://github.com/GuyARoss/CLIP-video-search/tree/article-01.
As well as a modified version of this code for improved efficiency here: https://github.com/GuyARoss/CLIP-video-search.
CLIP can be used to create a natural language video search engine with little effort. By using a pre-trained CLIP model and Google’s LevelDB, we can index and process videos for searching with natural language prompts. This search engine enables users to easily find relevant videos without extensive preprocessing or feature engineering.
What's next?
- Determining the best scene, with the timestamp of that scene.
- Modifying the inference to run on a compute cluster + LSH.
- Benchmarking CLIP search against BERT + Whisper + Elastic Search.
- Video recommender using CLIP.