Building a simple video analytics tool using the CLIP model

We are drowning in information but starved for knowledge — John Naisbitt

Yiwen Lai
CSIT tech blog


Benbrook Lake, Fort Worth, Texas, USA
Benbrook Lake, Fort Worth, Texas, USA


Misinformation and propaganda are forms of communication that are designed to deceive or mislead. They can be spread through various social media channels because they allow users to share information and ideas with a large audience quickly and easily.

The spread of misinformation and propaganda through social media can have serious consequences. It can lead to the spread of false or misleading information, which can negatively impact individuals and society. For example, it can cause people to make decisions based on incorrect information, leading to negative outcomes. It can also fuel conflict and division and undermine trust in institutions and the media. In extreme cases, it can even contribute to violence and destabilization. It is therefore important to be aware of the potential for misinformation and propaganda to spread through social media, and to take steps to counteract it.

Introducing video analytics

Video analytics is the capability of automatic analysing video content to extract meaning and insights from temporal and spatial events. It can be applied to a variety of scenarios and use cases.

In the realm of online security, video analytics can be applied to identify and track the spread of misinformation and propaganda. This is because the modus operandi of a misinformation actor is to propagate and amplify a piece of narrative over one or more social media channels, often in a cascading fashion. With videos being a popular medium in the social media space, being able to apply video analytics to quickly understand and track their dissemination will be useful in identifying the sources of misinformation before taking the necessary actions to stop their spread.

CLIP model

One way to build such a video analytics tool is to use the CLIP model created by OpenAI. CLIP (Contrastive Language–Image Pre-training) is a state-of-the-art model developed by OpenAI that can analyze both text and images. By being trained on a dataset of 400 million image and text pairs, whereby the text describes the image, CLIP can combine knowledge of language concepts with semantic knowledge of images. The core idea for CLIP is to learn good visual and text representations from the massive dataset.

Other than text and images, CLIP could also be useful for analyzing video data, as it can be applied to understand captions and scenes appearing in the video, thereby finding applications in such uses as video filtering, retrieval and even topic modelling.

In this article, I will describe how I adapted CLIP for analysing videos.

CLIP paper — Learning Transferable Visual Models From Natural Language Supervision
CLIP paper

CLIP as the state-of-the-art computer vision model. The following shows the CLIP out-performs other computer vision models regardless of scale and compute efficiency. They are tested across multiple datasets that cover a wide spectrum of categories.

Building a simple video analytics tool

I had the chance to experiment with the CLIP model and develop a simple video analytics tool. These were the steps I took:

  1. Collecting and storing video data
  2. Pre-processing the video by creating a video summary
  3. Running the video summary through the CLIP model to generate text-image embeddings
  4. Creating a series of downstream tasks that make use of the above embedding. I had experimented with the following tasks:
  • Task 1: Text and image search
  • Task 2: Search navigation
  • Task 3: Video topic modelling

1. Collecting and storing video data

I curated a collection of 7,500 short videos (each no more than five minutes) which spanned a wide range of topics, from politics to food videos. They were partially tagged by hashtags and short descriptions by their content creators. Most of them did not contain short descriptions and were tagged with generic hashtags.

2. Preprocessing the video by creating a video summary

What do we mean by video summarization? As we know, video files are very rich and content heavy. To balance computation and accuracy on downstream tasks, we need to extract a small subset of data from each video. This could be text, audio clips or images extracted from the video.

There are several approaches to create a summary of a video, such as time step sampling, uniform temporal subsampling, gaussian subsampling for frame extraction. As the focus of this post is applying CLIP to analyse video, I will be using a simple algorithm to summarize the videos that prioritize extracting frames that are both clear and diverse for each video.

To slice our video correctly, the algorithm consists of the following steps.

Do take note that a typical video can be record in different fps (frames per second) such as 30 fps or 60 fps. This will determine how many frames you will need to process on the video.

  1. Decide how many frames we want to extract from each video. For example, we take 5 frames from each video.
  2. We divide the entire video timeline into 5 segments
  3. For each segment, select images with the best clarity. For this, I utilised the Variance of Laplacian method implemented in the OpenCV Python Library.
  4. For the first segment, we select the image with the best color diversity by using the following steps. Reduce the image resolution, followed by color clustering to extract prominent colours, sort the colours for ease of comparison and finally compare the colours for each image within the segment to select frames exhibiting the highest variance of colours. The diversity in colours will ensure the frames we extracted are not repetitive and also help to eliminate extracting static frames, for example a black screen at the start and end of a video.
  5. For the second segment onwards, we repeat step 4 but this time round we select the frame with the highest color variance compared to the first segment. This is to ensure the frame we select will have a good diversity. Repeat this step until we get all the 5 segments for a video.

The output of this algorithm will ensure the 5 frames extracted from the video have high clarity and great diversity in terms of features.

Extracting text from the extracted 5 frames

In addition to images, text data is also valuable for video analysis. Many of the videos in our collection have captions, which we can extract to provide more information in the video summary. To extract text from images, we use a tool called PaddleOCR. To ensure that the text we extract is accurate, we set a threshold on the OCR confidence to filter out unreliable results.

The result of this process is that we can obtain several pieces of information for each video, including a brief description, the text extracted using OCR, video hashtags and images extracted from it.

Other than the method described, there are other methodologies to create video summarization such as the Towhee framework. The interested reader can find out more via the following links provided.

PaddleOCR is one of the fastest and accurate multilingual OCR tool.

I would also recommend looking into keyframe extraction from FFmpeg as an alternative to my algorithm. Refer to the following links for more details.

3. Run the video summary through the CLIP model

In machine learning, embedding is a technique used to represent input data, such as words or images, as vectors in a high-dimensional space. Embedding captures important features of the input in a way that machine learning algorithms can use.

In this project, embedding is done by using the CLIP model without any additional training or fine-tuning. The CLIP model is used to directly convert video descriptions and extracted images into embeddings. The CLIP model has two components: one for text (clip-ViT-B32-multilingual) and another for images (clip-ViT-B32).

clip-ViT-B32 — This is the Image & Text model CLIP, which maps text and images to a shared vector space. It can handle both image and text encode, but only English text.

clip-ViT-B32-multilingual — Is a multi-lingual version trained for text (in 50+ languages) and images to a common dense vector space such that images and the matching texts are close.

We use a multilingual model to handle different languages (support up to 50+ languages). This will allow us to create downstream tasks that can handle different languages. For example, search for videos using Malay, Hindi, English and Chinese text.

4. Create a series of downstream tasks

With all the pre-processing of videos done above. In this section, we’ll explore some practical applications to better understand the capabilities of our video analytic tool.

In the first task, we will evaluate our application’s search functionality using both textual and visual inputs. In the second task, we will delve into the latent space produced by the CLIP model, enabling us to retrieve videos that match a user’s specific context through text and images. Lastly, in the third task, we will create a topic model for videos utilizing a modified version of the multimodal BERTopic.

Task 1: Searching for video using text

The goal of this task is to test the performance of the embeddings we created using the CLIP model by searching for videos based on text input only. Specifically, we’ll input the text “Destroyed tank on a street” and see how well the tool can retrieve relevant videos from our collection.

Input text: Destroyed tank on a street

First return from our text input

It did well, showing the result of a tank running down the street but it missed out the “destroy” portion. Next, we will use the image alone and see how well the application will perform.

Task 1: Searching for video using image

Input image of a destroyed tank on a street
First return from our input image

This time round we supply a destroyed tank image taken online that is not part of our collection and use for search. Our tool manages to identify the “destroy” aspect from the input, but it struggles to retrieve tank related videos. To improve our search results, we can combine both the text and image input. By doing so, we are providing more information to the tool, leveraging the fact that both text and image are represented in a similar latent space. This should help us retrieve more relevant videos from our collection.

Return from combining text and image input

Task 2: Navigating search results (exploring latent space)

Input image (Zelensky giving speech) and keywords describing Zelensky on post war will look like

This task is inspired by word2vec, which uses arithmetic properties to manipulate word embeddings. (Example King — Man + Woman = Queen) Similarly, we want to see if we can use text descriptions to further describe the content, we want our tool to retrieve. This will test the arithmetic properties of the CLIP model, since CLIP represents images and text similarly, we can use arithmetic on both.

Our goal is to retrieve videos by providing some context. In this example we will be searching for President Zelensky after the war started while filtering out content from before the war. To do this, we will add and subtract keywords to our input image, which should help us find the relevant content.

The following keywords are added and subtracted

Add (+) #Ukraine military uniform, wearing bullet vest and helmet, tired, sad

Subtract (-) Giving speech, tie, suit, smile

Top 3 results, before adding and subtracting keywords
Top 3 results, after adding and subtracting keywords

We have successfully retrieved videos of President Zelensky after the war by adding and removing specific keywords to our original image. By adding context to an image while searching, this method could be useful for retrieving propaganda videos that reuse clips with some modifications.

An example one of the account narrative focus is on Zelensky

By further investigation on the accounts return by our result, we can see one of the account focus is on Zelensky. Nearly 90% of the video posted by this account were related to Zelensky. More investigation is needed to identify the account intent.

For more details of arithmetic properties on CLIP please refer to paper.

Performance Metrics

MRR can be interpreted as the probability that a user will find the relevant document at the top of the search results.

MAP can be interpreted as the average relevance of the documents returned by the system.

To ensure the effectiveness of our video analytics, we need to assess how well our search function performs (task 2). We will evaluate the search function’s performance on text and image retrieval, which involves navigating search results. The two-evaluation metrics are MRR (Mean Reciprocal Rank) and MAP (Mean Average Precision)

Integrating text and image retrieval is a daunting task, primarily because of the intricate nature of multimodal embeddings and semantic ambiguity. The contextual meaning of words, such as “bank,” can be subject to multiple interpretations. As an illustration, the embedding of the word “bank” in text may not correspond accurately with an image if it refers to either a financial institution or a riverbank.

The results of our task 2 scores, MRR@5 = 0.895 and MAP@5 = 0.857. For MRR@5, it means that on average of 5 returns, the first relevant result was found in the top 1/0.89 = 1.12 results. For MAP@5, it means that when the system retrieves 5 results for a query, on average, 85.7% of those results are relevant.

Overall, the high scores for both MRR@5 and MAP@5 indicate that the search function is effective for combining text and image search in Task 2, despite the difficulty of combining text and image search.

Task 3: Video topic model
Not using any language model for BERTopic, we provide image embeddings directly

Motivation for this task is to find the overall narrative of the author and if possible, surface trend or propaganda like behavior through topics clusters over time. We will be using a slight modification of BERTopic to achieve this task.

BERTopic is a topic modelling technique that leverages embedding models and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. BERTopic primarily use is for NLP tasks, which are use on text document.

The approach we’re taking is like how BERTopic works on multi-modal data. Instead of passing a single image, we’re passing the average of five pre-calculated embeddings that represent each video file. This means that BERTopic won’t create an embedding from the video’s text description. Instead, it will attempt to cluster the images based on their embeddings and then create a topic representation using the provided text description.

If you have difficulty understanding the above paragraph, you might want to look at how the BERTopic algorithm works. Links provided below.

We will be looking at the account we detected previously, using the content it has post to run the modified BERTopic algorithm. The result we obtain 5 different clusters, we will look at some of the clusters to analyst the account posting patterns.

Input: Identified account’s video post

5 clusters, posting date 05 Mar — 07 May
Cluster 0 — Zelensky’s office with green arm chair
Cluster 2 — Zelensky with Ukraine flag as background
Cluster 3 — Zelensky out door on the street
Cluster 3 — Zelensky out door on the street

Upon examining the account posted by the video it becomes apparent that the posting activity of the account corresponds to the escalation of conflicts in Ukraine peaked at March. However, the posting activity ceased after May 7, coinciding with a decrease in the intensity of the conflict.

Cluster 3 consists of numerous videos captured on the streets at night. This demonstrates President Zelensky’s confidence in the war, as he is able to venture outside without fear of incoming missiles or drone attacks from Russia.

The primary focus of the content shared by the account is to provide timely updates on the war situation and offer words of comfort and encouragement to the people of Ukraine. From this, we can infer that the account was intended to engage Ukrainians on social media, providing them with emotional support and fostering a sense of connection with the government.

Why is this interesting?

This tool is interesting because it can help us gain insights into current social media trends. By analyzing the posting frequency and patterns over time, we can potentially identify activity patterns on certain topics. This could be useful for detecting propaganda content posted by bots, which often reuse the same or similar video feeds. Overall, this tool has the potential to provide us with a better understanding of social media content and how it is shared over time.

Not all bells and whistles

From the above tasks it seems like a perfect solution, as we know all tools have their flaws so what are its limitations?

Task 2: Navigating search results (exploring latent space)

CLIP models have not learned facial recognition. It cannot retrieve videos of a given person who is not a public figure.

CLIP models suffer from abstract or systematic tasks.

A abstract sentence is a sentence that leaves a lot of room for interpretation and doesn’t give you a clear idea of what’s actually happening. Example “A red car closest to the image”, what image are you referring to? What perspective should I take?

A systematic task is a series of logical steps needed to understand a sentence. For example, to understand the sentence “Menu at a bar with a drink name 7 heaven” you need to break it down into four steps. This involves looking at the menu, recognizing you’re at a bar, finding the drink called 7 heaven, and putting it all together.

Task 3: Video topic model

Our topic modelling technique will sometimes wrongly cluster certain videos into a topic. This might be due to several reasons

  • Losing information from text content. Since we are not using text content, we lose some information that might be useful for topic model.
  • Clustering based on average images loses too much information. Images of different scenes in a video are merged, making it hard for clustering to perform well.
  • Topic model itself requires trial and error to get to the correct number of clusters that make sense to the user, like LDA (Latent Dirichlet Allocation). Although there are measures such as using Perplexity or Coherence score, they are still proxies for a topic model’s performance. The best way is to show it through UI and have humans visualise them to see if the topic makes sense.

Detail discussion on identify correct topic model can be found here.


Video analytics can be used to identify patterns of suspicious activity in online communities. This information can then be used to alert authorities or implement measures to prevent the further spread of misinformation. Overall, it can be a powerful tool for improving online security and preventing the spread of misinformation, but it is important to use it responsibly and ethically.

Additionally, it is important to recognize that video analytics is just one part of a larger effort to improve online security and combat the spread of misinformation. Other measures, such as educating the public on how to identify and report misinformation, strengthening laws and regulations related to online behavior, and working with technology companies to develop more effective tools and strategies, may also be necessary. By taking a multifaceted approach and working together, we can create a more secure and trustworthy online environment for all.

Other references

Check out the job opportunities that we have in AI & Data Analytics at CSIT.



Yiwen Lai
CSIT tech blog

🤖 AI² | NTU Computer Science Graduate | NUS M.Tech Knowledge Engineering |