Harnessing AI Synergy: Combining Video Analysis with LLM

Published in

CyberArk Engineering

4 min readMar 13, 2024

In the thrilling realm of artificial intelligence, the fusion of various AI tools with Large Language Models (LLM) is ushering in a revolution, particularly in video analysis. This article explores the potent combination of Amazon Rekognition and Amazon Bedrock in transforming the way visual media is analyzed for critical business applications.

In a previous post by my colleague Tamar Yankelevich, there is an explanation of how to implement an LLM analysis capability on a large volume of text. In this post, I describe a way to create a visual analysis. Eventually, those two posts show how CyberArk utilizes this technology to extract meaningful cybersecurity events from extensive data.

Step 1: Intelligent Frame Sampling in Video Analysis

A typical video file is a sequence of frames; the challenge lies in efficiently analyzing these frames. The naïve method of examining every frame is neither cost-effective nor practical. An initial approach involves sampling frames at a constant rate, (e.g.,every five seconds).

However, more sophisticated methods are employed for deeper analysis:

Frame Difference Analysis. This involves calculating the difference between consecutive frames. Frames with significant changes are flagged as key frames, indicating essential events.
Histogram Comparison. Significant changes indicating a scene shift or important event can be detected by comparing color histograms of successive frames.
Use Pre-trained Models. Pre-trained deep learning models are specifically designed for tasks like scene detection, which can help extract key frames. These models can be downloaded through Hugging Face and deployed using Amazon SageMaker.
Clustering Techniques. Frames are treated as high-dimensional data points and clustering algorithms group similar frames. The representative frame of each cluster is then analyzed as a key frame.

Step 2: Advanced Image Analysis with Amazon Rekognition

The next step involves the application of Amazon Rekognition for in-depth image analysis. This tool surpasses traditional OCR capabilities, offering high accuracy in extracting text and related labels from images. For instance, analyzing a webpage from CyberArk, Rekognition can extract the visible text and classify the image under labels like “webpage” or “file explorer.” This dual extraction is crucial for understanding both the content and its context. Custom label recognition is another powerful feature that allows identifying specific elements like company logos.

Step 3: Contextual Interpretation with LLM

With texts and labels from sequential video frames, we can now use generative AI to explain what happened. The problem is that this is a massive amount of data and, more importantly, out of the context data.

To illustrate this problem, let us look at a real-life example of a JSON file of a screenshot of a terminal window where the user is running a BASH script. The terminal content shows a command prompt, script execution, and output. The following is a possible output of the image recognition:

How do we extract a meaningful story from such an output? Here comes the generative AI to the rescue to read, understand, and tell the story behind the flow.

Amazon Bedrock allows us to interact with various LLM models to do that. For example, you can choose Claude, AWS Titan or other proprietary models without needing a dedicated machine and GPUs. It simplifies the way to interact with those models.

An LLM model can get the JSON input and convert the features of a frame or a list of sequential frames to a textual description.
Let us look at two JSON files. The first one previously presented is a user doing ‘ls –al’. The second one is this:

In this JSON, the user deletes the directory that was found.

An LLM can provide us with two crucial parts of information:

What is happening in each image (e.g., show all files command and delete a directory)
A clear and straightforward explanation of the flow (e.g, what are the steps that the user has completed).

The LLM connects the two stages. The user looks in a directory and then deletes it.

This output can be taken for further analysis by LLM to understand if this is normal behavior or to extract meaningful cybersecurity events.

Challenges with Long Videos

Processing long videos is a complex task due to LLMs’ inherent input size limitations.

Strategies to manage this include:

Segmentation. Breaking the video into segments, each within the LLM’s processing limit.
Overlap Integration. Including information from adjacent frames to maintain context.
Sequential Analysis with Summary. Generating summaries for each segment to aid the LLM in understanding the overall narrative.

The Evolving AI Landscape and What it Means For Cyber Security Technology

Recent enhancements in Amazon Rekognition allow for direct text extraction from videos, potentially reducing the need for complex frame sampling. In addition, the advent of multimodal LLMs capable of processing images directly is a promising development, potentially revolutionizing our current methodologies.

The Integration of tools like Amazon Rekognition and Bedrock with LLMs represents a significant leap in video analysis technology. This combination is advantageous in fields requiring deep, contextual analysis of extensive visual data. As AI technologies continue to evolve, such systems’ potential applications and capabilities will expand, heralding a new era of AI-driven analysis and insights.

Thanks to Roy Ben Yosef and Daniel Alfasi for contributing to this post.