GPT-4-vision: Trying Out Real-time Image Analysis Based on Context

3 min readNov 25, 2023

In this article, we delve into a process that captures video frames in real-time using a PC’s camera, encodes them into Base64 format, and sends them to OpenAI’s GPT-4-vision model. This technique leverages contextual information from videos to analyze current situations and predict subsequent events, focusing particularly on real-time tracking and prediction of situational changes using data from past frames.

Implementation

1. Capturing Video Frames

Real-time capture of video frames is accomplished using the PC’s camera.

2. Encoding Frames

Captured frames are encoded in Base64 format, allowing image data to be handled as text for easy API transmission.

3. Sending to AI Model

The encoded frames are sent to OpenAI’s GPT model, which then analyzes the video content to predict current and upcoming situations.

4. Using Context

The model utilizes context information from past frames to enhance the accuracy of current situation analysis and future predictions.

Code Example

import cv2
import base64
import os
import requests
import time
from openai import OpenAI
from collections import deque
from datetime import datetime

def encode_image_to_base64(frame):
    _, buffer = cv2.imencode(".jpg", frame)
    return base64.b64encode(buffer).decode('utf-8')

def send_frame_to_gpt(frame, previous_texts, client):
    # Combine texts and timestamps from the last 5 frames to create context
    context = ' '.join(previous_texts)
  
    # Prepare message payload for sending the frame to GPT
    # Evaluate if the previous prediction matches the current situation from the context,
    # and instruct to make the next prediction
    prompt_message = f"Context: {context}. Assess if the previous prediction matches the current situation. Current: explain the current situation in 10 words or less. Next: Predict the next situation in 10 words or less."

    PROMPT_MESSAGES = {
        "role": "user",
        "content": [
            prompt_message,
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{frame}"}}
        ],
    }

    # Parameters for API call
    params = {
        "model": "gpt-4-vision-preview",
        "messages": [PROMPT_MESSAGES],
        "max_tokens": 500,
    }

    # Make the API call
    result = client.chat.completions.create(**params)
    return result.choices[0].message.content

def main():
    # Initialize OpenAI client
    client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

    # Open PC's internal camera
    video = cv2.VideoCapture(0)

    # Queue to hold the texts of the most recent 5 frames
    previous_texts = deque(maxlen=5)

    while video.isOpened():
        success, frame = video.read()
        if not success:
            break

        # Encode the frame in Base64
        base64_image = encode_image_to_base64(frame)

        # Get the current timestamp
        timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

        # Send the frame to GPT and get the generated text
        generated_text = send_frame_to_gpt(base64_image, previous_texts, client)
        print(f"Timestamp: {timestamp}, Generated Text: {generated_text}")

        # Add the text with timestamp to the queue
        previous_texts.append(f"[{timestamp}] {generated_text}")

        # Wait for 1 second
        time.sleep(1)

    # Release the video
    video.release()

if __name__ == "__main__":
    main()

Running the Program

Creating and Activating a Virtual Environment:

Creation: python -m venv myenv

Activation: source myenv/bin/activate

Installing Necessary Packages:

pip install opencv-python requests openai

Setting Environment Variable:

export OPENAI_API_KEY="your-api-key"

Sample Outputs

Timestamp: 2025-01-01 00:00:00, Generated Text:
Current: Astronaut analyzing a complex star chart on a digital screen.

Timestamp: 2025-01-01 00:00:01, Generated Text:
Current: View of the cockpit filled with flickering control panels and monitors.
Next: Unable to predict without additional context or data.

Timestamp: 2025-01-01 00:00:02, Generated Text:
Current: Astronaut adjusting a glowing holographic navigation system.
Next: Unable to predict future actions or events.

Timestamp: 2025-01-01 00:00:03, Generated Text:
Current: Astronaut communicating through headset, lit by console lights.
Next: Continues communication or turns attention to nearby controls.

Timestamp: 2025-01-01 00:00:04, Generated Text:
Current: Close-up of astronaut's hand switching toggles on a control panel.
Next: May initiate a new course or continue monitoring systems.

Timestamp: 2025-01-01 00:00:05, Generated Text:
Current: Astronaut intently watching a 3D model of a galaxy rotating.
Next: Possible adjustment of the spaceship's trajectory or continued observation.

Timestamp: 2025-01-01 00:00:06, Generated Text:
Current: Reflective helmet visor showing a reflection of distant stars.
Next: Astronaut might turn to engage with a different instrument panel.

Timestamp: 2025-01-01 00:00:07, Generated Text:
Current: View of the cockpit's window revealing a distant nebula.
Next: Man could look up or interact with the cockpit's technology.

Timestamp: 2025-01-01 00:00:08, Generated Text:
Current: Astronaut typing on a virtual keyboard, data screens active.
Next: Continuation of data analysis or interaction with another astronaut.

Timestamp: 2025-01-01 00:00:09, Generated Text:
Current: Glancing outside the cockpit window at a passing meteoroid.
Next: May report the sighting or focus back on internal controls.