GPT-4-vision: Trying Out Real-time Image Analysis Based on Context
In this article, we delve into a process that captures video frames in real-time using a PC’s camera, encodes them into Base64 format, and sends them to OpenAI’s GPT-4-vision model. This technique leverages contextual information from videos to analyze current situations and predict subsequent events, focusing particularly on real-time tracking and prediction of situational changes using data from past frames.
Implementation
1. Capturing Video Frames
Real-time capture of video frames is accomplished using the PC’s camera.
2. Encoding Frames
Captured frames are encoded in Base64 format, allowing image data to be handled as text for easy API transmission.
3. Sending to AI Model
The encoded frames are sent to OpenAI’s GPT model, which then analyzes the video content to predict current and upcoming situations.
4. Using Context
The model utilizes context information from past frames to enhance the accuracy of current situation analysis and future predictions.
Code Example
import cv2
import base64
import os
import requests
import time
from openai import OpenAI
from collections import deque
from datetime import datetime
def encode_image_to_base64(frame):
_, buffer = cv2.imencode(".jpg", frame)
return base64.b64encode(buffer).decode('utf-8')
def send_frame_to_gpt(frame, previous_texts, client):
# Combine texts and timestamps from the last 5 frames to create context
context = ' '.join(previous_texts)
# Prepare message payload for sending the frame to GPT
# Evaluate if the previous prediction matches the current situation from the context,
# and instruct to make the next prediction
prompt_message = f"Context: {context}. Assess if the previous prediction matches the current situation. Current: explain the current situation in 10 words or less. Next: Predict the next situation in 10 words or less."
PROMPT_MESSAGES = {
"role": "user",
"content": [
prompt_message,
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{frame}"}}
],
}
# Parameters for API call
params = {
"model": "gpt-4-vision-preview",
"messages": [PROMPT_MESSAGES],
"max_tokens": 500,
}
# Make the API call
result = client.chat.completions.create(**params)
return result.choices[0].message.content
def main():
# Initialize OpenAI client
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
# Open PC's internal camera
video = cv2.VideoCapture(0)
# Queue to hold the texts of the most recent 5 frames
previous_texts = deque(maxlen=5)
while video.isOpened():
success, frame = video.read()
if not success:
break
# Encode the frame in Base64
base64_image = encode_image_to_base64(frame)
# Get the current timestamp
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
# Send the frame to GPT and get the generated text
generated_text = send_frame_to_gpt(base64_image, previous_texts, client)
print(f"Timestamp: {timestamp}, Generated Text: {generated_text}")
# Add the text with timestamp to the queue
previous_texts.append(f"[{timestamp}] {generated_text}")
# Wait for 1 second
time.sleep(1)
# Release the video
video.release()
if __name__ == "__main__":
main()
Running the Program
Creating and Activating a Virtual Environment:
Creation: python -m venv myenv
Activation: source myenv/bin/activate
Installing Necessary Packages:
pip install opencv-python requests openai
Setting Environment Variable:
export OPENAI_API_KEY="your-api-key"
Sample Outputs
Timestamp: 2025-01-01 00:00:00, Generated Text:
Current: Astronaut analyzing a complex star chart on a digital screen.
Timestamp: 2025-01-01 00:00:01, Generated Text:
Current: View of the cockpit filled with flickering control panels and monitors.
Next: Unable to predict without additional context or data.
Timestamp: 2025-01-01 00:00:02, Generated Text:
Current: Astronaut adjusting a glowing holographic navigation system.
Next: Unable to predict future actions or events.
Timestamp: 2025-01-01 00:00:03, Generated Text:
Current: Astronaut communicating through headset, lit by console lights.
Next: Continues communication or turns attention to nearby controls.
Timestamp: 2025-01-01 00:00:04, Generated Text:
Current: Close-up of astronaut's hand switching toggles on a control panel.
Next: May initiate a new course or continue monitoring systems.
Timestamp: 2025-01-01 00:00:05, Generated Text:
Current: Astronaut intently watching a 3D model of a galaxy rotating.
Next: Possible adjustment of the spaceship's trajectory or continued observation.
Timestamp: 2025-01-01 00:00:06, Generated Text:
Current: Reflective helmet visor showing a reflection of distant stars.
Next: Astronaut might turn to engage with a different instrument panel.
Timestamp: 2025-01-01 00:00:07, Generated Text:
Current: View of the cockpit's window revealing a distant nebula.
Next: Man could look up or interact with the cockpit's technology.
Timestamp: 2025-01-01 00:00:08, Generated Text:
Current: Astronaut typing on a virtual keyboard, data screens active.
Next: Continuation of data analysis or interaction with another astronaut.
Timestamp: 2025-01-01 00:00:09, Generated Text:
Current: Glancing outside the cockpit window at a passing meteoroid.
Next: May report the sighting or focus back on internal controls.