Building a Python App to Capture Images and Interact with OpenAI’s Vision API
In a world teeming with data, our ability to make sense of visual information is paramount. OpenAI’s recent DevDay introduced breakthroughs like the Vision API, which empowers developers to extend the sense of sight to machines. With this advancement, applications can now not only see but also interpret the world around us, laying the foundation for a myriad of innovative uses.
The Project’s Heartbeat
The app’s primary function is to capture an image using a camera and then apply AI to interpret and describe that image. In this project, I’ve utilized OpenAI’s newly released Vision API in conjunction with their Audio API. This combination allows the system to not only analyze visual content but also convey the findings through synthesized speech, effectively translating visual information into auditory descriptions. The integration showcases how visual and linguistic AI models can be orchestrated to convert image data into a spoken narrative.
Why This Matters
As highlighted during the keynote, OpenAI’s mission extends beyond the technology itself — it’s about augmenting human capabilities and fostering creativity. By leveraging the GPT-4 Turbo model, developers like myself can craft solutions that bring this vision to life, making technology an active participant in our daily lives.
The Tools
The core technologies I used for this project were:
- Python: A versatile programming language that’s become synonymous with AI and machine learning projects.
- OpenCV (Open Source Computer Vision Library): A library of programming functions aimed at real-time computer vision.
- OpenAI’s API: Provides access to powerful AI models capable of understanding and generating natural language and now, analyzing images.
Step-by-Step Implementation
Initializing the Camera Application
We begin by defining a CameraApp
class with an initializer that takes a camera_index
to identify which camera device to use and an api_key
for authenticating with the OpenAI API. The initialize_camera
method attempts to start video capture with the specified camera index.
from openai import OpenAI
import cv2
import time
import base64
import requests
client = OpenAI(api_key='API_KEY_HERE')
class CameraApp:
def __init__(self, camera_index, api_key):
self.camera_index = camera_index
self.api_key = api_key
def initialize_camera(self):
self.cap = cv2.VideoCapture(self.camera_index)
if not self.cap.isOpened():
print("Cannot open camera")
return False
return True
Capturing and Saving the Image
The capture_image
method activates the camera, waits for it to adjust its autoexposure settings, and then captures a single frame. If successful, it saves the frame to a file.
def capture_image(self, filename='capture.jpg'):
if not self.initialize_camera():
return False
time.sleep(2) # Camera warm-up
for _ in range(10): self.cap.read() # Autoexposure adjustment
ret, frame = self.cap.read()
if ret:
cv2.imwrite(filename, frame)
print("Image captured successfully")
else:
print("Failed to capture image")
self.cap.release()
cv2.destroyAllWindows()
return ret
Encoding the Image and Making the API Request
To interact with the OpenAI API, the image needs to be encoded in base64. The encode_image
method handles this, and send_request
sends the encoded image to the OpenAI API, asking for a description of the image limited to 100 words.
def encode_image(self, image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
def send_request(self, image_data):
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {self.api_key}"}
payload = {
"model": "gpt-4-vision-preview",
"messages": [{
"role": "user",
"content": [{"type": "text", "text": "Tell me about this image. Limit your response to 100 words."}, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}]
}],
"max_tokens": 300
}
return requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload).json()
Converting the Response to Audio
Once we have the textual response, save_response_as_audio
uses the OpenAI Audio API to create a speech output of the description which is saved as an MP3 file.
def save_response_as_audio(self, response_data):
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input=response_data['choices'][0]['message']['content'],
)
response.stream_to_file("output.mp3")
Estimating the API Usage Cost
The openai_api_calculate_cost
method calculates the cost of using the OpenAI API based on the number of tokens used in the prompt and completion.
def openai_api_calculate_cost(self, usage, model="gpt-4-1106-vision-preview"):
pricing = {
'gpt-3.5-turbo-1106': {
'prompt': 0.001,
'completion': 0.002,
},
'gpt-4-8k': {
'prompt': 0.03,
'completion': 0.06,
},
'gpt-4-32k': {
'prompt': 0.06,
'completion': 0.12,
},
'gpt-4-1106-vision-preview': {
'prompt': 0.01,
'completion': 0.03,
}
}
try:
model_pricing = pricing[model]
except KeyError:
raise ValueError("Invalid model specified")
prompt_cost = usage['prompt_tokens'] * model_pricing['prompt'] / 1000
completion_cost = usage['completion_tokens'] * model_pricing['completion'] / 1000
total_cost = prompt_cost + completion_cost
print(f"\nTokens used: {usage['prompt_tokens']:,} prompt + {usage['completion_tokens']:,} completion = {usage['total_tokens']:,} tokens")
print(f"Total cost for {model}: ${total_cost:.4f}\n")
return total_cost
Running the Application
Finally, we instantiate the CameraApp
with the appropriate camera index and API key. We then capture an image, send it for analysis, calculate the cost, and generate the audio output.
camera_app = CameraApp(camera_index=1, api_key=client.api_key)
if camera_app.capture_image():
encoded_image = camera_app.encode_image('capture.jpg')
response_data = camera_app.send_request(encoded_image)
camera_app.save_response_as_audio(response_data)
total_cost = camera_app.openai_api_calculate_cost(response_data['usage'])
Here are the final results
Connect with Me
I’m always open to connecting with fellow developers and tech enthusiasts. If you want to discuss this project, share ideas, or simply connect, feel free to reach out to me.
- LinkedIn: Ryan Klapper
- GitHub: klapp101
Explore the Codebase
The complete code for this project is available for you to delve into. Check it out, play around with it, and let me know what you think!
- GitHub Repository: CameraApp Codebase
I encourage you to star, fork, and contribute to the repository. Any feedback or contributions are highly appreciated!