Sitemap

Building a Python App to Capture Images and Interact with OpenAI’s Vision API

4 min readNov 9, 2023

In a world teeming with data, our ability to make sense of visual information is paramount. OpenAI’s recent DevDay introduced breakthroughs like the Vision API, which empowers developers to extend the sense of sight to machines. With this advancement, applications can now not only see but also interpret the world around us, laying the foundation for a myriad of innovative uses.

The Project’s Heartbeat

The app’s primary function is to capture an image using a camera and then apply AI to interpret and describe that image. In this project, I’ve utilized OpenAI’s newly released Vision API in conjunction with their Audio API. This combination allows the system to not only analyze visual content but also convey the findings through synthesized speech, effectively translating visual information into auditory descriptions. The integration showcases how visual and linguistic AI models can be orchestrated to convert image data into a spoken narrative.

Why This Matters

As highlighted during the keynote, OpenAI’s mission extends beyond the technology itself — it’s about augmenting human capabilities and fostering creativity. By leveraging the GPT-4 Turbo model, developers like myself can craft solutions that bring this vision to life, making technology an active participant in our daily lives.

The Tools

The core technologies I used for this project were:

  • Python: A versatile programming language that’s become synonymous with AI and machine learning projects.
  • OpenCV (Open Source Computer Vision Library): A library of programming functions aimed at real-time computer vision.
  • OpenAI’s API: Provides access to powerful AI models capable of understanding and generating natural language and now, analyzing images.

Step-by-Step Implementation

Initializing the Camera Application

We begin by defining a CameraApp class with an initializer that takes a camera_index to identify which camera device to use and an api_key for authenticating with the OpenAI API. The initialize_camera method attempts to start video capture with the specified camera index.

from openai import OpenAI
import cv2
import time
import base64
import requests

client = OpenAI(api_key='API_KEY_HERE')

class CameraApp:
def __init__(self, camera_index, api_key):
self.camera_index = camera_index
self.api_key = api_key

def initialize_camera(self):
self.cap = cv2.VideoCapture(self.camera_index)
if not self.cap.isOpened():
print("Cannot open camera")
return False
return True

Capturing and Saving the Image

The capture_image method activates the camera, waits for it to adjust its autoexposure settings, and then captures a single frame. If successful, it saves the frame to a file.

def capture_image(self, filename='capture.jpg'):
if not self.initialize_camera():
return False
time.sleep(2) # Camera warm-up
for _ in range(10): self.cap.read() # Autoexposure adjustment
ret, frame = self.cap.read()
if ret:
cv2.imwrite(filename, frame)
print("Image captured successfully")
else:
print("Failed to capture image")
self.cap.release()
cv2.destroyAllWindows()
return ret

Encoding the Image and Making the API Request

To interact with the OpenAI API, the image needs to be encoded in base64. The encode_image method handles this, and send_request sends the encoded image to the OpenAI API, asking for a description of the image limited to 100 words.

def encode_image(self, image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')

def send_request(self, image_data):
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {self.api_key}"}
payload = {
"model": "gpt-4-vision-preview",
"messages": [{
"role": "user",
"content": [{"type": "text", "text": "Tell me about this image. Limit your response to 100 words."}, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}]
}],
"max_tokens": 300
}
return requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload).json()

Converting the Response to Audio

Once we have the textual response, save_response_as_audio uses the OpenAI Audio API to create a speech output of the description which is saved as an MP3 file.

def save_response_as_audio(self, response_data):
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input=response_data['choices'][0]['message']['content'],
)
response.stream_to_file("output.mp3")

Estimating the API Usage Cost

The openai_api_calculate_cost method calculates the cost of using the OpenAI API based on the number of tokens used in the prompt and completion.

def openai_api_calculate_cost(self, usage, model="gpt-4-1106-vision-preview"):
pricing = {
'gpt-3.5-turbo-1106': {
'prompt': 0.001,
'completion': 0.002,
},
'gpt-4-8k': {
'prompt': 0.03,
'completion': 0.06,
},
'gpt-4-32k': {
'prompt': 0.06,
'completion': 0.12,
},
'gpt-4-1106-vision-preview': {
'prompt': 0.01,
'completion': 0.03,
}
}

try:
model_pricing = pricing[model]
except KeyError:
raise ValueError("Invalid model specified")

prompt_cost = usage['prompt_tokens'] * model_pricing['prompt'] / 1000
completion_cost = usage['completion_tokens'] * model_pricing['completion'] / 1000

total_cost = prompt_cost + completion_cost
print(f"\nTokens used: {usage['prompt_tokens']:,} prompt + {usage['completion_tokens']:,} completion = {usage['total_tokens']:,} tokens")
print(f"Total cost for {model}: ${total_cost:.4f}\n")

return total_cost
Cost for me to generate the photo

Running the Application

Finally, we instantiate the CameraApp with the appropriate camera index and API key. We then capture an image, send it for analysis, calculate the cost, and generate the audio output.

camera_app = CameraApp(camera_index=1, api_key=client.api_key)
if camera_app.capture_image():
encoded_image = camera_app.encode_image('capture.jpg')
response_data = camera_app.send_request(encoded_image)
camera_app.save_response_as_audio(response_data)
total_cost = camera_app.openai_api_calculate_cost(response_data['usage'])

Here are the final results

Here I am!
Here’s AI describing me!

Connect with Me

I’m always open to connecting with fellow developers and tech enthusiasts. If you want to discuss this project, share ideas, or simply connect, feel free to reach out to me.

Explore the Codebase

The complete code for this project is available for you to delve into. Check it out, play around with it, and let me know what you think!

I encourage you to star, fork, and contribute to the repository. Any feedback or contributions are highly appreciated!

--

--

No responses yet