Decoding the Herb and Spice Aisle with Generative AI

David Kolb
6 min readDec 3, 2023

--

How CV2 Segmentation and the GPT-4 Vision API Boosted Product Recognition

An intricate supermarket aisle captured on an iPhone 12 Max, filled with a diverse array of products ranging from fresh produce to packaged goods.
Image David Kolb Dalle 3

Have you ever wandered the grocery store’s busy herb and spice aisle, searching in vain for a particular jar buried among the lookalike labels? I recently faced that mundane yet frustrating dilemma far too often. That prompted an idea — could ChatGPT4 simplify ingredient searches

ChatGPT4 iOS has demonstrated impressive capabilities in understanding images and identifying objects within them. However, when locating a specific product — Lemongrass — on a shelf, ChatGPT4 struggled, incorrectly placing the lemongrass and providing inaccurate descriptions.

To enhance the analysis, I created an experiment using openCV, an open-source computer vision library for Python. By segmenting the dense shelf image into separate sections, I could feed ChatGPT more targeted and simplified images to analyse.

Use Cases

The broader applications of this technology are vast and varied. From aiding visually impaired individuals in shopping to retail inventory management, the implications extend into various domains, including e-commerce and supply chain optimisation.

Approach

  • Take a photo of a complex store setting on an iPhone 12 Max.
  • Perform image Segmentation using OpenCV.
  • Analyse the images with GPT-4 Vision Preview API.

The setting

The original iPhone photo of the densely packed shelves with the visual complexity of jars, containers, and packages of all shapes crowded together.

Setup

Install the openai libraries.

pip install openai

Install the openCV libraries

pip install opencv-python

OpenCV for Image Segmentation.

Use OpenCV to split the image into six segmented images. Building in a 10% overlap between segments aimed to comprehensively cover all products without missing details that could undermine the vision analysis.

    image = cv2.imread(image_name)

# Calculate the dimensions of each square
height, width, _ = image.shape
square_height = height // rows
square_width = width // cols

# Calculate the overlap size in pixels
overlap_height = square_height * overlap_percentage // 100
overlap_width = square_width * overlap_percentage // 100

The segmentation strategy, including the number of segments and the extent of overlap, can be adjusted based on the characteristics of different images.

    for i in range(rows):
for j in range(cols):
y_start = i * (square_height - overlap_height)
y_end = (i + 1) * square_height
x_start = j * (square_width - overlap_width)
x_end = (j + 1) * square_width

sub_image = image[y_start:y_end, x_start:x_end]

# Save the sub-image to disk
sub_image_filename = f'sub_images/sub_image_{i}_{j}.jpg'
cv2.imwrite(sub_image_filename, sub_image)

GPT-4 Vision API

The GPT-4 Vision API handles multiple image inputs compatible with base64 encoded strings and image URLs. For this experiment, base64 encoding was utilised.

  def encode_image(image_name):
with open(image_name, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')

Each image was sorted and base64 encoded to prepare them for processing. Images were stored locally to enable multiple accesses and processing at various stages of the experiment.

    encoded_images = []
# Get a list of filenames in the directory and sort them alphabetically
image_files = sorted(os.listdir(image_directory))

# Loop through the sorted filenames
for filename in image_files:
if filename.endswith(".jpg"):
image_name = os.path.join(image_directory, filename)

# Encode the image using the encode_image function
base64_image = encode_image(image_name)
encoded_images.append(base64_image)

Set up the connection with OpenAI’s API, with the type of data being sent and the unique access key.

    messages = []

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}

A prompt is included in the API request. My prompt was:

“In these images, please specify in which image I can find the lemongrass and provide a detailed description of the product and its location.”

    text_message = {
"role": "user",
"content": [
{
"type": "text",
"text": custom_text_message,
}
]
}

Add each of the six base64 encoded images to the API message.

    # Iterate through encoded images and add image messages
for base64_image in encoded_images:
image_message = {
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
messages.append(image_message)

Create the API payload, including the prompt and images.

    # Create the payload with the list of messages
payload = {
"model": "gpt-4-vision-preview",
"messages": messages,
"max_tokens": 300
}

Send the request to OpenAI.

response = requests.post(
"https://api.openai.com/v1/chat/completions",
headers=headers,
json=payload
)

After the model processed the request, this code retrieves and displays the response.

print(response.json())

Results

GPT-4 Vision API’s response successfully details the location of lemongrass. It’s described as being in dried form, typically utilised as a culinary flavouring in Asian cuisine for its distinctive, citrus-like flavour. The lemongrass is stored on the second shelf from the bottom, housed in cylindrical containers with greenish-yellow labels.

Zoom in on the Lemon Grass Location

The labels are marked with “LEMONGRASS” in bold black text against a light backdrop, next to a black-labeled product called “Palm Sugar.” The analysis also captures other elements, including various spices above and packages of dried mushrooms below the lemongrass.

The lemongrass’s price tag is also mentioned, further affirming the product’s identification. This detailed breakdown from the API showcases its capability to effectively recognise and describe products in a complex visual setup combined with CV2 image segmentation.

JSON from the GPT-4 Vision API’s response
API request took 16.05 seconds
{'id': '<removed>', 'object': 'chat.completion',
'created': 1700243878, 'model': 'gpt-4-1106-vision-preview',
'usage': {'prompt_tokens': 7030, 'completion_tokens': 217,
'total_tokens': 7247}, 'choices': [{'message': {'role': 'assistant',
'content': 'Lemongrass can be found in the last image you uploaded.
Here is a description and location:\n\nDescription:\nThe product in question
is lemongrass, which appears to be in dried form. Specifically, it is
"Waitrose Cooks\' Ingredients Lemongrass." This is commonly used as a
flavoring in Asian cuisine, providing a distinctive citrus flavor to dishes
without the tartness of actual lemon.\n\nLocation:\nThe lemongrass is
located on the second shelf from the bottom. There are two cylindrical
containers with greenish-yellow labels. Each container is labeled clearly
with "LEMONGRASS" in black text on a light background. The containers are
positioned next to a black-labeled product called "Palm Sugar." Above the
lemongrass, you can see various spices and below, there are packages of
dried mushrooms. The price tag visible below the lemongrass indicates that
it costs £1.80. The shelf tag located directly under the lemongrass\'s
shelf also appears to indicate the product\'s name, reinforcing its
identification.'}, 'finish_details': {'type': 'stop', 'stop':
'<|fim_suffix|>'}, 'index': 0}]}
127.0.0.1 - - [17/Nov/2023 17:58:05] "POST / HTTP/1.1" 200

Key Takeaways

This experiment underscores the synergy between computer vision and large language models in practical AI applications. The successful identification of a specific item in a complex retail setting paves the way for more nuanced and sophisticated uses of Generative AI in retail and other settings.

  • Combining CV2 image segmentation with GPT-4’s vision API enables more accurate identification of products in densely packed retail settings.
  • There is significant potential to transform retail experiences by integrating computer vision and large language models. Applications could range from improving customer service to optimizing inventory tracking.
  • Generative AI innovations often necessitate multidisciplinary approaches, like blending OpenCV computer vision libraries with neural network capabilities.

Interested in the intersection of Generative AI and retail? Share your thoughts in the comments below, or reach out for a deeper discussion.

--

--

David Kolb

Innovation Strategist & Coach | Cyclist 🚴‍♀️ | Photographer 📸 | IDEO U Alumni Coach