Decoding the Herb and Spice Aisle with Generative AI

6 min readDec 3, 2023

How CV2 Segmentation and the GPT-4 Vision API Boosted Product Recognition

An intricate supermarket aisle captured on an iPhone 12 Max, filled with a diverse array of products ranging from fresh produce to packaged goods. — Image David Kolb Dalle 3

Have you ever wandered the grocery store’s busy herb and spice aisle, searching in vain for a particular jar buried among the lookalike labels? I recently faced that mundane yet frustrating dilemma far too often. That prompted an idea — could ChatGPT4 simplify ingredient searches

ChatGPT4 iOS has demonstrated impressive capabilities in understanding images and identifying objects within them. However, when locating a specific product — Lemongrass — on a shelf, ChatGPT4 struggled, incorrectly placing the lemongrass and providing inaccurate descriptions.

To enhance the analysis, I created an experiment using openCV, an open-source computer vision library for Python. By segmenting the dense shelf image into separate sections, I could feed ChatGPT more targeted and simplified images to analyse.

Use Cases

The broader applications of this technology are vast and varied. From aiding visually impaired individuals in shopping to retail inventory management, the implications extend into various domains, including e-commerce and supply chain optimisation.

Approach

Take a photo of a complex store setting on an iPhone 12 Max.
Perform image Segmentation using OpenCV.
Analyse the images with GPT-4 Vision Preview API.

The setting

The original iPhone photo of the densely packed shelves with the visual complexity of jars, containers, and packages of all shapes crowded together.

Setup

Install the openai libraries.

pip install openai

Install the openCV libraries

pip install opencv-python

OpenCV for Image Segmentation.

Use OpenCV to split the image into six segmented images. Building in a 10% overlap between segments aimed to comprehensively cover all products without missing details that could undermine the vision analysis.

    image = cv2.imread(image_name)

    # Calculate the dimensions of each square
    height, width, _ = image.shape
    square_height = height // rows
    square_width = width // cols

    # Calculate the overlap size in pixels
    overlap_height = square_height * overlap_percentage // 100
    overlap_width = square_width * overlap_percentage // 100

The segmentation strategy, including the number of segments and the extent of overlap, can be adjusted based on the characteristics of different images.

    for i in range(rows):
        for j in range(cols):
            y_start = i * (square_height - overlap_height)
            y_end = (i + 1) * square_height
            x_start = j * (square_width - overlap_width)
            x_end = (j + 1) * square_width

            sub_image = image[y_start:y_end, x_start:x_end]

            # Save the sub-image to disk
            sub_image_filename = f'sub_images/sub_image_{i}_{j}.jpg'
            cv2.imwrite(sub_image_filename, sub_image)

GPT-4 Vision API

The GPT-4 Vision API handles multiple image inputs compatible with base64 encoded strings and image URLs. For this experiment, base64 encoding was utilised.

  def encode_image(image_name):
    with open(image_name, "rb") as image_file:
      return base64.b64encode(image_file.read()).decode('utf-8')

Each image was sorted and base64 encoded to prepare them for processing. Images were stored locally to enable multiple accesses and processing at various stages of the experiment.

    encoded_images = []
    # Get a list of filenames in the directory and sort them alphabetically
    image_files = sorted(os.listdir(image_directory))

    # Loop through the sorted filenames
    for filename in image_files:
        if filename.endswith(".jpg"): 
            image_name = os.path.join(image_directory, filename)

            # Encode the image using the encode_image function
            base64_image = encode_image(image_name)
            encoded_images.append(base64_image)

Set up the connection with OpenAI’s API, with the type of data being sent and the unique access key.

    messages = []

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

A prompt is included in the API request. My prompt was:

“In these images, please specify in which image I can find the lemongrass and provide a detailed description of the product and its location.”

    text_message = {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": custom_text_message,
            }
        ]
    }

Add each of the six base64 encoded images to the API message.

    # Iterate through encoded images and add image messages
    for base64_image in encoded_images:
        image_message = {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
        messages.append(image_message)

Create the API payload, including the prompt and images.

    # Create the payload with the list of messages
    payload = {
        "model": "gpt-4-vision-preview",
        "messages": messages,
        "max_tokens": 300
    }

Send the request to OpenAI.

response = requests.post(
    "https://api.openai.com/v1/chat/completions", 
    headers=headers, 
    json=payload
)

After the model processed the request, this code retrieves and displays the response.

print(response.json())

Results

GPT-4 Vision API’s response successfully details the location of lemongrass. It’s described as being in dried form, typically utilised as a culinary flavouring in Asian cuisine for its distinctive, citrus-like flavour. The lemongrass is stored on the second shelf from the bottom, housed in cylindrical containers with greenish-yellow labels.

The labels are marked with “LEMONGRASS” in bold black text against a light backdrop, next to a black-labeled product called “Palm Sugar.” The analysis also captures other elements, including various spices above and packages of dried mushrooms below the lemongrass.

The lemongrass’s price tag is also mentioned, further affirming the product’s identification. This detailed breakdown from the API showcases its capability to effectively recognise and describe products in a complex visual setup combined with CV2 image segmentation.

JSON from the GPT-4 Vision API’s response

API request took 16.05 seconds
{'id': '<removed>', 'object': 'chat.completion',
'created': 1700243878, 'model': 'gpt-4-1106-vision-preview', 
'usage': {'prompt_tokens': 7030, 'completion_tokens': 217, 
'total_tokens': 7247}, 'choices': [{'message': {'role': 'assistant', 
'content': 'Lemongrass can be found in the last image you uploaded. 
Here is a description and location:\n\nDescription:\nThe product in question 
is lemongrass, which appears to be in dried form. Specifically, it is 
"Waitrose Cooks\' Ingredients Lemongrass." This is commonly used as a 
flavoring in Asian cuisine, providing a distinctive citrus flavor to dishes 
without the tartness of actual lemon.\n\nLocation:\nThe lemongrass is 
located on the second shelf from the bottom. There are two cylindrical 
containers with greenish-yellow labels. Each container is labeled clearly 
with "LEMONGRASS" in black text on a light background. The containers are 
positioned next to a black-labeled product called "Palm Sugar." Above the 
lemongrass, you can see various spices and below, there are packages of 
dried mushrooms. The price tag visible below the lemongrass indicates that 
it costs £1.80. The shelf tag located directly under the lemongrass\'s 
shelf also appears to indicate the product\'s name, reinforcing its 
identification.'}, 'finish_details': {'type': 'stop', 'stop': 
'<|fim_suffix|>'}, 'index': 0}]}
127.0.0.1 - - [17/Nov/2023 17:58:05] "POST / HTTP/1.1" 200

Key Takeaways

This experiment underscores the synergy between computer vision and large language models in practical AI applications. The successful identification of a specific item in a complex retail setting paves the way for more nuanced and sophisticated uses of Generative AI in retail and other settings.

Combining CV2 image segmentation with GPT-4’s vision API enables more accurate identification of products in densely packed retail settings.
There is significant potential to transform retail experiences by integrating computer vision and large language models. Applications could range from improving customer service to optimizing inventory tracking.
Generative AI innovations often necessitate multidisciplinary approaches, like blending OpenCV computer vision libraries with neural network capabilities.

Interested in the intersection of Generative AI and retail? Share your thoughts in the comments below, or reach out for a deeper discussion.

Links to Code

OpenCV2 https://pypi.org/project/opencv-python/

OpenAI Vision https://pypi.org/project/opencv-python/

David Kolb www.linkedin.com/in/david-kolb/