Revolutionizing Image Search with GPT-4 Vision

Unveiling Precision and Context in Visual Discovery with AI!

Suresh R
13 min readNov 7, 2023
Source: Dall-E 3

After OpenAI unveiled GPT-4 Vision on their website in September, I became captivated by the potential applications we could make with access to an API version of this technology. My anticipation was met with excitement today as OpenAI announced the Vision API, accompanied by an array of new models and features, during their Developer Day presentation.

One of the compelling applications we can explore is conducting image searches without the need to compare text directly to an image. In this article, I will explain what GPT-4 Vision is, why it’s a game-changer in the realm of visual search, and how I utilized it to streamline the process of searching through images.

The GitHub link for the project can be found here.

Source: GPT-4 vision it’s amazing (Alpha users) : r/OpenAI (reddit.com)

Problem Statement

The primary issue confronting current image search technology is its limited ability to understand and process visual content with the same nuance and depth as human perception. Traditional image search algorithms rely heavily on metadata and image recognition software that can misinterpret the context or content of an image, leading to inaccurate search results.

The aim of this article is to demonstrate how GPT-4 Vision can be leveraged to overcome the limitations of conventional image search methods, thereby providing a more intuitive and user-friendly search experience, while also taking into account the cost and policy constraints associated with using such cutting-edge technology.

Value Proposition of the project

  1. Enhanced Accuracy: By leveraging GPT-4’s advanced understanding of images and language, search results are more aligned with the users’ intent and context of the query.
  2. Contextual Understanding: GPT-4 Vision goes beyond keyword matching to comprehend the nuances and subtleties within images, thus providing results that account for context and content more deeply.
  3. Efficient Processing: By converting visual content into text descriptions, the system allows for faster and more efficient comparison of search queries with a database of images, reducing search time.
  4. Innovative Content Interaction: Users can interact with the search system in novel ways, such as asking complex questions about the contents of an image, which traditional image search engines cannot handle.
  5. Improved User Experience: By providing more accurate and relevant results, user satisfaction with the image search process is expected to increase.
  6. Reduced Infrastructure Demand: For businesses, utilizing GPT-4 Vision can mean less reliance on heavy computational infrastructure compared to running their own complex image processing and recognition systems.
  7. Flexibility and Adaptability: The system is adaptable to various types of image searches, ranging from simple object identification to complex scene descriptions, making it a versatile tool for multiple use cases.
  8. Future-proof Technology: By incorporating the state-of-the-art AI model, the system is poised to evolve with advancements in AI, ensuring long-term relevance and improvement.

What is GPT-4 Vision?

GPT-4 Vision is a type of Visual Question Answering (VQA) model which allows the user to interact with GPT-4 not only through text but through images as well, allowing them to ask questions relating to the images.

VQA is a research area in artificial intelligence that combines techniques from computer vision and natural language processing to answer questions about the contents of an image. GPT-4 Vision extends this concept by attaching a Large Language Model(GPT-4) to an image encoder which allows GPT-4 to understand visual information from images in addition to the textual information it already handles.

Source: (PDF) A Novel Approach on Visual Question Answering by Parameter Prediction using Faster Region Based Convolutional Neural Network (researchgate.net)

Why is GPT-4 Vision so significant?

  • GPT-4 Vision is a major advancement because it uses GPT-4, which is a cutting-edge language model, to understand and analyze images that users upload. This allows it to answer questions about these images more accurately than any other tool currently available.
  • It is versatile and can manage various tasks related to both text and images, such as describing, explaining, summarizing, translating, and generating content. It can also answer complex questions about the information it receives.
  • Moreover, GPT-4 Vision can generate creative content like poems, stories, code, essays, and songs based on the images and text it processes.
  • Safety and responsibility are at the forefront of GPT-4 Vision’s design. It has gone through rigorous testing to ensure it behaves safely, with built-in measures to prevent risks, especially in sensitive areas like identifying people, giving medical advice, or making assumptions without solid evidence.
Source:[2303.08774] GPT-4 Technical Report (arxiv.org) (Notably, the model is capable not just of recognizing the events in an image but alsoidentifying the humor in them.)

The only comparable alternatives currently available are Fuyu-8B and LlaVa 1.5, both of which are open-source. You can opt for either if you find that GPT-4 Vision does not meet your specific needs or if you need more control over your data.

Conventional Approach

In a typical image search algorithm, you might start by transforming the input text into a text embedding and converting the image into image embeddings. These are then compared using a model that operates within a shared embedding space, like CLIP.

However, there are several drawbacks to this approach, including:

  • Limited Understanding: Traditional embedding-based methods may have limited capability in understanding the nuanced relationship between text and images. They might be good at matching direct correlations but can struggle with abstract concepts or interpreting the context of the images.
  • Data Bias: The accuracy of embeddings often depends on the data they were trained on. If the dataset has biases or lacks diversity, the search results could be skewed, leading to less accurate or fair outcomes.
  • Complex Queries: When users input complex or detailed search queries, these methods might not always provide accurate results because the algorithm may not capture all the nuances in both text and visual content.
  • Computational Resources: Generating and comparing embeddings can be resource-intensive, requiring significant computational power, especially when dealing with large image databases.

My Methodology

What if we could take a different approach by not directly using an image embedding model? Imagine if we pass our images to GPT-4 Vision and let it describe the images in words first. Then, we could use a superior text embedding model to process these descriptions. By comparing the text embeddings of the input and the image descriptions, we might enhance our results in several ways.

Advantages of this new approach include:

  • Improved Accuracy: GPT-4 Vision can extract details from images more effectively than any open-source VQA model currently available, leading to more precise search results.
  • Reduced Search Time: It is generally faster and requires less computational effort to compare two sets of text embeddings than to compare text embeddings with image embeddings. This is because text embeddings typically have fewer dimensions; for instance, Word2Vec or GloVe models usually have dimensions in the hundreds (e.g., 300 for Word2Vec), while image embeddings from convolutional neural networks can have thousands or even tens of thousands of dimensions.
  • Space Efficiency: Text embeddings also occupy less disk space. Their lower dimensionality means they are more compact compared to the more dimensional image embeddings.

Implementation

I am now going to implement an image search for an e-commerce store that specializes in selling luxury commodities such as liquor and perfumes. The dataset will have three columns: skuName, skuDescription, and imageUrls. The skuName will refer to the product name, the skuDescription will contain a 2-3 sentence description provided by the brand, and imageUrl will contain the URL of the product image.

The dataset will be available in the GitHub repository here. Note: The dataset is quite small (only 85 images) as I hit the rate limit for GPT-4 Vision, which is still in preview as of the time of writing this article. You can try scaling it up once it is fully available.

Sample of the Dataset

Now, install all of these prerequisite libraries using pip or conda:

openai
numpy
pandas
matplotlib
pillow
FlagEmbedding

Let’s import all of these into our python file:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import base64
from openai import OpenAI
from PIL import Image
from FlagEmbedding import FlagModel

Before we proceed, ensure that you have access to GPT-4 Vision. You can do this by visiting OpenAI, creating a paid account, and generating an API key. Once you have your key ready, run the following code once:

import os
os.environ['OPENAI_API_KEY'] = 'your-openai-key'

Once you are done with that, let’s add some helper functions.

def load_image(url_or_path):
if url_or_path.startswith("http://") or url_or_path.startswith("https://"):
return Image.open(requests.get(url_or_path, stream=True).raw)
else:
return Image.open(url_or_path)
def cosine_similarity(vec1, vec2):
# Compute the dot product of vec1 and vec2
dot_product = np.dot(vec1, vec2)

# Compute the L2 norm of vec1 and vec2
norm_vec1 = np.linalg.norm(vec1)
norm_vec2 = np.linalg.norm(vec2)

# Compute the cosine similarity
similarity = dot_product / (norm_vec1 * norm_vec2)

return similarity

The load_image() function is designed to load images from any URL, whether it's hosted online or located at a local file path. The cosine_similarity() function is employed to calculate the cosine similarity between two vectors. For an in-depth understanding of cosine similarity, please refer to my previous article, where I provide a detailed explanation of this concept.

Let’s now start the pre-processing stage.

def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
client = OpenAI()
def question_image(url,query):
if url.startswith("http://")or url.startswith("https://"):
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": f"{query}"},
{
"type": "image_url",
"image_url": url,
},
],
}
],
max_tokens=1000,
)
return response.choices[0].message.content
else:
base64_image = encode_image(url)
payload = {
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"{query}?"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
},
}
]
}
],
"max_tokens": 1000
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
temp=response.json()
return temp['choices'][0]['message']['content']

The encode_image() function encodes any image into its base64 representation. This step is necessary for offline images that we wish to send to GPT-4 Vision.

The question_image() function can take either an image URL or a local file path as input and applies distinct logic based on the nature of the input—whether it's an online resource or an offline file. This is because the payload sent to the OpenAI servers varies with the type of image source. The function requires two parameters: 'url,' which specifies the location of the image, and 'query,' which is the question or prompt related to the image and returns the answer provided by GPT-4 Vision.

data=pd.read_excel(r"image_search_dataset.xlsx")
data['imageDescription']=None
query="Describe this image to me in detail"
for i,row in data.iterrows():
url=row['imageUrl']
try:
response = requests.get(url)
if response.status_code == 200:
desc=question_image(url,query)
data.at[i,'imageDescription']=desc
data.to_excel(r"image_search_dataset.xlsx",index=False)
except requests.exceptions.RequestException:
continue

In the given code, we start by reading our dataset into a variable named data. Next, we create a new column titled imageDescription to hold the descriptions that will be generated by GPT-4 Vision. We then loop through our dataset, and for each image, we check if the URL loads correctly (indicated by a status code of 200). If it does, we send the image URL to GPT-4 Vision with the request, “Describe this image to me in detail.” Upon receiving the description, we capture the response and save it to an Excel file. If a URL fails to load, we skip it and proceed to the next one.

Now that we have compiled all of our image descriptions, the next step is to focus on the text embeddings. For this purpose, I referred to the MTEB leaderboard on HuggingFace, a benchmark that evaluates various text embedding models across multiple datasets and provides rankings. From there, I selected the top-performing open-source text embedding model, which is BAAI/bge-large-en-v1.5. This model is particularly adept at managing large volumes of text data. You can choose another model according to your needs if need be. Just be mindful as different embedding models have different dimensionality and speed.

model = FlagModel('BAAI/bge-large-en-v1.5', 
query_instruction_for_retrieval="Represent this sentence for searching relevant passages: ",
use_fp16=True)

We import bge-large-en-v1.5 through FlagEmbeddings and set the query_instruction_for_retrieval according to the author’s recommendation for retrieval tasks. You can also use either Sentence Transformers or HuggingFace pipelines to use this model.

data['embedding']=None
for i,row in data.iterrows():
temp=row['skuName']+row['imageDescription']+row['skuDescription']
data.at[i,'embedding']=model.encode(temp)

Next, we loop through the dataframe once more. We construct a string variable named temp that concatenates the skuName, imageDescription, and skuDescription. This temp string is then fed into the bge-large model to generate embeddings. The resulting embeddings are stored back into the corresponding rows of the dataframe.

We are finally done with all the pre-processing necessary for the project. Now let’s make the function which will combine all these elements.

def top_5_products(user_input):
user_embedding=model.encode(user_input)
data['scores']=None
for i,row in data.iterrows():
data.at[i,'scores']=cosine_similarity(user_embedding,row['embedding'])
data['scores'] = pd.to_numeric(data['scores'], errors='coerce')
top_5=data.nlargest(5,'scores')
fig, axs = plt.subplots(1, 5, figsize=(15, 3))
for i, row in enumerate(top_5.iterrows()):
image = load_image(row[1]['imageUrl'])
axs[i].imshow(np.array(image))
axs[i].set_title(f"{row[1]['scores']}")
plt.show()

The top_5_products() function initiates its process by taking a user's query and processing it through the bge-large model to produce a text embedding. We then utilize the cosine_similarity() function, previously defined, to compare this query embedding with all the description embeddings that were generated by GPT-4 Vision. These similarity scores are then added to the dataframe as a new column. Following this, we sort the dataframe to identify the top 5 products with the highest similarity scores and save this subset as top_5. In a live application, this top_5 dataset would be returned to the user. However, for demonstration purposes, I will present the images alongside their corresponding scores in a visual grid format using matplotlib.

This was the entire end-to-end process behind designing an image search engine using GPT-4 Vision. Now let’s put this to the test.

Results

I will evaluate the program using four distinct types of queries: product type, product description, visual attributes, and sensory or unique characteristics. You will be able to review and assess the results firsthand.

Query 1: Aberlour Double Cask 12 Year Old (Exact product name)

Query 1 Results (Scores are arranged in descending order, with the highest score on the left and decreasing towards the right)

Query 2: Scotch whiskey aged and mellowed (Search based on description)

Query 2 Results (Scores descending from left to right)

Based on the brand’s description, we can see that Ballantine has been recommended the most by the algorithm.

Product description of Ballantine 12 YO 100cl

Query 3: Tall cylindrical bottle with a deep blue color (Search based on Image description)

Query 3 Results (Scores descending from left to right)

We can see from the image description below why the algorithm chose Boss Bottled Night Eau De Toilette 200ml.

Image Description of Boss Bottled Night Eau De Toilette 200ml

Query 4: Single malt with a sherry cask finish (Sensory attribute)

Query 4 results(Scores descending from left to right)

We can see that the algorithm is able to suggest Auchentoshan Heartwood 100cl because of it’s image description provided by GPT-4 Vision.

Image Description of Auchentoshan Heartwood 100cl

Scope for Improvements

While GPT-4 Vision has brought remarkable improvements to image search technology, there are specific areas where further development could enhance its performance and usability:

  • Enhanced Text Embedding Models: Future updates could integrate text embedding models with larger sequence lengths than those of bge-large. This improvement would allow for a richer retention of product information, enhancing the precision of search results.
  • Advanced Retrieval Techniques: As the project scales, it would be beneficial to transition from using a dataframe and cosine similarity to employing vector databases with clustering algorithms. This would enable faster and more efficient retrieval of search results, especially in systems with a vast image database.
  • Cost Optimization Strategies: Considering the cost of $0.01105 per high-detail image of size 1366px X 768px, exploring cost-saving measures is crucial. Utilizing lower detail settings could reduce expenses significantly, to around $0.85 per 1000 images. However, this cost-cutting could impact the quality of results, necessitating a balance between detail and expense.
  • Content Policy Adaptations: GPT-4 Vision’s strict content policy may restrict its ability to describe certain types of images, such as those involving fashion models. Finding alternative solutions that comply with content guidelines yet maintain descriptive accuracy will be essential for broadening the application scope of the technology.
  • Open-Source Alternatives: In scenarios where GPT-4 Vision’s costs or content policies pose constraints, open-source alternatives like Fuyu-8B or LlaVa 1.5 could be considered, despite the potential increase in hardware requirements and possible compromise on performance.

Conclusion

In summary, the integration of GPT-4 Vision into image search algorithms has the potential to revolutionize the way we interact with and retrieve visual information. This article walked you through the concept of GPT-4 Vision, its underlying mechanism, and a practical application that enhances image search functionality.

By employing GPT-4 Vision, we’ve seen that not only can search accuracy be significantly improved by leveraging more descriptive image interpretations, but search efficiency can also be enhanced due to the reduced computational load of comparing text embeddings over traditional image-text comparisons.

Moreover, the practical application and testing of this technology across various query types have demonstrated its robustness and versatility, catering to a wide spectrum of informational and sensory attributes.

As we stand on the brink of this new era, the implications of such advancements reach far beyond mere convenience. They pave the way for more accessible digital environments, where the visual web can be navigated with the same ease as text-based information, opening up new possibilities for users with visual impairments or those seeking more intuitive search experiences.

As we continue to fine-tune and integrate these models into our systems, one thing remains clear: the way we search, discover, and interact with images is set to evolve in extraordinary ways, making the act of finding not just an outcome, but an experience in itself.

Follow For More!

I try to implement a lot of theoretical concepts in the ML space, with an emphasis on practical and intuitive applications.

Thanks for reading this article! If you have any questions, I will be happy to answer them. Feel free to message me on my LinkedIn or my email for other queries.

--

--

Suresh R

Passionate about all things data science, machine learning and coffee ;) LinkedIn: www.linkedin.com/in/suresh-raghu