Building Scalable Image Retrieval Systems: Unlocking the Power of Image Embeddings

Malik Muhammad Kashif Saeed
6 min readOct 1, 2024

--

Can you think of a time when looking for the right photograph for an address meant wading through a multitude of folders, or relying on simple word searches? We must credit the alteration in image search technology to times that are past. At the heart of this revolution lies a powerful concept: image embeddings.

This article will lead you toward discovering the interesting realm of image embeddings. We will cover how these changes are shaping image searches, introduce the latest techniques, and show you how to build your own effective image lookup system. Get excited — the adventure is about to be very exciting!

Fig 1 Image Embeddings Process and Applications

An Image embedding is a numerical vector that bottles up the semantics of the content in an image. The aim of image embedding is to shrink visual data into compact size. The ability to do so allows machine learning models to tackle the semantics and visual features found in visual data easily. By vectorizing images, we are able to evaluate them according to their semantic similarity, resulting in more precise search results i.e retrieval of like images, which allows for uses including image search, recommendations systems, and object recognition.

Now there are two main kinds of embedding: dense and sparse. Dense embeddings fit a vast amount of information into few values, but sparse embeddings use a larger array of values, with a large proportion being zero. Every one shows different strengths, according to its usage. For looking through images, dense embeddings generally lead the way, because they perform better in larger environments.

The Rise of Transformers in Image Processing:

Transformers, which began as a concept for natural language processing, have recently also done well in image processing. Unlike CNNs which rely on spatial operations for nearby information processing, transformers engage in self-attention methods to capture global relationships in images.

Fig 2 Transformer based image processing

In Vision Transformers, which are called “ViT”, pictures are divided into fixed-size patches. These patches are embedded linearly and treated as a sequence of tokens. To keep know where each piece is in space, positional embeddings are added to the sequence. The vision transformer then takes and treats these tokens by applying the multi-head self-attention and feed-forward Networks.

Fig 3 CNN based image processing

Compared to CNNs:

  1. Attention Mechanism: Vision transformers excel at capturing extended relationships, in contrast to the constrained receptive fields commonly utilized by CNNs.
  2. Scalability: Vision transformers can perform better when there is an increase in both data and their model size.
  3. Inductive Bias: CNNs display a prominent inductive bias, in opposition to vision transformers, which are more adaptable, but could require more data to learn these attributes.
  4. Computational Efficiency: Typically, CNNs function better in case of smaller datasets, whereas vision transformers prove their worth in large-scale pretraining.

Integration with Vector Databases:

By integrating image embeddings with databases like Milvus we can achieve efficient and scalable image retrieval systems by reducing visual data to compact numerical forms that promote speedy comparison and retrieval. This approach works as follows:

Image Embedding Generation: Model CLIP converts intricate images into fixed-length embeddings. These embeddings highlight essential visual properties and information-related context of the images.

Vector Storage: The created embeddings are placed in a vector database similar to Milvus that performs well with large vector data.

Indexing: Special indexes(such as HNSW and IVF_FLAT) are established by Milvus for enhancing fast approximate nearest neighbor (ANN) searches.

Similarity Search: For finding similar images the query image gets transformed into an embedding. With its database, Milvus finds closely similar vectors in a timely manner.

Retrieval: The system presents the actual images related to the most similar embeddings it identifies.

Fig 4 Image Retrieval system

Building a robust Image Search System:

Before start make ensure you install the python virtual environment and install the required libraries:

Write following commands on terminal

On Windows:

python -m venv .venv

Activate the env:

.venv\Scripts\activate

Install packages:

pip install pymilvus pillow torch torchvision transformers

1. Image Embedding with CLIP

OpenAI designed CLIP which excels in visual learning. Gaining image embeddings is best accomplished with this model. Here’s how to use it:

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def get_embedding(image_path):
image = Image.open(image_path).convert('RGB')
inputs = processor(images=image, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
outputs = model.get_image_features(**inputs)

embedding = outputs.squeeze().numpy()
embedding = embedding / np.linalg.norm(embedding)

return embedding

get_embedding() function takes an image_path, and processes it using CLIP then returns normalized 512-dimensional embedding.

2. Setting up Milvus

Milvus is a powerful vector database that allows for efficient similarity search. Here’s how to set it up:

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

connections.connect("default", host="localhost", port="19530")

fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=512)
]
schema = CollectionSchema(fields, "Image search")
collection = Collection("image_search", schema)

index_params = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)

IVF_FLAT index is used in the Milvus collection, which provides a good balance between speed and accuracy.

3. Inserting Embeddings into Milvus

An image embedding function was built while we formed the Milvus vector database. Let’s insert our image folder to activate the get_embedding() method and create the embedding afterwards, storing it in the Milvus database.

import os

image_folder = "path/to/your/images"
image_paths = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith(('.png', '.jpg', '.jpeg'))]

embeddings = [get_embedding(path) for path in image_paths]
collection.insert([None] * len(embeddings), [embeddings])
collection.flush()

4. Performing Image Search

Our images are successfully saved into a vector database, now we can perform similarity searches to retrieve the most relevant images.

def search_similar_images(query_image_path, top_k=5):
query_embedding = get_embedding(query_image_path)
search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
results = collection.search(
data=[query_embedding.tolist()],
anns_field="embedding",
param=search_params,
limit=top_k
)
return results

results = search_similar_images("path/to/query/image.jpg")
for hit in results[0]:
print(f"Image ID: {hit.id}, Distance: {hit.distance}")

This function takes a query image, generates its embedding and fetches out top relevant five images from Milvus vector database.

Testing and Optimization

To ensure your image search system performs optimally, consider the following guidelines:

  1. Use a diverse dataset:Test different types of images and include edge cases to ensure comprehensive coverage.
  2. Measure performance:Utilize metrics such as precision@k and mean average precision to evaluate the effectiveness of your system.
  3. Benchmark search speed: Time your queries and strive to optimize both accuracy and speed.

Optimization Tips

  1. Tune CLIP parameters:

Use different CLIP models, such as ViT-B/32, ViT-B/16, and ViT-L/14.

If possible, fine-tune the CLIP model on your specific dataset to further enhance its performance.

2. Optimize Milvus indexing:

You need to play with nlist in the index parameters

Experiment with different index types, such as HNSW for better recall:

optimized_index_params = {
"metric_type": "L2",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 500}
}
collection.create_index(field_name="embedding", index_params=optimized_index_params)

3. Implement batch processing: Use batches for better performance.

4. Utilize GPU speed: IIf GPU is accessible, employ it for the creation of embeddings and also for Milvus tasks.

5. Implement caching: If you want to reduce computation, you need cache frequent queries.

Conclusion:

And there, that’s all — we’ve gone on a quick trip around the fascinating world of image embeddings and they change the game for image search. We started from the simple part of embeddings and moved into the advanced area of Vision Transformers, and even put in extra effort to create a search system with Milvus.

The image search field is changing fast, pushed by new technologies in embedding techniques. These strong tools are creating more opportunities for precise, quick, and smart finding images.

So, whether you’re constructing the next big thing in image search or just wanting to make better with your own photo collection, remember: correct embedding methods can make all the difference.

--

--

Malik Muhammad Kashif Saeed

A self-taught coder and a data enthusiast who enjoys writing about AI, ML, and other emerging technologies. Machine Learning Engineer | Data Scientist.