Optimizing SAM: Elevating Prompt Engineering.

Nandini Lokesh Reddy
6 min readMay 1, 2024

--

In today’s digital world, when we talk to smart computers like GPT, we don’t just use words. We can show them pictures, videos, or even sounds to tell them what we want. These are called prompts.

But what exactly are prompts, and how do they work?

Well, prompts are like instructions that we give to a smart computer. These instructions help the computer understand what we want it to do. For example, if we want the computer to write a magic story, we might show it a picture of a castle and tell it about wizards and dragons.

Source: Google → Different types of Prompts

Prompts and Embedding

Before the computer can understand our instructions, it needs to change them into a simpler form that it can understand. This process is called embedding. It’s like translating our instructions into a language that the computer knows.

In other words: it involves transforming the high-dimensional data (like text or images) into a low-dimensional representation that can be more easily processed by the machine-learning model.

There are different embedding techniques for different types of data, such as CLIP (Contrastive Language-Image Pre-training) for images and text, or ALIGN models for other modalities.

Prompt vs. Input: What’s the Difference?

While often used interchangeably, there is a subtle distinction between a prompt and an input. A prompt is a set of instructions describing the task the model should perform, while the input data fed to the model may include additional context beyond the prompt itself.

For example, when working with a language model, the prompt might be a question or a few sentences, but the input could also include additional context like the user’s previous messages or relevant background information.

Input vs Prompt

Visual Prompting: A Case Study with Segment Anything Model (SAM)

Visual prompting is a method of interacting with pre-trained models to accomplish specific tasks by providing a set of instructions, often in different formats like text, images, or bounding boxes.

The Segment Anything Model (SAM) is a prime example of visual prompting in action. SAM is a computer vision model designed for image segmentation — a technique that partitions a digital image into discrete groups of pixels or image segments. Image segmentation is commonly used in object detection, 3D reconstructions, and image editing workflows.

More precisely, image segmentation is the process of assigning a label to every pixel in an image, such that pixels with the same label share certain characteristics. This enables the model to identify and separate objects or regions of interest within an image.

The image below illustrates the process of image segmentation:

Image Segmentation

To use SAM for image segmentation, we can provide visual prompts in the form of bounding boxes or point annotations on the input image, guiding the model to segment the desired objects or regions.

SAM Architecture

Implementation: Segmenting an Image with SAM with Prompts

Let’s walk through the process of segmenting an image using SAM and visual prompts:

Install Dependencies: First, we need to install the required packages, such as the Segment Anything Model (SAM) library and other necessary dependencies.

Load the Image: Next, we’ll load the image we want to segment into our Python environment.

!pip install ultralytics torch
from PIL import Image
raw_image = Image.open("./image")
raw_image
Source — Google

Before we can start creating masks with the Segment Anything Model (SAM), we need to prepare the input image. The first step is to resize the image to a width of 1024 pixels. This specific size is the original input size that the SAM model expects, ensuring optimal performance.

from PIL import Image

def resize_image(image, input_size):
w, h = image.size
scale = input_size / max(w, h)
new_w = int(w * scale)
new_h = int(h * scale)
image = image.resize((new_w, new_h))
return image

It’s important to note that SAM requires significant computational resources to run, particularly in terms of memory. To address this, we’ll be using a more efficient version called Fast SAM, which is available through the Ultralytics library.

Fast SAM has been optimized for faster inference and reduced memory consumption, making it more accessible for use on a wider range of hardware configurations.

from ultralytics import YOLO
model = YOLO('/content/FastSAM-s.pt')

SAM offers several options for creating masks based on different types of prompts. We can use a single set of pixel coordinates, multiple coordinate points, or even bounding boxes to guide the model’s segmentation process. Let’s explore these options:

  1. Single Point Prompt: We can start by using SAM to isolate an object based on a single set of pixel coordinates. To do this, we’ll define a point on the image that the model will use as a starting point for segmentation. This point will have a positive label, indicating that the region around it should be included in the mask.
from utils import show_points_on_image
input_points = [ [350, 300 ] ]
input_labels = [1] # positive point
show_points_on_image(resized_image, input_points)

Now, we can obtain the mask using the point as a prompt:

from utils import format_results, point_prompt
from utils import show_masks_on_image
results = format_results(results[0], 0)
masks, _ = point_prompt(results, input_points, input_labels)
show_masks_on_image(resized_image, [masks])

2. Multiple Point Prompts: To refine the segmentation further, we can provide multiple points and labels.

input_points = [ [350, 300], [620, 300] ]
input_labels = [1 , 1] # both positive points
show_points_on_image(resized_image, input_points)

we can add a negative label to exclude a specific region from the mask. This way, we can have a positive label to include a region around one point and a negative label to exclude the region around another point.

input_points = [ [350, 450], [400, 300]  ]
input_labels = [1, 0] # positive prompt, negative prompt
show_points_on_image(resized_image, input_points, input_labels)

The red star indicates the portion that should be excluded, while the green star indicates the portion to be included.

3. Bounding Box Prompts: Another option for providing prompts is to use bounding boxes. We can define one or more bounding boxes to guide the segmentation process. Each bounding box is defined by its top-left corner coordinates and its width and height dimensions. The model will use these bounding boxes as prompts to segment the objects or regions within those boundaries.

from utils import box_prompt
from utils import show_boxes_on_image
input_boxes = [430, 100, 780, 600]
show_boxes_on_image(resized_image, [input_boxes])

Now, we can obtain binary masks for the same:

Visual prompting and models like SAM are changing fast, opening up new uses in many areas. They could change medical imaging by helping find tumors better, improve manufacturing by spotting defects, enhance creative work like editing, and even be useful in robotics and augmented reality. As these technologies get better and we find new ways to use them, their impact could be huge. Keep following along as we discover more about what visual prompting can do.

For full code examples, visit my GitHub repository: https://github.com/NandiniLReddy/PromptEnggWithSam

--

--