A Computer Vision Developer’s New Friend: SAM (Segment Anything Model)

Abirami Vina
4 min readApr 12, 2023

Hours spent on annotations may soon be a thing of the past.

Meta’s FAIR lab recently launched the Segment Anything Model, an advanced image segmentation model aimed at revolutionizing computer vision. It can generate high-quality object masks from input prompts such as points or boxes. This model can also produce masks for all objects within an image.

Source: https://arxiv.org/pdf/2304.02643.pdf

What’s the big deal?

You’ll also understand why SAM is our new friend if you’ve spent hours annotating images on LabelImg, LabelMe, or any other annotation tool like me. Annotating for segmentation is a very time-consuming and tedious task. For example, to annotate a person in an image, you have to place multiple points around their body to capture the different edges and curves and create a mask for the person. Whereas with SAM, the model segments the image, and you can just confirm the mask and add a label as “Person.”

An interesting feature: SAM boasts zero-shot generalization. This refers to the model’s ability to perform well on image segmentation tasks that it has not been explicitly trained on, without requiring additional training or fine-tuning.

Try it Yourself

The demo page lets you get a feel for the model’s capabilities. Check it out here: https://segment-anything.com/demo

Try it out!

Setting up the codebase and trying out the model is also pretty simple. It took me about 20 minutes. Here’s an overview of the steps I followed (I recommend going through the SAM’s GitHub page for the details as it was relatively straightforward) :

  1. Installed PyTorch and TorchVision
  2. Installed Segment Anything
  3. I installed a couple of other libraries that were mentioned for mask post-processing.
  4. Downloaded the ViT-H SAM model and placed it under the folder, notebooks, in the cloned repo.
  5. Opened the notebook, automatic_mask_generator_example.ipynb
  6. I replaced the image path mentioned in cell 7 with the path of the image I wanted to segment.
  7. I ran the notebook and bada-bing!
Me, my husband, and SAM.

Things I found impressive: SAM doesn’t just segment distinct objects in an image. For example, in the above image, it didn’t just segment the two people present. It went ahead and segmented us, our clothes, my glasses, our hair, and our seashell necklaces. I was also amazed at the accuracy of the masks. The boundaries are almost perfect — as though they were actually annotated, not a prediction. This probably owes to the fact that the training dataset for SAM comprised 11 million images and 1.1 billion masks.

How does SAM work?

Source: https://arxiv.org/pdf/2304.02643.pdf

It comprises of the following three main components:

  • An image encoder takes in the image and converts it into a mathematical representation that the model can work with.
  • A flexible prompt encoder handles the input prompts (points, boxes, text, and masks) and converts them into embeddings that the model can understand.
  • A fast mask decoder takes in the image embedding, prompt embeddings, and an output token and generates a mask that outlines the objects in the image based on the prompts.

When looking it into SAM, you’ll notice the word “foundational” come up multiple times. What relation does it have to SAM? SAM builds on foundational models in machine learning, specifically Transformer vision models. SAM uses a Transformer vision model as its image encoder, specifically an MAE pre-trained Vision Transformer (ViT) that is adapted to process high-resolution inputs.

What does the future look like?

SAM’s ability to recognize and outline objects within an image based on various input prompts makes it a valuable tool for a range of industries, including healthcare, transportation, and entertainment. As SAM continues to be adopted by researchers and practitioners, we can expect to see further improvements in its performance and efficiency. In the future, SAM may be integrated into more sophisticated computer vision systems and used in conjunction with other machine learning models to solve even more complex tasks.

Always remember to keep your eyes and ears open for the latest in AI. Thanks for reading and learning with me.

Resources:

  1. Paper: https://arxiv.org/pdf/2304.02643.pdf
  2. Demo: https://segment-anything.com/demo
  3. GitHub Repo: https://github.com/facebookresearch/segment-anything

--

--

Abirami Vina

Vanakkam! I'm a computer vision engineer that writes because it's the next best thing to Dumbledore's Pensieve. I believe in love, kindness, and dreaming.