SegmentAnything: A Segmentation Model with Target Specification

Published in

axinc-ai

4 min readFeb 29, 2024

This is an introduction to「SegmentAnything」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

SegmentAnything is a segmentation model developed by Meta, released in April 2023. It can produce high quality object masks from input prompts such as points or boxes, making it ideal for image editing tasks such as background removal.

GitHub — facebookresearch/segment-anything: The repository provides code for running inference with…

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the…

github.com

Segment Anything

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our…

ai.meta.com

Architecture

In recent years, language models and foundational models of significantly higher accuracy have emerged through training with vast amounts of data available on the Internet. However, there was no large-scale dataset available for segmentation. SegmentAnything addresses this gap by creating a new large-scale dataset, comprising over 11 million images and more than 1 billion masks, to build a foundational model for segmentation.

SegmentAnything achieves segmentation based on prompts such as point location, bounding boxes, or text by training on this new large dataset.

Here’s an overview of SegmentAnything architecture. It converts images to embeddings using an image encoder, then generates segmentation with a mask decoder based on the prompt. The architecture utilizes Vision Transformers (ViT) for the image encoder, CLIP text encoder for the prompt encoder, and combines transformer and Multilayer Perceptron (MLP) for the mask decoder.

SegmentAnything architecture (Source: https://github.com/facebookresearch/segment-anything)

Below is an example of segmentation based on a box input. Only the tire that is within the specified box is segmented.

Result of constrained segmentation (Source: https://github.com/facebookresearch/segment-anything)

The output embeddings of the image encoder are unique to an input image, therefore they only need to be computed once, then the mask decoder can be executed multiple times while changing the segmentation constraints. The computational load is quite high for the image encoder, while the mask decoder is relatively lightweight.

By default, input images are RGB order and resized to have a maximum dimension of 1024 before being input into the image encoder. The preprocessing follows the ImageNet format, subtracting the mean and then dividing by the standard deviation.

The output of the mask decoder consists of multiple masks, and by default, the mask with the highest score is selected.

Applications

By combining SegmentAnything with object detection or pose estimation models to compute the segmentation target, it can be used to automate tasks such as layer separation.

Usage

From version 1.2.16 onwards, SegmentAnything can be used with ailia SDK using the following command.

$ python3 segment-anything.py - input intput.jpg - savepath output.jpg

By adding the gui option, it is also possible to interactively segment the area around the location clicked in the image.

$ python3 segment-anything.py --gui

ailia-models/image_segmentation/segment-anything at master · axinc-ai/ailia-models

The collection of pre-trained, state-of-the-art AI models for ailia SDK …