SegmentAnything: A Segmentation Model with Target Specification

David Cochard
axinc-ai
Published in
4 min readFeb 29, 2024

--

This is an introduction to「SegmentAnything」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

SegmentAnything is a segmentation model developed by Meta, released in April 2023. It can produce high quality object masks from input prompts such as points or boxes, making it ideal for image editing tasks such as background removal.

Architecture

In recent years, language models and foundational models of significantly higher accuracy have emerged through training with vast amounts of data available on the Internet. However, there was no large-scale dataset available for segmentation. SegmentAnything addresses this gap by creating a new large-scale dataset, comprising over 11 million images and more than 1 billion masks, to build a foundational model for segmentation.

SegmentAnything achieves segmentation based on prompts such as point location, bounding boxes, or text by training on this new large dataset.

Here’s an overview of SegmentAnything architecture. It converts images to embeddings using an image encoder, then generates segmentation with a mask decoder based on the prompt. The architecture utilizes Vision Transformers (ViT) for the image encoder, CLIP text encoder for the prompt encoder, and combines transformer and Multilayer Perceptron (MLP) for the mask decoder.

SegmentAnything architecture (Source: https://github.com/facebookresearch/segment-anything)

Below is an example of segmentation based on a box input. Only the tire that is within the specified box is segmented.

Result of constrained segmentation (Source: https://github.com/facebookresearch/segment-anything)

The output embeddings of the image encoder are unique to an input image, therefore they only need to be computed once, then the mask decoder can be executed multiple times while changing the segmentation constraints. The computational load is quite high for the image encoder, while the mask decoder is relatively lightweight.

By default, input images are RGB order and resized to have a maximum dimension of 1024 before being input into the image encoder. The preprocessing follows the ImageNet format, subtracting the mean and then dividing by the standard deviation.

The output of the mask decoder consists of multiple masks, and by default, the mask with the highest score is selected.

Applications

By combining SegmentAnything with object detection or pose estimation models to compute the segmentation target, it can be used to automate tasks such as layer separation.

Usage

From version 1.2.16 onwards, SegmentAnything can be used with ailia SDK using the following command.

$ python3 segment-anything.py - input intput.jpg - savepath output.jpg

By adding the gui option, it is also possible to interactively segment the area around the location clicked in the image.

$ python3 segment-anything.py --gui
Segmentation of the tire only
Segmentation of the entire vehicle

Output Examples

Here are some output example on images generated using SDXL on which we try to segment the background or the character.

Background Segmentation
Character Segmentation
Background Segmentation (© Unity Technologies Japan/UCL)
Character Segmentation (© Unity Technologies Japan/UCL)

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--