SAM v2 by Meta for Video segmentation

Segment Anything 2 features

Mehul Gupta
Data Science in your pocket

--

Meta, after Llama 3.1, is again out with a banger of an open-source model, this time for images and videos i.e. SAM v2, extending the capabilities of SAM v1 from only image segmentation to video segmentation.

The model, not only provides segmentation but also several editing options in videos like

Object removal

Blur it

Highlighter

Video cut-outs

SunRay effects, etc

For the demo, the model is available as a website to try it out on their videos and even download it for free. You can check the above video on how to use it.

Talking about some major features:

Supports real-time object segmentation in images and videos.

Released under Apache 2.0 license for community access.

Trained on SA-V Dataset which includes ~51,000 videos and 600,000+ spatio-temporal masks.

Zero-Shot Generalization: Segmentation of unseen objects in new media.

Interactive Segmentation: Allows real-time refinement through prompts.

Memory Mechanism: Tracks objects across video frames for continuity.

How it works?

https://ai.meta.com/blog/segment-anything-2/

SAM 2 was developed to extend the capabilities of the original SAM model (the 1st version for images), enabling accurate object segmentation in both images and videos. Key aspects of the development include:

  • Unified Approach: SAM 2 treats images as short videos, utilizing memory to recall processed information for accurate segmentation across video frames.
  • Challenges Addressed: The model tackles significant challenges in video segmentation, such as object motion, occlusion, and varying video quality, which existing models struggled with.
  • New Dataset: The development involved creating the SA-V dataset, significantly larger than previous datasets, to train SAM 2 and enhance its performance.
  • Innovative Methodology: The process included designing a promptable visual segmentation task and a model capable of handling it, resulting in state-of-the-art segmentation capabilities.

What is Promptable Visual Segmentation?

It is a new task for segmenting videos that build on the image segmentation task. The original SAM model was trained to use points, boxes, or masks in an image to identify and predict the shape of an object.

With SAM 2, it is trained to accept prompts from any frame in a video to predict a “masklet,” which is a shape that shows where the object is over time.

When given a prompt, SAM 2 quickly predicts the mask for the current frame and then spreads that prediction to all frames in the video.

After the initial masklet is created, users can improve it by giving more prompts in any frame, repeating this process as many times as needed until they get the desired result.

Will be updating more on this as I myself read about it in depth !!

--

--