Our team of experts is ready to answer!
You can contact us directly
Telegram iconFacebook messenger iconWhatApp icon
Fill in the form below and you will receive an answer within 2 working days.
Or fill in the form below and you will receive an answer within 2 working days.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Reading Time
3 Minutes
Victor Bebnev
Developer at OpenCV.ai
Fast Segment Anything Model: Advancing Object Segmentation with Speed

Tech track #2. Fast SAM review

Fast SAM, an innovative approach to the Segment Anything Task, drastically increases SAM model speed by 50 times. It introduces prompts to facilitate a broad range of problem-solving, breaking barriers of traditional supervised learning. This blog post will briefly explain the segment anything task and the Fast SAM approach.
July 3, 2023

Introduction

Fast SAM article recently introduced a significant advancement on top of SAM, increasing its speed by whooping 50 times. Both these models are designed to address the Segment Anything Task and are trained using the SA-1B dataset. Let's explore the nature of this task and understand why Fast SAM outperforms SAM in terms of speed.

Task

Usually, in the instance segmentation problem, a model is trained on a dataset and, during inference, can only identify the specific types of objects that were included in the annotations. In other words, it follows a traditional supervised learning approach. A new model needs to be trained using updated annotations to incorporate new types of objects or extract additional information from the data. This poses a significant problem and limits possible applications. We may want to scale to a new set of classes or get more fine-grained classification on the fly without data collection, annotation, and re-training effort.

The Segment Anything Task breaks these barriers by introducing promts (in a form of points, boxes, or text descriptions). This allows a broad range of problem-solving with a single type of annotation. Consider these examples of selecting objects using text prompts:

Parrot Beak → Parrot

Pillow → Chair with a pillow

Backpack → Man with a backpack

Model

First, let's take a brief look at the Segment Anything Model. The architecture consists of Visual Transformer as an image encoder, the CLIP text encoder as a prompt encoder, and a transformer-like mask decoder. The image encoder creates embeddings for an input image, the prompt encoder creates prompt encodings and then the mask decoder based on these embeddings creates an instance mask for the target image.

While Transformers offer superior accuracy and generalizability, their main drawback lies in their slower inference speed compared to CNN-based models, which can be several times faster. With this in mind let’s move to the Fast SAM approach.

The authors split the task into two parts: class-agnostic image segmentation and prompt-guided selection. 

For the first part, instead of the heavy Visual Transformer architecture the authors utilize Yolov8-seg (a Yoloact architecture with a Yolov8 backbone) . The model was trained to predict only a single class instance segmentation masks, so it can generalize to any object type by design (this task is usually called salient object segmentation). Moreover, this single-class instance segmentation model was trained with two percent of the SA-1B dataset. Switching to CNN-based model makes image encoder significantly faster while usage of.

The objects found by the class-agnostic instance segmentation model along with the text descriptions of the objects searched for are passed to the CLIP image and text encoder respectively. Finally, we match the objects and descriptions by their embedding similarity.
Point prompts or box prompts can be used in this process. In this case, during the post-processing step, the algorithm selects masks that contain prompt points and masks that have intersections with prompt boxes.

This approach offers several advantages, including a significant speed improvement (50 times faster) and the generation of high-quality masks, even with a relatively small dataset (2% of SA-1B). However, there are some drawbacks to consider. Firstly, the model’s accuracy is lower than the original SAM. Additionally, artifacts may be present at the borders of small objects.

Conclusion

Fast SAM presents immense potential for semi-automatic segmentation of arbitrary objects. Imagine a visual editor where object selection is as simple as describing them – a reality made possible by this solution.

At OpenCV.ai, we bring our expertise in AI and CV to the fore, dissecting complex models like Fast SAM and translating them into practical, efficient solutions for developers. Stay tuned to our blog as we continue to explore and share groundbreaking developments in this exciting field.

Let's discuss your project

Book a complimentary consultation

Read also

September 27, 2024

Robotics and agriculture

Let's explore how AI for agriculture improving the lives of animals, plants, and humans alike
September 11, 2024

Computer vision and artificial intelligence in smart cities

Let's talk about how cities are getting smarter, greener, safer, more dynamic — right now
August 26, 2024

Computer vision and artificial intelligence in manufacturing

A few examples of how modern technology makes it possible to produce things faster, cheaper — and more environmentally friendly