SkillAgentSearch skills...

Efficientsam3

EfficientSAM3 compresses SAM3 into lightweight, edge-friendly models via progressive knowledge distillation for fast promptable concept segmentation and tracking.

Install / Use

/learn @SimonZeng7108/Efficientsam3

README

EfficientSAM3: Progressive Hierachical Knowledge Distillation (PhD) from SAM1, 2 and 3

Chengxi Simon Zeng<sup>1,†</sup>, Yuxuan Jiang<sup>1</sup>, Gao Ge<sup>1</sup>, Shuai Wang<sup>2</sup>, Duolikun Danier<sup>3</sup>, Bin Zhu<sup>4</sup>, Stevan Rudinac<sup>2</sup>, David Bull<sup>1</sup>, Fan Aaron Zhang<sup>1</sup> <sup>1</sup>Visual Information Lab, University of Bristol; <sup>2</sup>MultiX lab, University of Amsterdam; <sup>3</sup>University of Edinburgh; <sup>4</sup>Singapore Management University

<sup></sup>Tech Lead & Corresponding Author

arXiv arXiv Project Page Hugging Face Discord

Updates

  • [2026/02/18] SAM3-LiteText released! SAM3-LiteText reduces text encoder parameters by up to 88% with similar performance to the original text encoder. Paper available on arXiv. Code available in sam3_litetext branch and weights on Hugging Face.
  • [2026/01/11] Stage 1 geometry-prompt fine-tuned (ft) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
  • [2025/12/08] Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
  • [2025/12/02] Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - unsupervised distilled on 1% of SA-1B dataset.
  • [2025/11/25] Teaser model released. See Above. More models are baking in the oven🔥.
  • [2025/10/18] Project announced. Code and weights are not released yet; they will be published once SAM3 code is publicly available.

Table of Contents


SAM3 (Segment Anything Model 3) has introduced powerful Promptable Concept Segmentation (PCS) capabilities, enabling semantic understanding and temporal object tracking beyond traditional mask generation. However, SAM3's massive vision backbone and dense memory bank make it impractical for real-time, on-device applications where computational resources and latency constraints are critical.

EfficientSAM3 addresses this challenge by distilling SAM3's capabilities into lightweight architectures suitable for edge devices, enabling high-quality concept segmentation on mobile phones, embedded systems, and resource-constrained platforms.

<p align="center"> <img src="images/efficientsam3_full.svg" alt="EfficientSAM3 Architecture" width="100%"> </p>
<details> <summary>Supported Models and Architecture</summary>

| Component | Model/Backbone | Purpose | |-----------|----------------|---------| | Teacher Models | SAM (Segment Anything Model) | Foundation for image-level encoder distillation | | | SAM2 | Temporal memory and video tracking distillation | | | SAM3 | Promptable Concept Segmentation (PCS) capabilities | | Datasets | SA-1B | Image segmentation dataset | | | SA-V | Video object segmentation dataset | | | SA-Co/Gold | Promptable concept segmentation benchmark | | | Recap-DataComp-1B | Large-scale image-text dataset for text encoder distillation | | Student Backbones (Image) | RepViT (M0.9, M1.1, M2.3) | Mobile-optimized Vision Transformer for highest throughput | | | TinyViT (5M, 11M, 21M) | Balanced efficiency and performance | | | EfficientViT (B0, B1, B2) | Ultra-lightweight architectures for minimal latency | | Student Backbones (Text) | MobileCLIP S0 | Lightweight text encoder (42.57M params) | | | MobileCLIP S1 | Balanced text encoder (63.56M params) | | | MobileCLIP2 L | Larger text encoder (123.6M params) |

</details>
<details> <summary>Three-Stage Progressive Training Curriculum</summary>

EfficientSAM3 is trained through a three-stage progressive distillation:

Stage 1: Encoder Distillation (Image-Level Segmentation)

  • Distill the SAM3 image encoder to nine student backbones (3 RepViT × 3 TinyViT × 3 EfficientViT variants)
  • Distill the SAM3 text encoder to three student text encoders (MobileCLIP S0, S1, 2-L variants)
  • Use SA-1B dataset with Prompt-in-the-Loop Distillation for image encoder distillation
  • Use Recap-DataComp-1B dataset for text encoder distillation
  • Align student backbone features with teacher encoder outputs.

Stage 2: Temporal Memory Distillation (Video Tracking)

  • Replace SAM3's dense memory bank with a compact Perceiver-based memory module (adapted from EdgeTAM)
  • Distill memory-conditioned mask predictions using SA-V dataset
  • Train the Perceiver module to compress and retrieve spatiotemporal features efficiently

Stage 3: End-to-End Fine-Tuning (Concept Segmentation)

  • Refine the complete EfficientSAM3 pipeline using SAM3 official dataset
  • Joint optimization of distilled encoder + compressed memory + mask decoder
  • Preserve Promptable Concept Segmentation capabilities while maintaining efficiency

tl;dr

Stage 1: We distill the SAM3 encoder using SAM1 data. <br> Stage 2: We align the distilled encoder to a perceiver and an efficient memory bank using SAM2 data. <br> Stage 3: We fine-tune the complete pipeline using SAM3 data. <br>

</details>

Installation

EfficientSAM3 purposely shares the same software contract as upstream SAM3:

  • Python ≥ 3.12
  • PyTorch 2.7.0
  • Device: NVIDIA GPU (CUDA), Apple Silicon (MPS), or CPU

For non-CUDA platforms (MPS/CPU), install scipy for distance transform operations:

pip install scipy

Follow the exact environment setup from the official SAM3 README or use the condensed steps below:

git clone https://github.com/SimonZeng7108/efficientsam3.git
cd efficientsam3

conda create -n efficientsam3 python=3.12 -y
conda activate efficientsam3

pip install --upgrade pip

# Install PyTorch (choose one based on your device):
# CUDA (default):
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# MPS/CPU (Apple Silicon or CPU-only):
pip install torch==2.7.0 torchvision torchaudio

# Install repo dependencies via the root pyproject (brings in SAM3 + Stage-1 extras)
pip install -e ".[stage1]"

# Note: the Stage-1 extra includes the SAM1 package dependency
# (PyPI name: segment-anything, import name: segment_anything).
# If your environment cannot resolve it from PyPI, install the vendored repo instead:
# pip install -e ./segment-anything

Inference

Download checkpoints from the Model Zoo section. All Stage 1 image encoder weights are available via Google Drive and Hugging Face links in the table below.

Quick Start (Image Segmentation):

🔥 Teaser Image Model

<p align="center"> <img src="https://github.com/SimonZeng7108/efficientsam3/blob/main/images/es-ev-s-teaser.jpg" width="30%"> </p>

EfficientViT-S (0.68M params) distilled from SAM3 Encoder (461.84M)99.85% smaller, trained on 1% SA-1B.

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model
model = build_efficientsam3_image_model(
  checkpoint_path="efficient_sam3_efficientvit_s.pt",
  backbone_type="efficientvit",
  model_name="b0",
  enable_inst_interactivity=True,
)

# Process image and predict
processor = Sam3Processor(model)
inference_state = processor.set_image(image)

# Single positive point prompt (x, y) in pixels
points = [[image.size[0] / 2, image.size[1] / 2]]
labels = [1]
masks, scores, _ = model.predict_

Related Skills

View on GitHub
GitHub Stars467
CategoryContent
Updated2h ago
Forks35

Languages

Jupyter Notebook

Security Score

85/100

Audited on Mar 23, 2026

No findings