Efficientsam3
EfficientSAM3 compresses SAM3 into lightweight, edge-friendly models via progressive knowledge distillation for fast promptable concept segmentation and tracking.
Install / Use
/learn @SimonZeng7108/Efficientsam3README
EfficientSAM3: Progressive Hierachical Knowledge Distillation (PhD) from SAM1, 2 and 3
Chengxi Simon Zeng<sup>1,†</sup>, Yuxuan Jiang<sup>1</sup>, Gao Ge<sup>1</sup>, Shuai Wang<sup>2</sup>, Duolikun Danier<sup>3</sup>, Bin Zhu<sup>4</sup>, Stevan Rudinac<sup>2</sup>, David Bull<sup>1</sup>, Fan Aaron Zhang<sup>1</sup> <sup>1</sup>Visual Information Lab, University of Bristol; <sup>2</sup>MultiX lab, University of Amsterdam; <sup>3</sup>University of Edinburgh; <sup>4</sup>Singapore Management University
<sup>†</sup>Tech Lead & Corresponding Author
Updates
- [2026/02/18] SAM3-LiteText released! SAM3-LiteText reduces text encoder parameters by up to 88% with similar performance to the original text encoder. Paper available on arXiv. Code available in
sam3_litetextbranch and weights on Hugging Face. - [2026/01/11] Stage 1 geometry-prompt fine-tuned (ft) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
- [2025/12/08] Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
- [2025/12/02] Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - unsupervised distilled on 1% of SA-1B dataset.
- [2025/11/25] Teaser model released. See Above. More models are baking in the oven🔥.
- [2025/10/18] Project announced. Code and weights are not released yet; they will be published once SAM3 code is publicly available.
Table of Contents
- Table of Contents
- Updates
- Installation
- Inference
- Training and Evaluation
- Datasets
- EfficientSAM3 Model Zoo & Weight Release
- Preliminary Evaluation
- CoreML / ONNX Export
- Web Demo
- Development To-Do List
- Call for Pull Requests
- Citation
- License
- Acknowledgments
- Users
SAM3 (Segment Anything Model 3) has introduced powerful Promptable Concept Segmentation (PCS) capabilities, enabling semantic understanding and temporal object tracking beyond traditional mask generation. However, SAM3's massive vision backbone and dense memory bank make it impractical for real-time, on-device applications where computational resources and latency constraints are critical.
EfficientSAM3 addresses this challenge by distilling SAM3's capabilities into lightweight architectures suitable for edge devices, enabling high-quality concept segmentation on mobile phones, embedded systems, and resource-constrained platforms.
<p align="center"> <img src="images/efficientsam3_full.svg" alt="EfficientSAM3 Architecture" width="100%"> </p><details> <summary>Supported Models and Architecture</summary>
| Component | Model/Backbone | Purpose | |-----------|----------------|---------| | Teacher Models | SAM (Segment Anything Model) | Foundation for image-level encoder distillation | | | SAM2 | Temporal memory and video tracking distillation | | | SAM3 | Promptable Concept Segmentation (PCS) capabilities | | Datasets | SA-1B | Image segmentation dataset | | | SA-V | Video object segmentation dataset | | | SA-Co/Gold | Promptable concept segmentation benchmark | | | Recap-DataComp-1B | Large-scale image-text dataset for text encoder distillation | | Student Backbones (Image) | RepViT (M0.9, M1.1, M2.3) | Mobile-optimized Vision Transformer for highest throughput | | | TinyViT (5M, 11M, 21M) | Balanced efficiency and performance | | | EfficientViT (B0, B1, B2) | Ultra-lightweight architectures for minimal latency | | Student Backbones (Text) | MobileCLIP S0 | Lightweight text encoder (42.57M params) | | | MobileCLIP S1 | Balanced text encoder (63.56M params) | | | MobileCLIP2 L | Larger text encoder (123.6M params) |
</details><details> <summary>Three-Stage Progressive Training Curriculum</summary>
EfficientSAM3 is trained through a three-stage progressive distillation:
Stage 1: Encoder Distillation (Image-Level Segmentation)
- Distill the SAM3 image encoder to nine student backbones (3 RepViT × 3 TinyViT × 3 EfficientViT variants)
- Distill the SAM3 text encoder to three student text encoders (MobileCLIP S0, S1, 2-L variants)
- Use SA-1B dataset with Prompt-in-the-Loop Distillation for image encoder distillation
- Use Recap-DataComp-1B dataset for text encoder distillation
- Align student backbone features with teacher encoder outputs.
Stage 2: Temporal Memory Distillation (Video Tracking)
- Replace SAM3's dense memory bank with a compact Perceiver-based memory module (adapted from EdgeTAM)
- Distill memory-conditioned mask predictions using SA-V dataset
- Train the Perceiver module to compress and retrieve spatiotemporal features efficiently
Stage 3: End-to-End Fine-Tuning (Concept Segmentation)
- Refine the complete EfficientSAM3 pipeline using SAM3 official dataset
- Joint optimization of distilled encoder + compressed memory + mask decoder
- Preserve Promptable Concept Segmentation capabilities while maintaining efficiency
tl;dr
Stage 1: We distill the SAM3 encoder using SAM1 data. <br> Stage 2: We align the distilled encoder to a perceiver and an efficient memory bank using SAM2 data. <br> Stage 3: We fine-tune the complete pipeline using SAM3 data. <br>
</details>Installation
EfficientSAM3 purposely shares the same software contract as upstream SAM3:
- Python ≥ 3.12
- PyTorch 2.7.0
- Device: NVIDIA GPU (CUDA), Apple Silicon (MPS), or CPU
For non-CUDA platforms (MPS/CPU), install scipy for distance transform operations:
pip install scipy
Follow the exact environment setup from the official SAM3 README or use the condensed steps below:
git clone https://github.com/SimonZeng7108/efficientsam3.git
cd efficientsam3
conda create -n efficientsam3 python=3.12 -y
conda activate efficientsam3
pip install --upgrade pip
# Install PyTorch (choose one based on your device):
# CUDA (default):
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# MPS/CPU (Apple Silicon or CPU-only):
pip install torch==2.7.0 torchvision torchaudio
# Install repo dependencies via the root pyproject (brings in SAM3 + Stage-1 extras)
pip install -e ".[stage1]"
# Note: the Stage-1 extra includes the SAM1 package dependency
# (PyPI name: segment-anything, import name: segment_anything).
# If your environment cannot resolve it from PyPI, install the vendored repo instead:
# pip install -e ./segment-anything
Inference
Download checkpoints from the Model Zoo section. All Stage 1 image encoder weights are available via Google Drive and Hugging Face links in the table below.
Quick Start (Image Segmentation):
🔥 Teaser Image Model
<p align="center"> <img src="https://github.com/SimonZeng7108/efficientsam3/blob/main/images/es-ev-s-teaser.jpg" width="30%"> </p>EfficientViT-S (0.68M params) distilled from SAM3 Encoder (461.84M) — 99.85% smaller, trained on 1% SA-1B.
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Load model
model = build_efficientsam3_image_model(
checkpoint_path="efficient_sam3_efficientvit_s.pt",
backbone_type="efficientvit",
model_name="b0",
enable_inst_interactivity=True,
)
# Process image and predict
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
# Single positive point prompt (x, y) in pixels
points = [[image.size[0] / 2, image.size[1] / 2]]
labels = [1]
masks, scores, _ = model.predict_
Related Skills
docs-writer
98.8k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
331.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
arscontexta
2.8kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
