SurgMotion
Official Code for "SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos"
Install / Use
/learn @CAIR-HKISI/SurgMotionREADME
<a href="https://surgmotion.cares-copilot.com/"><img src='https://img.shields.io/badge/Project-Homepage-0A66C2' alt='Project Page'></a> <a href="https://arxiv.org/abs/2602.05638"><img src='https://img.shields.io/badge/arXiv-2602.05638-b31b1b' alt='arXiv'></a> <a href="https://github.com/CAIR-HKISI/SurgMotion"><img src='https://img.shields.io/badge/GitHub-Repository-blue' alt='GitHub'></a> <a href="https://huggingface.co/CAIR-HKISI/SurgMotion"><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow' alt='HuggingFace'></a>
</div>
SurgMotion is a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction, with technical innovations tailored to surgical videos, built on top of V-JEPA 2.
Model Overview
Key innovations:
- Latent motion prediction — shifts from pixel-level reconstruction to abstract motion forecasting in latent space
- Flow-Guided Latent Prediction — a novel objective that prevents feature collapse in homogeneous surgical tissue regions
- Pre-trained on SurgMotion-15M — the largest multi-modal surgical video dataset to date (15M frames, 3,658 hours, 13+ anatomical regions)
Model Variants
| Variant | Backbone | Parameters | Pre-training Data | |---------|----------|------------|-------------------| | SurgMotion-L | ViT-Large | 300M | SurgMotion-15M | | SurgMotion-G | ViT-Giant-xformer | 1B | SurgMotion-15M |
Architecture
- Video Encoder (ViT) — processes 64-frame surgical video clips into spatiotemporal token sequences
- Latent Predictor — predicts masked region representations in latent space guided by optical flow
- Probing Head — lightweight temporal classifier for downstream phase recognition
Performance Highlights
SurgMotion achieves SOTA on all representative surgical tasks (workflow, action, segmentation, triplet, skill, depth estimation). For detailed results, see our paper and project page.
Quick Start
- Setup:
- Usage:
- Extend:
Project Structure
SurgMotion/
├── src/ # V-JEPA2 core: ViT, VideoMAE, datasets, masks
├── evals/ # Evaluation entry points & foundation phase probing
│ ├── main.py # Single-task entry: python -m evals.main --fname <yaml>
│ └── foundation_phase_probing/
│ ├── eval.py # Probing evaluation logic
│ ├── models.py # Probing head definitions
│ └── modelcustom/ # Per-model adapters (SurgMotion, DINOv3, SurgVLP, …)
├── configs/
│ └── foundation_model_probing/
│ ├── surgmotion/ # YAML configs per dataset
│ ├── dinov3/
│ ├── endofm/
│ ├── … # 15 model families supported
│ └── videomaev2/
├── data_process/ # End-to-end dataset preprocessing scripts
│ ├── autolaparo_prepare.py
│ ├── cholect80_prepare.py
│ ├── egosurgery_prepare.py
│ ├── m2cai2016_prepare.py
│ ├── ophnet_prepare.py
│ ├── pitvis_prepare.py
│ ├── pmlr50_prepare.py
│ ├── polypdiag_prepare.py
│ └── surgicalactions160_prepare.py
├── ckpts/ # Store all the foundation models
├── scripts/ # Batch probing & environment setup shells
├── foundation_models/ # Third-party model implementations (git submodules)
├── data/ # Data directory
├── setup.py # pip install -e .
└── requirements.txt # All dependencies (excluding EndoMamba)
Environment Installation
Main Environment (Recommended)
conda create -n SurgMotion python=3.12 -y
conda activate SurgMotion
# Install PyTorch matching your CUDA version first:
# https://pytorch.org/get-started/locally/
pip install -e .
EndoMamba (Separate Environment)
EndoMamba requires its own Conda env with custom CUDA extensions. Do not mix with the main environment.
bash scripts/srun_endomamba_complie.sh # Creates env + compiles extensions
conda activate endomamba # Use only for EndoMamba configs
Dependency Files
| File | Scope |
|------|-------|
| requirements.txt | All dependencies (V-JEPA2 core + foundation probing) |
| setup.py | pip install -e . reads requirements.txt automatically |
EndoMamba has its own isolated environment managed by
scripts/srun_endomamba_complie.sh.
Data Preparation
All preprocessing scripts under data_process/ now follow a unified end-to-end pipeline and support one-command execution:
python data_process/<dataset>_prepare.py --step all
Common step options:
--step all|frames|metadata|clips
--window_size 64
--stride 1
--fps 1
--no_padding
Typical outputs:
clip_infos/*.txt— per-case frame path lists{train,val,test}_metadata.csv— frame-level metadata for dense clip generationclips_<window_size>f/{train,val,test}_dense_<window_size>f_detailed.csvclips_<window_size>f/clip_dense_<window_size>f_info/{train,val,test}/*.txt
Frame-level metadata schema:
| Column | Description |
|--------|-------------|
| Case_ID | Numeric case / video identifier |
| Frame_Path | Absolute/relative frame image path |
| Phase_GT | Integer phase/class id for this frame |
| Phase_Name | Human-readable phase/class name |
Dense Sampling Strategy
We use an online workflow recognition setting:
- A clip is a sliding temporal window.
- The last frame in the window is the target frame to predict.
- Previous frames in the window are temporal context.
- Neighboring windows overlap.
For window_size=64, stride=1:
- clip 1: frames
[0, ..., 63] - clip 2: frames
[1, ..., 64] - clip 3: frames
[2, ..., 65]
Padding at video start:
- If the early timeline does not have enough preceding frames, we pad the window by repeating the current window's last frame.
- This behavior is enabled by default; use
--no_paddingto disable.
Clip Labeling Rule
For phase recognition:
- Frames are sampled at
1 fps. - Clip label = label of the clip's last frame.
Example:
- frames
0-40: Phase 0 - frames
41-63: Phase 1 - clip
[0, ..., 63]label is Phase 1.
Notes on Performance Gaps
If reproduced results are lower than expected, dense sampling mismatch is one possible source, but not the only one. We also recommend checking:
- longer training schedules (e.g., 2 / 4 / 8 epochs)
- class balancing / class weighting strategy
Class weighting can strongly affect surgical long-tail performance. See implementation in evals/foundation_phase_probing/eval.py.
Supported Datasets
Most datasets already provide extracted frames in data/Surge_Frames/.... The pipelines read annotations and frames; frame extraction from videos (--step frames) is optional and only needed if you have raw mp4 files.
| Dataset | Script | Annotation Path | Frames Path | Extract? |
|---------|--------|-----------------|-------------|----------|
| Cholec80 | cholect80_prepare.py | cholec80/phase_annotations | Surge_Frames/Cholec80/frames/{videoXX}/ | Optional |
| AutoLaparo | autolaparo_prepare.py | autolaparo/task1/labels | Surge_Frames/AutoLaparo/frames/{NN}/ | Optional |
| M2CAI2016 | m2cai2016_prepare.py | m2cai16/{train,test}_dataset | Surge_Frames/M2CAI16/frames/{video}/ | No |
| EgoSurgery | egosurgery_prepare.py | EgoSurgery/annotations/phase | Surge_Frames/EgoSurgery/frames/{video_id}/ | No |
| PitVis | pitvis_prepare.py | pitvits/26531686 | Surge_Frames/PitVis/frames/video_{XX}/ | No |
| OphNet2024 | ophnet_prepare.py | OphNet2024_trimmed_phase/*.csv | Surge_Frames/OphNet2024_phase/frames/ | No |
| PmLR50 | pmlr50_prepare.py | PmLR50/PmLR50/labels/*.pickle | Surge_Frames/PmLR50/frames/{XX}/ | No |
| SurgicalActions160 | surgicalactions160_prepare.py | (from video filenames) | Surge_Frames/SurgicalActions160_v1/frames/ | Yes |
| PolypDiag | polypdiag_prepare.py | (from video filenames) | Surge_Frames/PolypDiag/frames/ | Yes |
All annotation and frame paths above are relative to the data/ directory (e.g., data/Landscopy/cholec80/phase_annotations).
Pipeline behavior
The scripts are built around frame-based clip CSVs. Depending on whether you already have extracted frames or only raw videos, use the path that matches your data.
1) Frames-first input (recommended default)
You already have image sequences under --frames_root (e.g. data/Surge_Frames/...).
| Step | What it does |
|------|----------------|
| --step all | Runs metadata → clips. Does not decode videos in most scripts (see table below). |
| --step metadata | Builds {train,val,test}_metadata.csv from annotations + Frame_Path. |
| --step clips | Writes dense sliding-window clip lists and detailed CSVs via gen_clips.py. |
Typical command: python data_process/<dataset>_prepare.py --step all with correct --frames_root / annotation paths. No --videos_dir needed.
2) Videos-first input (optional extraction)
You only have .mp4 files and need JPEG/PNG frames under --frames_root
Related Skills
qqbot-channel
353.1kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.7k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
353.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
3.1kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
