SurgMotion

Official Code for "SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos"

Generate Convert Improve

Install / Use

/learn @CAIR-HKISI/SurgMotion

About this skill

Quality Score

0/100

README

<div align="center"> <h1>SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos</h1>

</div>

Main

SurgMotion is a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction, with technical innovations tailored to surgical videos, built on top of V-JEPA 2.

Model Overview

Framework Key innovations:

Latent motion prediction — shifts from pixel-level reconstruction to abstract motion forecasting in latent space
Flow-Guided Latent Prediction — a novel objective that prevents feature collapse in homogeneous surgical tissue regions
Pre-trained on SurgMotion-15M — the largest multi-modal surgical video dataset to date (15M frames, 3,658 hours, 13+ anatomical regions)

Model Variants

| Variant | Backbone | Parameters | Pre-training Data | |---------|----------|------------|-------------------| | SurgMotion-L | ViT-Large | 300M | SurgMotion-15M | | SurgMotion-G | ViT-Giant-xformer | 1B | SurgMotion-15M |

Architecture

Video Encoder (ViT) — processes 64-frame surgical video clips into spatiotemporal token sequences
Latent Predictor — predicts masked region representations in latent space guided by optical flow
Probing Head — lightweight temporal classifier for downstream phase recognition

Performance Highlights

SurgMotion achieves SOTA on all representative surgical tasks (workflow, action, segmentation, triplet, skill, depth estimation). For detailed results, see our paper and project page.

Quick Start

Setup:
- Environment Installation
- Data Preparation
Usage:
- Run Foundation Probing
Extend:
- Add a New Dataset
- Add a New Foundation Model

Project Structure

SurgMotion/
├── src/                        # V-JEPA2 core: ViT, VideoMAE, datasets, masks
├── evals/                      # Evaluation entry points & foundation phase probing
│   ├── main.py                 # Single-task entry: python -m evals.main --fname <yaml>
│   └── foundation_phase_probing/
│       ├── eval.py             # Probing evaluation logic
│       ├── models.py           # Probing head definitions
│       └── modelcustom/        # Per-model adapters (SurgMotion, DINOv3, SurgVLP, …)
├── configs/
│   └── foundation_model_probing/
│       ├── surgmotion/         # YAML configs per dataset
│       ├── dinov3/
│       ├── endofm/
│       ├── …                   # 15 model families supported
│       └── videomaev2/
├── data_process/               # End-to-end dataset preprocessing scripts
│   ├── autolaparo_prepare.py
│   ├── cholect80_prepare.py
│   ├── egosurgery_prepare.py
│   ├── m2cai2016_prepare.py
│   ├── ophnet_prepare.py
│   ├── pitvis_prepare.py
│   ├── pmlr50_prepare.py
│   ├── polypdiag_prepare.py
│   └── surgicalactions160_prepare.py
├── ckpts/                      # Store all the foundation models
├── scripts/                    # Batch probing & environment setup shells
├── foundation_models/          # Third-party model implementations (git submodules)
├── data/                       # Data directory
├── setup.py                    # pip install -e .
└── requirements.txt            # All dependencies (excluding EndoMamba)

Environment Installation

Main Environment (Recommended)

conda create -n SurgMotion python=3.12 -y
conda activate SurgMotion

# Install PyTorch matching your CUDA version first:
# https://pytorch.org/get-started/locally/

pip install -e .

EndoMamba (Separate Environment)

EndoMamba requires its own Conda env with custom CUDA extensions. Do not mix with the main environment.

bash scripts/srun_endomamba_complie.sh   # Creates env + compiles extensions
conda activate endomamba                 # Use only for EndoMamba configs

Dependency Files

| File | Scope | |------|-------| | requirements.txt | All dependencies (V-JEPA2 core + foundation probing) | | setup.py | pip install -e . reads requirements.txt automatically |

EndoMamba has its own isolated environment managed by scripts/srun_endomamba_complie.sh.

Data Preparation

All preprocessing scripts under data_process/ now follow a unified end-to-end pipeline and support one-command execution:

python data_process/<dataset>_prepare.py --step all

Common step options:

--step all|frames|metadata|clips
--window_size 64
--stride 1
--fps 1
--no_padding

Typical outputs:

clip_infos/*.txt — per-case frame path lists
{train,val,test}_metadata.csv — frame-level metadata for dense clip generation
clips_<window_size>f/{train,val,test}_dense_<window_size>f_detailed.csv
clips_<window_size>f/clip_dense_<window_size>f_info/{train,val,test}/*.txt

Frame-level metadata schema:

| Column | Description | |--------|-------------| | Case_ID | Numeric case / video identifier | | Frame_Path | Absolute/relative frame image path | | Phase_GT | Integer phase/class id for this frame | | Phase_Name | Human-readable phase/class name |

Dense Sampling Strategy

We use an online workflow recognition setting:

A clip is a sliding temporal window.
The last frame in the window is the target frame to predict.
Previous frames in the window are temporal context.
Neighboring windows overlap.

For window_size=64, stride=1:

clip 1: frames [0, ..., 63]
clip 2: frames [1, ..., 64]
clip 3: frames [2, ..., 65]

Padding at video start:

If the early timeline does not have enough preceding frames, we pad the window by repeating the current window's last frame.
This behavior is enabled by default; use --no_padding to disable.

Clip Labeling Rule

For phase recognition:

Frames are sampled at 1 fps.
Clip label = label of the clip's last frame.

Example:

frames 0-40: Phase 0
frames 41-63: Phase 1
clip [0, ..., 63] label is Phase 1.

Notes on Performance Gaps

If reproduced results are lower than expected, dense sampling mismatch is one possible source, but not the only one. We also recommend checking:

longer training schedules (e.g., 2 / 4 / 8 epochs)
class balancing / class weighting strategy

Class weighting can strongly affect surgical long-tail performance. See implementation in evals/foundation_phase_probing/eval.py.

Supported Datasets

Most datasets already provide extracted frames in data/Surge_Frames/.... The pipelines read annotations and frames; frame extraction from videos (--step frames) is optional and only needed if you have raw mp4 files.

| Dataset | Script | Annotation Path | Frames Path | Extract? | |---------|--------|-----------------|-------------|----------| | Cholec80 | cholect80_prepare.py | cholec80/phase_annotations | Surge_Frames/Cholec80/frames/{videoXX}/ | Optional | | AutoLaparo | autolaparo_prepare.py | autolaparo/task1/labels | Surge_Frames/AutoLaparo/frames/{NN}/ | Optional | | M2CAI2016 | m2cai2016_prepare.py | m2cai16/{train,test}_dataset | Surge_Frames/M2CAI16/frames/{video}/ | No | | EgoSurgery | egosurgery_prepare.py | EgoSurgery/annotations/phase | Surge_Frames/EgoSurgery/frames/{video_id}/ | No | | PitVis | pitvis_prepare.py | pitvits/26531686 | Surge_Frames/PitVis/frames/video_{XX}/ | No | | OphNet2024 | ophnet_prepare.py | OphNet2024_trimmed_phase/*.csv | Surge_Frames/OphNet2024_phase/frames/ | No | | PmLR50 | pmlr50_prepare.py | PmLR50/PmLR50/labels/*.pickle | Surge_Frames/PmLR50/frames/{XX}/ | No | | SurgicalActions160 | surgicalactions160_prepare.py | (from video filenames) | Surge_Frames/SurgicalActions160_v1/frames/ | Yes | | PolypDiag | polypdiag_prepare.py | (from video filenames) | Surge_Frames/PolypDiag/frames/ | Yes |

All annotation and frame paths above are relative to the data/ directory (e.g., data/Landscopy/cholec80/phase_annotations).

Pipeline behavior

The scripts are built around frame-based clip CSVs. Depending on whether you already have extracted frames or only raw videos, use the path that matches your data.

1) Frames-first input (recommended default)

You already have image sequences under --frames_root (e.g. data/Surge_Frames/...).

| Step | What it does | |------|----------------| | --step all | Runs metadata → clips. Does not decode videos in most scripts (see table below). | | --step metadata | Builds {train,val,test}_metadata.csv from annotations + Frame_Path. | | --step clips | Writes dense sliding-window clip lists and detailed CSVs via gen_clips.py. |

Typical command: python data_process/<dataset>_prepare.py --step all with correct --frames_root / annotation paths. No --videos_dir needed.

2) Videos-first input (optional extraction)

You only have .mp4 files and need JPEG/PNG frames under --frames_root

Related Skills

qqbot-channel

353.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.7k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

353.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

arscontexta

3.1k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

CAIR-HKISI

View profile

View on GitHub

GitHub Stars57

CategoryContent

Updated16h ago

Forks8

CAIR-HKISI/SurgMotion

Languages

Python

Security Score

95/100

Audited on Apr 9, 2026

No findings