SegAnyMo

[CVPR 2025] Code for Segment Any Motion in Videos

Generate Convert Improve

Install / Use

/learn @nnanhuang/SegAnyMo

About this skill

Quality Score

0/100

README

[CVPR2025] Segment Any Motion in Videos

Project Page | Arxiv

Nan Huang1,2, Wenzhao Zheng1, Chenfeng Xu1, Kurt Keutzer1, Shanghang Zhang2, Angjoo Kanazawa1,Qianqian Wang1

1UC Berkeley 2Peking University

Overview of Our Pipeline. We take 2D tracks and depth maps generated by off-the-shelf models as input, which are then processed by a motion encoder to capture motion patterns, producing featured tracks. Next, we use tracks decoder that integrates DINO feature to decode the featured tracks by decoupling motion and semantic information and ultimately obtain the dynamic trajectories(a). Finally, using SAM2, we group dynamic tracks belonging to the same object and generate fine-grained moving object masks(b).

This repository contains the code for Segment Any Motion in Videos.

Installation
Usage
Evaluation
Model Training
- Preprocess-data
- Training

Installation

Our code is developed on Ubuntu 22.04 using Python 3.12 and PyTorch 2.4.0+cu121 on a NVIDIA RTX A6000. Please note that the code has only been tested with these specified versions. We recommend using conda for the installation of dependencies.

git clone --recurse-submodules https://github.com/nnanhuang/SegAnyMo
cd SegAnyMo/
conda create -n seg python=3.12.4
conda activate seg
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121

downloading DINOv2 for preprocessing.

cd preproc
git clone https://github.com/facebookresearch/dinov2

downloading the environment & checkpoints for sam2, tapnet

# SAM2 (sam2_hiera_large.pt)
cd sam2
pip install -e .

cd checkpoints && \
./download_ckpts.sh && \
cd ../..

# install tapnet
cd preproc/tapnet
pip install .

cd ..
mkdir checkpoints
cd checkpoints
wget https://storage.googleapis.com/dm-tapnet/bootstap/bootstapir_checkpoint_v2.pt

Usage

You can use our method very simply with three lines of code:

# Make sure you set up the environment and download all the checkpoints.
# Make sure you already write the model ckpt path to the config file.

# 1.Data Preprocessing:
python core/utils/run_inference.py --video_path $VIDEO_PATH --gpus $GPU_ID --depths --tracks --dinos --e 
# 2.Predicting Per-Track Motion Labels:
python core/utils/run_inference.py --video_path $VIDEO_PATH --motin_seg_dir $OUTPUT_DIR --config_file $PATH --gpus $GPU_ID --motion_seg_infer --e 
# 3.Generating the Final Masks:
python core/utils/run_inference.py --video_path $VIDEO_PATH --sam2dir $RESULT_DIR --motin_seg_dir $OUTPUT_DIR --gpus $GPU_ID --sam2 --e 

# For example:
python core/utils/run_inference.py --data_dir ./data/images --gpus 0 1 2 3 --depths --tracks --dinos --e
python core/utils/run_inference.py --data_dir ./data/images --motin_seg_dir ./result/moseg --config_file ./configs/example.yaml --gpus 0 1 2 3 --motion_seg_infer --e
python core/utils/run_inference.py --data_dir ./data/images --sam2dir ./result/sam2 --motin_seg_dir ./result/moseg --gpus 0 1 2 3 --sam2 --e 
# or use --video_path ./data/video.mp4

Please see below for specific usage details.

Preprocessing

We depend on the following third-party libraries for preprocessing:

Monocular depth: Depth Anything v2
2D Tracks: TAPIR
Dino feature: DINO v2

Processed root dirs should be organized as:

data
├── images
│   ├── scene_name
│   │   ├── image_name   
│   │   ├── ...     
├── bootstapir
│   ├── scene_name
│   │   ├── image_name     
│   │   ├── ...    
├── dinos
│   ├── scene_name
│   │   ├── image_name     
│   │   ├── ...  
├── depth_anything_v2
│   ├── scene_name
│   │   ├── image_name    
│   │   ├── ...

We recommend enabling Efficiency Mode --e to accelerate data processing. This mode accelerate pipeline through frame rate reduction, interval sampling, and resolution scaling.
During inference, we implement a stride of 10 --step when processing image sequences, meaning only every 10th frame (designated as Query Frames) will be considered valid and processed. The system exclusively operates on these selected Query Frames for all data processing tasks. You can improve model performance by set a smaller value.
You can generate depth image, dino features and 2d tracks by (~10 min) this code. Use --data_dir if your input is image sequences and use --video_path if your input is a video.

python core/utils/run_inference.py --data_dir $DATA_DIR --gpus $GPU_ID --depths --tracks --dinos --e

python core/utils/run_inference.py --video_path $VIDEO_PATH --gpus $GPU_ID --depths --tracks --dinos --e

Tracks Label Prediction

First download the model checkpoints and write the path to the resume_path in configs/example_train.yaml. (the resume_path part)
You can download from huggingface,
Or you can download the model checkpoints from google drive.
Running inference after process depth, dino features and 2d tracks. The predicted result will be saved at motin_seg_dir.

python core/utils/run_inference.py --data_dir $DATA_DIR --motin_seg_dir $OUTPUT_DIR --config_file $PATH --gpus $GPU_ID --motion_seg_infer --e

Mask Densification Using SAM2

Run prediction and save mask result and video result. The sam2dir is where the SAM2 predicted mask is saved, the data_dir is the original rgb images dirs, and the motin_seg_dir is the results of Tracks Label Prediction Model which contains dynamic trajectories and visibilities.

python core/utils/run_inference.py --data_dir $data_dir  --sam2dir $result_dir --motin_seg_dir $tracks_label_result --gpus $GPU_ID --sam2 --e

For coordinate: trajectory output is (x,y) and SAM2 input is also (x,y)
Important: SAM2 official code hard-code the image sequence suffix is in ".jpg,.jpeg", and name as totally pure number. You can either change the image name or change the code. We change the code in this repo.

Evaluation

Download Pre-computed Results

The masks pre-compputed by us can be found here.

MOS task evaluation

e.g.For DAVIS Dataset, we use the script below, you can use different $eval_seq_list to eval subset of DAVIS, like DAVIS2016-Moving.

CUDA_VISIBLE_DEVICES=7 python core/eval/eval_mask.py --res_dir $res-dir --eval_dir $gt-dir --eval_seq_list /$root-dir/core/utils/moving_val_sequences.txt

If you don't specify eval_seq_list, then will use full sequence list by default.

Fine-grained MOS task evaluation

cd core/eval/davis2017-evaluation 
CUDA_VISIBLE_DEVICES=3 python evaluation_method.py --task unsupervised --results_path $mask_path

Model Training

Preprocess data

We take HOI4D as an instance.

# preprocess images and dynamic masks
python core/utils/process_HOI.py
# preprocess else
python core/utils/run_inference.py --data_dir current-data-dir/kubric/movie_f/validation/images --gpus 0 1 2 3 4 5 6 7 --tracks --depths --dinos

(optional) you can use this scripts to check if all data have been processed:

python current-data-dir/dynamic_stereo/dynamic_replica_data/check_process.py

If you want to train on a custom dataset, the dataset should have gt rgb and dynamic mask.
(optional) after processing, clean the data to save memory.

python core/utils/run_inference.py --data_dir $data_dir --gpus 0 1 2 3 4 5 6 7 --clean

Training

process Kubric dataset, we train on Kubric-Movie-F subset; process dynamic-stereo & HOI4D as above.
When preprocess dynamic-stereo dataset, we use cal_dynamic_mask.py to get the groudtruth dynamic mask by caculate the groudtruth trajectory motion.
Train these datasets together:

CUDA_VISIBLE_DEVICES=3 python train_seq.py ./configs/$CONFIG.yaml

Citation

If you find our repo or paper useful, please cite us as:

@InProceedings{Huang_2025_CVPR,
    author    = {Huang, Nan and Zheng, Wenzhao and Xu, Chenfeng and Keutzer, Kurt and Zhang, Shanghang and Kanazawa, Angjoo and Wang, Qianqian},
    title     = {Segment Any Motion in Videos},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {3406-3416}
}

Related Skills

qqbot-channel

354.3k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.9k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

354.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

ddd

Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso

nnanhuang

View profile

View on GitHub

GitHub Stars473

CategoryContent

Updated1d ago

Forks37

nnanhuang/SegAnyMo

Languages

Jupyter Notebook

Security Score

100/100

Audited on Apr 10, 2026

No findings