ORV

[CVPR 2026] ORV: 4D Occupancy-centric Robot Video Generation.

Generate Convert Improve

Install / Use

/learn @OrangeSodahub/ORV

About this skill

Quality Score

0/100

README

ORV: 4D Occupancy-centric Robot Video Generation

TL;DR ORV generates robot videos with the geometry guidance of 4D occupancy, achieves higher control precision, shows strong generalizations, performs multiview consistent videos generation and conducts simulation-to-real visual transfer.

ORV: 4D Occupancy-centric Robot Video Generation
Xiuyu Yang*, Bohan Li*, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, Hang Zhao, Hao Zhao
CVPR 2026 (arXiv 2506.03079)

BibTeX

If you find our work useful in your research, please consider citing our paper:

@article{yang2025orv,
    title={ORV: 4D Occupancy-centric Robot Video Generation},
    author={Yang, Xiuyu and Li, Bohan and Xu, Shaocong and Wang, Nan and Ye, Chongjie and Chen Zhaoxi and Qin, Minghan and Ding, Yikang and Zhu, Zheng and Jin, Xin and Zhao, Hang and Zhao, Hao},
    journal={arXiv preprint arXiv:2506.03079},
    year={2025}
}

Environment Setup

Clone the ORV repository first:

git --recurse-submodules clone https://github.com/OrangeSodahub/ORV.git
cd ORV/

Create new python environment:

conda create -n orv python=3.10
conda create orv

pip install -r requirements.txt

Note that we use cuda11.8 by default, please modify the lines in requirements.txt shown below to support your own versions:

torch==2.5.1 --index-url https://download.pytorch.org/whl/cu118
torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118

Our checkpoints are host at huggingface repo, feel free to download them.

Data Preparation

1. (Singleview) Video-Trajectory Data

For BridgeV2 and RT-1 data (singleview), we primarily reuse the video-trajectory data from IRASim (originally from OXE version). We also put the download links below for convenience:

This versions of data have raw resolutions of 480×640 for BridgeV2, and 256×320 for RT-1, and we train ORV models on preprocessed 320×480 resolution (Please refer to Section E.3 in paper for the details).

2. (Multiview) Video-Trajectory Data

Please download the official BridgeV2 data (all episodes have 1~3 views) from official BridgeV2 tfds and then extract the usable bridge data:

bash scripts/extract_data_tfds.sh bridge

We follow the official Droid tutorials to download DROID dataset (all episodes have 2 views) in RLDS and then extract:

# download raw tfds data (~1.7TB)
gsutil -m cp -r gs://gresearch/robotics/droid <path_to_your_target_dir>

# extract
bash scripts/extract_data_tfds.sh droid

This versions of data have raw resolutions of 256×256 for BridgeV2 and 180×256 for Droid, and we train ORV models on 320×480 for BridgeV2 and 256×384 for Droid.

2. Occupancy Data

To be finished.

3. Encode to (VAE) latents

Training or evaluation with processed latents data instead of encoding videos or images online will dramaticallly save memory and time. We use the VAE loaded from huggingface THUDM/CogVideoX-2b.

Please refer to scripts/encode_dataset.sh and scripts/encode_dataset_dist.sh to encode images or videos to latents and save them to disk. Please first check the arguments in scripts first (--dataset, --data_root and --output_dir) and then run:

# single process
bash scripts/encode_dataset.sh $SPLIT $BATCH

# multiple processes
bash scripts/encode_dataset_dist.sh $GPU $SPLIT $BATCH

where SPLIT is one of 'train', 'val', 'test', $GPU is the number of devices, $BATCH is the batch size of dataloader (recommend just use 1).

For those data reused from IRASim, please ignore their processed latents data and only raw .mp4 data will be used.

Training

Stage1 Singleview Action-to-video Generation

We first get the basic singleview action-to-video generation model starting from the pretrained THUDM/CogVideoX-2b (Text-to-video) model through SFT. Please check out and run the following script:

bash scripts/train_control_traj-image_finetune_2b.sh --dataset_type $DATASET

where $DATASET is chosen from ['bridgev2', 'rt1', 'droid'].

Use the correct configurations:

CUDA devices: please set correct value for the key ACCELERATE_CONFIG_FILE in these .sh scripts which are used for accelerate launching. Predfined .yaml files are at config/accelerate;

Experimental settings: Each configuration in config/traj_image_*.yaml files corresponds to one training experimental setting and one model. Please set the correct value for the key EXP_CONFIG_PATH in scripts.

Stage2 Occupancy-conditioned Generation

We incorporate occupancy-derived conditions to have more accurate controls. First set the correct path to pretrained model at stage1 in config/traj_image_condfull_2b_finetune.yaml:

transformer:
  <<: *runtime
  pretrained_model_name_or_path: THUDM/CogVideoX-2b
  transformer_model_name_or_path: outputs/orv_bridge_traj-image_480-320_finetune_2b_30k/checkpoint

Then run the following script (the yaml config file above is set in this script):

bash scripts/train_control_traj-image-cond_finetune.sh

Stage3 Multiview Generation

This step further extends the singleview generation model to multiview generation model. First set the correct path to pretrained singleview model in config/traj_image_2b_multiview.yaml and then run the following script:

bash scripts/train_control_traj-image-multiview.sh

Note that all RGB and condition data of all views need to be processed to latents first.

Evaluation and Metrics

1. Inference on dataset

Generally, run the following script to inference the trained model on the specific dataset:

# single process
bash scripts/eval_control_to_video.sh

# multiple processes
bash scripts/eval_control_to_video_dist.sh $GPU

Please choose the correct *.yaml configuration file in scripts:

eval_traj_image_2b_finetune.yaml: base action-to-video model
eval_traj_image_cond_2b_finetune.yaml: singleview occupancy-conditioned model
eval_traj_image_condfull_2b_multiview.yaml: multiview occupancy-conditioned model

2. Metrics Calculation

Set the keys GT_PATH and PRED_PATH in following script and run it to calculate metrics (refer to Section E.4 in paper for more details):

bash scripts/eval_metrics.sh

Inference on Demo Data

To be finished.

TODO

[x] Release arXiv technique report
[x] Release full codes
[x] Release checkpoints
[ ] Finish the full instructions
[ ] Release processed data

Acknowledgement

Thansk for these excellent opensource works and models: CogVideoX; diffusers;.

Related Skills

qqbot-channel

347.2k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.1k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

347.2k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

OrangeSodahub

View profile

View on GitHub

GitHub Stars90

CategoryContent

Updated11h ago

Forks1

OrangeSodahub/ORV

Languages

Python

Security Score

95/100

Audited on Apr 3, 2026

No findings