ORV
[CVPR 2026] ORV: 4D Occupancy-centric Robot Video Generation.
Install / Use
/learn @OrangeSodahub/ORVREADME
ORV: 4D Occupancy-centric Robot Video Generation
TL;DR ORV generates robot videos with the geometry guidance of 4D occupancy, achieves higher control precision, shows strong generalizations, performs multiview consistent videos generation and conducts simulation-to-real visual transfer.
<img src="assets/teaser.png" width="100%"/>ORV: 4D Occupancy-centric Robot Video Generation
Xiuyu Yang*, Bohan Li*, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, Hang Zhao, Hao Zhao
CVPR 2026 (arXiv 2506.03079)
BibTeX
If you find our work useful in your research, please consider citing our paper:
@article{yang2025orv,
title={ORV: 4D Occupancy-centric Robot Video Generation},
author={Yang, Xiuyu and Li, Bohan and Xu, Shaocong and Wang, Nan and Ye, Chongjie and Chen Zhaoxi and Qin, Minghan and Ding, Yikang and Zhu, Zheng and Jin, Xin and Zhao, Hang and Zhao, Hao},
journal={arXiv preprint arXiv:2506.03079},
year={2025}
}
Environment Setup
Clone the ORV repository first:
git --recurse-submodules clone https://github.com/OrangeSodahub/ORV.git
cd ORV/
Create new python environment:
conda create -n orv python=3.10
conda create orv
pip install -r requirements.txt
Note that we use cuda11.8 by default, please modify the lines in requirements.txt shown below to support your own versions:
torch==2.5.1 --index-url https://download.pytorch.org/whl/cu118
torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118
Our checkpoints are host at huggingface repo, feel free to download them.
Data Preparation
1. (Singleview) Video-Trajectory Data
For BridgeV2 and RT-1 data (singleview), we primarily reuse the video-trajectory data from IRASim (originally from OXE version). We also put the download links below for convenience:
|Data | Train | Evaluation | |:--: |:-----:|:----------:| |BridgeV2|bridge_train_data|bridge_eval_data| |RT-1|rt1_train_data|rt1_eval_data|
This versions of data have raw resolutions of
480×640for BridgeV2, and256×320for RT-1, and we train ORV models on preprocessed320×480resolution (Please refer to Section E.3 in paper for the details).
2. (Multiview) Video-Trajectory Data
Please download the official BridgeV2 data (all episodes have 1~3 views) from official BridgeV2 tfds and then extract the usable bridge data:
bash scripts/extract_data_tfds.sh bridge
We follow the official Droid tutorials to download DROID dataset (all episodes have 2 views) in RLDS and then extract:
# download raw tfds data (~1.7TB)
gsutil -m cp -r gs://gresearch/robotics/droid <path_to_your_target_dir>
# extract
bash scripts/extract_data_tfds.sh droid
This versions of data have raw resolutions of
256×256for BridgeV2 and180×256for Droid, and we train ORV models on320×480for BridgeV2 and256×384for Droid.
2. Occupancy Data
To be finished.
3. Encode to (VAE) latents
Training or evaluation with processed latents data instead of encoding videos or images online will dramaticallly save memory and time. We use the VAE loaded from huggingface THUDM/CogVideoX-2b.
Please refer to scripts/encode_dataset.sh and scripts/encode_dataset_dist.sh to encode images or videos to latents and save them to disk. Please first check the arguments in scripts first (--dataset, --data_root and --output_dir) and then run:
# single process
bash scripts/encode_dataset.sh $SPLIT $BATCH
# multiple processes
bash scripts/encode_dataset_dist.sh $GPU $SPLIT $BATCH
where SPLIT is one of 'train', 'val', 'test', $GPU is the number of devices, $BATCH is the batch size of dataloader (recommend just use 1).
For those data reused from IRASim, please ignore their processed latents data and only raw
.mp4data will be used.
Training
Stage1 Singleview Action-to-video Generation
We first get the basic singleview action-to-video generation model starting from the pretrained THUDM/CogVideoX-2b (Text-to-video) model through SFT. Please check out and run the following script:
bash scripts/train_control_traj-image_finetune_2b.sh --dataset_type $DATASET
where $DATASET is chosen from ['bridgev2', 'rt1', 'droid'].
Use the correct configurations:
- CUDA devices: please set correct value for the key
ACCELERATE_CONFIG_FILEin these.shscripts which are used for accelerate launching. Predfined.yamlfiles are at config/accelerate;- Experimental settings: Each configuration in config/traj_image_*.yaml files corresponds to one training experimental setting and one model. Please set the correct value for the key
EXP_CONFIG_PATHin scripts.
Stage2 Occupancy-conditioned Generation
We incorporate occupancy-derived conditions to have more accurate controls. First set the correct path to pretrained model at stage1 in config/traj_image_condfull_2b_finetune.yaml:
transformer:
<<: *runtime
pretrained_model_name_or_path: THUDM/CogVideoX-2b
transformer_model_name_or_path: outputs/orv_bridge_traj-image_480-320_finetune_2b_30k/checkpoint
Then run the following script (the yaml config file above is set in this script):
bash scripts/train_control_traj-image-cond_finetune.sh
Stage3 Multiview Generation
This step further extends the singleview generation model to multiview generation model. First set the correct path to pretrained singleview model in config/traj_image_2b_multiview.yaml and then run the following script:
bash scripts/train_control_traj-image-multiview.sh
Note that all RGB and condition data of all views need to be processed to latents first.
Evaluation and Metrics
1. Inference on dataset
Generally, run the following script to inference the trained model on the specific dataset:
# single process
bash scripts/eval_control_to_video.sh
# multiple processes
bash scripts/eval_control_to_video_dist.sh $GPU
Please choose the correct *.yaml configuration file in scripts:
eval_traj_image_2b_finetune.yaml: base action-to-video modeleval_traj_image_cond_2b_finetune.yaml: singleview occupancy-conditioned modeleval_traj_image_condfull_2b_multiview.yaml: multiview occupancy-conditioned model
2. Metrics Calculation
Set the keys GT_PATH and PRED_PATH in following script and run it to calculate metrics (refer to Section E.4 in paper for more details):
bash scripts/eval_metrics.sh
Inference on Demo Data
To be finished.
TODO
- [x] Release arXiv technique report
- [x] Release full codes
- [x] Release checkpoints
- [ ] Finish the full instructions
- [ ] Release processed data
Acknowledgement
Thansk for these excellent opensource works and models: CogVideoX; diffusers;.
Related Skills
qqbot-channel
347.2kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.1k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
347.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
