SkillAgentSearch skills...

UniVLA

[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions

Install / Use

/learn @OpenDriveLab/UniVLA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

[!IMPORTANT] 🌟 Stay up to date at opendrivelab.com!

:earth_asia: UniVLA

<div id="top" align="center"> <p align="center"> <img src="assets/univla-teaser_new.png" width="1000px" > </p> </div>

:page_facing_up: Paper | :rocket: Demo Page (Coming Soon)

:black_nib: Qingwen Bu, Y. Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, H. Li
:e-mail: Primary Contact: Qingwen Bu (buqingwen@opendrivelab.com)

:fire: Highlights

  • A recipe towards generalist policy by planning in a unified, embodiment-agnostic action space.
  • A novel approach for extracting task-centric latent actions from cross-embodiment videos.
  • A VLA that achieves state-of-the-art results on multiple benchmarks with compute-efficient training.

Table of Contents

:movie_camera: Demo

Real-world robot experiments.

<table style="width:100%;border-collapse:collapse;"> <tr> <td style="text-align:center;"><b>Store the screwdriver (1x speed)</b></td> <td style="text-align:center;"><b>Clean the cutting board (1x speed)</b></td> <td style="text-align:center;"><b>Fold towel twice (1x speed)</b></td> </tr> <tr> <td><video src="https://github.com/user-attachments/assets/b11b4e83-24d8-4b55-b50e-f8271249422c" style="object-fit:cover;" autoplay loop muted></video></td> <td><video src="https://github.com/user-attachments/assets/bafb5bac-8c8e-41d4-89d0-ec774b9b6e1c" style="object-fit:cover;" autoplay loop muted></video></td> <td><video src="https://github.com/user-attachments/assets/6779e0e4-aa6e-4c16-adb9-30dedfd4db85" style="object-fit:cover;" autoplay loop muted></video></td> </tr> <tr> <td style="text-align:center;"><b>Stack the tower of hanoi (1x speed)</b></td> </tr> <tr> <td><video src="https://github.com/user-attachments/assets/61f663da-18df-4892-ae8f-5e03aac7469e" style="object-fit:cover;" autoplay loop muted></video></td> <td><video src="https://github.com/user-attachments/assets/da7d7d4e-0634-42d7-8e88-8bb269965b1a" style="object-fit:cover;" autoplay loop muted></video></td> <td><video src="https://github.com/user-attachments/assets/cb3afa9a-ffeb-4879-b915-1803d7ff8262" style="object-fit:cover;" autoplay loop muted></video></td> </tr> </table>

:loudspeaker: News

  • [2025/05] The code of UniVLA v1.0 is released. Please check it out!

🤗 Model Zoo <a name="ckpts"></a>

<table> <tr> <th>Model Name</th> <th>Backbone</th> <th>HF Path</th> <th>Note</th> </tr> <tr> <td>lam-stage-1</td> <td> - </td> <td><a href="https://huggingface.co/qwbu/univla-latent-action-model">univla-latent-action-model</a></td> <td> The stage-1 latent action model trained on OpenX and Ego4D. </td></td> </tr> <tr> <td>lam-stage-2</td> <td> - </td> <td><a href="https://huggingface.co/qwbu/univla-latent-action-model">univla-latent-action-model</a></td> <td> The stage-2 latent action model trained on OpenX and Ego4D. (Generate task-centric latent actions.)</td> </tr> <tr> <td>univla-7b</td> <td><a href="https://huggingface.co/TRI-ML/prismatic-vlms/tree/main/prism-dinosiglip-224px%2B7b">TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b</a></td> <td><a href="https://huggingface.co/qwbu/univla-7b">univla-7b</a></td> <td>UniVLA pretrained on our full data collection (Manip. + Navi. + Human). </td> </tr> <tr> <td>univla-7b-bridge-pt</td> <td><a href="https://huggingface.co/TRI-ML/prismatic-vlms/tree/main/prism-dinosiglip-224px%2B7b">TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b</a></td> <td><a href="https://huggingface.co/qwbu/univla-7b-bridge-pt">univla-7b-bridge-pt</a></td> <td>UniVLA pretrained only on BridgeV2 data.</a></td> </tr> <tr> <td>univla-7b-human-pt</td> <td><a href="https://huggingface.co/TRI-ML/prismatic-vlms/tree/main/prism-dinosiglip-224px%2B7b">TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b</a></td> <td><a href="https://huggingface.co/qwbu/univla-7b-human-pt">univla-7b-human-pt</a></td> <td>UniVLA pretrained only on Ego4D human videos. </a></td> </tr> <tr> <td>univla-libero</td> <td><a href="https://huggingface.co/qwbu/univla-7b">univla-7b</a></td> <td><a href="https://huggingface.co/qwbu/univla-7b-224-sft-libero">univla-7b-224-sft-libero</a></td> <td>Finetuned on the LIBERO dataset</a></td> </tr> <tr> <td>univla-calvin</td> <td><a href="https://huggingface.co/qwbu/univla-7b">univla-7b</a></td> <td><a href="https://huggingface.co/qwbu/univla-7b-224-sft-calvin">univla-7b-224-sft-calvin</a></td> <td>Finetuned on the CALVIN dataset</a></td> </tr> <tr> <td>univla-r2r</td> <td><a href="https://huggingface.co/qwbu/univla-7b">univla-7b</a></td> <td><a href="https://huggingface.co/qwbu/univla-7b-224-sft-r2r">univla-7b-224-sft-r2r</a></td> <td>Finetuned on the R2R dataset</a></td> </tr> <tr> <td>univla-bridge</td> <td><a href="https://huggingface.co/qwbu/univla-7b">univla-7b</a></td> <td><a href="https://huggingface.co/qwbu/univla-7b-224-sft-simpler-bridge">univla-7b-224-sft-simpler-bridge</a></td> <td>Finetuned on the BridgeV2 (OXE ver.) dataset</a></td> </tr> </table>

:video_game: Getting Started <a name="installation"></a>

  1. (Optional) We use conda to manage the environment.
conda create -n univla python=3.10 -y
conda activate univla
  1. Install dependencies.
# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
# Our experiments are conducted with 'torch 2.2.0 + cuda 12.1'
pip install torch torchvision

# Clone our repo and pip install to download dependencies
git clone git@github.com:OpenDriveLab/UniVLA.git
cd univla
pip install -e .

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation

:fire: Training Recipe

:zero: Data Preparation

Please refer to this script for an example of how to download datasets from OXE

[optional] Please follow this instruction if you'd like to convert Ego4D data into RLDS format for training UniVLA.

:one: Task-centric Latent Action Learning

We hightly recommond directly using our pre-trained latent action model ckeckpoints to save your time and compute.

[!NOTE] Our latent action model is trained on a comprehensive data collection, encompassing multiple robotic manipulation and navigation datasets from Open X-Embodiment, along with a curated subset of the Ego4D dataset (detailed data construction procedures are provided in the appendix of our paper).

To adapt the model to additional datasets or custom data sources, users may refer to ./prismatic/vla/datasets/rlds/oxe/mixtures.py to either utilize predefined data mixtures or define new ones. Subsequently, the data_mix parameter in the configuration file should be updated accordingly.

The latent action model is implemented based on VQ-VAE. We train the latent action model on the collection of dataset comprising robot manipulation, navigation and human videos. In stage-1 training, we use an overall batch size of 512 and 100k optimization steps to construct the task-irrelevant latent actions:

torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
    --config config/lam-stage-1.yaml \
    2>&1 | tee lam-stage-1.log

The following stage-2 then focuses on learning task-centric latent actions on the basis of stage-1 results. Please modify the stage_one_ckpt in latent_action_model/config/lam-stage-2.yaml to your local path of stage-1 checkpoint, then run training with:

torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
    --config config/lam-stage-2.yaml \
    2>&1 | tee lam-stage-2.log

:two: Pretraining of Generalist Policy

  • Latent Action Pseudo-Labeling for Policy Optimization: The trained latent action model is employed to generate pseudo-labels for policy optimization via a next-token prediction objective. Specifically, the indices of inferred latent actions in the VQ-VAE codebook are mapped to dedicated tokens in the LLaMA tokenizer, denoted as {ACT_0, ACT_1, ..., ACT_C}.

  • Cost-effective Pre-Training: The full-scale pre-training procedure, incorporating both OpenX and Ego4D datasets, was performed using a 32-GPU A100 cluster over 20,000 optimization steps. This training regimen required approximately 960 A100 GPU-hours, representing just 5% of the computational resources utilized by Ope

View on GitHub
GitHub Stars1.0k
CategoryEducation
Updated11h ago
Forks58

Languages

Python

Security Score

100/100

Audited on Apr 8, 2026

No findings