SkillAgentSearch skills...

VLABench

Official repo of VLABench, a large scale benchmark designed for fairly evaluating VLA, Embodied Agent, and VLMs.

Install / Use

/learn @OpenMOSS/VLABench
About this skill

Quality Score

0/100

Category

Design

Supported Platforms

Universal

README

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

🎓 Paper | 🌐 Project Website | 🤗 Hugging Face | 🐳 Quick Start with Docker

<img src="docs/images/Figure1_overview.png" width="100%" />

<span style="font-size:16px"> 🚨 <span style="color:#AB4459;">NOTICE:</span>Please feel free to start an issue or create a PR! If I do not respond to issues in a timely manner, feel free to send me an email directly. I will do my best to build a more user-friendly community ecosystem for VLABench within this year.</span>

News

  • Preview A complete infra framework will be open-sourced alongside new work, including full training pipelines, VLABench evaluation, a new leaderboard, data processing, and real-device deployment. Stay tuned!
  • 2025/11/10 Upload several new checkpoints as baselines. Action representation matters.
  • 2025/8/06 Update VLABench with:
    • parrallel data collection;
    • parrallel evaluation example;
    • camera augmentation;
    • rlds/lerobot data format ft dataset on hf;
    • update on pi0 codebase.
  • 2025/6/26 VLABench has been accepted by ICCV 2025.
  • 2025/4/10 Releasing the finetuned pi0 checkpoint(pi0-base and pi0-fast) on hf.
  • 2025/3/25 Releasing standard evaluation episodes and primitive task finetune dataset.
  • 2025/2/26 Releasing referenced evaluation pipeline.
  • 2025/2/14 Releasing the scripts for trajectory generation.
  • 2024/12/25 The preview verison of VLABench has been released! The preview version showcases most of the designed tasks and structure, but the functionalities are still being managed and tested.

Recent Work Todo

  • [x] Organize the functional code sections.
    • [x] Reconstruct the efficient, user-friendly, and comprehensive evaluation framework.
    • [x] Manage the automatic data workflow for existing tasks.
    • [x] Improve the DSL of skill libarary.
  • [x] Release the trejectory and evaluation scripts.
  • [x] Test the interface of humanoid and dual-arm manipulation.
  • [x] Release the left few tasks not released in preview version.
  • [ ] Leaderboard of VLAs and VLMs in the standard evaluation
    • [x] Release standard evaluation datasets/episodes, in different dimension and difficulty level.
    • [x] Release standard finetune dataset.
    • [ ] Integrate the commonly used VLA models for facilitate replication. (Continously update)
  • [] Releasing more datasets, including pretraining version, composite tasks once finalizing testing.
  • [] Releasing more checkpoints.
  • [] Update the complete training and evaluation community.

Installation

Install VLABench

  1. Prepare conda environment
conda create -n vlabench python=3.10
conda activate vlabench

git clone https://github.com/OpenMOSS/VLABench.git
cd VLABench
pip install -r requirements.txt
pip install -e .
  1. Download the assets
python scripts/download_assets.py
  1. (Option) Initialize submodules
git submodule update --init --recursive

This will update other policies repos such openpi.

The script will automatically download the necessary assets and unzip them into the correct directory.

Data Collection

Run scripts to generate hdf5 dataset with multi-processing

We provide a brief tutorial in tutorials/2.auto_trajectory_generate.ipynb and the whole codes are in scripts/trajectory_generation.py. Trajectory generation can be sped up several times by using multiple processes. A naive way to use it is:

sh dataset_generation.sh

Currently, the version does not support multi-processing environment within the code. We will optimize the collection efficiency as much as possible in future updates. After running the script, each trajectory will be stored as a hdf5 file in the directory you specify.

To run the parrallel data collection on distributed machines, such as 8 GPU, please refer to

bash sh/data_generation/multi_gpu_data_generation.sh

Convert to rlds format

Due to some frameworks such as Octo and Openvla using data in the RLDS format for training, we refer to the process from rlds_dataset_builder to provide an example of converting the aforementioned HDF5 dataset into RLDS format data. First, run

python scripts/convert_to_rlds.py --task [list] --save_dir /your/path/to/dataset

This will create a python file including the task rlds-builder in the directory. Then

cd /your/path/to/dataset/task

tfds build

This process consumes a long time with only single process, and we are testing multithreading mthod yet. The codes of original repo seem to have some bugs.

Convert to Lerobot format

Following the Libero dataset process way of openpi, we offer a simple way to convert hdf5 data files into lerobot format. Run the script by

python scripts/convert_to_lerobot.py --dataset-name [your-dataset-name] --dataset-path /your/path/to/dataset --max-files 100

The processed Lerobot dataset will be stored defaultly in your HF_HOME/lerobot/dataset-name.

Expandation

VLABench adopts a flexible modular framework for task construction, offering high adaptability. You can follow the process outlined in tutorial 6.

<!-- ### Register New Entity 1. Process the obj file with `obj2mjcf`(https://github.com/kevinzakka/obj2mjcf). Here is an use demo, `obj2mjcf --verbose --obj-dir your_own_obj_dir --compile-model --save-mjcf --decompose` 2. Put the processed xml files/directory to somewhere under VLABench/assets/meshes. 3. If it's a new class of entity, please register a entity class in VLABench/tasks/components with global register. Then, import the class in the `VLABench/tasks/components/__init__.py`. 4. Register it in `VLABench/configs/constant.py` for global access. ### Register New Task 1. Create new task class file under `VLABench/tasks/hierarchical_tasks`. And register it with global register in `VLABench/utils/register.py`. Notice that if the current condition can not met your requirement, you should write a single Condition class in `VLABench/tasks/condition.py`. 2. Import the new task class in `VLABench/tasks/hierarchical_tasks/__init__.py`. -->

Evaluate

VLABench currently provides standard benchmark datasets, focusing on generalization across multiple dimensions. In the VLABench/configs/evaluation/tracks directory, we have set up multiple benchmark sets across different dimensions. These configs ensure that different models can be fairly compared under the same episodes on different machines.

| Track | Description | |----------|----------| | track_1_in_distribution | Evaluation of the policy's task learning ability, requiring it to fit in-domain episodes with a small and diverse set of data. | | track_2_cross_categroy | Evaluation of the policy's generalization ability at the object category level & instance level, requiring visual generalization capability. | | track_3_common_sense | Evaluation of the policy's application of common sense, requiring the use of common sense understanding for describing the target. | | track_4_semantic_instruction | Evaluation of the policy's ability to understand complex semantics involves instructions that are rich in contextual or semantic information. | | track_5_cross_task | Evaluation of the policy's ability to transfer skills across tasks is kept open in this setting, allowing users to choose training tasks and evaluation tasks according to their needs. | | track_6_unseen_texture | Evaluation of the policy's visual robustness, involving episodes with different backgrounds and table textures in this setting. |

NOTICE: The evaluation can also be done by directly sampling episodes from the environment. This evaluation method is more flexible, but there is a risk of improperly initialized episodes. We recommend using the 'evaluation_tracks' method for evaluation.

VLA/policy evaluation

We provide a standardized fine-tuning dataset, which can be downloaded from hf-dataset. In this version, the data focuses on primitive tasks. We selected 10 basic tasks and provided 500 samples for each task.

Since the current version of VLA does not perform well on primitive tasks, we plan to focus on enhancing VLA’s capabilities in this area first. In the future, we will release a more organized dataset for more composite tasks.

1. Evaluate OpenVLA

Before evaluate your finetuned OpenVLA, please compute the norm_stat on your dataset and place it to VLABench/configs/model/openvla_config.json

Run the evaluation scripts by

python scirpts/evaluate_policy.py --n-sample 20 --model openvla --model_ckpt xx --lora_ckpt xx --eval_track track_1_in_distribution --tasks task1, task2 ...

Multi-GPU Accelerated Evaluation

To speed up the evaluation process, VLABench supports multi-GPU parallel evaluation.

Example command:

bash sh/evaluation/example_multi_gpu_ev

Related Skills

View on GitHub
GitHub Stars411
CategoryDesign
Updated6h ago
Forks28

Languages

Python

Security Score

100/100

Audited on Apr 1, 2026

No findings