VLABench

Official repo of VLABench, a large scale benchmark designed for fairly evaluating VLA, Embodied Agent, and VLMs.

Generate Convert Improve

Install / Use

/learn @OpenMOSS/VLABench

About this skill

Quality Score

0/100

README

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

🎓 Paper | 🌐 Project Website ｜ 🤗 Hugging Face | 🐳 Quick Start with Docker

<span style="font-size:16px"> 🚨 <span style="color:#AB4459;">NOTICE:</span>Please feel free to start an issue or create a PR! If I do not respond to issues in a timely manner, feel free to send me an email directly. I will do my best to build a more user-friendly community ecosystem for VLABench within this year.</span>

News

Preview A complete infra framework will be open-sourced alongside new work, including full training pipelines, VLABench evaluation, a new leaderboard, data processing, and real-device deployment. Stay tuned!
2025/11/10 Upload several new checkpoints as baselines. Action representation matters.
- Pi05-ft-primitive: Track 1 SR 40.6%.
- Pi0-ft-primitive(10 tasks): Track 1 SR 47%.
- Pifast-ft-primitive(10 tasks)(relative chunk): Track 1 SR 29.1%. This is the official transform.
- Pifast-ft-primitive(10 tasks)(delta chunk): Track 1 SR 51.2%. This is the aligned transform.
2025/8/06 Update VLABench with:
- parrallel data collection;
- parrallel evaluation example;
- camera augmentation;
- rlds/lerobot data format ft dataset on hf;
- update on pi0 codebase.
2025/6/26 VLABench has been accepted by ICCV 2025.
2025/4/10 Releasing the finetuned pi0 checkpoint(pi0-base and pi0-fast) on hf.
2025/3/25 Releasing standard evaluation episodes and primitive task finetune dataset.
2025/2/26 Releasing referenced evaluation pipeline.
2025/2/14 Releasing the scripts for trajectory generation.
2024/12/25 The preview verison of VLABench has been released! The preview version showcases most of the designed tasks and structure, but the functionalities are still being managed and tested.

Recent Work Todo

[x] Organize the functional code sections.
- [x] Reconstruct the efficient, user-friendly, and comprehensive evaluation framework.
- [x] Manage the automatic data workflow for existing tasks.
- [x] Improve the DSL of skill libarary.
[x] Release the trejectory and evaluation scripts.
[x] Test the interface of humanoid and dual-arm manipulation.
[x] Release the left few tasks not released in preview version.
[ ] Leaderboard of VLAs and VLMs in the standard evaluation
- [x] Release standard evaluation datasets/episodes, in different dimension and difficulty level.
- [x] Release standard finetune dataset.
- [ ] Integrate the commonly used VLA models for facilitate replication. (Continously update)
[] Releasing more datasets, including pretraining version, composite tasks once finalizing testing.
[] Releasing more checkpoints.
[] Update the complete training and evaluation community.

Installation

Install VLABench

Prepare conda environment

conda create -n vlabench python=3.10
conda activate vlabench

git clone https://github.com/OpenMOSS/VLABench.git
cd VLABench
pip install -r requirements.txt
pip install -e .

Download the assets

python scripts/download_assets.py

(Option) Initialize submodules

git submodule update --init --recursive

This will update other policies repos such openpi.

The script will automatically download the necessary assets and unzip them into the correct directory.

Data Collection

Run scripts to generate hdf5 dataset with multi-processing

We provide a brief tutorial in tutorials/2.auto_trajectory_generate.ipynb and the whole codes are in scripts/trajectory_generation.py. Trajectory generation can be sped up several times by using multiple processes. A naive way to use it is:

sh dataset_generation.sh

Currently, the version does not support multi-processing environment within the code. We will optimize the collection efficiency as much as possible in future updates. After running the script, each trajectory will be stored as a hdf5 file in the directory you specify.

To run the parrallel data collection on distributed machines, such as 8 GPU, please refer to

bash sh/data_generation/multi_gpu_data_generation.sh

Convert to rlds format

Due to some frameworks such as Octo and Openvla using data in the RLDS format for training, we refer to the process from rlds_dataset_builder to provide an example of converting the aforementioned HDF5 dataset into RLDS format data. First, run

python scripts/convert_to_rlds.py --task [list] --save_dir /your/path/to/dataset

This will create a python file including the task rlds-builder in the directory. Then

cd /your/path/to/dataset/task

tfds build

This process consumes a long time with only single process, and we are testing multithreading mthod yet. The codes of original repo seem to have some bugs.

Convert to Lerobot format

Following the Libero dataset process way of openpi, we offer a simple way to convert hdf5 data files into lerobot format. Run the script by

python scripts/convert_to_lerobot.py --dataset-name [your-dataset-name] --dataset-path /your/path/to/dataset --max-files 100

The processed Lerobot dataset will be stored defaultly in your HF_HOME/lerobot/dataset-name.

Expandation

VLABench adopts a flexible modular framework for task construction, offering high adaptability. You can follow the process outlined in tutorial 6.

Evaluate

VLABench currently provides standard benchmark datasets, focusing on generalization across multiple dimensions. In the VLABench/configs/evaluation/tracks directory, we have set up multiple benchmark sets across different dimensions. These configs ensure that different models can be fairly compared under the same episodes on different machines.

| Track | Description | |----------|----------| | track_1_in_distribution | Evaluation of the policy's task learning ability, requiring it to fit in-domain episodes with a small and diverse set of data. | | track_2_cross_categroy | Evaluation of the policy's generalization ability at the object category level & instance level, requiring visual generalization capability. | | track_3_common_sense | Evaluation of the policy's application of common sense, requiring the use of common sense understanding for describing the target. | | track_4_semantic_instruction | Evaluation of the policy's ability to understand complex semantics involves instructions that are rich in contextual or semantic information. | | track_5_cross_task | Evaluation of the policy's ability to transfer skills across tasks is kept open in this setting, allowing users to choose training tasks and evaluation tasks according to their needs. | | track_6_unseen_texture | Evaluation of the policy's visual robustness, involving episodes with different backgrounds and table textures in this setting. |

NOTICE: The evaluation can also be done by directly sampling episodes from the environment. This evaluation method is more flexible, but there is a risk of improperly initialized episodes. We recommend using the 'evaluation_tracks' method for evaluation.

VLA/policy evaluation

We provide a standardized fine-tuning dataset, which can be downloaded from hf-dataset. In this version, the data focuses on primitive tasks. We selected 10 basic tasks and provided 500 samples for each task.

Since the current version of VLA does not perform well on primitive tasks, we plan to focus on enhancing VLA’s capabilities in this area first. In the future, we will release a more organized dataset for more composite tasks.

1. Evaluate OpenVLA

Before evaluate your finetuned OpenVLA, please compute the norm_stat on your dataset and place it to VLABench/configs/model/openvla_config.json

Run the evaluation scripts by

python scirpts/evaluate_policy.py --n-sample 20 --model openvla --model_ckpt xx --lora_ckpt xx --eval_track track_1_in_distribution --tasks task1, task2 ...

Multi-GPU Accelerated Evaluation

To speed up the evaluation process, VLABench supports multi-GPU parallel evaluation.

Example command:

bash sh/evaluation/example_multi_gpu_ev

Related Skills

diffs

344.1k

Use the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.

clearshot

Structured screenshot analysis for UI implementation and critique. Analyzes every UI screenshot with a 5×5 spatial grid, full element inventory, and design system extraction — facts and taste together, every time. Escalates to full implementation blueprint when building. Trigger on any digital interface image file (png, jpg, gif, webp — websites, apps, dashboards, mockups, wireframes) or commands like 'analyse this screenshot,' 'rebuild this,' 'match this design,' 'clone this.' Skip for non-UI images (photos, memes, charts) unless the user explicitly wants to build a UI from them. Does NOT trigger on HTML source code, CSS, SVGs, or any code pasted as text.

openpencil

2.0k

The world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.

HappyColorBlend

HappyColorBlendVibe Project Guidelines Project Overview HappyColorBlendVibe is a Figma plugin for color palette generation with advanced tint/shade blending capabilities. It allows designers to

OpenMOSS

View profile

View on GitHub

GitHub Stars411

CategoryDesign

Updated6h ago

Forks28

OpenMOSS/VLABench

Languages

Python

Security Score

100/100

Audited on Apr 1, 2026

No findings