<div align="center"> <h1 align="center"><span style="font-family: 'Courier New', Courier, monospace; font-size: 115%;"><span style="font-size: 130%;">V</span>ITRA</span>:<br><span style="font-size:2.22rem;">Scalable Vision-Language-Action Model Pretraining <br> for Robotic Manipulation with Real-Life Human Activity Videos </span></h1> <p align="center"> <a href="https://arxiv.org/abs/2510.21571"><img src='https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv&logoColor=white' alt='arXiv'></a> <a href='https://microsoft.github.io/VITRA/'><img src='https://img.shields.io/badge/Project_Page-Website-green?logo=googlechrome&logoColor=white' alt='Project Page'></a> <a href='https://huggingface.co/VITRA-VLA/VITRA-VLA-3B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a> <a href='https://huggingface.co/datasets/VITRA-VLA/VITRA-1M'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-yellow'></a> <a href='https://huggingface.co/datasets/microsoft/VITRA-TeleData'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-TeleData-yellow'></a> <a href='LICENSE'><img src='https://img.shields.io/badge/License-MIT-orange' alt='License'></a> </p> <p align="center"><img src="assets/teaser.jpg" width="100%" alt="VITRA Teaser"></p> <div align="justify"> <span style="font-family: 'Courier New', Courier, monospace; font-size: 115%;"><span style="font-size: 130%;">V</span>ITRA</span> is a novel approach for pretraining Vision-Language-Action (VLA) models for robotic manipulation using large-scale, unscripted, real-world videos of human hand activities. Treating human hand as dexterous robot end-effector, we show that in-the-wild egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. We create a human hand V-L-A dataset containing over 1 million episodes. We further develop a VLA model with a causal action transformer trained on this dataset. It demonstrates strong zero-shot human-hand action prediction in entirely new scenes and serves as a cornerstone for few-shot finetuning and adaptation to real-world robotic manipulation. <br> <br>

For video demonstrations, please refer to our project page.

</div> <br> </div>

🚩 New & Updates

[2026-02-09] 🚀 Release the finetuning dataset collected by teleoperation.
[2025-12-05] 🚀 Release the code for performing zero-shot inference using a single image.
[2025-11-30] 🚀 Our code, pretrained models, and datasets are now open-sourced.
[2025-10-24] 🚀 VITRA paper is released on arXiv.

🤗 Pretrained Models and Datesets

Our pretrained model and datasets are available on the huggingface hub:

<table> <thead> <tr> <th>Hugging Face Model</th> <th>#Params</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><a href="https://huggingface.co/VITRA-VLA/VITRA-VLA-3B" target="_blank"><code>VITRA-VLA-3B</code><a></td> <td style="font-size: 0.92em;">3B</td> <td style="font-size: 0.92em;">Base VLA model pretrained on Human Hand Data.</td> </tr> </tbody> </table>

Note: Our base VLA model is finetuned from Paligemma2. if you do not have access to Paligemma2, please request permission on the official website.

<table> <thead> <tr> <th rowspan="2">Hugging Face Dataset</th> <th colspan="2" style="text-align: center;">Sub Datasets</th> </tr> <tr> <th>Dataset Name</th> <th>Number of Episodes</th> </tr> </thead> <tbody> <tr> <td rowspan="6"><a href="https://huggingface.co/datasets/VITRA-VLA/VITRA-1M" target="_blank"><code>VITRA-1M</code></a></td> <td><code>ego4d_cooking_and_cleaning</code></td> <td>454,244</td> </tr> <tr> <td><code>ego4d_other</code></td> <td>494,439</td> </tr> <tr> <td><code>epic</code></td> <td>154,464</td> </tr> <tr> <td><code>egoexo4d</code></td> <td>67,053</td> </tr> <tr> <td><code>ssv2</code></td> <td>52,718</td> </tr> <tr> <td><strong>Total</strong></td> <td><strong>1,222,918</strong></td> </tr> </tbody> </table>

Note: See data/data.md for detailed information about our datasets.

Our finetuning dataset collected by teleoperation can be downloaded from https://huggingface.co/datasets/microsoft/VITRA-TeleData. See data/teleoperate_data.md for detailed information about the data (For the best formula visualization, we recommend previewing the teleoperate_data.md locally with VS Code Markdown).

📑 Table of Contents

1. Installation
- 1.1 Training / Inference Requirements
- 1.2 Visualization Requirements
2. Inference with Human Hand Image
3. Fine-tuning with a Custom Robot Dataset
4. Deployment in the Real World
5. Human Hand VLA Dataset Utilization
6. Human Data Pretraining from Scratch
Contact
Citation

1. Installation

1.1 Training / Inference Requirements

We recommend using conda to manage the environment. We require PyTorch >= 2.3.0 and CUDA >= 12.1 (It may run with lower versions, but we have not tested it). If the environment is used solely for training, it is recommended to use a higher version of PyTorch to achieve improved training speed.

# Clone the repository
git clone https://github.com/microsoft/VITRA.git
cd VITRA

# Create environment
conda create -n vitra python=3.10 -y
conda activate vitra

# Install dependencies
pip install -e .

<details> <summary>Click to view detailed system requirements</summary>

OS: Linux (Ubuntu 20.04/22.04 recommended)
Python: 3.10+
CUDA: 11.8+
GPU: Minimum 16GB VRAM for inference, A100/H100 recommended for training.

</details>

1.2 Visualization Requirements

If you want to visualize the results after inference, run dataset visualization, or perform zero-shot human hand action prediction from a single image, please follow the instructions below.

Submodules Installation

Please clone the submodules to perform hand pose estimation.

git submodule update --init --recursive

Libraries Installation

Please install the following additional modules for visualization using the commands below:

pip install -e .[visulization] --no-build-isolation

<details> <summary>Click here if you encounter issues when installing Installing <a href="https://github.com/facebookresearch/pytorch3d?tab=readme-ov-file">PyTorch3D</a> </summary>

If you encounter issues when installing PyTorch3D, please follow the installation instructions provided in the PyTorch3D repository or try installing it separately using:
```
pip install --no-build-isolation git+https://github.com/facebookresearch/pytorch3d.git@stable#egg=pytorch3d
```

</details>

If FFmpeg is not installed on your system, please install it first.

sudo apt install ffmpeg

MANO Hand Model

Our reconstructed hand labels are based on the MANO hand model. We only require the right hand model. The model parameters can be downloaded from the official website and organized in the following structure:

weights/
└── mano/
    ├── MANO_RIGHT.pkl
    └── mano_mean_params.npz

Please download the model weights of HaWoR for hand pose estimation:

wget https://huggingface.co/spaces/rolpotamias/WiLoR/resolve/main/pretrained_models/detector.pt -P ./weights/hawor/external/
wget https://huggingface.co/ThunderVVV/HaWoR/resolve/main/hawor/checkpoints/hawor.ckpt -P ./weights/hawor/checkpoints/

2. Inference with Human Hand Image

You can use our pretrained model to perform zero-shot 3D human hand action prediction directly from an egocentric human hand image (landscape view) based on instructions. To predict human actions from pre-captured images, please run scripts/run_human_inference.sh. Here is a simple example:

python scripts/inference_human_prediction.py \
    --config VITRA-VLA/VITRA-VLA-3B \
    --image_path ./examples/0002.jpg \
    --sample_times 4 \
    --save_state_local \
    --use_right \
    --video_path ./example_human_inf.mp4 \
    --mano_path ./weights/mano \
    --instruction "Left hand: None. Right hand: Pick up the picture of Michael Jackson." \

All example images are captured on mobile phones in rooms that do not appear anywhere in the V-L-A dataset. They also include entirely unseen concepts, such as photos of cel

VITRA

Install / Use

README