VITRA
[ICRA 2026] VITRA: Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Install / Use
/learn @microsoft/VITRAREADME
For video demonstrations, please refer to our project page.
</div> <br> </div>🚩 New & Updates
- [2026-02-09] 🚀 Release the finetuning dataset collected by teleoperation.
- [2025-12-05] 🚀 Release the code for performing zero-shot inference using a single image.
- [2025-11-30] 🚀 Our code, pretrained models, and datasets are now open-sourced.
- [2025-10-24] 🚀 VITRA paper is released on arXiv.
🤗 Pretrained Models and Datesets
Our pretrained model and datasets are available on the huggingface hub:
<table> <thead> <tr> <th>Hugging Face Model</th> <th>#Params</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><a href="https://huggingface.co/VITRA-VLA/VITRA-VLA-3B" target="_blank"><code>VITRA-VLA-3B</code><a></td> <td style="font-size: 0.92em;">3B</td> <td style="font-size: 0.92em;">Base VLA model pretrained on Human Hand Data.</td> </tr> </tbody> </table>Note: Our base VLA model is finetuned from Paligemma2. if you do not have access to Paligemma2, please request permission on the official website.
<table> <thead> <tr> <th rowspan="2">Hugging Face Dataset</th> <th colspan="2" style="text-align: center;">Sub Datasets</th> </tr> <tr> <th>Dataset Name</th> <th>Number of Episodes</th> </tr> </thead> <tbody> <tr> <td rowspan="6"><a href="https://huggingface.co/datasets/VITRA-VLA/VITRA-1M" target="_blank"><code>VITRA-1M</code></a></td> <td><code>ego4d_cooking_and_cleaning</code></td> <td>454,244</td> </tr> <tr> <td><code>ego4d_other</code></td> <td>494,439</td> </tr> <tr> <td><code>epic</code></td> <td>154,464</td> </tr> <tr> <td><code>egoexo4d</code></td> <td>67,053</td> </tr> <tr> <td><code>ssv2</code></td> <td>52,718</td> </tr> <tr> <td><strong>Total</strong></td> <td><strong>1,222,918</strong></td> </tr> </tbody> </table>Note: See data/data.md for detailed information about our datasets.
Our finetuning dataset collected by teleoperation can be downloaded from https://huggingface.co/datasets/microsoft/VITRA-TeleData. See data/teleoperate_data.md for detailed information about the data (For the best formula visualization, we recommend previewing the teleoperate_data.md locally with VS Code Markdown).
📑 Table of Contents
- 1. Installation
- 2. Inference with Human Hand Image
- 3. Fine-tuning with a Custom Robot Dataset
- 4. Deployment in the Real World
- 5. Human Hand VLA Dataset Utilization
- 6. Human Data Pretraining from Scratch
- Contact
- Citation
1. Installation
1.1 Training / Inference Requirements
We recommend using conda to manage the environment. We require PyTorch >= 2.3.0 and CUDA >= 12.1 (It may run with lower versions, but we have not tested it). If the environment is used solely for training, it is recommended to use a higher version of PyTorch to achieve improved training speed.
# Clone the repository
git clone https://github.com/microsoft/VITRA.git
cd VITRA
# Create environment
conda create -n vitra python=3.10 -y
conda activate vitra
# Install dependencies
pip install -e .
<details>
<summary>Click to view detailed system requirements</summary>
- OS: Linux (Ubuntu 20.04/22.04 recommended)
- Python: 3.10+
- CUDA: 11.8+
- GPU: Minimum 16GB VRAM for inference, A100/H100 recommended for training.
1.2 Visualization Requirements
If you want to visualize the results after inference, run dataset visualization, or perform zero-shot human hand action prediction from a single image, please follow the instructions below.
Submodules Installation
Please clone the submodules to perform hand pose estimation.
git submodule update --init --recursive
Libraries Installation
Please install the following additional modules for visualization using the commands below:
pip install -e .[visulization] --no-build-isolation
<details>
<summary>Click here if you encounter issues when installing Installing <a href="https://github.com/facebookresearch/pytorch3d?tab=readme-ov-file">PyTorch3D</a> </summary>
-
If you encounter issues when installing PyTorch3D, please follow the installation instructions provided in the PyTorch3D repository or try installing it separately using:
pip install --no-build-isolation git+https://github.com/facebookresearch/pytorch3d.git@stable#egg=pytorch3d
If FFmpeg is not installed on your system, please install it first.
sudo apt install ffmpeg
MANO Hand Model
Our reconstructed hand labels are based on the MANO hand model. We only require the right hand model. The model parameters can be downloaded from the official website and organized in the following structure:
weights/
└── mano/
├── MANO_RIGHT.pkl
└── mano_mean_params.npz
Please download the model weights of HaWoR for hand pose estimation:
wget https://huggingface.co/spaces/rolpotamias/WiLoR/resolve/main/pretrained_models/detector.pt -P ./weights/hawor/external/
wget https://huggingface.co/ThunderVVV/HaWoR/resolve/main/hawor/checkpoints/hawor.ckpt -P ./weights/hawor/checkpoints/
2. Inference with Human Hand Image
You can use our pretrained model to perform zero-shot 3D human hand action prediction directly from an egocentric human hand image (landscape view) based on instructions. To predict human actions from pre-captured images, please run scripts/run_human_inference.sh. Here is a simple example:
python scripts/inference_human_prediction.py \
--config VITRA-VLA/VITRA-VLA-3B \
--image_path ./examples/0002.jpg \
--sample_times 4 \
--save_state_local \
--use_right \
--video_path ./example_human_inf.mp4 \
--mano_path ./weights/mano \
--instruction "Left hand: None. Right hand: Pick up the picture of Michael Jackson." \
All example images are captured on mobile phones in rooms that do not appear anywhere in the V-L-A dataset. They also include entirely unseen concepts, such as photos of cel
