Vipra
[ICLR 2026] [NeurIPS 2025] ViPRA: Video Prediction for Robot Actions
Install / Use
/learn @sroutray/VipraREADME
ViPRA: Video Prediction for Robot Actions
<div align="center"> <picture> <!-- Optional: light/dark variants --> <img src="assets/teaser_vipra.png" alt="ViPRA teaser" style="max-width: 100%; height: auto;"> </picture> <p> <a href="https://arxiv.org/abs/2511.07732"> <img src="https://img.shields.io/badge/arXiv-2511.07732-b31b1b.svg" alt="Paper"> </a> <a href="https://vipra-project.github.io"> <img src="https://img.shields.io/badge/Project-Page-green.svg" alt="Project Page"> </a> <a href="https://github.com/sroutray/vipra"> <img src="https://img.shields.io/badge/Code-GitHub-blue.svg" alt="Code"> </a> <a href="https://huggingface.co/vipra-project"> <img src="https://img.shields.io/badge/🤗-Hugging_Face-yellow.svg" alt="Hugging Face"> </a> </p> <h3> <a href="https://sroutray.github.io">Sandeep Routray</a><sup>1,2</sup>, <a href="https://hengkaipan.github.io">Hengkai Pan</a><sup>1</sup>, <a href="https://unnat.github.io">Unnat Jain</a><sup>2,3</sup>, <a href="https://shikharbahl.github.io">Shikhar Bahl</a><sup>2</sup>, <a href="https://www.cs.cmu.edu/~dpathak/">Deepak Pathak</a><sup>1,2</sup> </h3> <h4><sup>1</sup>Carnegie Mellon University <sup>2</sup>Skild AI <sup>3</sup>University of California, Irvine</h4> <h4>Corresponding author: <a href="mailto:sroutra2@cs.cmu.edu">Sandeep Routray</a></h4> </div>News
- [2026/01/26] ViPRA accepted at ICLR 2026.
- [2025/12/06] ViPRA won the Best Paper Award at NeurIPS 2025 EWM Workshop.
- [2025/10/13] ViPRA accepted for an Oral at NeurIPS 2025 EWM Workshop.
- [2025/10/01] ViPRA accepted at NeurIPS 2025 SpaVLE Workshop.
Overview
- A recipe to learn generalist robot policies from large-scale human and robot videos without action labels.
- A novel approach to extract motion-centric latent actions that capture fine-grained physical dynamics.
- A flow matching action decoder with action chunking for high-frequency continuous control.
- Outperforms prior latent action methods and VLA baselines trained on ground-truth actions.
Latent Action Model
The latent action model learns motion-centric abstract representations from actionless video. These latents capture fine-grained temporal dynamics and are discretized into tokens that serve as "latent actions" for downstream policy learning.
Key Features
- Actionless Learning: Learns from videos directly; no action annotations required.
- Motion-Centric: Focuses on fine-grained temporal dynamics rather than static appearance.
- Multi-Dataset: Trained on diverse human and robot data.
- Optical Flow Consistency: Uses optical flow for temporal consistency regularization.
Architecture
- Spatial Encoder: DINOv2-initialized vision transformer for spatial features.
- Spatio-Temporal Encoder: Non-causal transformer encoder over video clips.
- Vector Quantizer: Noise Substitution Vector Quantization (NSVQ) for discretizing latent action.
- Spatio-Temporal Decoder: Causal transformer decoder for reconstruction.
- Flow Network: RAFT-based optical flow estimation for consistency loss.
Environment Setup
cd laq/
conda env create -f environment.yml -n laq
conda activate laq
Configuration
Training configs live in laq/configs/config.py. Key parameters:
- Model: 768-dim transformer, 6 encoder layers, 8 decoder layers.
- Data: 224×224 crops, 8-frame sequences.
- Quantization: 32-dim latent space, NSVQ codebook.
- Losses: L1 reconstruction, LPIPS perceptual loss, optical-flow consistency loss.
- Training: ~300k steps, batch size 18, bf16 on 8×H200 GPUs, grad norm clip 6.0.
Dataset Structure Requirements
You can match these layouts or extend laq/model/data.py to support your own.
Something-Something-v2 (SSv2)
ssv2/
├── labels/
│ ├── train.json
│ ├── validation.json
│ └── test.json
├── 20bn-something-something-v2/
│ ├── [video_id].webm
│ └── ...
Example config:
ssv2 = dict(
root_dir=Path("/path/to/ssv2"),
split="trainval", # "train", "val", "trainval", "test", "all"
stepsize=2, # frame sampling stride
)
OpenX Datasets (Fractal, Bridge, Kuka)
dataset_name/
├── processed/
│ ├── trajectory_001/
│ │ └── images/
│ │ ├── 000000.jpg
│ │ ├── 000001.jpg
│ │ └── ...
│ ├── trajectory_002/
│ └── ...
Example config:
bridge = dict(
root_dir=Path("/path/to/bridge"),
split="trainval",
num_trajs=dict(trainval=25460, val=2546),
stepsize=1,
)
LIBERO
LIBERO/
├── libero_10_modified/
│ └── images/trajectory_001/000000.jpg
├── libero_goal_modified/
│ └── images/...
├── libero_object_modified/
│ └── images/...
└── libero_spatial_modified/
└── images/...
Example config:
libero = dict(
root_dir=Path("/path/to/LIBERO"),
split="trainval",
num_trajs=dict(trainval=1.0, val=0.1), # float = percentage
stepsize=1,
)
Custom Dataset
- Add a discovery function in
laq/model/data.py:
def discover_custom_sequences(data_root: Path, mode: str, **kwargs) -> List[str]:
# return list of frame directories / trajectories
return list_of_paths
- Add your dataset case in
VideoDatasetCoTrain. - Add your config block to
laq/configs/config.py.
Training
Launch training using the provided script, configured for bf16 training on a single node with 8 H200 GPUs:
bash run_train_laq.sh
Inference and Evaluation
To reproduce codebook analysis and figures shown in the paper:
# Codebook usage analysis (reproduces codebook utilization figures)
python -m codebook_usage
# Rollout transfer evaluation (reproduces reconstruction and transfer results)
python -m rollout_transfer
To use the LAQ model to generate training data with latent actions for ViPRA policy pretraining, use the dataset-specific latent generation scripts:
# LIBERO
python -m inference.libero.libero_latent
# OpenX-style datasets (Fractal, BridgeData V2, Kuka)
python -m inference.openx.openx_latent --dataset bridge
python -m inference.openx.openx_latent --dataset kuka
# SSv2
python -m inference.ssv2.ssv2_latent
These scripts generate training data in JSONL format with multi-GPU processing and automatic shard merging. Each line contains a training sample with latent actions:
Sample JSONL Entry:
{
"instruction": "pick up the red block and place it in the blue bowl",
"raw_action": [0.1, -0.2, 0.05, 0.0, 0.0, 0.0, 1.0],
"image": ["libero_10_modified/images/traj_001/step0000.jpg", "libero_10_modified/images/traj_001/step0001.jpg"],
"latent_state": ["libero_10_modified/images/traj_001/step0015.jpg"],
"latent_action_idxs": [3, 7, 1, 4, 2, 6, 0, 5, 1, 3, 7, 2, 4, 0, 6, 1],
"fields_la": "[instruction],[vision],latent_action",
"fields_ls": "[instruction],[vision],latent_state",
"fields_ls_la": "[instruction],[vision],latent_state,latent_action"
}
ViPRA Policy
The ViPRA policy builds on a video-language foundation model, Large World Model (LWM). We use the LWM-Chat-1M-Jax as the base model and extend it with additional modules for latent action prediction and flow matching for continuous control.
Environment Setup
cd vipra/
conda env create -f environment.yml -n vipra
conda activate vipra
Before training, download the VQ-GAN image tokenizer, text tokenizer and pretrained model parameters from LWM-Chat-1M-Jax and place them under vipra/lwm/:
mkdir lwm
huggingface-cli download LargeWorldModel/LWM-Chat-1M-Jax --local-dir lwm/
Pretraining Data
We release a pre-tokenized, horizon-14 dynamics dataset on Hugging Face:
mkdir cotrain_data
huggingface-cli download vipra-project/cotrain-dynamics14 --local-dir cotrain_data/
cotrain-dynamics14 merges multiple robot datasets (LIBERO, BridgeData V2, Fractal, Kuka) with human video data from SSv2.
Each training sample includes:
- history frames
- latent state target
- latent action tokens from LAQ
- natural language task text
This dataset is already chunked into 14-step latent action sequences.
Vision Cache (Optional, speeds up training)
We also release a VQGAN vision cache on Hugging Face so you don't have to repeatedly tokenize raw pixels:
mkdir vision_cache
huggingface-cli download vipra-project/cotrain-vqgan-vision-cache --local-dir vision_cache/
This contains precomputed VQGAN token sequences for each frame, which can be used instead of running the image tokenizer online.
If you don't use the cache, set vqgan_path to the VQ-GAN weights from LWM-Chat-1M-Jax so ViPRA can tokenize frames on the fly.
Running Pretraining
Launch pretraining using the provided script (configured for 8×H200 GPUs):
cd vipra/
bash scripts/pretrain.sh
See vipra/scripts/pretrain.sh for full hyperparameters.
Finetuning
Download the pretrained checkpoint weights, VQ-GAN image tokenizer, and text tokenizer from Hugging Face:
cd vipra && mkdir vipra_checkpoints
huggingface-cli download vipra-project/vipra-7b-pretrained --local-dir vipra_checkpoints/
For task-specific finetuning, prepare your dataset in JSONL format where each line represen
Related Skills
qqbot-channel
351.8kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.6k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
351.8kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
project-overview
FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A
