SkillAgentSearch skills...

Vipra

[ICLR 2026] [NeurIPS 2025] ViPRA: Video Prediction for Robot Actions

Install / Use

/learn @sroutray/Vipra
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ViPRA: Video Prediction for Robot Actions

<div align="center"> <picture> <!-- Optional: light/dark variants --> <img src="assets/teaser_vipra.png" alt="ViPRA teaser" style="max-width: 100%; height: auto;"> </picture> <p> <a href="https://arxiv.org/abs/2511.07732"> <img src="https://img.shields.io/badge/arXiv-2511.07732-b31b1b.svg" alt="Paper"> </a> <a href="https://vipra-project.github.io"> <img src="https://img.shields.io/badge/Project-Page-green.svg" alt="Project Page"> </a> <a href="https://github.com/sroutray/vipra"> <img src="https://img.shields.io/badge/Code-GitHub-blue.svg" alt="Code"> </a> <a href="https://huggingface.co/vipra-project"> <img src="https://img.shields.io/badge/🤗-Hugging_Face-yellow.svg" alt="Hugging Face"> </a> </p> <h3> <a href="https://sroutray.github.io">Sandeep Routray</a><sup>1,2</sup>, <a href="https://hengkaipan.github.io">Hengkai Pan</a><sup>1</sup>, <a href="https://unnat.github.io">Unnat Jain</a><sup>2,3</sup>, <a href="https://shikharbahl.github.io">Shikhar Bahl</a><sup>2</sup>, <a href="https://www.cs.cmu.edu/~dpathak/">Deepak Pathak</a><sup>1,2</sup> </h3> <h4><sup>1</sup>Carnegie Mellon University <sup>2</sup>Skild AI <sup>3</sup>University of California, Irvine</h4> <h4>Corresponding author: <a href="mailto:sroutra2@cs.cmu.edu">Sandeep Routray</a></h4> </div>

News


Overview

  • A recipe to learn generalist robot policies from large-scale human and robot videos without action labels.
  • A novel approach to extract motion-centric latent actions that capture fine-grained physical dynamics.
  • A flow matching action decoder with action chunking for high-frequency continuous control.
  • Outperforms prior latent action methods and VLA baselines trained on ground-truth actions.

Latent Action Model

The latent action model learns motion-centric abstract representations from actionless video. These latents capture fine-grained temporal dynamics and are discretized into tokens that serve as "latent actions" for downstream policy learning.

Key Features

  • Actionless Learning: Learns from videos directly; no action annotations required.
  • Motion-Centric: Focuses on fine-grained temporal dynamics rather than static appearance.
  • Multi-Dataset: Trained on diverse human and robot data.
  • Optical Flow Consistency: Uses optical flow for temporal consistency regularization.

Architecture

  • Spatial Encoder: DINOv2-initialized vision transformer for spatial features.
  • Spatio-Temporal Encoder: Non-causal transformer encoder over video clips.
  • Vector Quantizer: Noise Substitution Vector Quantization (NSVQ) for discretizing latent action.
  • Spatio-Temporal Decoder: Causal transformer decoder for reconstruction.
  • Flow Network: RAFT-based optical flow estimation for consistency loss.

Environment Setup

cd laq/
conda env create -f environment.yml -n laq
conda activate laq

Configuration

Training configs live in laq/configs/config.py. Key parameters:

  • Model: 768-dim transformer, 6 encoder layers, 8 decoder layers.
  • Data: 224×224 crops, 8-frame sequences.
  • Quantization: 32-dim latent space, NSVQ codebook.
  • Losses: L1 reconstruction, LPIPS perceptual loss, optical-flow consistency loss.
  • Training: ~300k steps, batch size 18, bf16 on 8×H200 GPUs, grad norm clip 6.0.

Dataset Structure Requirements

You can match these layouts or extend laq/model/data.py to support your own.

Something-Something-v2 (SSv2)

ssv2/
├── labels/
│   ├── train.json
│   ├── validation.json
│   └── test.json
├── 20bn-something-something-v2/
│   ├── [video_id].webm
│   └── ...

Example config:

ssv2 = dict(
    root_dir=Path("/path/to/ssv2"),
    split="trainval",   # "train", "val", "trainval", "test", "all"
    stepsize=2,         # frame sampling stride
)

OpenX Datasets (Fractal, Bridge, Kuka)

dataset_name/
├── processed/
│   ├── trajectory_001/
│   │   └── images/
│   │       ├── 000000.jpg
│   │       ├── 000001.jpg
│   │       └── ...
│   ├── trajectory_002/
│   └── ...

Example config:

bridge = dict(
    root_dir=Path("/path/to/bridge"),
    split="trainval",
    num_trajs=dict(trainval=25460, val=2546),
    stepsize=1,
)

LIBERO

LIBERO/
├── libero_10_modified/
│   └── images/trajectory_001/000000.jpg
├── libero_goal_modified/
│   └── images/...
├── libero_object_modified/
│   └── images/...
└── libero_spatial_modified/
    └── images/...

Example config:

libero = dict(
    root_dir=Path("/path/to/LIBERO"),
    split="trainval",
    num_trajs=dict(trainval=1.0, val=0.1),  # float = percentage
    stepsize=1,
)

Custom Dataset

  1. Add a discovery function in laq/model/data.py:
def discover_custom_sequences(data_root: Path, mode: str, **kwargs) -> List[str]:
    # return list of frame directories / trajectories
    return list_of_paths
  1. Add your dataset case in VideoDatasetCoTrain.
  2. Add your config block to laq/configs/config.py.

Training

Launch training using the provided script, configured for bf16 training on a single node with 8 H200 GPUs:

bash run_train_laq.sh

Inference and Evaluation

To reproduce codebook analysis and figures shown in the paper:

# Codebook usage analysis (reproduces codebook utilization figures)
python -m codebook_usage

# Rollout transfer evaluation (reproduces reconstruction and transfer results)
python -m rollout_transfer

To use the LAQ model to generate training data with latent actions for ViPRA policy pretraining, use the dataset-specific latent generation scripts:

# LIBERO
python -m inference.libero.libero_latent

# OpenX-style datasets (Fractal, BridgeData V2, Kuka)
python -m inference.openx.openx_latent --dataset bridge
python -m inference.openx.openx_latent --dataset kuka

# SSv2
python -m inference.ssv2.ssv2_latent

These scripts generate training data in JSONL format with multi-GPU processing and automatic shard merging. Each line contains a training sample with latent actions:

Sample JSONL Entry:

{
  "instruction": "pick up the red block and place it in the blue bowl",
  "raw_action": [0.1, -0.2, 0.05, 0.0, 0.0, 0.0, 1.0],
  "image": ["libero_10_modified/images/traj_001/step0000.jpg", "libero_10_modified/images/traj_001/step0001.jpg"],
  "latent_state": ["libero_10_modified/images/traj_001/step0015.jpg"],
  "latent_action_idxs": [3, 7, 1, 4, 2, 6, 0, 5, 1, 3, 7, 2, 4, 0, 6, 1],
  "fields_la": "[instruction],[vision],latent_action",
  "fields_ls": "[instruction],[vision],latent_state", 
  "fields_ls_la": "[instruction],[vision],latent_state,latent_action"
}

ViPRA Policy

The ViPRA policy builds on a video-language foundation model, Large World Model (LWM). We use the LWM-Chat-1M-Jax as the base model and extend it with additional modules for latent action prediction and flow matching for continuous control.

Environment Setup

cd vipra/
conda env create -f environment.yml -n vipra
conda activate vipra

Before training, download the VQ-GAN image tokenizer, text tokenizer and pretrained model parameters from LWM-Chat-1M-Jax and place them under vipra/lwm/:

mkdir lwm
huggingface-cli download LargeWorldModel/LWM-Chat-1M-Jax --local-dir lwm/

Pretraining Data

We release a pre-tokenized, horizon-14 dynamics dataset on Hugging Face:

mkdir cotrain_data
huggingface-cli download vipra-project/cotrain-dynamics14 --local-dir cotrain_data/

cotrain-dynamics14 merges multiple robot datasets (LIBERO, BridgeData V2, Fractal, Kuka) with human video data from SSv2. Each training sample includes:

  • history frames
  • latent state target
  • latent action tokens from LAQ
  • natural language task text

This dataset is already chunked into 14-step latent action sequences.

Vision Cache (Optional, speeds up training)

We also release a VQGAN vision cache on Hugging Face so you don't have to repeatedly tokenize raw pixels:

mkdir vision_cache
huggingface-cli download vipra-project/cotrain-vqgan-vision-cache --local-dir vision_cache/

This contains precomputed VQGAN token sequences for each frame, which can be used instead of running the image tokenizer online.

If you don't use the cache, set vqgan_path to the VQ-GAN weights from LWM-Chat-1M-Jax so ViPRA can tokenize frames on the fly.

Running Pretraining

Launch pretraining using the provided script (configured for 8×H200 GPUs):

cd vipra/
bash scripts/pretrain.sh

See vipra/scripts/pretrain.sh for full hyperparameters.


Finetuning

Download the pretrained checkpoint weights, VQ-GAN image tokenizer, and text tokenizer from Hugging Face:

cd vipra && mkdir vipra_checkpoints
huggingface-cli download vipra-project/vipra-7b-pretrained --local-dir vipra_checkpoints/

For task-specific finetuning, prepare your dataset in JSONL format where each line represen

Related Skills

View on GitHub
GitHub Stars32
CategoryContent
Updated8d ago
Forks1

Languages

Python

Security Score

95/100

Audited on Mar 31, 2026

No findings