Vipra

[ICLR 2026] [NeurIPS 2025] ViPRA: Video Prediction for Robot Actions

Generate Convert Improve

Install / Use

/learn @sroutray/Vipra

About this skill

Quality Score

0/100

README

ViPRA: Video Prediction for Robot Actions

<div align="center"> <picture>  <img src="assets/teaser_vipra.png" alt="ViPRA teaser" style="max-width: 100%; height: auto;"> </picture> <p> <a href="https://arxiv.org/abs/2511.07732"> <img src="https://img.shields.io/badge/arXiv-2511.07732-b31b1b.svg" alt="Paper"> </a> <a href="https://vipra-project.github.io"> <img src="https://img.shields.io/badge/Project-Page-green.svg" alt="Project Page"> </a> <a href="https://github.com/sroutray/vipra"> <img src="https://img.shields.io/badge/Code-GitHub-blue.svg" alt="Code"> </a> <a href="https://huggingface.co/vipra-project"> <img src="https://img.shields.io/badge/🤗-Hugging_Face-yellow.svg" alt="Hugging Face"> </a> </p> <h3> <a href="https://sroutray.github.io">Sandeep Routray</a><sup>1,2</sup>, <a href="https://hengkaipan.github.io">Hengkai Pan</a><sup>1</sup>, <a href="https://unnat.github.io">Unnat Jain</a><sup>2,3</sup>, <a href="https://shikharbahl.github.io">Shikhar Bahl</a><sup>2</sup>, <a href="https://www.cs.cmu.edu/~dpathak/">Deepak Pathak</a><sup>1,2</sup> </h3> <h4><sup>1</sup>Carnegie Mellon University <sup>2</sup>Skild AI <sup>3</sup>University of California, Irvine</h4> <h4>Corresponding author: <a href="mailto:sroutra2@cs.cmu.edu">Sandeep Routray</a></h4> </div>

News

[2026/01/26] ViPRA accepted at ICLR 2026.
[2025/12/06] ViPRA won the Best Paper Award at NeurIPS 2025 EWM Workshop.
[2025/10/13] ViPRA accepted for an Oral at NeurIPS 2025 EWM Workshop.
[2025/10/01] ViPRA accepted at NeurIPS 2025 SpaVLE Workshop.

Overview

A recipe to learn generalist robot policies from large-scale human and robot videos without action labels.
A novel approach to extract motion-centric latent actions that capture fine-grained physical dynamics.
A flow matching action decoder with action chunking for high-frequency continuous control.
Outperforms prior latent action methods and VLA baselines trained on ground-truth actions.

Latent Action Model

The latent action model learns motion-centric abstract representations from actionless video. These latents capture fine-grained temporal dynamics and are discretized into tokens that serve as "latent actions" for downstream policy learning.

Key Features

Actionless Learning: Learns from videos directly; no action annotations required.
Motion-Centric: Focuses on fine-grained temporal dynamics rather than static appearance.
Multi-Dataset: Trained on diverse human and robot data.
Optical Flow Consistency: Uses optical flow for temporal consistency regularization.

Architecture

Spatial Encoder: DINOv2-initialized vision transformer for spatial features.
Spatio-Temporal Encoder: Non-causal transformer encoder over video clips.
Vector Quantizer: Noise Substitution Vector Quantization (NSVQ) for discretizing latent action.
Spatio-Temporal Decoder: Causal transformer decoder for reconstruction.
Flow Network: RAFT-based optical flow estimation for consistency loss.

Environment Setup

cd laq/
conda env create -f environment.yml -n laq
conda activate laq

Configuration

Training configs live in laq/configs/config.py. Key parameters:

Model: 768-dim transformer, 6 encoder layers, 8 decoder layers.
Data: 224×224 crops, 8-frame sequences.
Quantization: 32-dim latent space, NSVQ codebook.
Losses: L1 reconstruction, LPIPS perceptual loss, optical-flow consistency loss.
Training: ~300k steps, batch size 18, bf16 on 8×H200 GPUs, grad norm clip 6.0.

Dataset Structure Requirements

You can match these layouts or extend laq/model/data.py to support your own.

Something-Something-v2 (SSv2)

ssv2/
├── labels/
│   ├── train.json
│   ├── validation.json
│   └── test.json
├── 20bn-something-something-v2/
│   ├── [video_id].webm
│   └── ...

Example config:

ssv2 = dict(
    root_dir=Path("/path/to/ssv2"),
    split="trainval",   # "train", "val", "trainval", "test", "all"
    stepsize=2,         # frame sampling stride
)

OpenX Datasets (Fractal, Bridge, Kuka)

dataset_name/
├── processed/
│   ├── trajectory_001/
│   │   └── images/
│   │       ├── 000000.jpg
│   │       ├── 000001.jpg
│   │       └── ...
│   ├── trajectory_002/
│   └── ...

Example config:

bridge = dict(
    root_dir=Path("/path/to/bridge"),
    split="trainval",
    num_trajs=dict(trainval=25460, val=2546),
    stepsize=1,
)

LIBERO

LIBERO/
├── libero_10_modified/
│   └── images/trajectory_001/000000.jpg
├── libero_goal_modified/
│   └── images/...
├── libero_object_modified/
│   └── images/...
└── libero_spatial_modified/
    └── images/...

Example config:

libero = dict(
    root_dir=Path("/path/to/LIBERO"),
    split="trainval",
    num_trajs=dict(trainval=1.0, val=0.1),  # float = percentage
    stepsize=1,
)

Custom Dataset

Add a discovery function in laq/model/data.py:

def discover_custom_sequences(data_root: Path, mode: str, **kwargs) -> List[str]:
    # return list of frame directories / trajectories
    return list_of_paths

Add your dataset case in VideoDatasetCoTrain.
Add your config block to laq/configs/config.py.

Training

Launch training using the provided script, configured for bf16 training on a single node with 8 H200 GPUs:

bash run_train_laq.sh

Inference and Evaluation

To reproduce codebook analysis and figures shown in the paper:

# Codebook usage analysis (reproduces codebook utilization figures)
python -m codebook_usage

# Rollout transfer evaluation (reproduces reconstruction and transfer results)
python -m rollout_transfer

To use the LAQ model to generate training data with latent actions for ViPRA policy pretraining, use the dataset-specific latent generation scripts:

# LIBERO
python -m inference.libero.libero_latent

# OpenX-style datasets (Fractal, BridgeData V2, Kuka)
python -m inference.openx.openx_latent --dataset bridge
python -m inference.openx.openx_latent --dataset kuka

# SSv2
python -m inference.ssv2.ssv2_latent

These scripts generate training data in JSONL format with multi-GPU processing and automatic shard merging. Each line contains a training sample with latent actions:

Sample JSONL Entry:

{
  "instruction": "pick up the red block and place it in the blue bowl",
  "raw_action": [0.1, -0.2, 0.05, 0.0, 0.0, 0.0, 1.0],
  "image": ["libero_10_modified/images/traj_001/step0000.jpg", "libero_10_modified/images/traj_001/step0001.jpg"],
  "latent_state": ["libero_10_modified/images/traj_001/step0015.jpg"],
  "latent_action_idxs": [3, 7, 1, 4, 2, 6, 0, 5, 1, 3, 7, 2, 4, 0, 6, 1],
  "fields_la": "[instruction],[vision],latent_action",
  "fields_ls": "[instruction],[vision],latent_state", 
  "fields_ls_la": "[instruction],[vision],latent_state,latent_action"
}

ViPRA Policy

The ViPRA policy builds on a video-language foundation model, Large World Model (LWM). We use the LWM-Chat-1M-Jax as the base model and extend it with additional modules for latent action prediction and flow matching for continuous control.

Environment Setup

cd vipra/
conda env create -f environment.yml -n vipra
conda activate vipra

Before training, download the VQ-GAN image tokenizer, text tokenizer and pretrained model parameters from LWM-Chat-1M-Jax and place them under vipra/lwm/:

mkdir lwm
huggingface-cli download LargeWorldModel/LWM-Chat-1M-Jax --local-dir lwm/

Pretraining Data

We release a pre-tokenized, horizon-14 dynamics dataset on Hugging Face:

mkdir cotrain_data
huggingface-cli download vipra-project/cotrain-dynamics14 --local-dir cotrain_data/

cotrain-dynamics14 merges multiple robot datasets (LIBERO, BridgeData V2, Fractal, Kuka) with human video data from SSv2. Each training sample includes:

history frames
latent state target
latent action tokens from LAQ
natural language task text

This dataset is already chunked into 14-step latent action sequences.

Vision Cache (Optional, speeds up training)

We also release a VQGAN vision cache on Hugging Face so you don't have to repeatedly tokenize raw pixels:

mkdir vision_cache
huggingface-cli download vipra-project/cotrain-vqgan-vision-cache --local-dir vision_cache/

This contains precomputed VQGAN token sequences for each frame, which can be used instead of running the image tokenizer online.

If you don't use the cache, set vqgan_path to the VQ-GAN weights from LWM-Chat-1M-Jax so ViPRA can tokenize frames on the fly.

Running Pretraining

Launch pretraining using the provided script (configured for 8×H200 GPUs):

cd vipra/
bash scripts/pretrain.sh

See vipra/scripts/pretrain.sh for full hyperparameters.

Finetuning

Download the pretrained checkpoint weights, VQ-GAN image tokenizer, and text tokenizer from Hugging Face:

cd vipra && mkdir vipra_checkpoints
huggingface-cli download vipra-project/vipra-7b-pretrained --local-dir vipra_checkpoints/

For task-specific finetuning, prepare your dataset in JSONL format where each line represen

Related Skills

qqbot-channel

351.8k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.6k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

351.8k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

project-overview

FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A

sroutray

View profile

View on GitHub

GitHub Stars32

CategoryContent

Updated8d ago

Forks1

sroutray/vipra

Languages

Python

Security Score

95/100

Audited on Mar 31, 2026

No findings