FVP

[ICCV 2025] FVP: 4D Visual Pre-training for Robot Learning

Generate Convert Improve

Install / Use

/learn @JackHck/FVP

About this skill

Quality Score

0/100

README

FVP: 4D Visual Pre-training for Robot Learning

[Website] [arXiv] [ICCV 2025]

FVP is a novel 3D point cloud representation learning pipeline for robotic manipulation. Different from prior works in Contrastive Learning and Masked Signal Modeling, FVP trains 3D visual representations by leveraging the preceding frame point cloud and employing a diffusion model to predict the point cloud of the current frame.

This is a PyTorch implementation of the paper FVP: 4D Visual Pre-training for Robot Learning:

@article{cheng2025fvp,
    author    = {Chengkai Hou and Yanjie Ze and Yankai Fu and Zeyu Gao and Yue Yu and Songbo Hu and Shanghang Zhang and Huazhe Xu},
    title     = {FVP: 4D Visual Pre-training for Robot Learning},
    journal   = {ICCV},
    year      = {2025},
  }

:exclamation: This repo contains configs and experiments on simulation dataset and real-world dataset.

Requirements

3D Diffusion policy

Please see DP3 installation instructions.

FVP

In addition to PyTorch environments, please install:

conda install pyyaml
pip install ema-pytorch tensorboard

Simulation Dataset Generation

You can generate a dataset of simulated data following the DP3 instructions, for example:

cd your_path/3D-Diffusion-Policy-master
bash scripts/gen_demonstration_adroit.sh hammer

Real-world Dataset Generation

We collect the real-world dataset as a dictionary， which follows the same format as the simulator dataset：

"point_cloud": Array of shape (T, Np, 6), Np is the number of point clouds, 6 denotes [x, y, z, r, g, b]. Note: it is highly suggested to crop out the table/background and only leave the useful point clouds in your observation, which demonstrates effectiveness in our real-world experiments.
"image": Array of shape (T, H, W, 3)
"depth": Array of shape (T, H, W)
"agent_pos": Array of shape (T, Nd), Nd is the action dim of the robot agent, i.e. 22 for our dexhand tasks (6d position of end effector + 16d joint position)
"action": Array of shape (T, Nd). We use relative end-effector position control for the robot arm and relative joint-angle position control for the dex hand.

You can follow this example to collect real-world dataset.

FVP Pre-training

For config dp3.yaml, you should change your dataset path. Then, you can use FVP to train:

python train_gpu.py  --config config/dp3.yaml

DP3 Post-training

Simply load the weights trained using FVP, and then proceed with the standard DP3 command line for execution.

bash scripts/train_policy.sh dp3 adroit_hammer 0112 0 0

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

19.9k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary