FVP
[ICCV 2025] FVP: 4D Visual Pre-training for Robot Learning
Install / Use
/learn @JackHck/FVPREADME
FVP: 4D Visual Pre-training for Robot Learning
<p align="center"> <img src="picture/teaser.png" width="800"> </p>FVP is a novel 3D point cloud representation learning pipeline for robotic manipulation. Different from prior works in Contrastive Learning and Masked Signal Modeling, FVP trains 3D visual representations by leveraging the preceding frame point cloud and employing a diffusion model to predict the point cloud of the current frame.
This is a PyTorch implementation of the paper FVP: 4D Visual Pre-training for Robot Learning:
@article{cheng2025fvp,
author = {Chengkai Hou and Yanjie Ze and Yankai Fu and Zeyu Gao and Yue Yu and Songbo Hu and Shanghang Zhang and Huazhe Xu},
title = {FVP: 4D Visual Pre-training for Robot Learning},
journal = {ICCV},
year = {2025},
}
:exclamation: This repo contains configs and experiments on simulation dataset and real-world dataset.
Requirements
3D Diffusion policy
Please see DP3 installation instructions.
FVP
In addition to PyTorch environments, please install:
conda install pyyaml
pip install ema-pytorch tensorboard
Simulation Dataset Generation
You can generate a dataset of simulated data following the DP3 instructions, for example:
cd your_path/3D-Diffusion-Policy-master
bash scripts/gen_demonstration_adroit.sh hammer
Real-world Dataset Generation
We collect the real-world dataset as a dictionary, which follows the same format as the simulator dataset:
- "point_cloud": Array of shape (T, Np, 6), Np is the number of point clouds, 6 denotes [x, y, z, r, g, b]. Note: it is highly suggested to crop out the table/background and only leave the useful point clouds in your observation, which demonstrates effectiveness in our real-world experiments.
- "image": Array of shape (T, H, W, 3)
- "depth": Array of shape (T, H, W)
- "agent_pos": Array of shape (T, Nd), Nd is the action dim of the robot agent, i.e. 22 for our dexhand tasks (6d position of end effector + 16d joint position)
- "action": Array of shape (T, Nd). We use relative end-effector position control for the robot arm and relative joint-angle position control for the dex hand.
You can follow this example to collect real-world dataset.
FVP Pre-training
For config dp3.yaml, you should change your dataset path. Then, you can use FVP to train:
python train_gpu.py --config config/dp3.yaml
DP3 Post-training
Simply load the weights trained using FVP, and then proceed with the standard DP3 command line for execution.
bash scripts/train_policy.sh dp3 adroit_hammer 0112 0 0
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
19.9kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
