VideoGPA

VideoGPA is a self-supervised framework that enhances 3D consistency in Video Diffusion Models.

Generate Convert Improve

Install / Use

/learn @Hongyang-Du/VideoGPA

About this skill

Quality Score

0/100

README

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du*1,2 · Junjie Ye*1· Xiaoyan Cong*2 · Runhao Li1 · Jingcheng Ni2
Aman Agarwal2 · Zeqi Zhou2 · Zekun Li2 · Randall Balestriero2 · Yue Wang1

1Physical SuperIntelligence Lab, University of Southern California 2Department of Computer Science, Brown University * Equal Contribution

</div> <div align="center"> <img src="pipeline.png" alt="Pipeline" width="55%"> </div>

Quick Inference Scripts 🚀

This directory contains simplified command-line scripts for generating videos using CogVideoX models. These scripts are designed for quick testing and allow you to run inference directly from the terminal without preparing JSON configuration files.

Both scripts support loading LoRA adapters for customized generation.

📋 Requirements

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

🔘 Checkpoint Download

Automatic Download (Recommended)

Method 1: Using the Download Script

Run the provided Python script to download checkpoint files:

# Download all checkpoints
python download_ckpt.py all

# Or download specific checkpoints
python download_ckpt.py i2v    # CogVideoX-I2V-5B
python download_ckpt.py t2v    # CogVideoX-5B
python download_ckpt.py t2v15  # CogVideoX1.5-5B

The script will:

✅ Check if files already exist (skip re-downloading)
🚀 Download missing checkpoints with progress bars
📁 Organize files into the correct directory structure

Expected Directory Structure

After a successful download, your checkpoints folder should look like:

checkpoints/
├── VideoGPA-I2V-lora/
│   └── adapter_model.safetensors
├── VideoGPA-T2V-lora/
│   └── adapter_model.safetensors
└── VideoGPA-T2V1.5-lora/
    └── adapter_model.safetensors

📝 Available Scripts

1. Text-to-Video Generation (t2v_inference.py)

Generate videos from text prompts using CogVideoX-5B.

Basic Usage:

cd generate
python t2v_inference.py "A cat playing with a ball in a garden"

Advanced Usage:

python t2v_inference.py "A flying drone over a city skyline at sunset" \
    --output_dir ./my_videos \
    --lora_path ./checkpoints/my_lora_adapter \
    --gpu_id 0

Arguments:

prompt (required): Text prompt for video generation
--output_dir: Directory to save generated videos (default: ./outputs)
--lora_path: Path to LoRA adapter weights (optional)
--gpu_id: GPU device ID (default: 0)

Output: Videos saved as {prompt}_seed{seed}.mp4

2. Image-to-Video Generation (i2v_inference.py)

Generate videos from a static image with text guidance using CogVideoX-5B-I2V.

Basic Usage:

cd generate
python i2v_inference.py "The camera slowly zooms in" ./path/to/image.jpg

Advanced Usage:

python i2v_inference.py "A realistic continuation of the reference scene. Everything must remain completely static: no moving people, no shifting objects, and no dynamic elements. Only the camera is allowed to move. Render physically accurate multi-step camera motion.  Camera motion: roll gently to one side, then swing around the room, followed by push forward into the scene." ./image.png \
    --output_dir ./i2v_outputs \
    --lora_path ./checkpoints/i2v_lora \
    --gpu_id 1

Arguments:

prompt (required): Text prompt describing motion/scene
image_path (required): Path to input image file
--output_dir: Directory to save generated videos (default: ./outputs)
--lora_path: Path to LoRA adapter weights (optional)
--gpu_id: GPU device ID (default: 0)

Output: Videos saved as {image_name}_seed{seed}.mp4

⚙️ Configuration

Both scripts include configurable generation parameters:

NUM_INFERENCE_STEPS = 50  # Number of diffusion steps
GUIDANCE_SCALE = 6.0      # Classifier-free guidance scale
SEED = 42              # Seed for generation

💾 GPU Memory Requirements

Minimum VRAM: diffusers BF16 ~5GB for base models
Memory optimizations (VAE tiling/slicing) are automatically enabled

🚀 Features

Video Quality Assessment: Comprehensive metrics for evaluating video generation quality
DPO Training: Direct Preference Optimization for video generation models
Multi-Model Support: Compatible with CogVideoX and other video generation models
Flexible Pipeline: Easy-to-use inference and training pipelines

📁 Code Structure

VideoGPA/
├── data_prep/      # Data preparation scripts
├── train_dpo/      # DPO training scripts
├── pipelines/      # Inference pipelines
├── metrics/        # Quality assessment metrics
├── vggt/           # Video generation model architecture
└── utils/          # Utility functions

🔧 DPO Training (Direct Preference Optimization)

VideoGPA leverages DPO to optimize video generation quality through preference learning. The training pipeline consists of 3 steps after you have your generated videos. Revise the configs as you need:

Step 1: Score Your Generate Videos

python train_dpo/video_scorer.py

Step 2: Encode Videos to Latent Space

# For CogVideoX-I2V-5B
python train_dpo/CogVideoX-I2V-5B_lora/02_encode.py

# For CogVideoX-5B
python train_dpo/CogVideoX-5B_lora/02_encode.py

# For CogVideoX1.5-5B
python train_dpo/CogVideoX1.5-5B_lora/02_encode.py

Step 3: Run DPO Training

# For CogVideoX-I2V-5B
python train_dpo/CogVideoX-I2V-5B_lora/03_train.py

# For CogVideoX-5B
python train_dpo/CogVideoX-5B_lora/03_train.py

# For CogVideoX1.5-5B
python train_dpo/CogVideoX1.5-5B_lora/03_train.py

Key Features:

🎯 Preference-based learning using winner/loser pairs
🔧 Parameter-efficient fine-tuning with LoRA
📊 Multiple quality metrics support
⚡ Distributed training with PyTorch Lightning
💾 Automatic gradient checkpointing and memory optimization

Data Format: Training requires JSON metadata containing preference pairs - multiple videos generated from the same prompt with quality scores. See dataset.py for details.

🙏 Acknowledgements

We would like to express our gratitude to the following projects and researchers:

CogVideoX - The foundational state-of-the-art video generation model.
PEFT - For parameter-efficient framework and fine-tuning with LoRA.
Diffusion DPO - For the innovative Direct Preference Optimization approach in the diffusion latent space.

Thanks to Dawei Liu for the amazing website design!

🌟 Citation

@misc{du2026videogpadistillinggeometrypriors,
      title={VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation}, 
      author={Hongyang Du and Junjie Ye and Xiaoyan Cong and Runhao Li and Jingcheng Ni and Aman Agarwal and Zeqi Zhou and Zekun Li and Randall Balestriero and Yue Wang},
      year={2026},
      eprint={2601.23286},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.23286}, 
}

Related Skills

qqbot-channel

345.9k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.0k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

345.9k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

arscontexta

2.9k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

Hongyang-Du

View profile

View on GitHub

GitHub Stars48

CategoryContent

Updated10h ago

Forks1

Hongyang-Du/VideoGPA

Languages

Python

Security Score

75/100

Audited on Apr 2, 2026

No findings